Mathew Storm

Welcome to C—one of the world’s most senior computer programming languages and, according to the Tiobe Index, the world’s most popular. You’re probably familiar with many of the powerful tasks computers perform. In this textbook, you’ll get intensive, hands-on experience writing C instructions that command computers to perform those and other tasks. Software (that is, the C instructions you write, which are also called code) controls hardware (that is, computers and related devices).

C is widely used in industry for a wide range of tasks.2 Today’s popular desktop operating systems—Windows3 , macOS4 and Linux 5—are partially written in C. Many popular applications are partially written in C, including popular web browsers (e.g., Google Chrome 6 and Mozilla Firefox 7), database management systems (e.g., Microsoft SQL Server 8 , Oracle9 and MySQL10 ) and more.

Hardware and Software

Computers can perform calculations and make logical decisions phenomenally faster than human beings can. Today’s personal computers and smartphones can perform billions of calculations in one second—more than a human can perform in a lifetime.

Supercomputers already perform thousands of trillions (quadrillions) of instructions per second! As of December 2020, Fujitsu’s Fugaku is the world’s fastest supercomputer—it can perform 442 quadrillion calculations per second (442 petaflops)! To put that in perspective, this supercomputer can perform in one second almost 58 million calculations for every person on the planet! And supercomputing upper limits are growing quickly.

Computers process data under the control of sequences of instructions called computer programs (or simply programs). These programs guide the computer through ordered actions specified by people called computer programmers.

Computer Organisation

Regardless of physical differences, computers can be envisioned as divided into various logical units or sections.

Input Unit

This “receiving” section obtains information (data and computer programs) from input devices and places it at the other units’ disposal for processing. Computers receive most user input through keyboards, touch screens, mice and touchpads, though there are many more input devices available.

Output Unit

This “shipping” section takes information the computer has processed and places it on various output devices to make it available outside the computer. Most information that’s output from computers today is displayed on screens, played as audio/video, or transmitted over the internet.

Memory Unit

This rapid-access, relatively low-capacity “warehouse” section retains information entered through the input unit, making it immediately available for processing when needed. The memory unit also retains processed information until it can be placed on output devices by the output unit. Information in the memory unit is volatile—it’s typically lost when the computer’s power is turned off. The memory unit is often called either memory, primary memory or RAM (Random Access Memory). Main memories on desktop and notebook computers contain as much as 128 GB of RAM, though 8 to 16 GB is most common. GB stands for gigabytes; a gigabyte is approximately one billion bytes. A byte is eight bits. A bit (short for “binary digit”) is either a 0 or a 1.

Arithmetic and Logic Unit (ALU)

This “manufacturing” section performs calculations (e.g., addition, subtraction, multiplication and division) and makes decisions (e.g., comparing two items from the memory unit to determine whether they’re equal). In today’s systems, the ALU is part of the next logical unit, the CPU.

Central Processing Unit (CPU)

This “administrative” section coordinates and supervises the operation of the other sections. The CPU tells the input unit when to read information into the memory unit, the ALU when to use information from the memory unit in calculations, and the output unit when to send information from the memory unit to specific output devices.

Most computers today have multicore processors that economically implement multiple processors on a single integrated circuit chip. Such processors can perform many operations simultaneously. A dual-core processor has two CPUs, a quad-core processor has four and an octa-core processor has eight. Intel has some processors with up to 72 cores.

Secondary Storage Unit

This is the long-term, high-capacity “warehousing” section. Programs and data not actively being used by the other units are placed on secondary storage devices until they’re again needed, possibly hours, days, months or even years later. Information on secondary storage devices is persistent—it’s preserved even when the computer’s power is turned off. Secondary storage information takes much longer to access than information in primary memory, but its cost per byte is much less. Examples of sec- ondary storage devices include solid-state drives (SSDs), USB flash drives, hard drives and read/write Blu-ray drives. Many current drives hold terabytes (TB) of data. A tera- byte is approximately one trillion bytes. Typical desktop and notebook-computer hard drives hold up to 4 TB, and some recent desktop-computer hard drives hold up to 20 TB.16 The largest commercial SSD holds up to 100 TB (and costs $40,000).

Data Hierarchy

Data items processed by computers form a data hierarchy that becomes larger and more complex in structure as we progress from the simplest data items (called “bits”) to richer ones, such as characters and fields.

Bits

A bit is short for “binary digit”—a digit that can assume one of two values—and is a computer’s smallest data item. It can have the value 0 or 1. Remarkably, computers’ impressive functions involve only the simplest manipulations of 0s and 1s—examin- ing a bit’s value, setting a bit’s value and reversing a bit’s value (from 1 to 0 or from 0 to 1). Bits form the basis of the binary number system, which we discuss in our “Number Systems” appendix.

Characters

Work with data in the low-level form of bits is tedious. Instead, people prefer to work with decimal digits (0–9), letters (A–Z and a–z) and special symbols such as $ @ % & * ( ) – + " : ; , ? / Digits, letters and special symbols are known as characters. The computer’s character set contains the characters used to write programs and represent data items. Computers process only 1s and 0s, so a computer’s character set represents each character as a pattern of 1s and 0s. C uses the ASCII (American Standard Code for Information Interchange) character set by default. C also supports Unicode® characters composed of one, two, three or four bytes (8, 16, 24 or 32 bits, respectively).

Fields

ust as characters are composed of bits, fields are composed of characters or bytes. A field is a group of characters or bytes that conveys meaning. For example, a field con- sisting of uppercase and lowercase letters could represent a person’s name, and a field consisting of decimal digits could represent a person’s age in years.

Records

Several related fields can be used to compose a record. In a payroll system, for example, the record for an employee might consist of the following fields (possible types for these fields are shown in parentheses):

• Employee identification number (a whole number). 
• Name (a group of characters). 
• Address (a group of characters). 
• Hourly pay rate (a number with a decimal point). 
• Year-to-date earnings (a number with a decimal point). 
• Amount of taxes withheld (a number with a decimal point).

Thus, a record is a group of related fields. All the fields listed above belong to the same employee. A company might have many employees and a payroll record for each.

File

A file is a group of related records. More generally, a file contains arbitrary data in arbitrary formats. Some operating systems view a file simply as a sequence of bytes— any organization of the bytes in a file, such as organizing the data into records, is a view created by the application programmer. You’ll see how to do that in Chapter 11, File Processing. It’s not unusual for an organization to have many files, some containing billions, or even trillions, of characters of information. As we’ll see below, with big data, far larger file sizes are becoming increasingly common.

Databases

A database is a collection of data organized for easy access and manipulation. The most popular model is the relational database, in which data is stored in simple tables. A table includes records and fields. For example, a table of students might include first name, last name, major, year, student ID number and grade-point-aver- age fields. The data for each student is a record, and the individual pieces of information in each record are the fields. You can search, sort and otherwise manipulate the data based on its relationship to multiple tables or databases. For example, a university might use data from the student database combined with data from databases of courses, on-campus housing, meal plans, etc.

Machine Languages, Assembly Languages and High-Level Languages

Programmers write instructions in various programming languages, some directly understandable by computers and others requiring intermediate translation steps. Hundreds of such languages are in use today. These may be divided into three general types:

Machine Languages

Any computer can directly understand only its own machine language, defined by its hardware design. Machine languages generally consist of strings of numbers (ultimately reduced to 1s and 0s) that instruct computers to perform their most elementary operations one at a time. Machine languages are machine-dependent—a particular machine language can be used on only one type of computer. Such languages are cumbersome for humans. For example, here’s a section of an early machine-language payroll program that adds overtime pay to base pay and stores the result in gross pay:

+1300042774 
+1400593419 
+1200274027

Assembly Languages and Assemblers

Programming in machine language was simply too slow and tedious for most pro- grammers. Instead of using the strings of numbers that computers could directly understand, programmers began using English-like abbreviations to represent elementary operations. These abbreviations formed the basis of assembly languages.

Translator programs called assemblers were developed to convert assembly-language programs to machine language at computer speeds. The following section of an assembly-language payroll program also adds overtime pay to base pay and stores the result in gross pay:

load   basepay
add    overpay
store  grosspay

Although such code is clearer to humans, it’s incomprehensible to computers until it’s translated to machine language.

High-Level Languages and Compilers

With the advent of assembly languages, the use of computers increased rapidly. However, programmers still had to use numerous instructions to accomplish even simple tasks. To speed the programming process, high-level languages were developed in which single statements could accomplish substantial tasks. A typical high-level-language program contains many statements, known as the program’s source code. Translator programs called compilers convert high-level-language source code into machine language. High-level languages allow you to write instructions that look almost like everyday English and contain common mathematical notations. A payroll program written in a high-level language might contain a single statement such as:

grossPay = basePay + overTimePay

From the programmer’s standpoint, high-level languages are preferable to machine and assembly languages. C is among the world’s most widely used high-level programming languages.

Interpreters

Compiling a large high-level language program into machine language can take con- siderable computer time. Interpreters execute high-level language programs directly. Interpreters avoid compilation delays, but your code runs slower than compiled programs. Some programming languages, such as Java and Python , use a clever mixture of compilation and interpretation to run programs.

Operating Systems

Operating systems are software that make using computers more convenient for users, software developers and system administrators. They provide services that allow applications to execute safely, efficiently and concurrently with one another. The software that contains the core operating-system components is called the kernel. Linux, Windows and macOS are popular desktop computer operating systems—you can use any of these with this book. Each is partially written in C. The most popular mobile operating systems used in smartphones and tablets are Google’s Android and Apple’s iOS.

Windows - A Proprietary Operating System

In the mid-1980s, Microsoft developed the Windows operating system, consisting of a graphical user interface built on top of DOS (Disk Operating System)—an enormously popular personal-computer operating system that users interacted with by typing commands. Windows 10 is Microsoft’s latest operating system—it includes the Cortana personal assistant for voice interactions. Windows is a proprietary operating system—it’s controlled by Microsoft exclusively. It is by far the world’s most widely used desktop operating system.

Linux—An Open-Source Operating System

The Linux operating system is among the greatest successes of the open-source movement. Proprietary software for sale or lease dominated software’s early years. With open source, individuals and companies contribute to developing, maintaining and evolving the software. Anyone can then use that software for their own purposes—normally at no charge, but subject to a variety of (typically generous) licensing requirements. Open-source code is often scrutinised by a much larger audience than proprietary software, so errors can get removed faster, making the software more robust. Open source increases productivity and has contributed to an explosion of innovation.

The Linux kernel is the core of the most popular open-source, freely distributed, full-featured operating system. It’s developed by a loosely organized team of volunteers and is popular in servers, personal computers and embedded systems (such as the computer systems at the heart of smartphones, smart TVs and automobile systems). Unlike Microsoft’s Windows and Apple’s macOS source code, the Linux source code is available to the public for examination and modification and is free to download and install. As a result, Linux users benefit from a huge community of developers actively debugging and improving the kernel, and from the ability to customise the operating system to meet specific needs.

Apple’s macOS and Apple’s iOS for iPhone and iPad Devices

Apple, founded in 1976 by Steve Jobs and Steve Wozniak, quickly became a leader in personal computing. In 1979, Jobs and several Apple employees visited Xerox PARC (Palo Alto Research Center) to learn about Xerox’s desktop computer that featured a graphical user interface (GUI). That GUI served as the inspiration for the Apple Macintosh, launched in 1984.

Google's Android

Android—the most widely used mobile and smartphone operating system—is based on the Linux kernel, the Java programming language and, now, the open-source Kotlin programming language. Android is open source and free. Though you can’t develop Android apps purely in C, you can incorporate C code into Android apps. According to idc.com, 84.8% of smartphones shipped in 2020 use Android, compared to 15.2% for Apple. The Android operating system is used in numerous smartphones, e-reader devices, tablets, TVs, in-store touch-screen kiosks, cars, robots, multimedia players and more.

The C Programming Language

C evolved from two earlier languages, BCPL and B. BCPL was developed in 1967 by Martin Richards as a language for writing operating systems and compilers. Ken Thompson modeled many features in his B language after their counterparts in BCPL, and in 1970 he used B to create early versions of the UNIX operating system at Bell Laboratories.

The C language was evolved from B by Dennis Ritchie at Bell Laboratories and was originally implemented in 1972. C initially became widely known as the development language of the UNIX operating system. Many of today’s leading operating systems are written in C and/or C++. C is mostly hardware-independent—with careful design, it’s possible to write C programs that are portable to most computers.

Built for Performance

C is widely used to develop systems that demand performance, such as operating systems, embedded systems, real-time systems and communications systems:

Operating Systems -C’s portability and performance make it desirable for implementing operating systems, such as Linux and portions of Microsoft’s Windows and Google’s Android. Apple’s macOS is built in Objective-C, which was derived from C.

Embedded systems - The vast majority of the microprocessors produced each year are embedded in devices other than general-purpose computers. These embedded systems include navigation systems, smart home appliances, home security systems, smartphones, tablets, robots, intelligent traffic intersections and more. C is one of the most popular programming languages for developing embedded systems, which typically need to run as fast as possible and conserve memory. For example, a car’s antilock brakes must respond immediately to slow or stop the car without skidding; video-game controllers should respond instantaneously to prevent lag between the controller and the game action.

Real-time systems - Real-time systems are often used for “mission-critical” applications that require nearly instantaneous and predictable response times. Real-time systems need to work continuously. For example, an air-traffic-control system must continuously monitor planes’ positions and velocities and report that information to air-traffic controllers without delay so they can alert the planes to change course if there’s a possibility of a collision.

Communications systems - Communications systems need to route massive amounts of data to their destinations quickly to ensure that things such as audio and video are delivered smoothly and without delay.

By the late 1970s, C had evolved into what’s now referred to as “traditional C.” The publication in 1978 of Kernighan and Ritchie’s book, The C Programming Language, drew wide attention to the language. This became one of the most successful computer-science books of all time.

Standardization

C’s rapid expansion to various hardware platforms (that is, types of computer hard- ware) led to many similar but often incompatible C versions. This was a serious problem for programmers who needed to develop code for several platforms. It became clear that a standard C version was needed. In 1983, the American National Standards Com- mittee on Computers and Information Processing (X3) created the X3J11 technical committee to “provide an unambiguous and machine-independent definition of the language.” In 1989, the standard was approved in the United States through the American National Standards Institute (ANSI), then worldwide through the International Standards Organization (ISO). This version was simply called Standard C.

The C Standard Library and Open-Source Libraries

C programs consist of pieces called functions. You can program all the functions you need to form a C program. However, most C programmers take advantage of the rich collection of existing functions in the C standard library. Thus, there are really two parts to learning C programming:

learning the C language itself, and
learning how to use the functions in the C standard library.

When programming in C, you’ll typically use the following building blocks:

C Standard library functions,
open-source C library functions,
functions you create yourself, and
functions other people (whom you trust) have created and made available to you.

The advantage of creating your own functions is that you’ll know exactly how they work. The disadvantage is the time-consuming effort that goes into designing, developing, debugging and performance-tuning new functions

Using C standard library functions instead of writing your own versions can improve program performance, because these functions are carefully written to perform efficiently. Using C standard library functions instead of writing your own comparable versions also can improve program portability.

Open-Source Libraries

There are enormous numbers of third-party and open-source C libraries that can help you perform significant tasks with modest amounts of code. GitHub lists over 32,000 repositories in their C category:
https://github.com/topics/c
In addition, pages such as Awesome C:
https://github.com/kozross/awesome-c
provide curated lists of popular C libraries for a wide range of application areas.

Typical C Program-Development Environment

C systems generally consist of several parts: a program-development environment, the language and the C standard library. C programs typically go through six phases to be executed—edit, preprocess, compile, link, load and execute. Although C How to Program, 9/e, is a generic C text- book (written independently of any particular operating system), we concentrate in this section on a typical Linux-based C system

Phase 1: Creating a Program

Phase 1 consists of editing a file in an editor program. Two editors widely used on Linux systems are vi and emacs. C and C++ integrated development environments (IDEs) such as Microsoft Visual Studio and Apple Xcode have integrated editors. You type a C program in the editor, make corrections if nec- essary, then store the program on a secondary storage device such as a hard disk. C program filenames should end with the .c extension.

Phases 2 and 3: Preprocessing and Comping a C Program

In Phase 2, you give the command to compile the program. The compiler translates the C program into machine-language code (also referred to as object code). In a C system, the compilation command invokes a preprocessor program before the compiler’s translation phase begins. The C preprocessor obeys special commands called preprocessor directives, which perform text manipulations on a program’s source-code files. These manipulations consist of inserting the contents of other files and various text replacements.

In Phase 3, the compiler translates the C program into machine-language code. A syntax error occurs when the compiler cannot recognize a statement because it vio- lates the language rules. The compiler issues an error message to help you locate and fix the incorrect statement. The C standard does not specify the wording for error messages issued by the compiler, so the messages you see on your system may differ from those on other systems. Syntax errors are also called compile errors or compile- time errors.

Phase 4: Linking

The next phase is called linking. C programs typically use functions defined elsewhere, such as in the standard libraries, open-source libraries or private libraries of a particular project. The object code produced by the C compiler typically contains “holes” due to these missing parts. A linker links a program’s object code with the code for the missing functions to produce an executable image (with no missing pieces). On a typical Linux system, the command to compile and link a program is gcc (the GNU C compiler). To compile and link a program named welcome.c using the latest C standard (C18), type:

gcc -std=c18 welcome.c

Linux commands are case sensitive. If the program compiles and links correctly, the compiler produces a file named a.out (by default), which is welcome.c’s executable image.

Phase 5: Loading

The next phase is called loading. Before a program can execute, the operating system must load it into memory. The loader takes the executable image from disk and transfers it to memory. Additional components from shared libraries that support the program also are loaded.

Phase 6: Execution

Finally, in the last phase, the computer, under control of its CPU, executes the program one instruction at a time. To load and execute the program on a Linux system, type ./a.out at the Linux prompt and press Enter.

Problems that May Occur at Execution Time

Programs do not always work on the first try. Each of the preceding phases can fail because of various errors that we’ll discuss. For example, an executing program might attempt to divide by zero (an illegal operation on computers just as in arithmetic). This would cause the computer to display an error message. You would then return to the edit phase, make the necessary corrections and proceed through the remaining phases again to determine that the corrections work properly.

Errors such as division-by-zero that occur as programs run are called runtime errors or execution-time errors. Divide-by-zero is generally a fatal error that causes the program to terminate immediately without successfully performing its job. Nonfatal errors allow programs to run to completion, often producing incorrect results.

Standard Input, Standard Output and Standard Error Streams

Most C programs input and/or output data. Certain C functions take their input from stdin (the standard input stream), which is normally the keyboard. Data is often output to stdout (the standard output stream), which is normally the com- puter screen. When we say that a program prints a result, we normally mean that the result is displayed on a screen. Data also may be output to devices such as disks and printers. There’s also a standard error stream referred to as stderr, which is normally connected to the screen and used to display error messages. It’s common to route regular output data, i.e., stdout, to a device other than the screen while keeping stderr assigned to the screen so that the user can be immediately informed of errors.

Internet, World Wide Web, the Cloud, and IoT

In the late 1960s, ARPA—the Advanced Research Projects Agency of the United States Department of Defense—rolled out plans for networking the main computer systems of approximately a dozen ARPA-funded universities and research institutions.

he computers were to be connected with communications lines operating at speeds on the order of 50,000 bits per second, a stunning rate at a time when most people (of the few who even had networking access) were connecting over telephone lines to computers at a rate of 110 bits per second. Academic research was about to take a giant leap forward. ARPA proceeded to implement what quickly became known as the ARPANET, the precursor to today’s Internet.

Things worked out differently from the original plan. Although the ARPANET enabled researchers to network their computers, its main benefit proved to be the capability for quick and easy communication via what came to be known as electronic mail (e-mail). This is true even on today’s Internet, with e-mail, instant messaging, file transfer and social media, such as Snapchat, Instagram, Facebook and Twitter, enabling billions of people worldwide to communicate quickly and easily.

The protocol (set of rules) for communicating over the ARPANET became known as the Transmission Control Protocol (TCP). TCP ensured that messages, consisting of sequentially numbered pieces called packets, were properly delivered from sender to receiver, arrived intact and were assembled in the correct order

The Internet: A Network of Networks

In parallel with the early evolution of the Internet, organizations worldwide were implementing their own networks for intra-organization (that is, within an organization) and inter-organization (that is, between organizations) communication. A huge variety of networking hardware and software appeared. One challenge was to enable these different networks to communicate with each other. ARPA accomplished this by developing the Internet Protocol (IP), which created a true “network of networks,” the Internet’s current architecture. The combined set of protocols is now called TCP/IP. Each Internet-connected device has an IP address—a unique numerical identifier used by devices communicating via TCP/IP to locate one another on the Internet.

The World Wide Web: Making the Internet User-Friendly

The World Wide Web (simply called “the web”) is a collection of hardware and software associated with the Internet that allows computer users to locate and view documents (with various combinations of text, graphics, animations, audios and videos) on almost any subject.

In 1989, Tim Berners-Lee of CERN (the European Organization for Nuclear Research) began developing HyperText Markup Language (HTML)—the technology for sharing information via “hyperlinked” text documents. He also wrote communication protocols such as HyperText Transfer Protocol (HTTP) to form the backbone of his new hypertext information system, which he referred to as the World Wide Web.

In 1994, Berners-Lee founded the World Wide Web Consortium (W3C, https://www.w3.org), devoted to developing web technologies. One of the W3C’s primary goals is to make the web universally accessible to everyone regardless of dis- abilities, language or culture.

The Cloud

More and more computing today is done “in the cloud”—that is, using software and data distributed across the Internet worldwide, rather than locally on your desktop, notebook computer or mobile device. Cloud computing allows you to increase or decrease computing resources to meet your needs at any given time, which is more cost-effective than purchasing hardware to provide enough storage and processing power to meet occasional peak demands. Cloud computing also saves money by shifting to the service provider the burden of managing these apps (such as installing and upgrading the software, security, backups and disaster recovery).

The apps you use daily are heavily dependent on various cloud-based services. These services use massive clusters of computing resources (computers, processors, memory, disk drives, etc.) and databases that communicate over the Internet with each other and the apps you use. A service that provides access to itself over the Internet is known as a web service.

Software as a Service

Cloud vendors focus on service-oriented architecture (SOA) technology. They provide “as-a-Service” capabilities that applications connect to and use in the cloud.

Mashups

The applications-development methodology of mashups enables you to rapidly develop powerful software applications by combining (often free) complementary web services and other forms of information feeds. One of the first mashups, www.housingmaps.com, combined the real-estate listings from www.craigslist.org with Google Maps to show the locations of homes for sale or rent in a given area. Check out www.housingmaps.com for some interesting facts, history, articles and how it influenced real-estate industry listings.

The Internet of Things

The Internet is no longer just a network of computers—it’s an Internet of Things (IoT). A thing is any object with an IP address and the ability to send, and in some cases receive, data automatically over the Internet. Such things include:
- A car with a transponder for paying tolls,
- monitors for parking-space availability in a garage,
- water-quality monitors,
- radiation detectors,
- smart thermostats that adjust temperatures based on weather forecasts,
- intelligent home appliances

According to statista.com, there are already over 23 billion IoT devices in use today, and there could be over 75 billion IoT devices in 2025

Software Technologies

As you learn about and work in software development, you’ll frequently encounter the following buzzwords:

Refactoring: Reworking programs to make them clearer and easier to maintain while preserving their correctness and functionality. Many IDEs contain built-in refactoring tools to do major portions of the reworking automatically
Design patterns: Proven architectures for constructing flexible and maintainable object-oriented software. The field of design patterns tries to enumerate those recurring patterns, encouraging software designers to reuse them to develop better-quality software using less time, money and effort.
Software Development Kits (SDKs)—The tools and documentation that developers use to program applications.

How Big is Big Data?

For computer scientists and data scientists, data is now as crucial as writing programs. According to IBM, approximately 2.5 quintillion bytes (2.5 exabytes) of data are created daily, and 90% of the world’s data was created in the last two years. The Internet, which will play an important part in your career, is responsible for much of this trend. According to IDC, the global data supply will reach 175 zettabytes (equal to 175 trillion gigabytes or 175 billion terabytes) annually by 2025. Consider the following examples of various popular data measures.

Megabytes (MB)

One megabyte is about one million (actually 2 20) bytes. Many of the files we use daily require one or more MBs of storage. Some examples include:

MP3 audio files—High-quality MP3s range from 1 to 2.4 MB per minute. 50
Photos—JPEG format photos taken on a digital camera can require about 8 to 10 MB per photo.
Video—Smartphone cameras can record video at various resolutions. Each minute of video can require many megabytes of storage. For example, on one of our iPhones, the Camera settings app reports that 1080p video at 30 frames-per-second (FPS) requires 130 MB/minute and 4K video at 30 FPS requires 350 MB/minute.

Gigabytes (GB)

One gigabyte is about 1000 megabytes (actually 2 30 bytes). A dual-layer DVD can store up to 8.5 GB , which translates to:

as much as 141 hours of MP3 audio,
approximately 1000 photos from a 16-megapixel camera,
approximately 7.7 minutes of 1080p video at 30 FPS, or
approximately 2.85 minutes of 4K video at 30 FPS. The current highest-capacity Ultra HD Blu-ray discs can store up to 100 GB of video.52 Streaming a 4K movie can use between 7 and 10 GB per hour (highly com- pressed)

Terabytes (TB)

One terabyte is about 1000 gigabytes (actually 2^40 bytes). Recent disk drives for desk- top computers come in sizes up to 20 TB, which is equivalent to:

approximately 28 years of MP3 audio,
approximately 1.68 million photos from a 16-megapixel camera,
approximately 226 hours of 1080p video at 30 FPS, or
approximately 84 hours of 4K video at 30 FPS.

Nimbus Data now has the largest solid-state drive (SSD) at 100 TB, which can store five times the 20-TB examples of audio, photos and video listed above.

Petabytes, Exabytes and Zettabytes

There are over four billion people online, creating about 2.5 quintillion bytes of data each day55 —that’s 2500 petabytes (each petabyte is about 1000 terabytes) or 2.5 exa- bytes (each exabyte is about 1000 petabytes). A March 2016 AnalyticsWeek article stated that by 2021 there would be over 50 billion devices connected to the Internet (most of them through the Internet of Things; Section 1.11.4) and, by 2020, there would be 1.7 megabytes of new data produced per second for every person on the planet.56 At today’s numbers (approximately 7.7 billion people57 ), that’s about:

13 petabytes of new data per second,
780 petabytes per minute,
46,800 petabytes (46.8 exabytes) per hour, or
1,1123 exabytes per day

That’s the equivalent of over 5.5 million hours (over 600 years) of 4K video every day or approximately 116 billion photos every day!

Big-Data Analytics

Data analytics is a mature and well-developed discipline. The term “data analysis” was coined in 1962, though people have been analyzing data using statistics for thousands of years, going back to the ancient Egyptians. Big-data analytics is a more recent phenomenon—the term “big data” was coined around 1987.

Consider for of the V's of big data:

Volume - the data the world is producing is growing exponentially
Velocity - the speed at which data is being produced, the speed at which it moves through organizations and the speed at which data changes are growing quickly.
Variety - data used to be alphanumeric (that is, consisting of alphabetic char- acters, digits, punctuation and some special characters)—today, it also in- cludes images, audios, videos and data from an exploding number of Internet of Things sensors in our homes, businesses, vehicles, cities and more.
Veracity - the validity of the data—is it complete and accurate? Can we trust that data when making crucial decisions? Is it real?

Most data is now being created digitally in a variety of types, in extraordinary volumes and moving at astonishing velocities. Moore’s Law and related observations have enabled us to store data economically and process and move it faster—and all at rates growing exponentially over time. Digital data storage has become so vast in capacity, and so cheap and small, that we can now conveniently and economically retain all the digital data we’re creating. That’s big data.

The following Richard W. Hamming quote—although from 1962—sets the tone for the rest of this book:

"The Purpose of computing is insight, not numbers."

Data science is producing new, deeper, subtler and more valuable insights at a remarkable pace. It’s truly making a difference. Big-data analytics is an integral part of the answer.

Data Science and Big Data are making a Difference

The data-science field is growing rapidly because it’s producing significant results that are making a difference. We enumerate data-science and big-data use cases in the following table. We expect that the use cases and our examples, exercises and projects will inspire interesting term projects, directed-study projects, capstone-course proj- ects and thesis research. Big-data analytics has resulted in improved profits, better customer relations, and even sports teams winning more games and championships while spending less on players.

anomaly detection
credit scoring
automated captions
brain mappings
computer vision
dynamic pricing
diagnostic medicine

Case Study - A Big Data Mobile Application

In your career, you’ll work with many programming languages and software technol- ogies. With its 130 million monthly active users, Google’s Waze GPS navigation app is one of the most widely used big-data apps. Early GPS navigation devices and apps relied on static maps and GPS coordinates to determine the best route to your destination. They could not adjust dynamically to changing traffic situations. Waze processes massive amounts of crowdsourced data—that is, the data that’s continuously supplied by their users and their users’ devices worldwide.

They analyze this data as it arrives to determine the best route to get you safely to your destination in the least amount of time. To accomplish this, Waze relies on your smartphone’s Internet connection. The app automatically sends location updates to their servers (assuming you allow it to). They use that data to dynamically re-route you based on current traffic conditions and to tune their maps. Users report other information, such as roadblocks, construction, obstacles, vehicles in breakdown lanes, police locations, gas prices and more. Waze then alerts other drivers in those locations.

Waze uses many technologies to provide its services. We’re not privy to how Waze is implemented, but we infer below a list of technologies they probably use. For example,

Most apps created today use at least some open-source software. You’ll take advantage of open-source libraries and tools in the case studies.
Waze communicates information over the Internet between their servers and their users’ mobile devices. Today, such data typically is transmitted in JSON (JavaScript Object Notation) format. Often the JSON data will be hidden from you by the libraries you use.
Waze uses speech synthesis to speak driving directions and alerts to you, and uses speech recognition to understand your spoken commands. Many cloud vendors provide speech-synthesis and speech-recognition capabilities.
aze uses your phone as a streaming Internet of Things (IoT) device. Each phone is a GPS sensor that continuously streams data over the Internet to Waze.
Waze receives IoT streams from millions of phones at once. It must process, store and analyze that data immediately to update your device’s maps, display and speak relevant alerts and possibly update your driving directions. This requires massively parallel processing capabilities implemented with clusters of computers in the cloud. You can use various big-data infrastructure tech- nologies to receive streaming data, store that big data in appropriate databases and process the data with software and hardware that provide massively par- allel processing capabilities.
Waze probably stores its routing information in a graph database. Such data- bases can efficiently calculate shortest routes. You can use graph databases, such as Neo4J.
Waze uses artificial-intelligence capabilities to perform the data-analysis tasks that enable it to predict the best routes based on the information it receives. You can use machine learning and deep learning, respectively, to analyze mas- sive amounts of data and make predictions based on that data.

AI - at the Intersection of Computer Science and Data Science

When a baby first opens its eyes, does it “see” its parent’s faces? Does it understand any notion of what a face is—or even what a simple shape is? Babies must “learn” the world around them. That’s what artificial intelligence (AI) is doing today. It’s looking at massive amounts of data and learning from it. AI is being used to play games, implement a wide range of computer-vision applications, enable self-driving cars, enable robots to learn to perform new tasks, diagnose medical conditions, translate speech to other languages in near real-time, create chatbots that can respond to arbitrary questions using massive databases of knowledge, and much more.

Who’d have guessed just a few years ago that artificially intelligent self-driving cars would be allowed on our roads—or even become common? Yet, this is now a highly competitive area. The ultimate goal of all this learning is artificial general intelligence—an AI that can perform intelligence tasks as well as humans can

Articial-Intelligence Milestones

Several artificial-intelligence milestones, in particular, captured people’s attention and imagination, made the general public start thinking that AI is real and made businesses think about commercializing AI:

In a 1997 match between IBM’s DeepBlue computer system and chess Grandmaster Gary Kasparov, DeepBlue became the first computer to beat a reigning world chess champion under tournament conditions. IBM loaded DeepBlue with hundreds of thousands of grandmaster chess games. DeepBlue was capable of using brute force to evaluate up to 200 million moves per second! This is big data at work. IBM received the Carnegie Mellon University Fredkin Prize, which in 1980 offered $100,000 to the creators of the first computer to beat a world chess champion.
In 2011, IBM’s Watson beat the two best human Jeopardy! players in a $1 million match. Watson simultaneously used hundreds of language-analysis techniques to locate correct answers in 200 million pages of content (including all of Wikipedia) requiring four terabytes of storage.95,96 Watson was trained with machine-learning and reinforcement-learning techniques. Powerful libraries enable you to perform machine-learning and reinforcement-learning in various programming languages.
Go—a board game created in China thousands of years ago 98 —is widely con- sidered one of the most complex games ever invented with 10^170 possible board configurations. 99 To give you a sense of how large a number that is, it’s believed that there are (only) between 10^78 and 10^82 atoms in the known universe!In 2015, AlphaGo—created by Google’s DeepMind group— used deep learning with two neural networks to beat the European Go champion Fan Hui. Go is considered to be a far more complex game than chess. Powerful libraries enable you to use neural networks for deep learning.
More recently, Google generalized its AlphaGo AI to create AlphaZero—a game-playing AI that teaches itself to play other games. In December 2017, AlphaZero learned the rules of and taught itself to play chess in less than four hours using reinforcement learning. It then beat the world champion chess program, Stockfish 8, in a 100-game match—winning or drawing every game. After training itself in Go for just eight hours, AlphaZero was able to play Go vs. its AlphaGo predecessor, winning 60 of 100 games.

AI: A Field with Problems but No Solutions

For many decades, AI has been a field with problems and no solutions. That’s because once a particular problem is solved, people say, “Well, that’s not intelligence; it’s just a computer program that tells the computer exactly what to do.” However, with machine learning, deep learning and reinforcement learning, we’re not pre-programming solutions to specific problems. Instead, we’re letting our computers solve problems by learning from data—and, typically, lots of it. Many of the most interesting and challenging problems are being pursued with deep learning. Google alone has thousands of deep-learning projects underway.