The role of data in computer programming¶

In this lesson, you will identify and explore some fundamental concepts that underlie computer programming. Understanding these will enable you to give clear, high-level explanations of the way that programs work and help you to design and implement software in an effective way.

The lesson will focus primarily on the concepts of data and data structure, and outline in general terms the role of data in algorithms, architecture and applications. Specific algorithms, architectures and applications will be introduced and explored in detail later in the module.

Learning outcomes

After completing this lesson you should be able to:

describe software in terms of the key concepts of data, data structure, algorithm, architecture and application
give examples of the wide variety of data that is involved in computer applications
explain how data can be used to support various kinds of software functionality.

Key concepts¶

Below are definitions of some the key concepts in this lesson:

Data – information that is stored in some specific format.
Data Structure – a particular way of formatting or organising data.
Algorithm – a specification of a computational process.
Architecture – the overall structure of a computer program in terms of its component algorithms and data.
Application – a piece of software that provides one or more useful functions.

These definitions are intentionally very general. They refer to aspects of computer programs and programming that are often highly interconnected. However, considering them as separate aspects can make it much easier to think about and design programs that can perform very complex tasks.

What is data?¶

In science, the word data refers to information in the form of a set of measurements or records that have been collected for reference or analysis. This can be directly from measurement of the world or from other data sources.

A computer program can only access and operate on information that is represented by some kind of structured format. It cannot determine meaning or the origin of the information it receives, except in so far as its meaning and origin may be encoded within that format. This is why, in computing, the word data can refer to any kind of information that is stored in some specific format, for which there is some convention for interpreting the information stored in that format.

To be brief: data is any kind of information stored in a specific format.

Nearly all computer programs involve some kind of data processing. But this data can vary greatly in its type, quantity, complexity; and there are a huge number of ways that a computer program can operate with data.

Data examples¶

Data can be found in a large variety of forms. The following examples illustrate some of the kinds of data that you may deal with:

3 numbers describing the size of a box in terms of height, width and depth
a sequence of temperature measurements
financial information (e.g. bank accounts)
a database of information about employees of a company
an inventory of products and stock held by a supermarket
a 3D representation of a human body
the text of a book
audio data (e.g. in mp3 or flac format)
image data (e.g. photos in jpeg format)
video data (e.g. in mpeg format)
a large repository of URLs and textual data harvested from the web.

Question

There are many other kinds of data. What other kinds of data can you think of? When you have thought of more examples, select the button below to see some possible answers.

Representing information: data and data structures¶

Data is always encoded within some kind of representational system. This system enables us to interpret the data and give it meaning. The meaning of an item of data is often some fact about the world (for example the height of a building, or the birth date of a person) but it could also represent non-factual information such as a picture or sound, or an abstract mathematical object.

Whatever is represented by data, it needs to be in a format for which there are specific conventions that define its meaning. At the lowest level of detail, nearly all information handled by computers is in the form of binary digits (0s and 1s), which are stored on an electronic medium or state. Using these as building blocks, formatting conventions are used to encode much more complex types of information. The following list gives the main types of data format in order of increasing structural complexity:

Binary digits (0, 1)
Numbers, characters
Sequences of numbers, strings (sequences of characters)
Complex data structures: lists, dictionaries, sets, trees
Structured data objects - CSV files; standard formats for images, video, audio, etc; structured objects defined by classes.

The handling of numbers, characters, sequences and complex data structures in Python will be explored in following lessons. To illustrate the diverse ways in which data may be formatted there are two examples below: CSV files and images. You will see many other ways data can be formatted throughout the course.

CSV files¶

CSV stands for Comma-Separated Values. The CSV file format provides a general and convenient way to store data, similar to a table or spreadsheet. In fact, the relation between spreadsheets and CSV files is so close that nearly all spreadsheet software (for example, Excel and Gnumeric) allows data to be both imported from and exported to CSV files.

CSV files normally end in the extension .csv. Each line of the file represents a data record, which is a sequence of values, separated by commas.

Here is an example of the first few lines of a file pokemon.csv:

#,Name,Type 1,Type 2,Total,HP,Attack,Defence,Sp. Atk,Sp. Def,Speed,Generation
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1
4,Charmander,Fire,,309,39,52,43,60,50,65,1
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1

Here, as is often (but not always) the case, the first line is a header line. This is a comma separated sequence of headers, each of which gives a description of the type of data item in the column below.

As in most CSV data files, each record (i.e. each line apart from the header) represents a group of data values that relate to a particular item, and each of these values represents a particular attribute of that item. Also each item has the same kinds of attribute in the same sequence (corresponding to the column headers). So, each record or line has a definite meaning (in this case giving the vital statistics of a species of cute imaginary creature). However, you should note that the CSV format itself does not specify how the contained information should be interpreted. The header line, if present, just associates certain characters with each column, which may or may not be informative to a programmer. Hence, when writing code that operates on data extracted from a CSV file, it is the responsibility of the programmer to ensure that the data is meaningful and is processed in a way that is appropriate to what it means.

We shall later see many examples of information in CSV files, and will also see how this type of data can easily be read into and manipulated within Python (e.g. using lists or using the powerful DataFrame class provided by the pandas library).

Question

The first column of the example pokemon.csv data file seems to be an ID number. But can you see something odd about it that might cause problems?
What if some of the data you want to store in a CSV file contains commas? For example an address might contain commas. Can we use a CSV file to store such information?

Answers

One of the ID numbers is repeated. If this is not corrected then this could lead to one of the rows with the repeated ID number being lost in some analysis, or misleading results if the two rows become mixed up.
A CSV file can be used to store data that itself contains commas. Two ways to approach this are by quoting values, so that a comma within quote marks is not counted as a delimiter, or using an escape character to indicate that the comma immediately following the escape character should not be understood as a delimiter.

Select the title above to find out more.

Note on file extensions

File extensions may be hidden by your file browsing tool. For example, Windows File Explorer does not show file extensions by default. It is possible to change this setting to show the extension, and you are recommended to do this, since it is often useful to see file extensions when programming. If you do not understand the term file extension (also known as filename extension), or how to change the default view setting, you can read more at the Microsoft Support page.

Images¶

There are several different types of data format that are used to store and to display images.

Each image is stored in a complex digital format that can be interpreted by software and rendered as screen pixels by sending corresponding streams of digits to a computer's GPU (Graphics Processing Unit). Many different formats are used for storing images, but some are much more common than others: GIF, JPEG and PNG are currently the most common.

The image displayed on a computer screen is produced by sending a signal to its video display adapter that is generated from a special area of memory known as a Frame Buffer. The frame buffer stores a colour value for each pixel of the display in some binary form. From a simplified perspective of actual pixel colour encoding schemes, the colour of each screen pixel is represented by the magnitude of red, green and blue light to be emitted from the pixel (this is the well-known RGB colour encoding). Each of these magnitudes is then represented by a byte (corresponding to a number in the range 0-255).

The ways in which an image is represented are complex, and vary both depending on whether the image is stored in a file or rendered to the screen via a frame buffer. High-level languages such as Python provide simpler ways of accessing and manipulating images. For example, the Python module PIL (Python Image Library) defines a class of image data objects which can be:

created from or exported to image files of various formats
displayed on the screen
and modified (e.g. clipped, re-sized, rotated, etc) by means of convenient functions that can be executed within a Python program.

A yellow smiley face with its RGB data showing. — **Figure 2.1:** This image of a smiley yellow face is represented by a grid of pixels; and the colour of each pixel is specified by three numbers representing the proportions of red, green and blue that combine to make the colour.

A simple image processing example using PIL¶

The following example code illustrates how a very simple image processing operation can be accomplished in Python by means of the high-level functionality provided by the PIL library. The code reads in the image from the PNG file images/yellow-smiley.png, and replaces every yellow pixel in the image by a pink pixel. It then saves the result into the file images/yellow-smiley.png, as well as displaying the result using Jupyter's display function (which can be used to display a wide variety of kinds of output value in the output area of a Jupyter code cell).

Note that the colour encoding used in PIL's representation of a pixel consists of a tuple of four values, (R, G, B, A), which are each numbers in the range 0-255 and correspond to red, green, blue, and alpha values. The alpha value determines opacity of the pixel, with 0 being fully transparent and 255 being fully opaque.

Don't worry if you do not fully understand this program. It should become clear once you know more about the Python language.

from PIL import Image
smiley_image = Image.open("images/yellow-smiley.png")

(width, height) = image.size

yellow = (255, 255,   0, 255)
pink   = (255, 200, 200, 255)

for x in range(width):
    for y in range(height):
        col = smiley_image.getpixel( (x,y) )
        if col == yellow:
            smiley_image.putpixel( (x,y), pink )

smiley_image.save( "images/pink-smiley.png" )

display( image )

When run this will give the following output:

Note on efficient image representation and encapsulation

In real graphics cards, there will almost certainly be a compact encoding of colours in terms of a colour palette. This allows the colours displayed to be selected from a much larger set of possible colours. Most screen images will involve only a small subset of the possible colours. This allows all the required colours to be represented using a much smaller number of bits than would be required to represent every colour. So, less memory will be needed and the screen display can be updated more quickly.

A Python image data object may also make use of some form of compression to reduce the amount of memory required to store it in memory. But even if this is the case, it is still possible to access and manipulate the image object in a convenient way. The image can be accessed as if it were an array of pixels, each of which is located at an x, y coordinate and has a colour specified by R, G, B (and A) values. This illustrates the important concept of encapsulation.

Encapsulation is an important concept in computer programming. It arises from the idea that a component of a computer program should be usable without a programmer needing to know all the details of how it works. Although within the component may be a complex implementation, the programmer can use it by just passing it certain information (e.g. the arguments of a function call) and can rely on it computing results according to a specification of what it should do. Thus, the complex implementation is contained within (in other words 'encapsulated by' the component). This has several advantages. Perhaps most importantly, it reduces what the programmer needs to know and therefore makes programming easier. Another advantage is that it is possible to change the implementation of the component (perhaps to make it more efficient) without affecting other parts of the program that make use of that component. Encapsulation applies to data structures as well as to functions. A data-type may be represented in a complex and unintuitive format (perhaps because that enables it to be accessed and operated upon efficiently). But when accessing and manipulating the data it should appear to have a simple intuitive form. So the complexity is encapsulated within the implementation of the data type, rather than being allowed to affect the whole program. Encapsulation is related to modularity, which is the concept that the functionality of a computer program should be broken down into independent modules. Creating smaller and simpler pieces of functionality means that each one can be understood by a programmer more easily. The parts can then be used together like building blocks to create large, complex systems.

The uses of data in computer programs¶

External input data vs internal program data¶

In many coding scenarios, data is derived from a source that is external to a computer program and is then read into and manipulated by the program. So, from the point of view of programming, data has both an external and an internal form.

Programming languages provide a variety of input functions that allow data to be read into a program, either from streams of bytes, or, more commonly, from files. Later in the module, you will look at how Python enables data to be input from files, and how you can access and process various particular types of external data. For now, the module will assume that the data you want to work with has already been input to the program and is stored as internal program data.

When considering the fundamentals of how a program manipulates its internal data, you don’t need to be concerned with what the data means, or how much of it there is. So, this part of the module will look at examples involving small amounts of very simple data, such as just a few numbers or words. Later, you will see that same programming techniques can be applied to analysing large sets of real data containing meaningful information about the world.

You should also note that not all internal program data is derived from external data. Data is often created for book-keeping purposes, either to control algorithms or to help with the processing other data. In some cases, for example simulation programs, functionality may depend on processing large amounts of data that is all generated internally.

Data and algorithms¶

As defined above, an algorithm is a specification of a computational process. Prior to the development of electronic computers, mathematicians specified algorithms by means of symbolic representations and rules for transforming sequences of symbols to give desired results. For example, you can multiply two numbers by writing them down as sequences of digits and then following a sequence of rules that manipulate these digits to obtain a new sequence of digits that represents the result of the multiplication.

In the age of electronic computation, the most obvious way to specify an algorithm is by a computer program. Computer languages (such a Python) provide us with convenient symbolic systems to specify algorithms, in a form that can be automatically executed by computer hardware. An algorithm does not necessarily operate on any input data. For instance, one may define an algorithm to search for a solution to a mathematical problem (for example, what is the largest number whose cube is less than 10000). But nearly all algorithms output some kind of data as a result. However, most algorithms do operate on some kind of data: in some cases just a few numbers, in other cases a huge database of statistics, or a library of images.

Typical algorithms will both operate on some input data and generate some output data. So, in a simplified view of computation, all data is either input data or output data. In this simplified view, algorithms would be sequences of computational steps that transform the input data into the desired output data.

Although some kinds of simple computation do fit this simple picture, nearly all complex programs make use of data in a way that is highly interconnected and interwoven with algorithmic computations. Within a computation, data may be created to represent many intermediate forms between its input and output (for example, between the stages of a data processing pipeline). This could involve building complex structures, such as hash tables, networks and trees. Building these intermediate forms support complex processing operations. Many key algorithms of data analysis, artificial intelligence and machine learning are based upon special-purpose data structures that are generated and manipulated during their execution.

Question

Can you think of any kinds of algorithm or software system that do not output any form of data?

Possible Answers

A locking mechanism reads input signals and opens a door when a signal matches its stored codes.
A garbage collection routine for memory management.

Select the title above to find out more.

Question

Could there even be an algorithm that has neither input nor output but still does something its programmer intended?

Possible Answers

Malicious software that is intended to consume computer resources such as processing power or memory does not require either input or output.

Select the title above to find out more.

Types of computation and application involving data¶

The importance of data in computation is not only due to its variety of forms and the diversity of it’s content. Data’s importance also comes from the many different ways it can be put to use in algorithms, and the many kinds of applications it can support.

Here are three of the main types of computation involving data, together with some typical examples:

1. Manipulating data to convert it into new forms

Converting input data (e.g. data in a file – perhaps compressed) into a form that is more easily operated on by a program
Formatting an address, so it can be printed nicely on a label
Converting a bibliographic database to a web-site
Compressing text, audio or video, so it can be stored in a smaller file

2. Analysing data to extract useful new information

Calculating average life expectancy from birth and death dates of a population
Finding the most powerful earthquake recorded in a dataset of seismic activity

3. Using data to enable some useful functionality

Finding the quickest route between two places on a map
Diagnosing an illness

Of course, these types of data-use are not completely separate. Many applications will involve both analysis and manipulation of data, as well as using it to support some functionality.

Application example: a medical diagnosis system¶

Consider the example of a medical diagnosis system and the ways in which it may interact with and manipulate data:

The system might make use of many different forms of input data. Examples could include medical histories, blood test results, genetic information, blood pressure figures and X-ray images.
It may perform a variety of transformations on data items. These could include combining data from different sources, converting between different types of record, calculating derived quantities or adjusting the scale and orientation of images.
It will carry out various forms of analysis, such as finding abnormal measurements, or changes over time. It could also detect unusual image features or identify correlations between different sets of data values. Based on the results of its data manipulation and analysis, a medical diagnosis system could support several different functions, such as: diagnosing diseases, predicting outcomes of interventions, suggesting treatment and identifying likely causes of illnesses.

Question

What kinds of data would you expect to be generated internally within a medical diagnosis system? When you have thought of some examples, select the button below to see more.

Possible Answers

There are many forms of internal data that could be created. Some examples could include:

Derived values such as BMI could be computed from known measurements.
Data objects could be created that group together multiple pieces of information relating to each individual patient or course of treatment.
Averages and frequencies of measurements over a population of patients could be computed.
Possible paths through a treatment process could be generated in the form of trees or networks.
Images could be scaled and rotated to standard sizes and orientations (for example, so that they can be compared to see the progress of a disease).

Select the title above to find out more.

Exercise

Consider the following examples of applications. For each:

list the types of data that might be used or generated
the types of computation that might operate on the data
any internal forms of data that might be created within the application.

The examples below are all complex kinds of software. There will be a lot of different types of data and operations that could be involved in such systems, so there is no need to make a comprehensive list. But it should be interesting to consider the possibilities.

A route planning and navigation system (for example, Google Maps or something similar)
Web-based software for selling and recommending books or other products (for example, something like Amazon)
A system that simulates the interactions and population dynamics of plant and animal species in a natural environment.

Lesson complete¶

In this lesson, we have discussed how to describe software in terms of the key concepts of: data, data structure, algorithm, architecture and application. We have seen the wide variety of data that is involved in computer applications and how data can be used to support various kinds of software functionality.

Most software and computer applications are complex and use a variety of data and algorithms. Encapsulation and modularity allow complex applications to be understood and maintained more easily.

The first part of this unit shows how functions and classes are defined in Python. Both functions and classes provide ways to structure your code so that the functionality is broken down into smaller parts. Later in the unit we will look at how to load and use data from files, particularly focusing on structured data in csv files.

Select the next button to continue to the next lesson.