Skip to content

Data transformations

This lesson covers the following general and overlapping aspects of computational data processing. These include:

  • data transformations
  • manipulating data (including introducing the Pandas library)
  • data quality
  • data analysis (which will be studied further in the next unit).

Learning outcomes

After completing this lesson you will be able to:

  • explain key concepts surrounding data transformation, manipulation and quality
  • use Python to carry out simple operations on textual information.

Data transformations

Converting data from one form to another is an extremely common part of many computational tasks. Some different ways in which we may want to transform data from one form to another include:

  • converting between different representations for the same information
  • cleaning or enhancing data
  • data compression
  • transforming numerical or other abstract information into a visual form (visualisation)
  • combining two or more sources of data into a single dataset.

Of course, data transformations must be implemented by some kind of algorithm, which will involve carrying out a sequence of data manipulations of various kinds. Resolving data quality issues requires a kind of data transformation in which we apply certain manipulations in order to make the data more suitable for further analysis; and, as we shall see, data analysis can also be seen as a special type of data transformation, by which we uncover interesting and useful information that is hidden within data.

Transforming text to Morse code

This video gives an example of a data transformation: from text to Morse code.

View PDF transcript
Audio file

Data manipulation

By data manipulation we mean the application of a wide range of operations and techniques that can be used to implement the transformation from one form of data to another. This can range from conceptually simple operations, such as splitting text into sentences or sorting a list of cities in terms of their population, to very complex processing such as creating a statistical model of natural language by feeding a corpus of text data into a neural network.

Fortunately Python is very well suited to carrying out a wide range of data manipulations, both in terms of it providing a programmer with a flexible set of operations within the core Python language, and also in terms of the large number of modules that can be imported in order to provide easy access to powerful processing capabilities that have been developed by other programmers. In this unit, we will further develop expertise in applying Python's core capabilities to data manipulation tasks and will also look at how the pandas package provides a rich library of additional functionality that has been specially designed to support complex manipulation of large datasets.

Data quality

In current times, it is easy to get access to vast amounts of data of a huge variety of types. However, we cannot always rely on data to provide exactly what we want from it. There are many ways in which data can mislead: it may be inaccurate, incomplete or un-representative; and often it will be just a little different in form or content from what we were expecting. These issues present a major obstacle for data science. However, data scientists (at least the more conscientious ones) are well aware of the problems that can arise from poor data, and many techniques have been developed that can mitigate against them. And as mentioned above these techniques take the form of transforms whose purpose is to convert data into a 'cleaner', more reliable, or more appropriate form.

Data analysis

Data analysis can be regarded as an extreme case of transformation, by which certain significant quantities, patterns, relationships and statistics are extracted from data. Certainly, the result of any particular analytical processing computation is a transformation. However, analysis can also be regarded as a more general exploratory activity, in which investigation of data is driven by questions which we attempt to answer by applying computations to process and extract information from the data.

An example of applying basic text processing operations

This video gives an example of some simple exploratory text analysis.

View PDF transcript
Audio file

Lesson complete

The rest of this unit will build on the core Python knowledge you already have and give you the tools to work with larger amounts of data and more complex kinds of data manipulation and transformation.

Select the next button to continue to the next activity.