Data analysis¶
In this video, Brandon welcomes you to Unit 4 and gives an overview of the topics you will cover.
View PDF transcript
Audio file
This lesson will introduce you to the general ideas and methodology of data analysis. We will examine the structure of the process by which data can be analysed in order to reveal significant answers to meaningful questions. We shall see that this is a dynamic process which involves many choices and may require one to deal with a wide variety of issues.
Learning outcomes
After completing this lesson you should:
-
understand the pipeline architecture and how it can be applied to data analysis
-
be able to explain various different ways in which data analysis can progress in the light of evaluation of results.
The concept of analysis¶
Let us consider the meaning of the word analysis. This may not be directly relevant to practical programming tasks; but, in order to achieve insight as a data scientist, you will need to develop a broad perspective of the field. Awareness of the meanings and interpretations of key terms will help you develop this.
Here are some possible definitions of the word analysis:
- The process of separating something into its constituent elements.
- Resolution of anything complex into simple elements.
- Detailed examination of the elements or structure of something.
These definitions correspond well with certain types of data analysis that we might perform. For example, there are many cases where we have a set of heterogeneous information and want to split it into classes (and often also quantify or rank those classes). This could include:
- finding the breakdown of different types of people visiting a website in terms of attributes such as age, gender, nationality etc.
- quantifying and ranking the volume of information stored on servers in each country
- classifying causes of death in a population.
Analysis of this kind is often very useful and sometimes may reveal surprising results that were not so obvious from the mixed-up data that we started with. However, the methods of of data science are certainly not limited to these purely reductionist forms of investigation.
A fundamental aspect of data science is that (when done properly) it is not just a set of techniques that can be applied to data to give results. It is a dynamic process of investigation and interpretation in which each finding guides subsequent investigation. In the rest of this lesson, we shall consider ways in which this process can be organised and can progress towards finding useful and interesting answers to significant questions.
Data analysis pipelines¶
It was mentioned in the previous unit that data analysis can be considered as a particular type of transformation, where we apply processing operations to data in order to obtain useful or interesting results. Hence, the simplest picture of data analysis is as follows:
However, the transformation required to derive anything useful or interesting from the input data may be very complex and will often involve a long sequence of processing operations. Hence, in order to make the programming task reliable and also to ensure that the analysis is robust and reliable, the processing will typically be broken down into several steps. In fact, in many cases it will be broken down into a large number of relatively simple steps. This leads us to the following picture of data analysis:
This way of organising computational processing is known as a pipeline architecture. Typically, the pipeline will begin with low level operations such as 'cleaning' the data to resolve quality issues. It will then progress through various stages of selection and organisation. Towards the end of the pipeline sophisticated data mining, machine learning or other AI algorithms might be be applied. As well as helping programmers to organise code into manageable modules, the pipeline architecture makes it easy to check intermediate results at many stages, which facilitates identification of bugs and hence helps ensure that the analysis is reliable.
An iterative cycle of analysis¶
The pipeline architecture organises data analysis in a way that is conceptually simple and beneficial for reliable implementation. However, in relation to the process of analysis that is actually carried out by data scientists, it leaves out several very important aspects of the situation, and thus is somewhat misleading.
In fact, the pipeline picture misses out the essential motivation for data analysis, which is this:
We have questions and we want answers.
So, perhaps the following picture is more appropriate:
Maybe the above picture is accurate, but it is not very informative. Yes, data analysis is about finding answers to questions. But how does it work? Well, as we have seen, in the field of data science it is achieved by processing data by computational transformations, in a way that is usually organised in terms of a pipeline architecture. To get a better picture, we need to fill in the structure of how data and transformations play their role in data analysis.
Let us first consider the simple picture of a single transformation from data to results. In the context of our goal of finding answers to questions, we can incorporate our data-driven, computational method of analysis into the picture as follows:
Here we see that questions motivate us to procure relevant data, which we explore by means of computational transformations in order to derive results, which we then evaluate with the intention of finding answers.
Of course, as outlined above, we will most likely break down a complex transformation into sequence of steps forming a pipeline.
We must also note another very important feature of Figure 4.4: it indicates that, the results of a transformation do not directly give us answers. In order to find meaningful answers, we need to evaluate the results that have been obtained. Moreover, when results are evaluated we do not only find answers, we are very likely to realise that many other questions remain. In fact, one may realise that only limited answers can be drawn from the results obtained so far and that further analysis is needed to get any significant or useful answers. This is shown in the diagram by the split in the arrow that represents the consequences of evaluation: it can reveal answers, but it will usually also lead to one to reassess the issues and raise more questions.
Since reassessment brings us back to questions, and the purpose of analysis is to answer questions, it may seem like we can simply repeat the process of analysis, applying it to the questions arising from the first analyses, and then if further questions arise, we could further repeat the process to reach more and more significant answers.
But it is not so simple. There are many ways in which we could address the issues that have arisen and continue on to further analysis. Here are some possibilities:
-
Extend the processing pipeline. This is the most straightforward way to continue. We realise that by adding further steps to the analysis pipeline we can get more detailed or more useful results.
-
Modify the pipeline. In many cases, we may realise that we cannot simply add processing steps to the end of the pipeline, since there are limitations or problems occurring earlier in the process. Hence, we may need to insert additional steps into the pipeline, or modify or replace earlier steps.
-
Use different data. This may seem radical, since the we may have been thinking of the data as the origin of the analysis process and so be reluctant to abandon or change it. However, the motivation and true origin of analysis is questions, and it may be that the data that we have is unsuitable or inadequate to answer these questions. Recognising when data is not adequate to yield the information we seek is a vital skill for a data scientist. Having appropriate and sufficiently high quality data is essential for revealing interesting and potentially useful insights.
-
Incorporate additional data. Another very common outcome of evaluating the results of analysis is that we realise that we realise that we simply do not have enough data, either in terms of sheer quantity or perhaps in the variety of samples or the range of attributes that have been recorded. Many algorithms, especially those involving statistical calculations and/or machine learning, require large amounts of data to give reliable and accurate results. With only a small amount of data you may get results that will not apply to other cases, and hence the answers one appears to get may be misleading. A similar problem occurs in relation to the variety of attributes in the data. If this is too limited, it may not be possible to yield the information we wish to obtain. (In particular we may be unable to accurately establish a classification that we wish to make.) It is actually very common for a data analysis project to begin by investigating a fairly small dataset, and then, once the results of this pilot study have been analysed, look for a larger and/or more diverse dataset that can yield more information that is relevant to the questions one wishes to answer.
-
Refine or revise our questions. This is the truly radical form of reassessment, which does take us back to the beginning of our endeavour. After evaluating the results of data analysis, one may realise that the questions one originally wanted to answer are in fact poorly defined or ill-conceived. Here are some ways in which we may find that our questions are not as precise or interesting as we had assumed:
-
The question is too narrow. It only applies to a small number of cases and is not significant in a wider context.
-
The question is too broad. Although the question may be answered in a general way, it hides a large amount of variation which has a significant and interesting form.
-
The question is not sufficiently clear. This may be revealed if, when trying to interpret results in order to answer the question, it is found that there is ambiguity and subjective judgement needs to be applied.
-
The answer to the question varies depending upon other factors, such as when and where it is asked.
-
The question incorporates unwarranted assumptions about causal relationships. For example, the question could ask: How does property \(X\) depend on property \(Y\)?; and analysis may appear to give answers to this. However, upon further analysis it may be recognised that properties \(X\) and \(Y\) are both strongly affected by a third, previously unconsidered factor, \(Z\). Hence, the apparent correspondence between \(X\) and \(Y\) may be only due to their both being dependent on \(Z\). For example, it may be found that people who use a certain kind of toothpaste have, on average, much better condition teeth than those who do not. But that could just be that for other reasons (such as price or marketing) that toothpaste is mostly bought be people that do not eat much sugary food.
-
Answering the question would require kinds of investigation that cannot be conducted simply by gathering and analysing data.
-
Question about questions
Consider the following questions with regard to why they may need to be clarified, refined or revised:
-
What is the relationship between nationality and life expectancy?
-
Does food affect intelligence?
-
How does the human brain work?
-
Who is the most popular pop star?
-
Do Californian LISP programmers earn more that French Prolog programmers?
-
Does eating sushi increase life expectancy?
-
Are people who can program in Python cleverer than those who cannot?
-
Why are people who can program in Python cleverer than those who cannot?
When you have considered the questions, share your thoughts in the discussion thread in Minerva. Access the thread for this topic by selecting the 'Discussion' tab in the side navigation in Minerva and selecting the 'Questions about questions' thread for this lesson.
A branching picture of analysis¶
The considerations of the previous section suggest that analysis is not a linear process where each successive iteration is determined by the results of the last. It is actually a process that involves many choices among possible ways to proceed.
There are choices even at the outset, when we decide what data to procure and how to explore it. And each time we evaluate some results there are further choices: whether we are content with our answers or whether we want to proceed with further analysis; and if we want to extend the analysis, in what direction should we proceed? Do we need different data, more data, more processing steps, different kinds of processing, or some combination of these? Hence, we should view the process of analysis as a branching structure in which we choose how to progress after each evaluation. Bearing this in mind, we can picture the scenario of data analysis culminating in evaluation as follows:
Since each question that arises can lead to further analysis, the possible paths that we could take form a tree, where each branch corresponds to a possible sequence of analysis tasks. Of course the answers we may find along some paths may be much more interesting or useful than those we find on others.
We shall see in the next unit that this branching structure of data analysis is directly analogous to the abstract structure known as a search space, which underpins the way in which many artificial intelligence algorithms find solutions to a wide variety of problems.
Analysis vs synthesis¶
In study of philosophy and literature, the concept of analysis is often contrasted with synthesis:
- Analysis is a kind of enquiry that breaks down its subject of examination to find its basic elements and structure.
- Synthesis (in this sense) compares and brings together different subjects and
phenomena to reveal connections and correspondences between them.
Synthesis in data science¶
In data science, the term data analysis can mean any of a very wide range of data processing techniques, algorithms and evaluation tools that may be applied to derive information from data. Hence, the term data analysis can include both analysis in its original sense of breaking down data into underlying components and structures; and also synthesis, since we are often interested in comparing and identifying correlations between different kinds of data value or dataset.
The possibility and potential benefit of synthetically combining two or more sets of findings from different prior investigations mean that the branching tree pictures shown in Figure 4.6 is still a simplification of the ways in which one can progress towards answering questions and revealing insights. Hence, we should really imagine an even more complex picture in which, as well as branching, there may also be convergence and merger between different paths of investigation.
Data mining¶
Although we shall not be covering data mining in any detail in this module, now is a good time to briefly introduce the idea, since it relates to the idea of synthesis.
Data mining is a kind of investigation where we try to use systematic means – in the form of algorithms - to automatically explore patterns and connections that exist within diverse data values and data sources, in order to identify relationships of which we had no prior awareness. Hence, it attempts to provide a kind of automatic synthesis, which could potentially give insights that go beyond prior expectations and are not necessarily answers to previously posed questions. To achieve this, it employs techniques that process data in a very general way, looking for general kinds of correspondence, relationships or structure. So data mining can be regarded as a kind of unsupervised machine learning.
Exercise
Research the concept of data mining and gain a basic knowledge of one or more types of algorithm that are used to implement it.
Aside
Inductive Logic Programming (ILP) is a logic-based approach to knowledge acquisition, that bears a similarity to data mining. It is essentially a method of automatically extracting logical relationships from data, which can be used to identify rules that are satisfied by the information contained in that data and can potentially be applied to reason about and draw conclusions from other sets of data, which have similar content.
Work involving the use of ILP was conducted at Leeds some time ago, as part of the CogVis Project on interpreting video data. An interesting paper entitled Protocols from perceptual observations presents an ILP-based method for identifying the rules a game (such as simple card games) from videos of people playing the game.
Data synthesis and synthetic data¶
In fact, in the field of data science, the term data synthesis does not mean comparing or bringing together different kinds of data, but something completely different. It actually means creating artificial data in order to use this to develop and test our software.
Question
Why would we want to create artificial data?
How would we create artificial data?
Answer
The main reason that we might want do this is that we may not have access to sufficient quantities of actual data to carry out this development and testing.
In some cases, synthetic data may be based on some actual data, but in most cases it is created directly by an algorithm that produces data that we believe is similar to real data that we might find. Typically the algorithm would involve randomisation or simulation or both. The algorithm builds in certain frequencies, dependencies and conditions that we know to hold in the real case. This is why we believe it is similar to real data.
An Example of Data Synthesis:
Suppose we want to develop software that can identify places on the Earth's surface where we are likely to find diamonds – and that would certainly be useful. From theories of minerals and geology, we may know enough about the formation of diamonds and the way that the Earth's surface can change over time to be able to write a simulation program that would generate a semi-random example of a landscape, along with locations of different minerals that have been created by geological processes. To use this artificial data to test this software, we can hide any information that we would not have access to if it were real (i.e. the locations of the diamonds) but keep the information we would easily be able to obtain (such as the land surface elevation and the regions occupied by different types of rock). Then we use the synthetic land surface and rock types data as test input to develop our diamond finding software. And of course, after we have run the software we can check how well the regions it identifies correspond to the locations of the diamonds.
Of course, there can be problems with using synthetic data. It may not actually be similar to real data. The assumptions we made or the data creation algorithm may be incorrect. We may have built in some correlation that does not exist in real life. Hence, in most cases it is important to also do some reality check testing with real data. But this may only require a small amount of real data compared with the amount we need for developing our system (especially if the system is using some kind of machine learning).
Lesson complete¶
You should now have a better understanding of pipeline architecture and how it applies to data analysis, and the different ways in which data analysis can progress in the light of evaluation of results. In the next lesson, you’ll explore how to acquire data from the internet.
Select the next button to continue to the next lesson.