Introduction to Data Analysis and LibrariesAhren Stevens-Taylor
In this article by Martin Czygan and Phuong Vothihong, the authors of the book Getting Started with Python Data Analysis, Data is raw information that can exist in any form, which is either usable or not. We can easily get data everywhere in our life; for example, the price of gold today is $ 1.158 per ounce. It does not have any meaning, except describing the gold price. This also shows that data is useful based on context.
(For more resources related to this topic, see here.)
With relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered overtime, one information we might have is that the price has continuously risen from $1.152 to $1.158 for three days. It is used by someone who tracks gold prices.
Knowledge helps people to create value in their lives and work. It is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the gold price continuously increases for three days, it will lightly decrease on the next day; this is useful knowledge. The following figure illustrates the steps from data to knowledge; we call this process the data analysis process and we will introduce it in the next section:
In this article, we will cover the following topics:
- Data analysis and process
- Overview of libraries in data analysis using different programming languages
- Common Python data analysis libraries
Data analysis and process
Data is getting bigger and more diversified every day. Therefore, analyzing and processing data to advance human knowledge or to create value are big challenges. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure:
Let’s us go through the Data analysis and it’s domain knowledge:
- Computer science: We need this knowledge to provide abstractions for efficient data processing. A basic Python programming experience is required. We will introduce Python libraries used in data analysis.
- Artificial intelligence and machine learning: If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to model the data and learn from it in order to build smart products.
- Statistics and mathematics: We cannot extract useful information from raw data if we do not use statistical techniques or mathematical functions.
- Knowledge domain: Besides technology and general techniques, it is important to have an insight into the specific domain. What do the data fields mean? What data do we need to collect? Based on the expertise, we explore and analyze raw data by applying the above techniques, step by step.
Data analysis is a process composed of the following steps:
- Data requirements: We have to define what kind of data will be collected based on the requirements or problem analysis. For example, if we want to detect a user’s behavior while reading news on the internet, we should be aware of visited article links, date and time, article categories, and the user’s time spent on different pages.
- Data collection: Data can be collected from a variety of sources: mobile, personal computer, camera, or recording device. It may also be obtained through different ways: communication, event, and interaction between person and person, person and device, or device and device. Data appears whenever and wherever in the world. The problem is, how can we find and gather it to solve our problem? This is the mission of this step.
- Data processing: Data that is initially obtained must be processed or organized for analysis. This process is performance-sensitive: How fast can we create, insert, update, or query data? For building a real product that has to process big data, we should consider this step carefully. What kind of database should we use to store data? What kind of data structure, such as analysis, statistics, or visualization, is suitable for our purposes?
- Data cleaning: After being processed and organized, the data may still contain duplicates or errors. Therefore, we need a cleaning step to reduce those situations and increase the quality of the results in the following steps. Common tasks include record matching, deduplication, or column segmentation. Depending on the type of data, we can apply several types of data cleaning. For example, a user’s history of a visited news website might contain a lot of duplicate rows, because the user might have refreshed certain pages many times. For our specific issue, these rows might not carry any meaning when we explore the user’s behavior. So, we should remove them before saving it to our database. Another situation we may encounter is click fraud on news—someone just wants to improve their website ranking or sabotage the website. In this case, the data will not help us to explore a user’s behavior. We can use thresholds to check whether a visit page event comes from a real person or from malicious software.
- Exploratory data analysis: Now, we can start to analyze data through a variety of techniques referred to as exploratory data analysis. We may detect additional problems in data cleaning or discover requests for further data. Therefore, these steps may be iterative and repeated throughout the whole data analysis process. Data visualization techniques are also used to examine the data in graphs or charts. Visualization often facilitates the understanding of data sets, especially, if they are large or high-dimensional.
- Modelling and algorithms: A lot of mathematical formulas and algorithms may be applied to detect or predict useful knowledge from the raw data. For example, we can use similarity measures to cluster users who have exhibited similar news reading behavior and recommend articles of interest to them next time. Alternatively, we can detect users’ gender based on their news reading behavior by applying classification models such as Support Vector Machine (SVM) or linear regression. Depending on the problem, we may use different algorithms to get an acceptable result. It can take a lot of time to evaluate the accuracy of the algorithms and to choose the best one to implement for a certain product.
- Data product: The goal of this step is to build data products that receive data input and generate output according to the problem requirements. We will apply computer science knowledge to implement our selected algorithms as well as manage the data storage.
Overview of libraries in data analysis
There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages and have different advantages as well as disadvantages of solving various data analysis problems. Now, we introduce some common libraries that may be useful for you. They should give you an overview of libraries in the field. However, the rest of this focuses on Python-based libraries.
Some of the libraries that use the Java language for data analysis are as follows:
- Weka: This is the library that I got familiar with, the first time I learned about data analysis. It has a graphical user interface that allows you to run experiments on a small dataset. This is great if you want to get a feel for what is possible in the data processing space. However, if you build a complex product, I think it is not the best choice because of its performance: sketchy API design, non-optimal algorithms, and little documentation (http://www.cs.waikato.ac.nz/ml/weka/).
- Mallet: This is another Java library that is used for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications on text. There is an add-on package to Mallet, called GRMM, that contains support for inference in general, graphical models, and training of Conditional random fields (CRF) with arbitrary graphical structure. In my experience, the library performance as well as the algorithms are better than Weka. However, its only focus is on text processing problems. The reference page is at http://mallet.cs.umass.edu/.
- Mahout: This is Apache’s machine learning framework built on top of Hadoop; its goal is to build a scalable machine learning library. It looks promising, but comes with all the baggage and overhead of Hadoop. The Homepage is at http://mahout.apache.org/.
- Spark: This is a relatively new Apache project; supposedly up to a hundred times faster than Hadoop. It is also a scalable library that consists of common machine learning algorithms and utilities. Development can be done in Python as well as in any JVM language. The reference page is at https://spark.apache.org/docs/1.5.0/mllib-guide.html.
Here are a few libraries that are implemented in C++:
- Vowpal Wabbit: This library is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. It has been used to learn a tera-feature (1012) dataset on 1000 nodes in one hour. More information can be found in the publication at http://arxiv.org/abs/1110.4198.
- MultiBoost: This package is a multiclass, multilabel, and multitask classification boosting software implemented in C++. If you use this software, you should refer to the paper published in 2012, in the Journal of Machine Learning Research, MultiBoost: A Multi-purpose Boosting Package, D.Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl.
- MLpack: This is also a C++ machine learning library, developed by the Fundamental Algorithmic and Statistical Tools Laboratory (FASTLab) at Georgia Tech. It focusses on scalability, speed, and ease-of-use and was presented at the BigLearning workshop of NIPS 2011. Its homepage is at http://www.mlpack.org/about.html.
- Caffe: The last C++ library we want to mention is Caffe. This is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. You can find more information about it at http://caffe.berkeleyvision.org/.
Other libraries for data processing and analysis are as follows:
- Statsmodels: This is a great Python library for statistical modelling and is mainly used for predictive and exploratory analysis.
- Modular toolkit for data processing (MDP):This is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures (http://mdp-toolkit.sourceforge.net/index.html).
- Orange: This is an open source data visualization and analysis for novices and experts. It is packed with features for data analysis and has add-ons for bioinformatics and text mining. It contains an implementation of self-organizing maps, which sets it apart from the other projects as well (http://orange.biolab.si/).
- Mirador: This is a tool for the visual exploration of complex datasets supporting Mac and Windows. It enables users to discover correlation patterns and derive new hypotheses from data (http://orange.biolab.si/).
- RapidMiner: This is another GUI-based tool for data mining, machine learning, and predictive analysis (https://rapidminer.com/).
- Theano: This bridges the gap between Python and lower-level languages. Theano gives very significant performance gains, particularly for large matrix operations and is, therefore, a good choice for deep learning models. However, it is not easy to debug because of the additional compilation layer.
- Natural language processing toolkit (NLTK): This is written in Python with very unique and salient features.
Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications.
Python libraries in data analysis
Python is a multi-platform, general purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and the .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. We will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you getting started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack.
One of the fundamental packages used for scientific computing with Python is Numpy. Among other things, it contains the following:
- A powerful N-dimensional array object
- Sophisticated (broadcasting) functions for performing array computations
- Tools for integrating C/C++ and Fortran code
- Useful linear algebra operations, Fourier transforms, and random number capabilities.
Besides this, it can also be used as an efficient multidimensional container of generic data. Arbitrary data types can be defined and integrated with a wide variety of databases.
Pandas is a Python package that supports rich data structures and functions for analyzing data and is developed by the PyData Development Team. It is focused on the improvement of Python’s data libraries. Pandas consists of the following things:
- A set of labelled array data structures; the primary of which are Series, DataFrame, and Panel
- Index objects enabling both simple axis indexing and multilevel/hierarchical axis indexing
- An integrated group by engine for aggregating and transforming datasets
- Date range generation and custom date offsets
- Input/output tools that loads and saves data from flat files or PyTables/HDF5 format
- Optimal memory versions of the standard data structures
- Moving window statistics and static and moving window linear/panel regression
Because of these features, Pandas is an ideal tool for systems that need complex data structures or high-performance time series functions such as financial data analysis applications.
Matplotlib is the single most used Python package for 2D-graphic. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats: line plots, contour plots, scatter plots, or Basemap plot. It comes with a set of default settings, but allows customizing all kinds of properties. However, we can easily create our chart with the defaults of almost every property in Matplotlib.
PyMongo is a Python distribution containing tools for working with MongoDB. Many tools have also been written for working with PyMongo to add more features such as MongoKit, Humongolus, MongoAlchemy, and Ming.
scikit-learn is an open source machine learning library using the Python programming language. It supports various machine learning models, such as classification, regression, and clustering algorithms, interoperated with the Python numerical and scientific libraries NumPy and SciPy. The latest scikit-learn version is 0.16.1, published in April 2015.
In this article, there were three main points that we presented. Firstly, we figured out the relationship between raw data, information and knowledge. Because of its contribution in our life, we continued to discuss an overview of data analysis and processing steps in the second part. Finally, we introduced a few common supported libraries that are useful for practical data analysis applications. Among those we will focus on Python libraries in data analysis.
Resources for Article:
- Exploiting Services with Python [Article]
- Basics of Jupyter Notebook and Python [Article]
- How to do Machine Learning with Python [Article]