Exploratory Data Analysis popularly known as EDA is an approach to analyzing data sets to summarize their main characteristics. This is usually the first and basic step when analyzing your data. The goal of EDA is to give you an overview of your dataset before you start more in-depth analysis, often using the outcome of the EDA to set parameters and choose appropriate algorithms. Thus it can be said to be your first look at the dataset.
A good EDA of your data can
- Uncover the type of distribution within the dataset.
- Help identify important variables and uncover underlying features in the data.
- Identify missing values, hidden patterns, and outliers.
- Present the relationships and correlation between the data features/variables.
- Provide insights needed for optimal model tuning for machine learning
By the end of the article, the reader would have:
- Have an understanding of what exploratory data analysis is
- Have an understanding of the Python Pandas library.
- How to use Pandas function when performing EDA
The reader requires a basic understanding of
- Python and data analytics
- Jupyter Notebooks or any other notebook-based technology, e.g., Google Colab.
What is Pandas?
Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool. And yes, I mean the Python data analysis module, not the bear.
The module is built on Python programming language. Pandas shares some similarities with Excel, it stores data as series and data frames. This data analysis module also has a robust list of functions that make data processing and analysis easier.
You can install pandas by entering the command below on your terminal.
pip install pandas
You can also learn more about Pandas by reading the official documentation.
In this tutorial, you are going to learn and explore some useful pandas functions which you should use when performing EDA. They should be part of your toolset.
Pandas Functions for EDA
For this post, we would be using the Top 20 World Container Ports gotten from Data World.
To get started, run the code snippet below:
import pandas as pd
Using panda functions df.head(), df.tail(), df.sample(), and df.columns is useful for EDA. They can give you a quick view at a section or your entire dataset.
The .head() function tells you the first few rows of your data. The number of rows it returns can however be specific inside the ().
The .tail() is similar to the .head() but this time it returns the last few rows of your data.
The .sample() randomly pick sample data from your data while df.columns return a list of all columns.
Size and Shape
The .shape function returns the dimension of your dataset. So for a two-dimensional array (data frame), it will return the number of rows and columns. The .size() function returns the number of rows of a series. However, for a data frame, it returns the number of rows times the number of columns of the data frame.
Lastly, .memory_usage returns how much memory each column uses.
Data type is an important concept in data analysis. In python, your variables can be stored in any of the supported data types. Some examples of these data types are:
- Text (str)
- Numeric (int, float, complex)
- Boolean Type (bool)
You can find out the data types of your dataset by making use of the dtypes function.
Missing data is the absence of a value for your variable. This often reduces the representativeness of your sample thus leading to bias and a complicated analysis process. There are multiple ways to deal with this, however, for this post, we will focus on how to find them and remove them.
You can find the number of missing values by using df.isnull or df.isna() to return a boolean summary.
You can thus get a more digestible result by using the sum() function along with the isna() or isnull() function.
You can find out how unique your data is using the .nunique() function. you can make use of the axis parameter of the function to specify if you want to check the row (axis =1) or column (axis = 1).
The .duplicated() returns a boolean Series telling you which rows are duplicates.
To get the unique return a list of all unique variables of your column while the .value_counts() counts the number of variables in the column specified.
Pandas .describe() functions return the summary statistics of your numeric data. It gives information such as the counts, mean, standard deviation, minimum, maximum value, the 25th, 50th, and 75th percentile of value in each column.
The .info() function returns information on the index dtype, column dtypes, non-null values, and memory usage.
Smallest and Largest
Pandas .nsmallest and .nlargest return the first number of rows specified by the smallest or largest value in a column in descending order.
The .groupby is a great way to group your data based on the criteria specified. This makes aggregating your data efficient.
Correlation evaluates and gives you a better understanding of the relationships between your features/variables. The correlation coefficient ranges from -1 ( strong negative between the variables) to +1 (strong positive between the variables). A correlation of 0 suggests that two variables are independent of each other.
Data Visualization (Scatter plot, histogram and heatmap)
Data visualization is a great technique to use when performing EDA. The histogram and scatter plot give you a representation of the data distribution and are a great way to uncover outliers in your dataset.
You can perform data visualization in Python using seaborn, matplotlib, and a lot of other plotting libraries.
Where to Go From Here?
The goal of today’s post was to provide you with a quick overview of exploratory data analysis. We began by defining the term, then demonstrated some common python pandas functions that you will use to run an EDA on your data.
There’s a lot more to EDA than what a single blog article can cover. You should check out our training programs to learn more. And remember that UrBizEdge has many other courses as well.
So keep studying and keep an eye on our blog for more information on data analysis.
Thanks for reading and see you next time!