Intro to Data Exploration & Visualization in Python

There is a hidden pattern behind every data. The power of machine learning is to find that hidden pattern and use it to make predictions. Python has good libraries which can be used for EDA. What is EDA (Exploratory data analysis)? Just like how you think of the abbreviation, it sounds as if it is an approach to analyze the data to understand the underlining structure using different visualizing technique.

Before I start, I would like to give you a brief introduction on the scope of work. You could find the data set here. First, we will learn how to import the data set and the libraries. Then the initial analysis on the data will be conducted and then we will go through few methods used for data visualization, such as:

  1. Histogram
  2. Scatter Plot
  3. Box plot

Remember this is just an introduction, we will go through different methods of analysis at a later time. Now let us begin.

Here we have the customer data shown for four regions, both online and in-store shopping. Let us now delve through the data to understand how the data visualization is utilized to understand more of the data.

The libraries portrayed below are the ones that are necessary for this analysis:

Use the code below to import the data set:

Let us look into the data set. Type the code below to obtain the first few rows of the data set.

The describe() method is utilized to display the statistical data, such as the percentile, mean and standard deviation of the numerical values.

From the row showing the “Count,” one can understand if there are any missing data. As shown above, all attributes have 80000 values, therefore none of the columns have missing values. How can one read the above table? Let us go through an example. The youngest customer is 18 years and the oldest one is 85 years of old. Age and amount have high standard deviation when compared to the other attributes. This information could be used on deciding whether to normalize the data or not while performing modelling.

The above provides the quick look into the data types, total values in each variable.

Before we dive into EDA, let us do cleaning of our data by removing the duplicates if any. As we noticed we do not have any null values.

Data Visualization

From Wikipedia

“A histogram is an approximate representation of the distribution of numerical. It was first introduced by Karl Pearson.[1] To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but not required to be) of equal size.[2]

Histogram is widely used to understand the distribution. Let us use histogram to understand the age distribution

Basic Histogram

The above gives you a histogram, but how about the X & Y label right?

The above is a simple histogram with speaks more by itself. More customers are in the range of 25 to 60 and then the number gradually drops as the age increases.

How can we do binning or divide the entire range of values into a series of intervals ?

To get the histogram of all the attributes we could simply use the below

According to the Wikipedia

A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram)[3] is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.[4]

In short Scatterplot helps to understand how one variable is affected by other. There are three types of scatter plot:

  • Positive Correlation: as one variable increase so does the other
  • Negative Correlation: as one variable increases, the other decreases
  • No Correlation: there is no apparent relationship between the variables.

The below shows an simple example

You could also use this, which will give you the same plot as above

Scatter plot provide much more than comparing the relation between two feature. The main difference between plt.scatter and plt.plot is that properties of each point in the scatter plots like color, sizes can be mapped to the individual attributes of the data. We will explain this with an example

Let us make this a little more interesting. How can view the total amount spend by the customers in each region? There are various ways to do it, but I would take this as opportunity to introduce “pivot_table”. A pivot_table provide the summary of the data in a simple table. This is useful if we have multiple rows & we need to get the aggregate functions like sum, medium, mean etc.…

Boxplot sometimes called as Whisker plot are useful to visualize the distribution of data based on a five number summary(“ minimum ”,first quartile(Q1), median ,third quartile(Q3)). In short we could say this helps to understand where the major part of the data is lying also is very useful in identifying the outliers. Why is outliers important? Outliers are the points in the data set which is distant from the rest of the point. This can affect the mean, median which in turn can cause error in the data set. Some outliers are good ones from which we can gai new knowledge. But some are bad which can affect the data set, especially when training the model.

Boxplot showing Amount

Boxplot helps understand whether the data is symmetric or skewed(lopsided). Symmetric data have the median in the middle of the box, where skewed data divide the box into unequal parts. The above image shows that amount is skewed to the right. The part of the box to the left of median showing the customers who purchase below 500 is shorter than the right part of the box. One part larger does not mean that part has more data. It simply means that part has wide range in the values of data. Each section of data contains 25% of the data no matter what. From the above we can also note that amount has a lot of outliers in it.

Summary

This page mainly go through the basics. Hopefully this would help give a strong foundation. Please let me know if you have any questions.

If you have any thoughts, comments or questions, please leave a comment below or contact me on LinkedIn. You could also find more similar projects in my github.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store