There is a hidden pattern behind every data. The power of machine learning is to find that hidden pattern and use it to make predictions. Python has good libraries which can be used for EDA. What is EDA (Exploratory data analysis)? Just like how you think of the abbreviation, it sounds as if it is an approach to analyze the data to understand the underlining structure using different visualizing technique.
Before I start, I would like to give you a brief introduction on the scope of work. You could find the data set here. First, we will learn how to import the data set and the libraries. Then the initial analysis on the data will be conducted and then we will go through few methods used for data visualization, such as:
- Scatter Plot
- Box plot
Remember this is just an introduction, we will go through different methods of analysis at a later time. Now let us begin.
Here we have the customer data shown for four regions, both online and in-store shopping. Let us now delve through the data to understand how the data visualization is utilized to understand more of the data.
The libraries portrayed below are the ones that are necessary for this analysis:
import matplotlib as mplimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pandas import Series, DataFrame%matplotlib inline
Use the code below to import the data set:
data = pd.read_csv('Demographic_Data.csv')
Let us look into the data set. Type the code below to obtain the first few rows of the data set.
The describe() method is utilized to display the statistical data, such as the percentile, mean and standard deviation of the numerical values.
From the row showing the “Count,” one can understand if there are any missing data. As shown above, all attributes have 80000 values, therefore none of the columns have missing values. How can one read the above table? Let us go through an example. The youngest customer is 18 years and the oldest one is 85 years of old. Age and amount have high standard deviation when compared to the other attributes. This information could be used on deciding whether to normalize the data or not while performing modelling.
data.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 80000 entries, 0 to 79999Data columns (total 5 columns):# Column Non-Null Count Dtype0 in-store 80000 non-null int641 age 80000 non-null int642 items 80000 non-null int643 amount 80000 non-null float644 region 80000 non-null int64dtypes: float64(1), int64(4)memory usage: 3.1 MB
The above provides the quick look into the data types, total values in each variable.
Before we dive into EDA, let us do cleaning of our data by removing the duplicates if any. As we noticed we do not have any null values.
“A histogram is an approximate representation of the distribution of numerical. It was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but not required to be) of equal size.”
Histogram is widely used to understand the distribution. Let us use histogram to understand the age distribution
The above gives you a histogram, but how about the X & Y label right?
fig = plt.hist(data['age'])
plt.title('Histogram- Age distribution')
The above is a simple histogram with speaks more by itself. More customers are in the range of 25 to 60 and then the number gradually drops as the age increases.
How can we do binning or divide the entire range of values into a series of intervals ?
fig = plt.hist(data['age'],
bins = 5,
edgecolor = 'white'
plt.title('Histogram- Age distribution')
To get the histogram of all the attributes we could simply use the below
2. Scatter Plot
According to the Wikipedia
A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
In short Scatterplot helps to understand how one variable is affected by other. There are three types of scatter plot:
- Positive Correlation: as one variable increase so does the other
- Negative Correlation: as one variable increases, the other decreases
- No Correlation: there is no apparent relationship between the variables.
The below shows an simple example
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)
fig, ax = plt.subplots()
plt.plot(t, s, 'o')
You could also use this, which will give you the same plot as above
plt.scatter(t, s, marker='o')
Scatter plot provide much more than comparing the relation between two feature. The main difference between plt.scatter and plt.plot is that properties of each point in the scatter plots like color, sizes can be mapped to the individual attributes of the data. We will explain this with an example
rdm = np.random.RandomState(0)
t = rdm.random(100)
s = rdm.random(100)
color = rdm.random(100)
sizes = 1000*rdm.random(100)
fig = plt.scatter(t, s, c=color, s=sizes)
Let us make this a little more interesting. How can view the total amount spend by the customers in each region? There are various ways to do it, but I would take this as opportunity to introduce “pivot_table”. A pivot_table provide the summary of the data in a simple table. This is useful if we have multiple rows & we need to get the aggregate functions like sum, medium, mean etc.…
amount_table = data.pivot_table(
aggfunc = [np.sum, len, np.mean, np.median]
color = 'green',
linestyle = '--')plt.legend()
3. Box Plot
Boxplot sometimes called as Whisker plot are useful to visualize the distribution of data based on a five number summary(“ minimum ”,first quartile(Q1), median ,third quartile(Q3)). In short we could say this helps to understand where the major part of the data is lying also is very useful in identifying the outliers. Why is outliers important? Outliers are the points in the data set which is distant from the rest of the point. This can affect the mean, median which in turn can cause error in the data set. Some outliers are good ones from which we can gai new knowledge. But some are bad which can affect the data set, especially when training the model.
Boxplot helps understand whether the data is symmetric or skewed(lopsided). Symmetric data have the median in the middle of the box, where skewed data divide the box into unequal parts. The above image shows that amount is skewed to the right. The part of the box to the left of median showing the customers who purchase below 500 is shorter than the right part of the box. One part larger does not mean that part has more data. It simply means that part has wide range in the values of data. Each section of data contains 25% of the data no matter what. From the above we can also note that amount has a lot of outliers in it.
This page mainly go through the basics. Hopefully this would help give a strong foundation. Please let me know if you have any questions.