Data Exploration & Visualization in Python — Part1

Rayon Susan Koshy
6 min readJan 2, 2021
Photo by Edward Howell on Unsplash

In my previous blog, we went through the fundamentals in Data Analysis. Python has plenty of libraries which aid in analyzing the data in depth. Features are critical in a data set as in Machine Learning when we are trying to find a pattern between the features. Now comes the question, what is a dependent variable and independent variable? Just like it sounds dependent variable is the output of the process and independent variable is the input to the process. For example in the below data set, “Species” is the dependent variable and the remaining variables are considered to be independent variables.

Iris Data set

Independent variables are also known as “predictors”. Dependent variables are also known as “response or target variable”.

Now for this article we will be using the data from public Kaggle dataset. It is called Iris and it is very common data set used for practice. You can find the data set from the link below:

Let us take a look at the data set and the information. This iris data consists of five columns, which are:

  • ID: Identification Number
  • SepalLengthCm: Length of Sepal in cm
  • SepalWidthCm: Width of Sepal in cm
  • PetalLengthCm: Length of Petal in cm
  • PetalWidthCm: Width of Petal in cm
  • Species: Type of Species

For our analysis we removed the ID column and renamed the column. We can drop the column using “data.drop('name', axis=1)”. This is one way to do it. If we know the column names in prior, we can then also choose which columns are to be in our data frame using df = pd.read_csv("sampleData.csv", usecols = ['Col1','Col2']) . Let us see how the data frame looks now.

data.describe()
Fig1: Data set Information
data.info()
Fig3: Data’s description

Install necessary packages

These are the packages that we need to install first

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
%matplotlib inline
from pandas.plotting import scatter_matrix
import seaborn as sns
import plotly.io as pio

Visualization

First, I am going to get the total count of data for each Species:

sns.catplot('Species',data = data , kind= 'count', aspect = 1.2)
Fig4: Bar graph showing the count of species

If we are given a data set, it is important that we understand the relation between the variables. How can one variable affect the other? That relationship is known as the Correlation. Correlation can either be:

  • Positive : If the values increase together
  • Negative : If the values decrease together
mpl.rcParams['figure.figsize'] = (10,7)
corre = data.corr()
print(corre)
fig =plt.figure()
ax = fig.add_subplot()
cat = ax.matshow(corre, vmin=-1, vmax =1)
fig.colorbar(cat)
ticks = np.arange(0,4)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(data.columns[0:4])
ax.set_yticklabels(data.columns[0:4])
Fig 5: Correlation Plot

Bar Graph/Bar Chart: Represent categorical data with rectangular bars, whose height or length is proportional to the value that they represent. We can use the bar graph to get the count of the values in the column “Petal.Width”.

sns.catplot('Petal.Width',data = data , 
kind= 'count',aspect = 1.5)
Fig6: Bar graph showing the count of Petal Width

hue will draw a separate histogram for each of its unique values and distinguish them by color. Now if we want to get the count of values corresponding to the Width of the petal but also want to categorize them based on the Species, use the below.

sns.catplot('Petal.Width',data = data , kind= 'count', hue = 'Species' ,aspect = 1.5)
Fig7: Bar graph showing the count of Petal Width categorized based on Species

The graph speaks for itself. For example, petal width 0.1 to 0.6 mainly belong to the Species “Setosa”. Do you see any relation between petal width and Species?

Scatter Plot : To put it in a simple way, scatter plots are points on horizontal and vertical axis, which show how much a variable is affected by other. This relation is called correlation. We can call two variables highly correlated if the data points makes a straight line.

The below gives a simple scatter plot from which we can understand the relationship between Width & the Length of the petal

chart = sns.catplot('Petal.Width','Petal.Length',data = data ,
aspect = 1.5)
chart.set_xlabels('Petal Width',weight='bold', fontsize=13)
chart.set_ylabels('Petal Length', weight='bold', fontsize=13)
plt.title('Relation between petal width & length',
weight='bold',fontsize=16)
Fig8: Scatter plot

Let us gather a few more details from the plot like color code them separately based on the Species. This will help us understand the relation between petal’s width and length in different Iris Species.

chart = sns.catplot('Petal.Width','Petal.Length',data = data, 
hue ='Species' ,aspect = 1.5)
#Customize chart
chart.set_xlabels('Petal Width',weight='bold',fontsize=13)
chart.set_ylabels('Petal Length', weight='bold',fontsize=13)
plt.title('Relation between petal width & length',
weight='bold',fontsize=16)
Fig9

seaborn.jointplot

According to seaborn 0.11.0 documentation , seaborn.jointplot

Draw a plot of two variables with bivariate and univariate graphs.

To get a simple jointplot, assign x and y to create a scatterplot (using scatterplot()) with marginal histograms (using histplot())

sns.jointplot('Petal.Width','Petal.Length',
data= data)
Fig10

The graph gives both scatter plot and separate histogram for each variable. It is clear from this that there is a positive correlation between petal width and length

Kind is a parameter within the seaborn.jointplot which can be customized to determine the kind of plot we want to draw. It can be scatter, kde, hist, hex, reg, resid.

Setting kind = ‘kde’, will will draw both bivariate and univariate KDEs

sns.jointplot('Petal.Width','Petal.Length',
data= data, kind = 'kde')
Fig11

Kernel Density Estimation (KDE) is a way to estimate the probability density function of a continuous random variable.

According to seaborn 0.11.0 documentation

A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range.

Similar to a heatmap, bivariate KDE plot smoothens x and y variable with 2D Gaussian

If we set kind= ‘reg’, it will plot data and a linear regression model to fit using regplot() along with univariate KDE curves

sns.jointplot('Petal.Width','Petal.Length',
data= data, kind = 'reg')
Fig12

There are two ways to obtain a bin based jointplot, one is using kind= ‘hist’.

This uses histplot() on all of the axes

sns.jointplot(‘Petal.Width’,’Petal.Length’,
data= data, kind=”hist”)
Fig13

The other way is to using kind = ‘hex’.

This will use matplotlib.axes.Axes.hexbin() to compute a bivariate histogram using hexagonal bins

sns.jointplot('Petal.Width','Petal.Length',
data= data, kind = 'hex')
Fig14

Summary

In this article we discussed about correlation plot, scatter plot, seaborn.catplot and seaborn.jointplot. This is a few among many visualization techniques. Data Visualization is very important to analyze datasets. Data Visualization is like story telling. In this current age of Big data, visualization is a key tool to tell the stories by making data much easier to understand, highlight the trends and outliers. I hope this helps you to build foundation in data analytics.

Please let me know if you have any questions.

If you have any thoughts, comments or questions, please leave a comment below or contact me on LinkedIn. You could also find more similar projects in my github.

--

--