Pushpita Chatterjee
6 min readFeb 23, 2021

EDA ~ Exploratory Data Analysis

In a good movie, the director first introduces you to the characters of the story, digs a bit in their history or backgrounds and then slowly by the time of interval has introduced us to the main plot and created a stage to keep you hooked, so one stays back after the interval and finishes the story for the joy of Storytelling and Cinema!!!

EDA is like the first half only in Data Science or Machine learning. As the name explains clearly it is a process of exploring or investing all the aspects of the data and analyzing them. The more you explore the data the more it unfolds like a story. Its process of storytelling through numbers and visualization techniques.

The main steps involved in EDA are :
* Uploading the data
* Checking for data information
* 5 Point Summary
* Check for null and duplicate values
* Visualize the data distribution
* Checking for Outliers
* Imputation of values — Null/Outlier
* Removing Duplicates
* Encoding Categorical Variables
* Normalizing and Scaling

To explain EDA , Python and Jupiter Notebooks have been used. “ df” is the name of the dataset used in explanation process, which contains a dataset of Zirconium — a type of stone used to make jewelry.

Step 1

Upload all the Python libraries and the data in Jupiter Notebook.

Using the head function — head(), check if the dataset is uploaded correctly, we observe few rows and the respective data of each column.
Within the () one can mention the number of rows, five is the default number.

Step 2

Check the shape of the dataset meaning no. of rows and columns using shape function — shape().

Using function Info() we understand the datatype of each variable()and if null values are present in any.
It also reflects the number of rows and columns.

Step 3

Using the describe function — describe(), we get the 5 Point Summary — This works best for continuous data and reflects the below for each variable:
a. Maximum
b. Minimum
c. Mean
d. Standard Deviation
e. Data at 25% , 50% and 75% mark.

Step 4

Check for null values using isnull function — isnull().
I
t is only helpful incase of very small data set, as with big dataset we can not see each data point for every variable.

isnull().sum() — function helps to find total number of null values for each variable. This is much more useful than isnull() function. The dataset does not have any null values, incase they were we would need to treat them — discussed in step.

Step 5

Check for duplicity in the data, as the presence of this lowers the accuracy of a model.
Below code results to show no duplicates are present in the data.

Step 6

Visualize the data distribution using Histograms

This tell us if the data is normally distributed, right skewed or left skewed, indication for presence of outliers.

Step 7

Checking for outliers, these are very high values present in the dataset in both higher and lower extremes.

This can be described using Boxplot function from Seaborn library.
The normal range of data lies within the box whereas the outliers are indicated by dots.

Below images show that the variable carat has many outliers whereas cut has none.

Step 8

Imputing Values — Treating Outliers

We first find out the upper and lower quartile of each variable and then replace any value which is lesser than lower quartile by lower quartile and if a value is higher than the upper quartile then it is replaced by the upper quartile of the variable (cut does not have outliers, steps are shown for explanation purpose) .

Treating Missing values

For treating missing values the above where finction is used and depending on the variable and its distribution, null value is replaced by the below process

a. Replacing by Mean — If distribution is normal and variable is continuous, null value is replaced by mean.

b. Replacing by Median — If distribution is not normal and variable is continuous, null value is replaced by median.

c. Replaced by Mode — when variable is categorical, we use mode to replace null values.

d. Drop the missing value — here the missing value are dropped or deleted from the dataset.

Step 9

Dropping the duplicates — If duplicates are not much in number they can be deleted but this always needs to be checked with clients and if it does not affect the dataset we go ahead and delete it, using drop_duplicates function.

Step 10

Encoding Categorical Variables means converting categorical variable into numerical. This is done using get_dummies (). We can see in the below image all the object variables are now transferred to numerical data. New variable is created using categories in object variable and by using binary codes its presence or absence is denoted.

Step 11

Scaling is done when in a data set contains variables of different scales and its difficult to compare them. By using z score function varible is scaled, which is mean of the variable is “0” and Standard deviation is “1”.

Once the data is scaled, we can use it to find co-relations among variables using the corr (), then project a heatmap to visualize the correlation among variables.

Now, that it has unfolded all the relations between your Variables, quickly grab a popcorn to see the how the model performs after the interval.

No responses yet