Exploratory Data Analysis: Iris dataset

Zion Oladiran
6 min readJul 22, 2021

In this article, I am going to walk you through the data analysis process with Python. I’ll be using a dataset from Kaggle called Iris Species, which you can download to perform the analysis.

The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are; Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species.

This dataset has fewer variables(columns) and it's very much easier to analyze as a beginner.

Iris flower

I’ll start with downloading and cleaning the dataset, move on to the analysis and visualization, and then tell a story about the data findings.

For this entire analysis, I will be using a Jupyter Notebook. You can use any Python IDE you like.

The Analysis

Exploratory Data Analysis is very essential because it is a good practice to first understand the problem statement and the various relationships between the data features before drawing insights and making predictions.

Read the data

After downloading the dataset, you will need to read the .csv file as a data frame in Python. You can do this using the Pandas library. Pandas library is used for data manipulation and aggregation.

After reading the dataset, we need to observe the dataset by checking few rows(i.e the head). To check the head of the data frame, run:

Image by Author

From the screenshot above, we can see 5 different variables related to the Iris flower (i.e Sepal Length, Sepal Width, Petal Length, Petal Width, Species).

Gaining information from data

We need to know further information about the columns like the column name, the number of non-null values in each column, the data type of the data, and memory usage. We can check that using the info() method.

Missing values in the dataset need to be checked and handled because it creates imbalanced observations, causes biased estimates, and in extreme cases, can even lead to invalid conclusions.

Image by Author

From this, we can see that the data frame contains 4 numerical variables and 1 categorical variable(Species). There are no missing values in the data frame also.

Statistical Insight

We need to know the overall statistical information of the dataset before analyzing further. This includes the mean, median, and other statistical properties of the numerical variables.

The describe method of Pandas generates descriptive statistics of the data in the data frame and helps in getting a quick overview of the dataset.

Image by Author

From the image above, we can say that the maximum sepal length of an Iris flower is 7.9cm while the minimum sepal length is 4.3cm.

Checking the distribution of each species in the data set

We need to check the distribution of the Iris Species. Balanced data is a factor in getting accurate results.

Image by Author

From the screenshot above, we see that each species of the iris flowers has 50 samples each. This means our data is balanced.

Since there are no missing or duplicate rows in the data frame as seen above, we don’t need to do any additional data cleaning.

Data Visualization

I’ll be using two libraries to find the relationship between each variable— Matplotlib, Seaborn.

Importing relevant libraries

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(‘whitegrid’)
%matplotlib inline

Species count

plt.title(‘Species Count’)
sns.countplot(iris_df[‘Species’])

This shows the number of samples of each species of the Iris flower. Iris-setosa, Iris-versicolor, and Iris-virginica have 50 samples each.

Comparison between various species based on sepal length and width

plt.figure(figsize=(14,6))

plt.title(‘Comparison on various species based on Sepal length and width’)

sns.scatterplot(x=iris_df[‘SepalLengthCm’], y=iris_df[‘SepalWidthCm’], hue=iris_df[‘Species’], s=50)

Image by Author

From the plot above, we can say;

  1. Iris-setosa species have smaller sepal lengths and higher sepal width.
  2. Iris-versicolor species lies in the middle for both its sepal length and sepal width.
  3. Iris-virginica species have higher sepal length and smaller sepal width.

Comparison between various species based on petal length and width

plt.figure(figsize=(14,6)

plt.title(‘Comparison on various species based on Petal length and width’)

sns.scatterplot(x=iris_df[‘PetalLengthCm’], y=iris_df[‘PetalWidthCm’], hue=iris_df[‘Species’], s=50)

Image by Author

From the plot above, we can say;

  1. Iris-setosa species have the smallest petal length and petal width.
  2. Iris-versicolor species have average petal length and petal width.
  3. Iris-virginica species have the highest petal length and petal width.

Checking Correlation

We can find the pairwise correlation between the different columns of the data using corr() method. (Note — All non-numeric data type columns will be ignored). Missing values are automatically excluded.

The resulting coefficient is always a value between -1 and 1 inclusive, where:

1: Total positive linear correlation

0: No linear correlation, the two variables most likely do not affect each other

-1: Total negative linear correlation

Pearson Correlation is the default method of the function “corr”.

Correlation matrix helps us to gain a better understanding of the correlation between the variables in the dataset.

To plot the map;

plt.figure(figsize=(8,6))
iris_corr = iris_df.corr()
sns.heatmap(iris_corr, annot=True)

Image by Author

From the plot, we see that the petal length and petal width are highly correlated, as well as the Petal width and sepal length have a good correlation.

Checking distribution for each species

A box plot is a way of summarizing a set of data measured on an interval scale. It is a graph that gives you a good indication of how the values in the data are spread out. It is also useful in comparing the distribution of data across data sets.

The box plot comprises of;

  1. a box — this illustrates the interquartile spread of the distribution; its length is determined by the 25𝑡ℎ(Q1) and 75𝑡ℎ(Q3) percentiles. The vertical line inside the box marks the median ( 50% ) of the distribution).
  2. whiskers — whiskers are the lines extending from the box. It represents the entire scatter of data points, specifically the points that fall within the interval (Q1−1.5⋅IQR, Q3+1.5⋅IQR), where IQR=Q3−Q1 is the interquartile range
  3. outliers — An outlier is a data point that differs significantly from the majority of the data taken from a sample or population.

visualizing the distribution, mean, and median using box plots;

fig, axes = plt.subplots(2,2, figsize=(16,9))

sns.boxplot(x=’Species’, y=’PetalWidthCm’, data=iris_df, orient=’v’, ax=axes[0,0])
sns.boxplot(x=’Species’, y=’PetalLengthCm’, data=iris_df, orient=’v’, ax=axes[0,1])
sns.boxplot(x=’Species’, y=’SepalLengthCm’, data=iris_df, orient=’v’, ax=axes[1,0])
sns.boxplot(x=’Species’, y=’SepalWidthCm’, data=iris_df, orient=’v’, ax=axes[1,1])

Image by Author

From the plot above, we can say;

  1. Setosa species have some outliers in the Petal length and width(i.e data points that differs significantly from the majority of the data taken from a sample or population)
  2. Setosa species have smaller features(dimensions) and are less distributed
  3. Versicolor species are distributed in an average manner and average features(dimensions).
  4. Virginica species are highly distributed with a large number of values and features(dimensions).

The above are the steps that I personally follow for Exploratory Data Analysis on datasets, but there are various other plots and techniques which we can use to explore more into the data. Thanks for reading.

Feel free to comment and share. You can also access the notebook here.

--

--

Zion Oladiran

Daughter of God || B(Eng.) Computer Engineering || Tech Enthusiast || Interested in Data Science, Machine Learning and AI