Exploratory Data Analysis: Students Performance in Exam
The dataset consists of the marks secured in various subjects by high school students from the United States, which is accessible from Kaggle Student Performance in Exams. There are 1000 occurrences and 8 columns:
- gender
- race / ethnicity
- parental level of education
- lunch
- test preparation course
- math score
- reading score
- writing score
We will be checking out the performance of the class in each subject, the effect of parent level of education on the student performance, and also the relationship between Gender and Student performance.
For this entire analysis, I will be using a Jupyter Notebook. You can use any Python IDE you like. We will start by downloading and cleaning the dataset, then move on to the dataset analysis and visualization.
Import the libraries
Numpy Library is used for numerical calculations and scientific computing.
Pandas library can be used for various data manipulation operations (such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features.
Matplotlib and Seaborn are libraries for data visualization.
Read the dataset
Before performing any analysis, we need to read the data into a data frame and this can be done using the Pandas read_csv function.
We also need to check the data frame using the head method to access the first few rows to confirm that it has the right type of data.
From the screenshot above, we see that the data frame contains 8 different columns- 5 categorical variables and 3 numerical variables.
For us to know the number of occurrences in the data Frame, we can use the shape method.
From the screenshot, we can tell that the data Frame consists of 1000 rows(occurrences) and 8 columns(variables).
Statistical insights
To get a quick summary of the statistical characteristics of the data frame, we can use the describe() method of Pandas Library. The describe() method is used for calculating some statistical data like percentile, mean, and standard deviation of the numerical values of the DataFrame.
Count: This shows the total number.
Mean: Shows the average.
Std: Standard deviation value
Min: Minimum value
25%: First Quantile
50%: Median or Second Quantile
75%: Third Quantile
Max: Maximum value
Checking for missing values
Missing values in the dataset need to be checked and handled because it creates imbalanced observations, causes biased estimates, and in extreme cases, can even lead to invalid conclusions.
From the screenshot above, it is observed that there are no missing values in the data frame.
Renaming columns
We need to rename our columns to a more readable and understandable format.
We should rename columns names, just make it as a title
and replace spaces with _
1. Univariate Visualization
Univariate data visualization plots help in understanding the location or position of observations in the data variable, its distribution, and dispersion. It describes the pattern of response to the variable. The univariate analysis looks at one feature at a time.
When we analyze a feature independently, we are usually most interested in the distribution of its values.
1.1 Quantitative features
Quantitative features take on ordered numerical values. Those values can be discrete(e.g. integers) or continuous(e.g. real numbers) and usually express a count or a measurement.
In this data frame, we have 3 quantitative features(math score, reading score, and writing score).
Let’s see the class distribution of students' performance in math, reading, and writing subjects.
The histogram plot provides a visual interpretation of numerical data by showing the number of data points that fall within a specified range of values called bins.
To have a good indication of how the values are spread out in a dataset, you can use the box plot.
Median: The median (middle quartile) marks the mid-point of the data and is shown by the line that divides the box into two parts.
Whiskers: Whiskers are the lines extending from the box. It represents the entire scatter of data points, specifically the points that fall within the interval (Q1−1.5⋅IQR, Q3+1.5⋅IQR), where IQR=Q3−Q1
is the interquartile range
Inter-quartile range: The middle “box” represents the middle 50% of scores for the group. The range of scores from lower to upper quartile is referred to as the inter-quartile range. The middle 50% of scores fall within the inter-quartile range.
Outliers: An outlier is a data point that differs significantly from the majority of the data taken from a sample or population.
To check the Gender performance of students that scored above 80%;
From the analysis above, we can say that a larger number of males scored more than 80% in mathematics compared to the females while a larger number of females scored more than 80% in Reading and Writing.
1.2 Categorical/Binary features
Categorical features can only take on a limited, and usually fixed, a number of possible values. It assigns each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.
Frequency table
Let’s create a frequency table for the categorical variables contained in our data frame(i.e Gender, Race/Ethnicity, Parental Level of Education, Lunch, Test Preparation Course)
To check the statistics of the students’ ethnicity;
From the screenshot above, we see that 31.9% of the students belong to Ethnic Group C.
To check analysis of Parents level of education;
From the analysis above, only 5.9% of the parents have attained a Master's Degree. A larger percent of parents have some college or associate degree.
2. Multivariate visualization
Multivariate plots allow us to see relationships between two and more different variables.
Correlation matrix
Let’s look at the correlations among the numerical variables in our dataset. This information is important to know as there are Machine Learning algorithms (for example, linear and logistic regression) that do not handle highly correlated input variables well.
First, we will use the method corr() on a DataFrame that calculates the correlation between each pair of features. Then, we pass the resulting correlation matrix to heatmap() from seaborn, which renders a color-coded matrix for the provided values.
From the heatmap, We see that reading scores and writing scores have a total positive linear correlation while math scores and reading scores have a correlation of 0.8
To check the gender statistics of the class population;
From the analysis above, A larger number(51.8%) of the student population are females.
Statistics of those that attended the test preparation course
From the chart, we see that only 35.8% of the student attended/completed the test preparation course.
Is the test preparation course a criterion for better performance in maths?
From the chart above, we see that those who completed the test preparation course had high scores on the maths exam.
Now I believe we have been able to understand our variables and also have both univariate and multivariate visualizations of the variables. You can access the notebook here. Feel free to like, share and comment. Gracias!