46 Exploratory Data Analysis and Summary Statistics
46.1 Introduction
The summary statistics are very useful for an Exploratory Data Analysis (henceforward: EDA) since they allow us to condense a large amount of information and make it a little bit more interpretable since we will just look at a few numbers.
When we want to decide what kind of information we would like to use, we need to have in mind two important aspects-questions:
- the research question (and how many variables this question involves)
- the data (if it is quantitative or categorical)
46.2 Univariate Statistics
Summary statistics that focus on only one variable are called univariate statistics.
For example, let’s say that we have a dataset about books of a specific bookstore and in this dataset there is a column that contains the price of each book. (data)
Our question is how much does a book cost on average in this bookstore.
To do that we need to calculate the mean of the values in the prices column, thus we need to focus only in one variable; therefore this summary statistic is called univariate.
46.2.1 Quantitative variables
When we have to deal with quantitative variables, usually we need to describe the central location (centrality) and spread (spreadness) of the values of these variables.
46.2.1.1 Central location
By finding the central location (aka central tendency) we find the “typical” value of a variable. There are a lot of methods to find this typical value:
- Mean: (aka average) the value that derives from the sum of all values divided by the number (i.e., the length) of the values.
- Median: the middle value of a variable when the values are sorted.
- Mode: the most frequent value of the variable
- Trimmed mean: the mean excluding x percent of the lowest and highest data points.
Sometimes in order to choose the most appropriate summary statistic for a variable we need to combine it with a visualization and domain knowledge.
For example, let’s assume that we get the following values as summary statistics from a variable of cars’ prices:
- Mean = Rs. 63827.18
- Median = Rs. 45000.00
- Mode = Rs. 30000.00
- Trimmed Mean = Rs. 47333.61
If we look closely at these values, we will observe that the mean is so much larger than the median and the trimmed mean, which might means that there are some outliers in the dataset.
46.2.1.2 Spread
Spread, or dispersion, describes the variability within a feature. This is important because it provides context for measures of central location. For example, if there is a lot of variability in car prices, we can be less certain that any particular car will be close to 450000.00 Rupees (the median price). Like the central location measures, there are a few values that can describe the spread:
- Range: The difference between the maximum and minimum values in a variable.
- Inter-Quartile Range (IQR): The difference between the 75th and 25th percentile values.
- Variance: The average of the squared distance from each data point to the mean.
- Standard Deviation (SD): The square root of the variance.
- Mean Absolute Deviation (MAD): The mean absolute value of the distance between each data point and the mean. e.g.,
# Range
data.column_name.max() - data.column_name.min()
# Interquartile range
data.column_name.quantile(0.75) - data.column_name.quantile(0.25)
# alternative way
from scipy.stats import iqr
iqr(data.column_name)
# Variance
data.column_name.var()
# Standard deviation
data.column_name.std()
# Mean absolute deviation
data.column_name.mad()
Choosing the most appropriate measure of spread is much like choosing a measure of central tendency, in that we need to consider the data holistically. For example, below are measures of spread calculated for selling_price
:
- Range: Rs. 9970001
- IQR: Rs. 420001
- Variance: 650044550668.61 (Rs^2)
- Standard Deviation: Rs. 806253.40
- Mean Absolute Deviation: Rs. 42,213.14
We see that the range is almost 10 million Rupees; however, this could be due to a single 10 million Rupee car in the dataset. If we remove that one car, the range might be much smaller. The IQR is useful in comparison because it trims away outliers.
Meanwhile, we see that variance is extremely large. This happens because variance is calculated using squared differences, and is therefore not in the same units as the original data, making it less interpretable. Both the standard deviation and MAD solve this issue, but MAD is even less impacted by extreme outliers.
For highly skewed data or data with extreme outliers, we therefore might prefer to use IQR or MAD. For data that is more normally distributed, the variance and standard deviation are frequently reported.
46.2.2 Categorical Variables
Categorical variables can be either ordinal (ordered) or nominal (unordered). For ordinal categorical variables, we may still want to summarize central location and spread. However, because ordinal categories are not necessarily evenly spaced (like numbers), we should NOT calculate the mean of an ordinal categorical variable (or anything that relies on the mean, like variance, standard deviation, and MAD).
For nominal categorical variables (and ordinal categorical variables), another common numerical summary statistic is the frequency or proportion of observations in each category. This is often reported using a frequency table and can be visualized using a bar plot.
46.3 Bivariate Statistics
In contrast to univariate statistics, bivariate statistics are used to summarize the relationship between two variables. They are useful for answering questions like:
- Do manual transmission cars tend to cost more or less than automatic transmission?
- Do older cars tend to cost less money?
- Are automatic transmission cars more likely to be sold by individuals or dealers?
Depending on the types of variables we want to summarize a relationship between, we should choose different summary statistics.
46.4 Relationship between types of variables
46.4.1 One Quantitative Variable and One Categorical Variable
If we want to know whether manual transmission cars tend to cost more or less than automatic transmission cars, we are interested in the relationship between transmission
(categorical) and selling_price
(quantitative). To answer this question, we can use a mean or median difference.
For example, we could calculate that the median price of automatic transmission cars is 100000 Rupees higher than for manual transmission cars.
46.4.2 Two Quantitative Variables
If we want to know whether older cars tend to cost less money, we are interested in the relationship between year
and selling_price
, both of which are quantitative. To answer this question, we can use the ==Pearson correlation==.
For example, if we calculate that the correlation between year
and selling_price
is 0.4, we can conclude that there is a moderate positive association between these variables (older cars do tend to cost less money).
46.4.3 Two Categorical Variables
If we want to know whether automatic transmission cars are more likely to be sold by individuals or dealers, we are interested in the relationship between transmission
and seller_type
, both of which are categorical. We can explore this relationship using a contingency table and the Chi-Square statistic.
For example, based on the following contingency table, we might conclude that a higher proportion of cars sold by dealers are automatic (compared to cars sold by individuals):
seller_type Dealer Individual Trustmark Dealer
transmission
Automatic 217 212 19
Manual 777 3032 83