42 Data Types and Quality of the Data
42.1 Introduction to Data Types and Quality
Each individual object that we have in a dataset is called entity, instance, observation. These terms are used interchangeably.
Each feature of an observation is considered to be a variable. Well-defined variable:
Measures only one characteristic.
The variable per se should not be a characteristic on it own.
42.2 Working with Missing Data
The types of “missing-ness” are three:
Missing Completely at Random: When there is data missing and there is no deeper explanation to why the data is not there.
Missing at Random: When we can predict the value of a variable based on the value of another variable. For example, by knowing the type of the tree we might be able to predict the height.
Structurally Missing: it is the data that is missing because we wouldn’t expect a value there to begin with. Let’s say that we want to measure the fruits on the trees. In some trees the fruits might be visible and thus countable, whereas in some other trees the fruits might not be visible and thus uncountable; the latter doesn’t mean that they do not exist.
We need to decide what to do with the missing data. Even if we decide not to do anything, it also affects our analysis. Take care of the missing data at the early stages of the analysis!
Hint: A value of 0
in the dataset is not a NaN
value. A NaN
value is a completely missing value.
42.2.1 Missing data
pandas primarily uses the value np.nan
to represent missing data. It is by default not included in computations. See the Missing Data section.
Missing values should not be included in the Categorical’s categories
, only in the values
. Instead, it is understood that NaN is different, and is always a possibility. When working with the Categorical’s codes
, missing values will always have a code of -1
. Methods for working with missing data, e.g. isna()
, fillna()
, dropna()
.
42.3 Accuracy
Accuracy is how well the data captures the reality. When we try to measure the accuracy, we should think of reliable and unreliable variables.
One way to achieve accuracy is standardization.
Looking at the outliers and the distribution in order to get an idea about what the data looks like.
Checking errors in the data (e.g. inconsistency on how two variables were measured)
Keep in mind: - every solution has to be tailored to the dataset that we analyze.
- use real-world knowledge to be sure that the dataset reflects reality.
42.4 Validity
Validity of a dataset is when a dataset actually measures what we think it measures. Validity is a special kind of quality measure since it’s not only about the dataset itself but it also connected to the purpose of its dataset.
Some datasets are good to answer a question but not other questions. We undermine the validity of a dataset only if we use it in order to answer a question that it is not possible to be answered using a dataset.
For instance, we know that you can measure the age of a tree by counting the rings, but we didn’t do that. Let’s say that we did measure the width of the tree.
We decide that since number of rings and width are related, we will use width as a proxy for the age. With that decision, we just compromised the validity of our dataset. Our data doesn’t measure age, it measures width. And even though there is a relationship between the number of rings and the width, it’s not a direct relationship and therefore cannot be substituted without affecting the validity of our dataset and measures.
Now let’s say that we want to know how much our trees grow every year. We found a dataset for the same region from 20 years ago. We use the locations to match up the old and new measurements. But this data can tell us how much they grow every 20 years, not every year. If we try to use these two datasets to measure yearly growth, we will compromise the validity of the dataset again.
Using proxies and inappropriate time spans are just two ways to compromise the validity of a dataset. There are infinite ways in which a given dataset is not valid for answering a given question.
One easy way to spot issues with the validity of the dataset is to ask: Does this variable measure what I think it does?
42.5 Representative Samples
Population is all the observations for a certain thing that we want to measure (e.g., if we want to measure the number of abortions in New York, the population should include every person that can be impregnated).
Sample is the number of observations that we have data about. A sample should be representative; in other words, it should include observations from every different sub-category of the population (e.g., people from every skin colour, race, ethnicity, etc.). In some cases, if we choose randomly a sample from the population yields more robust result since we avoid the cherry-picking of a sample that answer our questions.
42.5.1 Sampling errors
If we have a sampling error, then we insert bias in our dataset.
- Convenience samples: when we don’t include measurements and observations in our dataset, due to convenience. For example, if we can’t go to a certain area that is important for our sample, we collect the data only from areas where we have some kind of access. To some extent, this is also connected to the validity of the dataset as we saw earlier.