44 Categorical Variable Data Type

Author

Paschalis Agapitos

44.1 Good practise

Just like with numerical variables, best practices for categorical data storage say that we should match the data type of the column with its real-world variable type. However, the types are a little more nuanced:

Nominal variables are often represented by the object data type. Columns in the object data type can contain any combination of values, including strings, integers, booleans, etc. This means that string operations like .lower() are not possible on object columns.
Nominal variables are also represented by the string data type. However, Pandas usually guesses object rather than string, so if you want a column to be a string, you will likely have to explicitly tell pandas to make it a string. This is most important if you want to do string manipulations on a column like .lower().
Ordinal variables should be represented as objects, but pandas often guesses int since they are often encoded as whole numbers.
Binary variables can be represented as bool, but pandas often guesses int or object data types.

44.2 Working with Ordinal Categorical Variables

For ordinal categorical variables, we often want to store two different pieces of information: category labels and their order. None of the data types we have covered so far can store both of these at once.

We can use the .unique() method to inspect the category names:

print(data['column_name'].unique())

At this point, Python does not know that these categories have an inherent order. Luckily, there is a specific data type for categorical variables in pandas called category to address this problem! The pandas .Categorical() method can be used to store data as type category and indicate the order of the categories.

dataframe['column_name'] = pd.Categorical(dataframe['column_name'], ['bottom', 'mid', 'top'], ordered=True)

bottom, mid, top: these values are an example of categories. If we assume, that they were not in this order pd.Categorical allows us to set them in the order that we want.

Now, not only does Python recognize that the shelf column is an ordinal variable, it understands that top > mid > bottom. If we call .unique() on this column again, we see how Python retains the correct rankings.

This is helpful in the event that we would like to sort the column by category; if we use .sort_values(), the DataFrame will be sorted by the logical order of a given column as opposed to the alphabetical order.

44.2.1 One-Hot Encoding

Let’s assume that we want to change the data type of a categorical variable, but without assigning any kind of weight to the labels. To do that, firstly we need our categorical variables not to be ordinal or if they are ordinal the order does not really matter; in other words, the values of our variable should not be interpreted only in a specific order.

To do that, we can perform a different encoding of categorical variables; One-Hot Encoding (OHE). To do that, we can use the pandas .get_dummies() method.

44.2.1.1 How does OHE works?

Basically, it creates a new binary variable (0 or 1) for each of the categories within our original categorical variable. By passing in the dataset and column that we want to encode into pd.get_dummies(), we have created a new dataframe that contains new binary variables, according to the number of categories in our original variable, with values of 1 for True and 0 for False, which we can view when we scroll to the right in the table. Now we haven’t assigned weighting to our nominal variable.

new_df = pd.get_dummies(data=df, columns=['column_name'])

It is important to note that OHE works best when we do not create too many additional variables, as increasing the dimensionality of our dataframe can create problems when working with certain machine learning models.