import pandas as pd
23 Pandas Basics — More Useful Methods — Workbook
23.1 Dataset
23.1.1 The Pudding’s Film Dialogue Data
The dataset that we’re working with in this lesson is taken from Hannah Andersen and Matt Daniels’s Pudding essay, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”. The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how “white men dominate movie roles.”
23.2 Import Pandas
To use the Pandas library, we first need to import
it.
23.3 Change Display Settings
By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.
= 200 pd.options.display.max_rows
23.4 Get Data
For this notebook, we’re going to read in a slightly less clean version of the Pudding data.
= pd.read_csv('../data/Pudding/Pudding-Film-Dialogue-Salty.csv', delimiter=",", encoding='utf-8') film_df
23.5 Overview
To look at a random n number of rows in a DataFrame, we can use a method called .sample()
.
5) film_df.sample(
Generate information about all the columns in the data
film_df.describe()
23.6 Rename Columns
= film_df.rename(columns={'imdb_character_name': 'character', 'year': 'release_year'}) film_df
23.7 Clean and Transform Data
23.7.1 Pandas .str
Methods
Remember all the special things that you can do with Python strings aka string methods?
Pandas has special Pandas string methods, too. Many of them are very similar to Python string methods, except they will transform every single string value in a column, and we have to add .str
to the method chain.
Pandas String Method | Explanation |
---|---|
df[‘column_name’].str.lower() |
makes the string in each row lowercase |
df[‘column_name’].str.upper() |
makes the string in each row uppercase |
df[‘column_name’].str.title() |
makes the string in each row titlecase |
df[‘column_name’].str.replace('old string', 'new string') |
replaces old string with new string for each row |
df[‘column_name’].str.contains('some string') |
tests whether string in each row contains “some string” |
df[‘column_name’].str.split('delim') |
returns a list of substrings separated by the given delimiter |
df[‘column_name’].str.join(list) |
opposite of split(), joins the elements in the given list together using the string |
For example, to transform every character’s name in the “character” column from lowercase to title case, we can use .str.title()
To transform every character’s name in the “character” column to lowercase, we can use .str.lower()
'character'] film_df[
'character'].str.title() film_df[
The transformation above was temporary. Change the DataFrame column “character” to title case permanently.
#Your code here
23.8 Filter with .contains()
We can use the .str.contains()
to search for particular words or phrases in a column, such as “Star Wars.”
= film_df['title'].str.contains('Star Wars') star_wars_filter
film_df[star_wars_filter]
= film_df['title'].str.contains('Twilight') twilight_filter
film_df[twilight_filter]
Make your own filter with .str.contains()
= ... new_filter
film_df[new_filter]
23.9 Applying Functions
You can apply Python functions to DataFrames with multiple different methods. We will explore two of them here.
Let’s say we want to transform the “gross” column, which reports how much money each film made, into a full number. Right now if a film made $2 million, it simply says 2.0
. But maybe we want it to say 20000000
.
def make_millions(number):
= number * 1000000
new_number return new_number
2.0) make_millions(
2000000.0
23.9.1 Method 1
You can use regular Python to loop through the Series film_df['gross']
and make a list with all your new values.
= []
gross_in_millions for gross in film_df['gross']:
= make_millions(gross)
gross_mill gross_in_millions.append(gross_mill)
Then you can add that list as a new column to the DataFrame.
'gross_millions'] = gross_in_millions film_df[
film_df
23.9.2 Method 2
You can directly apply a function to every row in a column (or Series) with a special Pandas methods called .apply()
.
'gross_millions2'] = film_df['gross'].apply(make_millions) film_df[
film_df
Let’s transform the column proportion_of_dialogue
into a percentage, e.g., .02 -> 2.
Make a Python function called make_percent()
that will take in a number and then return that number mutlipled by 100.
def ...
Make a new column “percentage_of_dialogue” by applying this function to the “proportion of dialogue” column.
'percentage_of_dialogue'] = ... film_df[
23.10 More Useful Methods
23.10.1 Make a Series into a DataFrame
Here’s a Series:
'title')['words'].sum() film_df.groupby(
You can make a Series object into a DataFrame object by using two square brackets instead of one.
'title')[['words']].sum() film_df.groupby(
But perhaps the most useful method, especially after grouping by a category, is to use .reset_index()
. The index is “reset” to ascending numbers, and the groupby category becomes a regular column.
'title')[['words']].sum().reset_index() film_df.groupby(
23.10.2 Change Data Types
You can convert data types with .astype()
(documentation here).
film_df.info()
'script_id'].astype('str') film_df[
Try to convert “age” to an integer and see what happens…
'age'].astype(int) film_df[
Pandas says that you cannot convert NaN values to integers. If we’re ok with keeping the Nan values, we can simply choose to “ignore” the errors.
'age'].astype(int, errors='ignore') film_df[
Convert to a datetime data type
'release_year'], format='%Y') pd.to_datetime(film_df[
23.10.3 Binning
You can use pd.cut()
to bin values (documentation here).
'age'], bins=[0, 25, 50, 75, 100], labels= ['0-25', '25-50', '50-75', '75-100']) pd.cut(film_df[
= pd.cut(film_df['age'], bins=[0, 25, 50, 75, 100], labels= ['0-25', '25-50', '50-75', '75-100']) binned_age
='bar') binned_age.value_counts().plot(kind
'age'].hist(bins=[0, 25, 50, 75, 100]) film_df[
If there is anything wrong, please open an issue on GitHub or email f.pianzola@rug.nl