23  Pandas Basics — More Useful Methods — Workbook

Author

Melanie Walsh


23.1 Dataset

23.1.1 The Pudding’s Film Dialogue Data

The dataset that we’re working with in this lesson is taken from Hannah Andersen and Matt Daniels’s Pudding essay, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”. The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how “white men dominate movie roles.”


23.2 Import Pandas

To use the Pandas library, we first need to import it.

import pandas as pd

23.3 Change Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.

pd.options.display.max_rows = 200

23.4 Get Data

For this notebook, we’re going to read in a slightly less clean version of the Pudding data.

film_df = pd.read_csv('../data/Pudding/Pudding-Film-Dialogue-Salty.csv', delimiter=",", encoding='utf-8')

23.5 Overview

To look at a random n number of rows in a DataFrame, we can use a method called .sample().

film_df.sample(5)

Generate information about all the columns in the data

film_df.describe()

23.6 Rename Columns

film_df = film_df.rename(columns={'imdb_character_name': 'character', 'year': 'release_year'})

23.7 Clean and Transform Data

23.7.1 Pandas .str Methods

Remember all the special things that you can do with Python strings aka string methods?

Pandas has special Pandas string methods, too. Many of them are very similar to Python string methods, except they will transform every single string value in a column, and we have to add .str to the method chain.

Pandas String Method Explanation
df[‘column_name’].str.lower() makes the string in each row lowercase
df[‘column_name’].str.upper() makes the string in each row uppercase
df[‘column_name’].str.title() makes the string in each row titlecase
df[‘column_name’].str.replace('old string', 'new string') replaces old string with new string for each row
df[‘column_name’].str.contains('some string') tests whether string in each row contains “some string”
df[‘column_name’].str.split('delim') returns a list of substrings separated by the given delimiter
df[‘column_name’].str.join(list) opposite of split(), joins the elements in the given list together using the string

For example, to transform every character’s name in the “character” column from lowercase to title case, we can use .str.title()

To transform every character’s name in the “character” column to lowercase, we can use .str.lower()

film_df['character']
film_df['character'].str.title()

The transformation above was temporary. Change the DataFrame column “character” to title case permanently.

#Your code here

23.8 Filter with .contains()

We can use the .str.contains() to search for particular words or phrases in a column, such as “Star Wars.”

star_wars_filter = film_df['title'].str.contains('Star Wars')
film_df[star_wars_filter]
twilight_filter = film_df['title'].str.contains('Twilight')
film_df[twilight_filter]

Make your own filter with .str.contains()

new_filter = ...
film_df[new_filter]

23.9 Applying Functions

You can apply Python functions to DataFrames with multiple different methods. We will explore two of them here.

Let’s say we want to transform the “gross” column, which reports how much money each film made, into a full number. Right now if a film made $2 million, it simply says 2.0. But maybe we want it to say 20000000.

def make_millions(number):
    new_number = number * 1000000
    return new_number
make_millions(2.0)
2000000.0

23.9.1 Method 1

You can use regular Python to loop through the Series film_df['gross'] and make a list with all your new values.

gross_in_millions = []
for gross in film_df['gross']:
    gross_mill = make_millions(gross)
    gross_in_millions.append(gross_mill)

Then you can add that list as a new column to the DataFrame.

film_df['gross_millions'] = gross_in_millions
film_df

23.9.2 Method 2

You can directly apply a function to every row in a column (or Series) with a special Pandas methods called .apply().

film_df['gross_millions2'] = film_df['gross'].apply(make_millions)
film_df

Let’s transform the column proportion_of_dialogue into a percentage, e.g., .02 -> 2.

Make a Python function called make_percent() that will take in a number and then return that number mutlipled by 100.

def ...

Make a new column “percentage_of_dialogue” by applying this function to the “proportion of dialogue” column.

film_df['percentage_of_dialogue'] = ...

23.10 More Useful Methods

23.10.1 Make a Series into a DataFrame

Here’s a Series:

film_df.groupby('title')['words'].sum()

You can make a Series object into a DataFrame object by using two square brackets instead of one.

film_df.groupby('title')[['words']].sum()

But perhaps the most useful method, especially after grouping by a category, is to use .reset_index(). The index is “reset” to ascending numbers, and the groupby category becomes a regular column.

film_df.groupby('title')[['words']].sum().reset_index()

23.10.2 Change Data Types

You can convert data types with .astype() (documentation here).

film_df.info()
film_df['script_id'].astype('str')

Try to convert “age” to an integer and see what happens…

film_df['age'].astype(int)

Pandas says that you cannot convert NaN values to integers. If we’re ok with keeping the Nan values, we can simply choose to “ignore” the errors.

film_df['age'].astype(int, errors='ignore')

Convert to a datetime data type

pd.to_datetime(film_df['release_year'], format='%Y')

23.10.3 Binning

You can use pd.cut() to bin values (documentation here).

pd.cut(film_df['age'], bins=[0, 25, 50, 75, 100], labels= ['0-25', '25-50', '50-75', '75-100'])
binned_age = pd.cut(film_df['age'], bins=[0, 25, 50, 75, 100], labels= ['0-25', '25-50', '50-75', '75-100'])
binned_age.value_counts().plot(kind='bar')
film_df['age'].hist(bins=[0, 25, 50, 75, 100])

If there is anything wrong, please open an issue on GitHub or email f.pianzola@rug.nl