21 Pandas Basics — Part 3

Author

Melanie Walsh

In this lesson, we’re going to introduce more fundamentals of Pandas, a powerful Python library for working with tabular data like CSV files.

We will review skills learned from the last two lessons and introduce how to:

Check for duplicate data
Clean and transform data
Manipulate string data
Apply functions
Reset index
Bonus! Create an interactive data viz

21.1 Dataset

21.1.1 The Pudding’s Film Dialogue Data

Lately, Hollywood has been taking so much shit for rampant sexism and racism. The prevailing theme: white men dominate movie roles.

But it’s all rhetoric and no data, which gets us nowhere in terms of having an informed discussion. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?

-Hannah Anderson and Matt Daniels, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”

The dataset that we’re working with in this lesson is taken from Hannah Andersen and Matt Daniels’s Pudding essay, “Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age”. The dataset provides information about 2,000 films from 1925 to 2015, including characters’ names, genders, ages, how many words each character spoke in each film, the release year of each film, and how much money the film grossed. They included character gender information because they wanted to contribute data to a broader conversation about how “white men dominate movie roles.”

Yet transforming complex social constructs like gender into quantifiable data is tricky and historically fraught. They claim, in fact, that one of the most frequently asked questions about the piece is about gender: “Wait, but let’s talk about gender. How do you know the monster in Monsters Inc. is a boy!” The short answer is that they don’t. To determine character gender, they used actors’ IMDB information, which they acknowledge is an imperfect approach: “Sometimes, women voice male characters. Bart Simpson, for example, is voiced by a woman. We’re aware that this means some of the data is wrong, AND we’re still fine with the methodology and approach.”

As we work with this data, we want to be critical and cognizant of this approach to gender. How does such a binary understanding of gender, gleaned from the IMDB pages of actors, influence our later results and conclusions? What do we gain by using such an approach, and what do we lose? How else might we have encoded or determined gender for the same data?

Import Pandas

To use the Pandas library, we first need to import it.

import pandas as pd

The above import statement not only imports the Pandas library but also gives it an alias or nickname — pd. This alias will save us from having to type out the entire words pandas each time we need to use it. Many Python libraries have commonly used aliases like pd.

Set Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.

pd.options.display.max_rows = 100

Read in CSV File

To read in a CSV file, we will use the function pd.read_csv() and insert the name of our desired file path.

film_df = pd.read_csv('../data/Pudding/Pudding-Film-Dialogue-Salty.csv', delimiter=",", encoding='utf-8')

When reading in the CSV file, we also specified the encoding and delimiter. The delimiter parameter specifies the character that separates or “delimits” the columns in our dataset. For CSV files, the delimiter will most often be a comma. (CSV is short for Comma Separated Values.) Sometimes, however, the delimiter of a CSV file might be a tab (\t) or, more rarely, another character.

Display Data

We can display a DataFrame in a Jupyter notebook simply by running a cell with the variable name of the DataFrame.

Pandas Review

NaN is the Pandas value for any missing data. See “Working with missing data” for more information.

film_df

	script_id	imdb_character_name	words	gender	age	imdb_id	title	year	gross	proportion_of_dialogue
0	280	betty	311	f	35.0	tt0112579	The Bridges of Madison County	1995	142.0	0.048639
1	280	carolyn johnson	873	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	0.136534
2	280	eleanor	138	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	0.021583
3	280	francesca johns	2251	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	0.352049
4	280	madge	190	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	0.029715
...	...	...	...	...	...	...	...	...	...	...
23047	9254	maurice	1107	m	71.0	tt0101414	Beauty and the Beast	1991	452.0	0.108967
23048	9254	monsieur d'arqu	114	m	58.0	tt0101414	Beauty and the Beast	1991	452.0	0.011222
23049	9254	mrs. potts	564	f	66.0	tt0101414	Beauty and the Beast	1991	452.0	0.055517
23050	9254	wardrobe	121	f	54.0	tt0101414	Beauty and the Beast	1991	452.0	0.011911
23051	9254	mrs. potts	564	f	66.0	tt0101414	Beauty and the Beast	1991	452.0	0.055517

23052 rows × 10 columns

There are a few important things to note about the DataFrame displayed here:

Index
- The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.
- By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.
Truncation
- The DataFrame is truncated, signaled by the ellipses in the middle ... of every column.
- The DataFrame is truncated because we set our default display settings to 100 rows. Anything more than 100 rows will be truncated. To display all the rows, we would need to alter Pandas’ default display settings yet again.
Rows x Columns
- Pandas reports how many rows and columns are in this dataset at the bottom of the output (23,052 x 10 columns).

Display First n Rows

To look at the first n rows in a DataFrame, we can use a method called .head().

film_df.head(10)

	script_id	imdb_character_name	words	gender	age	imdb_id	title	year	gross	proportion_of_dialogue
0	280	betty	311	f	35.0	tt0112579	The Bridges of Madison County	1995	142.0	0.048639
1	280	carolyn johnson	873	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	0.136534
2	280	eleanor	138	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	0.021583
3	280	francesca johns	2251	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	0.352049
4	280	madge	190	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	0.029715
5	280	michael johnson	723	m	38.0	tt0112579	The Bridges of Madison County	1995	142.0	0.113075
6	280	robert kincaid	1908	m	65.0	tt0112579	The Bridges of Madison County	1995	142.0	0.298405
7	623	bobby korfin	328	m	NaN	tt0179626	15 Minutes	2001	37.0	0.036012
8	623	daphne handlova	409	f	28.0	tt0179626	15 Minutes	2001	37.0	0.044906
9	623	deputy chief fi	347	m	NaN	tt0179626	15 Minutes	2001	37.0	0.038098

Display Random Sample

To look at a random sample of rows, we can use the .sample() method.

film_df.sample(10)

	script_id	imdb_character_name	words	gender	age	imdb_id	title	year	gross	proportion_of_dialogue
3519	1483	auntie em	295	f	63.0	tt0032138	The Wizard of Oz	1939	839.0	0.031426
5395	1991	holden mcneil	9462	m	25.0	tt0118842	Chasing Amy	1997	22.0	0.300973
5238	1957	lightning mcque	1968	m	43.0	tt1216475	Cars 2	2011	209.0	0.155622
2992	1350	tariq asani	267	m	31.0	tt0360009	Spartan	2004	6.0	0.043337
18281	6858	mara	568	f	NaN	tt0114852	Village of the Damned	1995	18.0	0.099006
8333	2764	nurse #1	162	f	37.0	tt0264796	Life as a House	2001	23.0	0.007482
13345	4394	jonathan	1168	m	47.0	tt1691920	I Melt with You	2011	NaN	0.159041
16889	6053	bomber	392	m	NaN	tt1532958	Battle of the Year	2013	9.0	0.035617
15704	5297	mr. green	825	m	38.0	tt0088930	Clue	1985	7.0	0.087701
75	642	kurt zagon	914	m	60.0	tt0094602	Above the Law	1988	40.0	0.116522

Examine Data

Shape

To explicitly check for how many rows vs columns make up a dataset, we can use the .shape method.

film_df.shape

(23052, 10)

There are 23,052 rows and 10 columns.

Data Types

Just like Python has different data types, Pandas has different data types, too. These data types are automatically assigned to columns when we read in a CSV file. We can check these Pandas data types with the .dtypes method.

Pandas Data Type	Explanation
`object`	string
`float64`	float
`int64`	integer
`datetime64`	date time

film_df.dtypes

script_id                   int64
imdb_character_name        object
words                       int64
gender                     object
age                       float64
imdb_id                    object
title                      object
year                        int64
gross                     float64
proportion_of_dialogue    float64
dtype: object

It’s important to always check the data types in your DataFrame. For example, sometimes numeric values will accidentally be interpreted as a string object. To perform calculations on this data, you would need to first convert that column from a string to an integer.

Columns

We can also check the column names of the DataFrame with .columns

film_df.columns

Index(['script_id', 'imdb_character_name', 'words', 'gender', 'age', 'imdb_id',
       'title', 'year', 'gross', 'proportion_of_dialogue'],
      dtype='object')

Summary Statistics

film_df.describe(include='all')

	script_id	imdb_character_name	words	gender	age	imdb_id	title	year	gross	proportion_of_dialogue
count	23052.000000	23050	23052.000000	23052	18266.000000	23052	23052	23052.000000	19390.000000	23052.000000
unique	NaN	17613	NaN	3	NaN	2000	1994	NaN	NaN	NaN
top	NaN	doctor	NaN	m	NaN	tt0116905	Lone Star	NaN	NaN	NaN
freq	NaN	36	NaN	16135	NaN	40	40	NaN	NaN	NaN
mean	4195.580904	NaN	907.870684	NaN	42.386839	NaN	NaN	1998.133828	106.805570	0.086512
std	2473.518048	NaN	1399.538583	NaN	59.714730	NaN	NaN	14.745342	145.992394	0.107740
min	280.000000	NaN	101.000000	NaN	3.000000	NaN	NaN	1929.000000	0.000000	0.001537
25%	2095.000000	NaN	193.000000	NaN	30.000000	NaN	NaN	1992.000000	22.000000	0.019771
50%	3694.000000	NaN	396.000000	NaN	39.000000	NaN	NaN	2001.000000	56.000000	0.042421
75%	6234.000000	NaN	980.000000	NaN	50.000000	NaN	NaN	2009.000000	136.000000	0.104166
max	9254.000000	NaN	28102.000000	NaN	2013.000000	NaN	NaN	2015.000000	1798.000000	0.923422

Do you notice any outliers, anomalies, or potential problems here?

The maximum value in the “age” column is 2013! That seems like an error.

film_df[film_df['age'] == 2013]

	script_id	imdb_character_name	words	gender	age	imdb_id	title	year	gross	proportion_of_dialogue
11639	3737	lucas solomon	190	m	2013.0	tt0993846	The Wolf of Wall Street	2013	125.0	0.011077

Let’s drop this row from the dataset by using the .drop() method and the Index number of the row.

film_df = film_df.drop(11639)

Now if we look for it again, that row is gone.

film_df[film_df['age'] == 2013]

	script_id	imdb_character_name	words	gender	age	imdb_id	title	year	gross	proportion_of_dialogue

Rename Columns

film_df = film_df.rename(columns={'imdb_character_name': 'character', 'year': 'release_year'})

film_df.head()

	script_id	character	words	gender	age	imdb_id	title	release_year	gross	proportion_of_dialogue
0	280	betty	311	f	35.0	tt0112579	The Bridges of Madison County	1995	142.0	0.048639
1	280	carolyn johnson	873	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	0.136534
2	280	eleanor	138	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	0.021583
3	280	francesca johns	2251	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	0.352049
4	280	madge	190	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	0.029715

Drop Columns

film_df = film_df.drop(columns='imdb_id')

Missing Data

.isna() / .notna()

Pandas has special ways of dealing with missing data. As you may have already noticed, blank rows in a CSV file show up as NaN in a Pandas DataFrame.

To filter and count the number of missing/not missing values in a dataset, we can use the special .isna() and .notna() methods on a DataFrame or Series object.

The .isna() and .notna() methods return True/False pairs for each row, which we can use to filter the DataFrame for any rows that have information in a given column.

film_df[film_df['character'].isna()]

	script_id	character	words	gender	age	title	release_year	gross	proportion_of_dialogue
4656	1807	NaN	146	f	NaN	Beloved	1998	42.0	0.005162
19448	7445	NaN	520	f	NaN	House of Games	1987	5.0	0.165184

This is important information for the sake of better understanding our dataset. But it’s also important because NaN values are treated as floats, not strings. If we tried to manipulate this column as text data, we would get an error. For this reason, we’re going to replace or “fill” these NaN values with the string “No Character Data” by using the .fillna() method.

film_df['character'] = film_df['character'].fillna('No Character Data')

film_df[film_df['character'].isna()]

	script_id	character	words	gender	age	title	release_year	gross	proportion_of_dialogue

21.2 Check for Duplicates

21.2.1 Duplicates

We can check for duplicate rows by using the .duplicated() method and setting the parameter keep=False, which will display all the duplicated values in the dataset — rather than just the first duplicated value keep='first' or the last duplicated value keep='last'.

film_df.duplicated(keep=False)

0        False
1        False
2        False
3        False
4        False
         ...  
23047    False
23048    False
23049     True
23050    False
23051     True
Length: 23051, dtype: bool

The output above is reporting whether each row in the dataset is a duplicate. We can use the .duplicated() method inside a filter to isolate only the rows in the dataframe that are exact duplicates.

film_df[film_df.duplicated(keep=False)]

	script_id	character	words	gender	age	title	release_year	gross	proportion_of_dialogue
21001	8099	c-3po	138	m	69.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.020702
21002	8099	c-3po	138	m	69.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.020702
21963	8560	kilgore trout	2673	m	63.0	Breakfast of Champions	1999	NaN	0.197211
21964	8560	kilgore trout	2673	m	63.0	Breakfast of Champions	1999	NaN	0.197211
22935	9222	groomsman #5	238	m	NaN	Wedding Crashers	2005	283.0	0.012032
22936	9222	groomsman #5	238	m	NaN	Wedding Crashers	2005	283.0	0.012032
23049	9254	mrs. potts	564	f	66.0	Beauty and the Beast	1991	452.0	0.055517
23051	9254	mrs. potts	564	f	66.0	Beauty and the Beast	1991	452.0	0.055517

We can drop duplicates from the DataFrame with the .drop_duplicates() method and choose to keep the first instance of the duplicate or the last instance.

film_df = film_df.drop_duplicates(keep='first')

Now if we check the data for duplicates again, they should be all gone.

film_df[film_df.duplicated(keep=False)]

	script_id	character	words	gender	age	title	release_year	gross	proportion_of_dialogue

21.3 Clean and Transform Data

21.3.1 Pandas `.str` Methods

Remember all the special things that you can do with Python strings aka string methods?

Pandas has special Pandas string methods, too. Many of them are very similar to Python string methods, except they will transform every single string value in a column, and we have to add .str to the method chain.

Pandas String Method	Explanation
df[‘column_name’]`.str.lower()`	makes the string in each row lowercase
df[‘column_name’]`.str.upper()`	makes the string in each row uppercase
df[‘column_name’]`.str.title()`	makes the string in each row titlecase
df[‘column_name’]`.str.replace('old string', 'new string')`	replaces `old string` with `new string` for each row
df[‘column_name’]`.str.contains('some string')`	tests whether string in each row contains “some string”
df[‘column_name’]`.str.split('delim')`	returns a list of substrings separated by the given delimiter
df[‘column_name’]`.str.join(list)`	opposite of split(), joins the elements in the given list together using the string

For example, to transform every character’s name in the “character” column from lowercase to uppercase, we can use .str.upper()

film_df['character'].str.upper()

0                  BETTY
1        CAROLYN JOHNSON
2                ELEANOR
3        FRANCESCA JOHNS
4                  MADGE
              ...       
23046            LUMIERE
23047            MAURICE
23048    MONSIEUR D'ARQU
23049         MRS. POTTS
23050           WARDROBE
Name: character, Length: 23047, dtype: object

To transform every character’s name in the “character” column to lowercase, we can use .str.lower()

film_df['character'].str.lower()

0                  betty
1        carolyn johnson
2                eleanor
3        francesca johns
4                  madge
              ...       
23046            lumiere
23047            maurice
23048    monsieur d'arqu
23049         mrs. potts
23050           wardrobe
Name: character, Length: 23047, dtype: object

If we want to replace the gender columns’s single letter abbreviation for “male” / “female” (sex) with “man” / “woman” (gender identity), we could use the .str.replace() method.

film_df['gender'] = film_df['gender'].str.replace('m', 'man')

film_df['gender'] = film_df['gender'].str.replace('f', 'woman')

film_df.sample(10)

	script_id	character	words	gender	age	title	release_year	gross	proportion_of_dialogue
18526	7014	julie	256	woman	22.0	Showgirls	1995	40.0	0.023103
10160	3289	charlie mccorry	584	man	40.0	The Searchers	1956	NaN	0.036878
15578	5214	guy	163	man	NaN	The Wedding Singer	1998	148.0	0.011877
649	793	ellis	223	man	NaN	Cellular	2004	44.0	0.023130
892	851	peterson	207	man	NaN	Deception	2008	5.0	0.024247
1190	915	clear rivers	472	woman	24.0	Final Destination	2000	86.0	0.163946
3032	1360	lama su	350	man	48.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.046574
10153	3288	somerset	8838	man	58.0	Se7en	1995	200.0	0.428281
10421	3352	anna crowe	916	woman	31.0	The Sixth Sense	1999	503.0	0.080549
7132	2442	noah	160	man	32.0	The Grapes of Wrath	1940	NaN	0.006841

We can use the .str.contains() to search for particular words or phrases in a column, such as “Star Wars.”

film_df[film_df['title'].str.contains('Star Wars')]

	script_id	character	words	gender	age	title	release_year	gross	proportion_of_dialogue
3017	1359	admiral ackbar	199	man	61.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.039096
3018	1359	ben 'obi-wan' k	462	man	69.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.090766
3019	1359	c-3po	881	man	37.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.173084
3020	1359	darth vader	381	man	48.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.074853
3021	1359	han solo	835	man	41.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.164047
3022	1359	lando calrissia	379	man	46.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.074460
3023	1359	luke skywalker	915	man	32.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.179764
3024	1359	princess leia	359	woman	27.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.070530
3025	1359	the emperor	516	man	39.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.101375
3026	1359	yoda	163	man	19.0	Star Wars: Episode VI - Return of the Jedi	1983	853.0	0.032024
3027	1360	anakin skywalke	2061	man	21.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.274251
3028	1360	cliegg lars	152	man	62.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.020226
3029	1360	count dooku	321	man	80.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.042715
3030	1360	dexter jettster	151	man	67.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.020093
3031	1360	jar jar binks	294	man	29.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.039122
3032	1360	lama su	350	man	48.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.046574
3033	1360	mace windu	365	man	54.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.048570
3034	1360	madame jocasta	331	woman	82.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.044045
3035	1360	obi-wan kenobi	1477	man	31.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.196540
3036	1360	padme	668	woman	21.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.088889
3037	1360	queen jamillia	135	woman	25.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.017964
3038	1360	supreme chancel	576	man	58.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.076647
3039	1360	taun we	139	woman	40.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.018496
3040	1360	watto	140	man	49.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.018629
3041	1360	yoda	355	man	58.0	Star Wars: Episode II - Attack of the Clones	2002	465.0	0.047239
3058	1362	ben obi-wan ken	995	man	63.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.099084
3059	1362	c-3po	1506	man	31.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.149970
3060	1362	cmdr. praji (im	117	man	42.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.011651
3061	1362	darth vader	492	man	46.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.048994
3062	1362	general dodonna	185	man	66.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.018423
3063	1362	grand moff tark	411	man	64.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.040928
3064	1362	greedo	103	man	33.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.010257
3065	1362	han solo	1730	man	35.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.172276
3066	1362	jabba	185	man	NaN	Star Wars: Episode IV - A New Hope	1977	1798.0	0.018423
3067	1362	luke skywalker	2485	man	26.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.247461
3068	1362	princess leia o	636	woman	21.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.063334
3069	1362	red leader	241	man	37.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.023999
3070	1362	red three (bigg	523	man	38.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.052081
3071	1362	stormtrooper	122	man	36.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.012149
3072	1362	uncle owen	311	man	61.0	Star Wars: Episode IV - A New Hope	1977	1798.0	0.030970
10591	3427	anakin skywalke	3768	man	24.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.271138
10592	3427	c-3po	312	man	59.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.022451
10593	3427	Clone Jedi Knig	104	man	NaN	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.007484
10594	3427	darth sldious	244	man	NaN	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.017558
10595	3427	general grievou	317	man	33.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.022811
10596	3427	Ki-Adi-Mundi /	107	man	40.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.007700
10597	3427	Mace Windu	407	man	57.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.029287
10598	3427	mon mothma	127	woman	28.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.009139
10599	3427	obi-wan kenobi	3181	man	34.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.228898
10600	3427	Padme	1561	woman	24.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.112326
10601	3427	senator bail or	479	man	50.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.034468
10602	3427	Supreme Chancel	2367	man	61.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.170325
10603	3427	yoda	923	man	61.0	Star Wars: Episode III - Revenge of the Sith	2005	515.0	0.066417
10604	3433	admiral piett	232	man	43.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.025127
10605	3433	ben (obi-wan) k	177	man	66.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.019170
10606	3433	C-3PO	1474	man	34.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.159645
10607	3433	captain needa	108	man	42.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.011697
10608	3433	darth vader	853	man	49.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.092386
10609	3433	general veers	114	man	45.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.012347
10610	3433	han solo	2024	man	38.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.219214
10611	3433	lando calrissia	798	man	43.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.086429
10612	3433	luke skywalker	1282	man	29.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.138850
10613	3433	princess leia	1025	woman	24.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.111015
10614	3433	rebel force gen	168	man	50.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.018196
10615	3433	snow creature	227	man	NaN	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.024586
10616	3433	yoda	751	man	36.0	Star Wars: Episode V - The Empire Strikes Back	1980	937.0	0.081339
10617	3437	anakin skywalke	1697	man	10.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.158037
10618	3437	boss nass	209	man	63.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.019464
10619	3437	Captain Panaka	390	man	45.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.036320
10620	3437	chancellor valo	127	man	61.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.011827
10621	3437	darth sidious	277	man	NaN	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.025796
10622	3437	jar jar binks	632	man	26.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.058856
10623	3437	mace windu	247	man	51.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.023002
10624	3437	nute gunray	544	man	34.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.050661
10625	3437	obi-wan kenobi	701	man	28.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.065282
10626	3437	padme	718	woman	18.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.066865
10627	3437	queen amidala	656	woman	18.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.061091
10628	3437	qui-gon jinn	2339	man	47.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.217825
10629	3437	ric oli�	132	man	42.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.012293
10630	3437	rune haako	126	man	NaN	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.011734
10631	3437	Senator Palpati	551	man	55.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.051313
10632	3437	shmi skywalker	414	woman	41.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.038555
10633	3437	sio bibble	164	man	60.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.015273
10634	3437	watto	502	man	46.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.046750
10635	3437	yoda	312	man	55.0	Star Wars: Episode I - The Phantom Menace	1999	813.0	0.029056
21000	8099	bala-tik	140	man	NaN	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.021002
21001	8099	c-3po	138	man	69.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.020702
21003	8099	finn	1447	man	23.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.217072
21004	8099	general hux	356	man	32.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.053405
21005	8099	han solo	1225	man	73.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.183768
21006	8099	kylo ren	618	man	32.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.092709
21007	8099	maz kanata	241	woman	32.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.036154
21008	8099	poe dameron	604	man	36.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.090609
21009	8099	princess leia	252	woman	59.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.037804
21010	8099	rey	1363	woman	23.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.204470
21011	8099	stormtrooper	114	man	NaN	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.017102
21012	8099	supreme leader	168	man	51.0	Star Wars: Episode VII - The Force Awakens	2015	927.0	0.025203

We can use the .str.contains() to search for particular words or phrases in a column, such as “Mean Girls.”

film_df[film_df['title'].str.contains('Mean Girls')]

	script_id	character	words	gender	age	title	release_year	gross	proportion_of_dialogue
13938	4605	aaron samuels	426	man	23.0	Mean Girls	2004	120.0	0.053890
13939	4605	cady heron	2798	woman	18.0	Mean Girls	2004	120.0	0.353953
13940	4605	damian	624	man	26.0	Mean Girls	2004	120.0	0.078937
13941	4605	gretchen wiener	609	woman	22.0	Mean Girls	2004	120.0	0.077040
13942	4605	janis ian	907	woman	22.0	Mean Girls	2004	120.0	0.114738
13943	4605	karen smith	301	woman	19.0	Mean Girls	2004	120.0	0.038077
13944	4605	mr. duvall	365	man	43.0	Mean Girls	2004	120.0	0.046173
13945	4605	mrs. george	125	woman	33.0	Mean Girls	2004	120.0	0.015813
13946	4605	ms. norbury	720	woman	34.0	Mean Girls	2004	120.0	0.091082
13947	4605	regina george	1030	woman	26.0	Mean Girls	2004	120.0	0.130297

21.4 Applying Functions

With the .apply() method, we can run a function on every single row in a Pandas column or dataframe.

def make_text_title_case(text):
    title_case_text = text.title()
    return title_case_text

make_text_title_case("betty")

'Betty'

film_df['character'].apply(make_text_title_case)

0                  Betty
1        Carolyn Johnson
2                Eleanor
3        Francesca Johns
4                  Madge
              ...       
23046            Lumiere
23047            Maurice
23048    Monsieur D'Arqu
23049         Mrs. Potts
23050           Wardrobe
Name: character, Length: 23047, dtype: object

film_df['character'] = film_df['character'].apply(make_text_title_case)

film_df.sample(10)

	script_id	character	words	gender	age	title	release_year	gross	proportion_of_dialogue
21944	8559	Joey Pinero	288	man	40.0	Primal Fear	1996	110.0	0.021510
1698	1044	Anna Mcdoogles	1346	woman	37.0	The Invention of Lying	2009	21.0	0.160105
11675	3746	Randy 'The Ram'	5220	man	56.0	The Wrestler	2008	31.0	0.496009
801	821	Peter	804	man	NaN	Contact	1997	191.0	0.068449
15636	5236	Michael Jackson	151	man	NaN	Willard	1971	101.0	0.016402
1059	885	Lord Howard	154	man	65.0	Elizabeth: The Golden Age	2007	20.0	0.021050
9011	2949	Cosmo Castorini	1196	man	67.0	Moonstruck	1987	179.0	0.063957
473	745	Deputy Da Ellis	462	man	37.0	The Black Dahlia	2006	29.0	0.040477
17745	6535	Rick Cowley	150	man	46.0	The Dark Half	1993	20.0	0.014219
2086	1140	Dr. Arthur Neum	267	man	50.0	The Mask	1994	249.0	0.042995

Filter DataFrame

We can filter the DataFrames for only characters who are men or women.

men_film_df = film_df[film_df['gender'] == 'man']

women_film_df = film_df[film_df['gender'] == 'woman']

Groupby

We can use the .groupby() function to group all the men characters in each film and sum up their total dialogue.

By adding a Python string slice, we can identify the top 20 films with the greatest proportion of men speaking.

{margin} Line Breaks If a line of code gets too long, you can create a line break with a backslash `\`

men_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]

	proportion_of_dialogue
title
The Men Who Stare at Goats	1.0
Kagemusha	1.0
The Wild Bunch	1.0
Killing Them Softly	1.0
Stalag 17	1.0
There Will Be Blood	1.0
Fury	1.0
The Revenant	1.0
Saving Private Ryan	1.0
Crimson Tide	1.0
Die Hard: With a Vengeance	1.0
Jeremiah Johnson	1.0
Conquest of the Planet of the Apes	1.0
Duel	1.0
The Hunt for Red October	1.0
Platoon	1.0
Reservoir Dogs	1.0
The Battle of Algiers	1.0
Armored	1.0
Space Jam	1.0

We can use the .groupby() function to group all the women characters in each film and sum up their total dialogue.

By adding a Python string slice, we can identify the top 20 films with the greatest proportion of women speaking.

{margin} Line Breaks If a line of code gets too long, you can create a line break with a backslash `\`

women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False)[:20]

	proportion_of_dialogue
title
The Descent	1.000000
Now and Then	1.000000
Precious	0.993541
Martyrs	0.965550
The Hand That Rocks the Cradle	0.933750
Agnes of God	0.922482
Heavenly Creatures	0.919368
The Help	0.916947
3 Women	0.899530
The Watermelon Woman	0.894676
Grandma	0.894137
The Craft	0.891866
Easy A	0.888131
The Others	0.879972
Black Christmas	0.877467
Wadjda	0.870567
For Colored Girls	0.870247
The Roommate	0.869879
Mrs. Winterbourne	0.864730
Practical Magic	0.856693

21.5 Reset Index

We can transform a Groupby object into a DataFrame with a regular Index by tacking on .reset_index().

{margin} Line Breaks If a line of code gets too long, you can create a line break with a backslash `\`

women_film_df.groupby('title')[['proportion_of_dialogue']]\
.sum().sort_values(by='proportion_of_dialogue', ascending=False).reset_index()

	title	proportion_of_dialogue
0	The Descent	1.000000
1	Now and Then	1.000000
2	Precious	0.993541
3	Martyrs	0.965550
4	The Hand That Rocks the Cradle	0.933750
...	...	...
1935	The Last Castle	0.011139
1936	The Damned United	0.010909
1937	Thirteen Days	0.010834
1938	Men in Black 3	0.007812
1939	Full Metal Jacket	0.007170

1940 rows × 2 columns

21.6 Bonus — Interactive Data Visualization

#Import necessary Bokeh modules
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, NumeralTickFormatter
from bokeh.io import output_notebook, show
from bokeh.palettes import RdBu
from bokeh.transform import linear_cmap, jitter

#Set up Bokeh to work in Jupyter notebook
output_notebook()

Loading BokehJS ...

Here’s an interactive data visualization of these Hollywood films by release year and percentage of women dialogue. This data viz was created with the Python library Bokeh, which we will discuss in some later lessons.

Scroll to zoom in
Hover to see more information about each point

To see the code that created this visualization, select “Click to show” below.

#Make groupby into a new DataFrame
dialogue_df = women_film_df.groupby(['title', 'release_year'])[['proportion_of_dialogue']].sum()\
.sort_values(by='proportion_of_dialogue', ascending=False).reset_index()

# Set up the source data that will suppply the x,y columns and the film title hover text
source = ColumnDataSource(dialogue_df)

# Set the hover tool tip to the film title, release year, and proportion of dialogue

TOOLTIPS = [("Title", "@title"),
            ("Year", "@release_year"),
           ("Women Dialogue", "@{proportion_of_dialogue}{%0.2f}")]

#Set up Bokeh plot with title, labels, 
bokeh_plot = figure(title="How Much Do Women Speak in Hollywood Films?", x_axis_label = 'Film Release Year',
                    y_axis_label = 'Amount of Dialogue Spoken By Women',x_range = [1930, 2018], y_range = [0, 1.01],
                 tooltips=TOOLTIPS, width=800, height=550, active_scroll='wheel_zoom')

# Create a red to blue color palette
color_mapper = linear_cmap(field_name='proportion_of_dialogue', palette=RdBu[4], low=1.1, high=0)

# Supply inidivudal points values
bokeh_plot.circle(y='proportion_of_dialogue', x=jitter('release_year', width=.2),
         size = 10,
        line_color='black',
        line_alpha=.4,
         source=source,
         color=color_mapper, alpha=.5)

bokeh_plot.title.text_font_size='20pt'

#Make Y axis percentages
bokeh_plot.yaxis.formatter = NumeralTickFormatter(format='0 %')

show(bokeh_plot)

21.1 Dataset

21.1.1 The Pudding’s Film Dialogue Data

21.2 Check for Duplicates

21.2.1 Duplicates

21.3 Clean and Transform Data

21.3.1 Pandas .str Methods

21.4 Applying Functions

21.5 Reset Index

21.6 Bonus — Interactive Data Visualization

21.3.1 Pandas `.str` Methods