24 Pandas — Merge Datasets

Author

Melanie Walsh

In this lesson, we’re going to demonstrate how to merge two datasets with Pandas. We’re going to walk through a real-world example of how I merged two different datasets published by The Pudding to create the CSV file that we used in the previous lesson.

24.1 Dataset

24.1.1 The Pudding’s Film Dialogue Data

When Hannah Andersen and Matt Daniels published the data behind their Pudding essay about film dialogue, they published this data as a few different CSV files. But I wanted to combine them into a single CSV file.

Import Pandas

To use the Pandas library, we first need to import it.

import pandas as pd

The above import statement not only imports the Pandas library but also gives it an alias or nickname — pd. This alias will save us from having to type out the entire words pandas each time we need to use it. Many Python libraries have commonly used aliases like pd.

Set Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change Pandas’ default display settings to show more rows or columns.

pd.options.display.max_rows = 100

Read in CSV File

To read in a CSV file, we will use the function pd.read_csv() and insert the name of our desired file path.

Hannah Andersen and Matt Daniels published one CSV file called “meta_data7.csv” that contains, among things, the title of each movie, the year of its release, and its box office gross.

metadata = pd.read_csv('../data/Pudding/meta_data7.csv', delimiter=',', encoding='utf-8')

metadata

	script_id	imdb_id	title	year	gross	lines_data
0	1534	tt1022603	(500) Days of Summer	2009	37.0	7435445256774780000000000000000000000000000000...
1	1512	tt0147800	10 Things I Hate About You	1999	65.0	1777752320274530000000000000000000000000000000...
2	1514	tt0417385	12 and Holding	2005	NaN	5461357777754220000000000000000000000000000000...
3	1517	tt2024544	12 Years a Slave	2013	60.0	4567334777777780000000000000000000000000000000...
4	1520	tt1542344	127 Hours	2010	20.0	453513352345766000000000000000
...	...	...	...	...	...	...
1995	3765	tt0403702	Youth in Revolt	2009	17.0	7766777656545340000000000000000000000000000000...
1996	3766	tt1790885	Zero Dark Thirty	2012	104.0	5677677556654470000000000000000000000000000000...
1997	8158	tt0120906	Zero Effect	1998	3.0	4777774477777650000000000000000000000000000000...
1998	3768	tt0421090	Zerophilia	2005	NaN	7647774446763350000000000000000000000000000000...
1999	6491	tt0443706	Zodiac	2007	41.0	2447767766763570000000000000000000000000000000...

2000 rows × 6 columns

We’re going to drop the column “lines_data”, which contains information about when during the film each character speaks.

metadata = metadata.drop(columns='lines_data')

They published another CSV file called “character_list5.csv” that contains, among other things, the name, gender, and age of each character as well as the number of words the character speaks.

characters = pd.read_csv('../data/Pudding/character_list5.csv', delimiter=',',encoding='utf-8')

characters

	script_id	imdb_character_name	words	gender	age
0	280	betty	311	f	35.0
1	280	carolyn johnson	873	f	NaN
2	280	eleanor	138	f	NaN
3	280	francesca johns	2251	f	46.0
4	280	madge	190	f	46.0
...	...	...	...	...	...
23043	9254	lumiere	1063	m	56.0
23044	9254	maurice	1107	m	71.0
23045	9254	monsieur d'arqu	114	m	58.0
23046	9254	mrs. potts	564	f	66.0
23047	9254	wardrobe	121	f	54.0

23048 rows × 5 columns

As you can see, the characters DataFrame doesn’t include the actual title of the movie in which the character appears or the movie’s release year or box office gross. And the metadata datafram doesn’t contain any information about the characters. We want that info all in one place. So how can we combine all of this data together?

24.2 Merge Datasets

If you look closely, there’s one column that both datasets share in common: “script_id”. If two datasets share at least one column in common, we can merge them together based on this column.

We can use the pd.merge() function and type in the name of the first dataframe, the name of the second dataframe, and the shared column to be merged on.

pd.merge(characters, metadata, on='script_id')

	script_id	imdb_character_name	words	gender	age	imdb_id	title	year	gross	lines_data
0	280	betty	311	f	35.0	tt0112579	The Bridges of Madison County	1995	142.0	4332023434343450000000000000000000000000000000...
1	280	carolyn johnson	873	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	4332023434343450000000000000000000000000000000...
2	280	eleanor	138	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	4332023434343450000000000000000000000000000000...
3	280	francesca johns	2251	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	4332023434343450000000000000000000000000000000...
4	280	madge	190	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	4332023434343450000000000000000000000000000000...
...	...	...	...	...	...	...	...	...	...	...
23043	9254	lumiere	1063	m	56.0	tt0101414	Beauty and the Beast	1991	452.0	3245753334377770000000000000000000000000000000...
23044	9254	maurice	1107	m	71.0	tt0101414	Beauty and the Beast	1991	452.0	3245753334377770000000000000000000000000000000...
23045	9254	monsieur d'arqu	114	m	58.0	tt0101414	Beauty and the Beast	1991	452.0	3245753334377770000000000000000000000000000000...
23046	9254	mrs. potts	564	f	66.0	tt0101414	Beauty and the Beast	1991	452.0	3245753334377770000000000000000000000000000000...
23047	9254	wardrobe	121	f	54.0	tt0101414	Beauty and the Beast	1991	452.0	3245753334377770000000000000000000000000000000...

23048 rows × 10 columns

Now we have a combined DataFrame with character and film information.

merged_movie_character = pd.merge(characters, metadata, on='script_id')

24.3 Calculate Dialogue Proportions

We’re going to add one more column to this dataset before the next lesson. We’re going to calculate the proportion of words spoken in each film by each character. To do so, we’re going to .groupby() the movie’s title and calculate the sum total number of words spoken in each movie.

merged_movie_character.groupby('title')[['words']].sum()

	words
title
(500) Days of Summer	18500
10 Things I Hate About You	19680
12 Years a Slave	19628
12 and Holding	15968
127 Hours	5145
...	...
Zero Effect	13927
Zerophilia	16686
Zodiac	14656
eXistenZ	9447
xXx	8285

1994 rows × 1 columns

If we use the .transform(), we can turn this groupby into a single column of data.

merged_movie_character.groupby(['title'])[['words']].transform(sum)

	words
0	6394
1	6394
2	6394
3	6394
4	6394
...	...
23043	10159
23044	10159
23045	10159
23046	10159
23047	10159

23048 rows × 1 columns

total_movie_words = merged_movie_character.groupby(['title'])[['words']].transform(sum)

Then we’re going to divide the total number of words spoken by each character by the total number of words spoken in each film.

total_character_words = merged_movie_character[['words']]

total_character_words / total_movie_words

	words
0	0.048639
1	0.136534
2	0.021583
3	0.352049
4	0.029715
...	...
23043	0.104636
23044	0.108967
23045	0.011222
23046	0.055517
23047	0.011911

23048 rows × 1 columns

dialogue_proportion = total_character_words / total_movie_words

Then we’re going to add it as a new column.

merged_movie_character['proportion_of_dialogue'] = dialogue_proportion

merged_movie_character

	script_id	imdb_character_name	words	gender	age	imdb_id	title	year	gross	proportion_of_dialogue
0	280	betty	311	f	35.0	tt0112579	The Bridges of Madison County	1995	142.0	0.048639
1	280	carolyn johnson	873	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	0.136534
2	280	eleanor	138	f	NaN	tt0112579	The Bridges of Madison County	1995	142.0	0.021583
3	280	francesca johns	2251	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	0.352049
4	280	madge	190	f	46.0	tt0112579	The Bridges of Madison County	1995	142.0	0.029715
...	...	...	...	...	...	...	...	...	...	...
23043	9254	lumiere	1063	m	56.0	tt0101414	Beauty and the Beast	1991	452.0	0.104636
23044	9254	maurice	1107	m	71.0	tt0101414	Beauty and the Beast	1991	452.0	0.108967
23045	9254	monsieur d'arqu	114	m	58.0	tt0101414	Beauty and the Beast	1991	452.0	0.011222
23046	9254	mrs. potts	564	f	66.0	tt0101414	Beauty and the Beast	1991	452.0	0.055517
23047	9254	wardrobe	121	f	54.0	tt0101414	Beauty and the Beast	1991	452.0	0.011911

23048 rows × 10 columns

24.4 Write to CSV File

Finally, we’re going to output this merged and more comprehensive dataset to a CSV file by using the .to_csv() method. We set the index parameter to False to remove the index column (the numbers in the left-most column).

merged_movie_character.to_csv('../data/Pudding/Merged-Pudding-Film-Dialogue.csv', encoding='utf-8', index=False)

If there is anything wrong, please open an issue on GitHub or email f.pianzola@rug.nl