40  Reddit Data Collection and Analysis with PSAW

Author

Melanie Walsh

To collect Reddit data, we’re going to use the Pushift API, specifically a Python wrapper for the Pushshift API called PSAW (PushShift API Wrapper). Why are we using the Pushshift API instead of the official Reddit API, and PSAW instead of Pushshift itself?

Well, as Pushshift’s creator Jason Baumgartner and his co-authors describe it in their published paper, “Pushshift makes it much easier for researchers to query and retrieve historical Reddit data, provides extended functionality by providing fulltext search against comments and submissions, and has larger single query limits.” PSAW, meanwhile, makes it easier to work with Pushshift and provides better documentation.

40.1 Install PSAW

To use PSAW, we first need to install it.

!pip install psaw

Then we will import pandas for eventually working with the collected data, and we will change pandas default display setting to make our DataFrame columns wider.

import pandas as pd
pd.set_option('max_colwidth', 500)
pd.set_option('max_columns', 50)

Next we will import the PushshiftAPI from psaw and initialize it.

from psaw import PushshiftAPI

# Initialize PushShift
api = PushshiftAPI()

40.2 Collect Reddit Posts (By Subreddit)

To collect Reddit posts, we will use api.search_submissions() and then establish the parameters of our query, such as which subreddit we want to search in and what threshold of upvote score we want to set.

Below we are setting up to search for posts in the subreddit “AmITheAsshole” that have an upvote score of at least 2,000 or more.

api_request_generator = api.search_submissions(subreddit='AmITheAsshole',
                                               score = ">2000")

Once this generator is set up, we can use it to collect Reddit posts. The code below is a list comprehension that loops through the generator and extracts relevant data for each matching Reddit post. It then turns that list into a Pandas DataFrame.

Pandas

Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out Pandas Basics (1-3) in this textbook!

aita_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])

The cell above should take a while to run. It’s searching through Reddit’s entire history. It’s ok if you periodically get errors while it’s running.

Let’s check to see how many Reddit posts we have collected by checking the shape of the DataFrame.

aita_submissions.shape
(2959, 78)

We have 2,959 posts!

Let’s check to see what columns/metadata exist in this data by seeing what columns are in the DataFrame.

aita_submissions.columns
Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'edited', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'suggested_sort', 'thumbnail', 'title',
       'total_awards_received', 'treatment_tags', 'upvote_ratio', 'url',
       'whitelist_status', 'wls', 'created', 'gilded', 'top_awarded_type',
       'removed_by_category', 'link_flair_template_id', 'link_flair_text',
       'author_flair_background_color', 'author_flair_text_color', 'post_hint',
       'preview', 'author_flair_template_id', 'link_flair_css_class',
       'banned_by', 'steward_reports', 'updated_utc', 'og_description',
       'og_title', 'author_cakeday', 'rte_mode'],
      dtype='object')

To get a quick peak of the data, we can look at 10 random rows of data, and only for the columns “title” and upvote “score.”

aita_submissions[['title', 'score']].sample(10)
title score
1787 AITA for not paying for college for my pregnant daughter? 5913
1563 AITA for offering my sister to pay for an abortion but not offering to support the child finacially if she keeps it? 9111
1422 AITA for snapping at my sister when she encouraged her 4 y/o daughter to fingerprint with my makeup? 4420
2908 AITA For purposely stopping my classmate from winning an award and subsequently making her cry? 32356
2412 AITA for not giving up my table (at a restaurant) for a pregnant woman who needed it for accessibility? 11900
2672 AITA for not allowing people to access a swimming hole that has been used for generations, but is on our property? 4618
1220 AITA for giving both of my kids the same money for Back to School Shopping? 3749
1707 WIBTA if I stopped inviting my triggered friend to movie night? 12656
949 AITA for leaving my girlfriend alone in the ER to go workout? 2718
566 AITA for doing the bare minimum babysitting because my mom and stepdad expect me to fo it for free? 4467

To sort by and manipulate date information, let’s transform the date into datetime values.

aita_submissions['date'] = pd.to_datetime(aita_submissions['created_utc'], utc=True, unit='s')

To only look at columns of interest, we can insert them in double square brackets.

aita_submissions[['author', 'date', 'title', 'selftext', 'url', 'subreddit', 'score', 'num_comments', 'num_crossposts']]
author date title selftext url subreddit score num_comments num_crossposts
0 Additional-Pizza-805 2020-07-24 19:13:49+00:00 AITA for kicking my cousin off of my sister’s wedding Zoom call? My [27M] older sister [30F] and her fiancé [31M] were planning for over a year for their wedding to be this month. Obviously, they can’t have the wedding as planned, but they still would like to get married, so they decided on a “Zoom” wedding where all of the family/friends would just call in to watch the officiant, my sister, and her fiancé. \n\nMy sister didn’t want to be in charge of hosting the Zoom call because she thought it would stress her out, so she asked me to and I gladly accept... https://www.reddit.com/r/AmItheAsshole/comments/hx80wd/aita_for_kicking_my_cousin_off_of_my_sisters/ AmItheAsshole 11159 2209 4
1 decadel8ter 2020-07-24 14:37:13+00:00 AITA for resenting my family for something that happened over a decade ago? when i was 15 i was in a car accident. i was riding my bike on new bike lanes that my city had installed, and a car hit me. I ended up having to go to the hospital, nothing major, the car was turning onto the main road when it hit me so it was going well below the speed limit. but since it was a car accident, i was forced into the ambulance and shipped off to the hospital to get x-rays.\n\nsince i was a minor, I wasn't allowed to be released from the hospital without my family coming to pick... https://www.reddit.com/r/AmItheAsshole/comments/hx2vvl/aita_for_resenting_my_family_for_something_that/ AmItheAsshole 2541 1143 0
2 Snoo_66130 2020-07-24 12:35:35+00:00 AITA for telling my step dad to stop trying to be my dad? I'm 35, and my mom who is 52 is dating a man who is is 27. This is fucking weird as hell and I really don't like this guy but whatever. He always brings up the fact that's he's my step dad and always talks about how proud of me he is and how good of a son I've become and how he's raised me so well and shit.\n\nLike he's literally acting like my dad, which pisses me off because he is younger htan me and was barely ever in my life but more so, becuase I actually had a dad who u... https://www.reddit.com/r/AmItheAsshole/comments/hx0zk7/aita_for_telling_my_step_dad_to_stop_trying_to_be/ AmItheAsshole 2809 1253 1
3 ohnoihaveabluechair 2020-07-24 10:56:56+00:00 AITA for confronting my SIL for wearing clothes that belonged to me? Some info: A few years ago, my family didn’t have a lot of spare money to buy a lot of things (lower middle class), only recently (around 2016) did my husband get a new job and we got financially better by a lot, so finally we could afford to buy things for our kids and travel.\n\nIn 2018, I was so excited when my husband got both of us tickets to go to India for 2 weeks and gave me 200,000 rupees for punjabi suit shopping (traditional indian clothes). This meant a lot to me because my husba... https://www.reddit.com/r/AmItheAsshole/comments/hwzpbu/aita_for_confronting_my_sil_for_wearing_clothes/ AmItheAsshole 7581 1550 1
4 FormalLettuce3 2020-07-24 10:52:08+00:00 AITA for saying we'd only help with my ex's kid's party if we could tell people we're engaged? This guy, "Jack", and I were together for about a year, and within a couple weeks of ending it I found out I was pregnant. I texted Jack to tell him, and a couple hours later this woman, "Liz", showed up at my place saying she and Jack had been together for 6 months, and she was also pregnant, and when the text arrived she got my address out of Jack's phone so she could talk to me before him. I told her everything, and Liz dumped Jack. I was about 6 weeks along at this stage, and she was 12 ... https://www.reddit.com/r/AmItheAsshole/comments/hwzncq/aita_for_saying_wed_only_help_with_my_exs_kids/ AmItheAsshole 2915 1214 0
... ... ... ... ... ... ... ... ... ...
2927 BackgroundJellyfish 2018-08-31 21:39:49+00:00 AITA for hitting my girlfriend out of reflex for her scaring me? Hi, so my girlfriend and i watched a horror movie recently, called The Last Exorcism. Now keep in mind, i HATE horror, i get very scared easily. but she likes horror, and practically begged me to watch it with her. So, i did watch it, and have been very jumpy lately. It was pretty scary, because horror movies have always given me nightmares.\n\n\nMy girlfriend, however, thinks it's funny how i am. She has picked up me bad habit. She likes to sneak up behind me and make a loud noise to startl... https://www.reddit.com/r/AmItheAsshole/comments/9bxoro/aita_for_hitting_my_girlfriend_out_of_reflex_for/ AmItheAsshole 2719 405 1
2928 Marylebone_Road 2018-08-30 17:00:31+00:00 AITA for thinking that this sub is only so people can have their decisions validated and never actually post something assholish they've done? https://www.reddit.com/r/AmItheAsshole/comments/9blev1/aita_for_thinking_that_this_sub_is_only_so_people/ AmItheAsshole 2001 104 0
2929 treefiddyfive 2018-08-26 09:33:45+00:00 AITA for not believing my daughter is non-binary? Recently my daughter has 'come out' to me as non binary, meaning that she supposedly does not believe she is a man or a woman. I heard her out and let her speak, and tried to calmly ask her how she has come to this conclusion. The conversation was civil until I told her I did not believe she is anything but a woman. At this point, she started crying, calling me a bigot, and my wife had to take over. My wife tells me I am being insensitive. \n\n\nThey want me to refer to them as 'they', which... https://www.reddit.com/r/AmItheAsshole/comments/9aeeed/aita_for_not_believing_my_daughter_is_nonbinary/ AmItheAsshole 2068 905 3
2930 grizzythekid 2018-06-19 02:07:05+00:00 AITA for throwing a soda on the ground near the dude I bought it for? I was going to McDonald's for a quick bite to go, when a drunk maybe homeless, definitely in some state guy asked if I could but him a burger. I said sure, having been drunk plenty of times myself, I know a burger hits the spot when you're on one. So I buy two big cheeseburger meals, and walk out. I've got all the food in a bag, and two drinks in the other. I set the drinks down on an outside table and fish out one burger and hand it to him. I set the fries on the table for him, as he his un... https://www.reddit.com/r/AmItheAsshole/comments/8s56a3/aita_for_throwing_a_soda_on_the_ground_near_the/ AmItheAsshole 3532 90 0
2931 Pettheftthrow 2018-05-11 03:28:07+00:00 AITA for refusing to return a lost pet? So over two years ago a cat appeared in my yard. He was skinny, skittish, unneutered, and had a serious abscess on his rump, likely from a cat bite wound. I took him to the vet that night and had him treated. The vet estimated he was about six months old.\n\nI called the local county shelters to file a found cat report. I also posted on Craigslist, posted his info at local vet offices, and kept an eye out for flyers. He was scanned for a microchip and didn't have one. \n\nAt that point I did... https://www.reddit.com/r/AmItheAsshole/comments/8ikrb0/aita_for_refusing_to_return_a_lost_pet/ AmItheAsshole 2387 169 0

2932 rows × 9 columns

40.3 Collect Reddit Posts (By Keyword)

To search by a keyword, we will add q= and insert a query phrase, the rapper “Missy Elliott.”

api_request_generator = api.search_submissions(q='Missy Elliott', score = ">2000")

Once this generator is set up, we can use it to collect Reddit posts. The code below is a list comprehension that loops through the generator and extracts relevant data for each matching Reddit post. It then turns that list into a Pandas DataFrame.

missy_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])

The cell above should take a while to run. It’s searching through Reddit’s entire history. It’s ok if you periodically get errors while it’s running.

To sort by and manipulate date information, let’s transform the date into datetime values.

missy_submissions['date'] = pd.to_datetime(missy_submissions['created_utc'], utc=True, unit='s')

To only look at columns of interest, we can insert them in double square brackets.

missy_submissions[['author', 'date', 'title', 'selftext', 'url', 'subreddit', 'score', 'num_comments', 'num_crossposts', ]]
author date title selftext url subreddit score num_comments num_crossposts
0 Origai 2020-04-12 17:21:30+00:00 When Missy Elliott is a Heidi Stan https://i.redd.it/nbra20y58fs41.jpg rupaulsdragrace 3682 103 0.0
1 Chriscftb97 2019-10-04 20:21:19+00:00 DaBaby's "KIRK" sells 147K First Week (8K Pure). Kevin Gates' "I'm Him" sells 71K First Week (15K Pure). Young M.A's "Herstory in the Making" sells 22K First Week (5K Pure). Rank| Artist| Album| Label | Pure Sales| Sales + Streaming\n---|---|----|----|----|----\n1 | Post Malone | Hollywood's Bleeding | Republic | 200,000 | 489,000\n2 | Khalid | Free Spirit | Right Hand Music/RCA | 82,000 | 202,000\n3 | Ed Sheeran | No.6 Collaborations Project | Atlantic | 70,000 | 173,000\n4 | Tyler, The Creator | IGOR | Columbia | 77,716 | 172,377\n5 | Juice WRLD | Death Race For Love | Grade A/Interscope | 42,648 | 164,076\n**6** | **DaBaby** | **KIRK** | **Interscope** | **7,... https://www.reddit.com/r/hiphopheads/comments/ddcx0y/dababys_kirk_sells_147k_first_week_8k_pure_kevin/ hiphopheads 2470 575 0.0
2 Chriscftb97 2019-09-14 16:29:34+00:00 Post Malone's "Hollywood's Bleeding" sells 493K First Week (210K Pure). EARTHGANG's "Mirrorland" sells 12K First Week (2K Pure). Rank| Artist| Album| Label | Pure Sales| Sales + Streaming\n---|---|----|----|----|----\n**1** | **Post Malone** | **Hollywood's Bleeding** | **Republic** | **210,283** | **492,854**\n2 | Khalid | Free Spirit | Right Hand Music/RCA | 82,000 | 202,000\n3 | Ed Sheeran | No.6 Collaborations Project | Atlantic | 70,000 | 173,000\n4 | Tyler, The Creator | IGOR | Columbia | 77,716 | 172,377\n5 | Juice WRLD | Death Race For Love | Grade A/Interscope | 42,648 | 164,076\n6 | DJ Khaled | Father Of Asa... https://www.reddit.com/r/hiphopheads/comments/d47eza/post_malones_hollywoods_bleeding_sells_493k_first/ hiphopheads 3332 889 0.0
3 Chriscftb97 2019-08-30 21:00:33+00:00 BROCKHAMPTON's "GINGER" sells 77K First Week (55K Pure). Jeezy's "TM104: The Legend of the Snowman" sells 50K First Week (22K Pure). Missy Elliott's "ICONOLOGY" sells 17K First Week (11K Pure). SAINt JHN's "Ghetto Lenny's Love Songs" sells 13K First Week (4K Pure). Rank| Artist| Album| Label | Pure Sales| Sales + Streaming\n---|---|----|----|----|----\n1 | Khalid | Free Spirit | Right Hand Music/RCA | 82,000 | 202,000\n2 | Ed Sheeran | No.6 Collaborations Project | Atlantic | 70,000 | 173,000\n3 | Tyler, The Creator | IGOR | Columbia | 77,716 | 172,377\n4 | Juice WRLD | Death Race For Love | Grade A/Interscope | 42,648 | 164,076\n5 | DJ Khaled | Father Of Asahd | We The Best/Epic | 32,466 | 131,717\n6| Young Thug | So Much Fun | Atlantic | 5,142 | 131,... https://www.reddit.com/r/hiphopheads/comments/cxmule/brockhamptons_ginger_sells_77k_first_week_55k/ hiphopheads 3317 694 0.0
4 hugh__honey 2019-08-22 18:12:01+00:00 Missy Elliott announces first new album in 14 years, ICONOLOGY, coming out TONIGHT Announced on her IG - - https://www.instagram.com/p/B1ebzOigauK/?utm_source=ig_web_button_share_sheet https://www.reddit.com/r/hiphopheads/comments/cu199g/missy_elliott_announces_first_new_album_in_14/ hiphopheads 4262 237 0.0
5 errgreen 2019-07-26 19:59:15+00:00 Lizzo - Tempo (feat. Missy Elliott) [Official Music Video] https://www.youtube.com/watch?v=Srq1FqFPwj0 Music 7656 603 0.0
6 emitremmus27 2019-06-14 13:59:53+00:00 Missy Elliott becomes first female hip-hop artist inducted into Songwriters Hall of Fame https://abcnews.go.com/GMA/Culture/missy-elliott-female-hip-hop-artist-inducted-songwriters/story?id=63695814 Music 7935 299 0.0
7 DanWofSoc 2018-05-04 15:59:44+00:00 Friday Midday MAGAthread! - 05/04/2018 - Focus on the Midterms Volume 1: House Seats : CA-10 Jeff Denham, CA-25 Steve Knight, CA-48 Dana Rohrabacher _______________\n\n_______________\n\nGood afternoon my fellow Americans. **Your dom sheriff /u/DanWofSoc here to kick off the upcoming election cycle.** Believe it or not, we have about 24 weeks until the [November midterms](https://i.imgur.com/CdgAsM7.png) so it is time to go to work. **My goal is to run through all of the competitive house races each Friday until the election.** It is no secret that the democrat's only goal is the impeachment of our dear GEOTUS and The House of Repres... https://www.reddit.com/r/The_Donald/comments/8h0buy/friday_midday_magathread_05042018_focus_on_the/ The_Donald 2873 28 0.0
8 rejeremiad 2018-01-22 17:02:21+00:00 Malia Obama's Brand New Car is Disgusting | It isn't, refers to "Limosine One" used by the POTUS, saved you 257 clicks screenshot of original click headline: https://i.imgur.com/kCDDZBU.png \n\nsource: http://archive.is/FkEMT\n\nList: \n\n001: Paris Hilton – Bentley GT Continental, Estimated $285K \n002: Jay Leno – 1955 Mercedes 300SL Gullwing Coupe, Estimated $1.8 Million \n003: Kim Kardashian – Ferrari 458 Italia, Estimated $325K \n004: P.Diddy – Rolls Royce Phantom Drophead Coupe, Estimated $440K \n005: Nicolas Cage – 1958 Ferrari 250 GT Pininfarina, Estimated $3.6 Million \n006: Jerry Seinfeld – Po... https://www.reddit.com/r/savedyouaclick/comments/7s751u/malia_obamas_brand_new_car_is_disgusting_it_isnt/ savedyouaclick 9950 339 0.0
9 silverwolfer 2017-10-23 22:34:16+00:00 Missy Elliott doing you a /r/personalfinance https://imgur.com/GzzkVyt BlackPeopleTwitter 3057 46 0.0
10 hennny 2017-01-30 11:40:38+00:00 MORNING MAGAthread: Let's destroy, debunk and decuck some of these CTR talking points. #RISE AND SHINE, CENTIPEDES!\n\n#How are we all today?\n\n#Are we all ready for another week of being **PRESIDENTIAL AF**?\n\nI'm sure you don't need me to add another scratch onto an already very scratched, overplayed record- it's obvious right now that CTR are back with a vengeance. However, **jet fuel can't melt steel memes**. I, for one, absolutely fucking love banning cucks. It takes so much longer for them to create an account and copy and paste a CTR talking point than it does for me... https://www.reddit.com/r/The_Donald/comments/5r06o1/morning_magathread_lets_destroy_debunk_and_decuck/ The_Donald 7000 816 NaN
11 BigLeJaffe 2016-04-26 20:33:14+00:00 Beyoncé’s “Masterpiece of Black Feminism” Was Produced Almost Entirely By Men Beyoncé Knowles, Kevin Garrett, James Black, Wynter Gordon, Jack White, Mike Will Made It, Diplo, Ezra Koenig, Kevin Cossom, Melo-X, Danny Boy Styles, Ben Billions, Boots, Mike Dean, Vincent Berry II, Jonathan Coffer, and Just Blaze; these are the names credited as producers on Lemonade, the album The Hollywood Reporter calls a “Revolutionary Work of Black Feminism.” Notice anything peculiar?\n\nI believe the music is as advertised: a triumphant masterwork created by some of the most talente... https://www.reddit.com/r/hiphopheads/comments/4gktw6/beyoncés_masterpiece_of_black_feminism_was/ hiphopheads 3436 925 NaN
12 TheHHHRobot 2016-04-21 17:50:05+00:00 R.I.P. Prince Megathread http://www.tmz.com/2016/04/21/prince-dead-at-57/\n\nShare your favorite Prince songs, memories, interviews and alike.\n\nR.I.P.\n\n**REACTIONS FROM THE HIP-HOP COMMUNITY:**\n\n[Pete Rock](https://www.instagram.com/p/BEeJtKUGLdr/)\n\n[Killer Mike](https://www.instagram.com/p/BEeKYHCy1LZ/)\n\n[TDE's Punch](https://twitter.com/iamstillpunch/status/723205084151083008)\n\n[Questlove](https://www.instagram.com/p/BEeIpzpwa8I/)\n\n[Juicy J](https://www.instagram.com/p/BEeJU-fI-Kj/)\n\n[Young Thug](h... https://www.reddit.com/r/hiphopheads/comments/4fu6lc/rip_prince_megathread/ hiphopheads 2546 750 NaN
13 RaHxRaH 2015-11-12 14:05:16+00:00 [FRESH] Missy Elliott - WTF (Where They From) ft. Pharrell Williams https://youtu.be/KO_3Qgib6RQ hiphopheads 2905 502 NaN
14 caphis 2015-05-28 02:12:13+00:00 Why haven't Missy Elliott and Baskin Robbins teamed up for "Get Ur Free Cone" day yet? http://www.reddit.com/r/Showerthoughts/comments/37jj0x/why_havent_missy_elliott_and_baskin_robbins/ Showerthoughts 2918 205 NaN
15 All_Under_Heaven 2015-02-02 01:27:53+00:00 MRW Missy fucking Elliott showed up in the Superbowl Halftime show. http://www.reactiongifs.com/r/2013/07/holy-sht.gif reactiongifs 3746 313 NaN

To look at the subreddits where “Missy Elliott” appears the most often, we can use the .value_counts() method.

missy_submissions['subreddit'].value_counts()
hiphopheads           7
Music                 2
The_Donald            2
Showerthoughts        1
rupaulsdragrace       1
savedyouaclick        1
reactiongifs          1
BlackPeopleTwitter    1
Name: subreddit, dtype: int64

40.4 Collect Reddit Comments (By Keyword)

To collect Reddit comments rather than posts, we can use api.search_comments() rather than api.search_submissions().

api_request_generator = api.search_comments(q='Missy Elliott', score = ">2000")
missy_comments = pd.DataFrame([comment.d_ for comment in api_request_generator])

40.5 Collect Reddit Posts and Comments (By Multiple Keywords)

To search for multiple phrases in posts — such as posts that mention the author George Orwell OR the author J.R.R. Tolkein — we can use parentheses and the bitwise OR operator |

api_request_generator = api.search_comments(q='(George Orwell)|(J. R. R. Tolkien)')

To search for multiple phrases in posts — such as posts that mention Shakespeare AND Beyonce — we can use parentheses and the bitwise AND operator &

api_request_generator = api.search_comments(q='(Shakespeare)&(Beyonce)')

40.6 Collect Reddit Posts and Comments (By Date Range)

import datetime as dt

start_epoch=int(dt.datetime(2020, 1, 1).timestamp())
end_epoch=int(dt.datetime(2020, 1, 10).timestamp())

api_request_generator = api.search_comments(q='(Shakespeare)&(Beyonce)"', after = start_epoch, before=end_epoch)

If there is anything wrong, please open an issue on GitHub or email f.pianzola@rug.nl