25  Users’ Data: Legal & Ethical Considerations

Author

Melanie Walsh

Before we dive into collecting data from the internet, we need to discuss some serious questions. Is it legal or ethical to computationally collect data from the internet? Is it legal or ethical to publish research that includes internet users’ data without their knowledge?

25.2 Institutional Review Boards (IRBs)

Research that involves human participants (e.g., surveys, interviews, blood draws) needs to be approved by an Institutional Review Board (IRB). But research about publicly available internet data does not typically require IRB approval.

The Cornell Institutional Review Board recommends being cautious with regard to data mining from the internet, however, and seeking “formal confirmation of non-human participant research status”:

If the individual or social media/network site has not placed any restrictions on access to information about himself/herself (e.g., information available on a public website, blog, twitter feed, chat room, etc.), the following best practices should be followed: - The researcher should send a project description to the IRB office and seek a formal confirmation of non-human participant research status for the study. We believe that in most cases, this will not be considered human participant research, but caution is recommended before a researcher makes his/her own determination, because of the emerging ethical sensitivities in this area.

25.3 Publishing, Privacy, & Citation

Just because something is legal or gets approved by an IRB does not mean it is ethical. Collecting, sharing, and publishing internet data created by or about individuals can lead to unwanted public scrutiny, harm, and other negative consequences for those individuals. For these reasons, some researchers attempt to anonymize internet data before sharing it or before publishing an article that cites a post specifically. Yet anonymizing internet data also does not give credit to internet users as creators and authors.

There is no single, simple answer to the many difficult questions raised by internet data collection. It is important to develop an ethical framework that responds to the specifics of your particular research project or use case (e.g., the platform, the people involved, the context, the potential consequences, etc.).

In my own research, I have started seeking explicit permission from internet users when I want to quote them in a published article. In this book, I only share internet data that meets a certain threshold of publicness, such as tweets from verified Twitter accounts or Reddit posts with a certain number of upvotes. This is an approach that I have developed based on some of the models and readings included below.

25.4 Models & Examples of Social Media Data in Published Research

Below are a few examples of how researchers have approached social media data in published research:

25.4.1 Paraphrasing Posts

25.4.2 Linking to Posts & Using “Reasonably Public” Thresholds

  • In Deen Freelon, Charlton McIlwain, and Meredith D. Clark’s report about the #BlackLivesMatter movement, they included links to tweets rather than the full text of tweets and only linked to tweets with a minimum of 100 retweets published by Twitter users who had at least 3,000 followers or were verified. They embargoed their Twitter data for a year and then publicly released a list of tweet IDs. Tweet IDs can be used by third-parties to re-download any tweets that have not been deleted yet, as I discuss in the lesson “Twitter Data Sharing”.

25.4.3 Direct Collaboration & Conversation with Users

  • In Moya Bailey’s article about the #GirlsLikeUs hashtag, created by trans advocate Janet Mock, she asked for Mock’s permission to work on the project before it began and collaborated with Mock to develop research questions and determine the project’s direction.