Research Guides: Social Media Research: Social Media and Web Datasets

Social media & web datasets

Awesome Public Datasets
A list of topic-centric high quality public data sources. Collected from blogs, answers, and user responses.
Blogger Corpus (2004)
The collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts.
Common Crawl
The Common Crawl corpus contains petabytes of data collected from the web since 2008. It contains raw web page data, extracted metadata and text extractions.
Facebook Ad Categories (ProPublica)
This dataset includes two tables: data on the interest categories Facebook shows to users and the ad groups its shows to advertisers (2016).
Internet Archive (data)
How to download files from archive.org in an automated way using wget.
Library of Congress Datasets
Obama Administration Social Media Archives
A directory of sites archiving various social media posts from members of the Obama Administration.
Political Ads from Facebook (ProPublica)
This database, updated daily, contains ads that ran on Facebook and were submitted by thousands of ProPublica users from around the world (via browser extensions).
Quantcast (website traffic estimates)
Use Quantcast's top sites and Measure tools to find traffic estimates for website domains (free account required).
reddit APIs
Access data from posts, threads, comments, users and more from reddit and subreddits. Historical Reddit data has been collected at http://files.pushshift.io/reddit/ as monthly CSV downloads.
Social Computing Data Repository (Arizona)
As a service to the Machine Learning, Data Mining, and Social Sciences communities, the Social Computing data repository currently hosts datasets from a collection of many different social media sites.
Stanford Large Network Dataset Collection (SNAP)
The SNAP library collects data on large social and information networks since 2004.
Tweet Datasets (DocNow)
Directory of open-access tweet datasets on DocNow, available for research use. To convert Tweet IDs to JSON files of full-text tweets and metadata, use DocNow's Hydrator app. See also: UNLV's Twitter Data Tutorial Series.
Twitter: Moral Foundations Corpus
35,108 Tweets curated from seven different domains of Twitter corpus that have been hand-annotated for 10 categories of moral sentiment.
Web Corpus (iWeb)
14 billion words from 22 million web pages and 95k websites.
Wikipedia Data Dumps
Monthly database backups of all Wikimedia wikis in various formats.

Library

Ask a Librarian:

Social Media Research

Need help?

Content reused with permission from University of Minnesota Libraries.

Social media & web datasets