The Common Crawl corpus contains petabytes of data collected from the web since 2008. It contains raw web page data, extracted metadata and text extractions.
This database, updated daily, contains ads that ran on Facebook and were submitted by thousands of ProPublica users from around the world (via browser extensions).
Access data from posts, threads, comments, users and more from reddit and subreddits. Historical Reddit data has been collected at http://files.pushshift.io/reddit/ as monthly CSV downloads.
As a service to the Machine Learning, Data Mining, and Social Sciences communities, the Social Computing data repository currently hosts datasets from a collection of many different social media sites.