Data Sources

Below is a list most interesting data sources I’ve come across:

Big Data: 33 Brilliant And Free Data Sources Anyone Can Use

Awesome Public Datasets – GitHub

The most popular NoSQL database is Apache Cassandra. Cassandra, which was once Facebook proprietary database, was released as open source in 2008. Other NoSQL implementa- tions include SimpleDB, Google BigTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort. Companies that use NoSQL include Twitter, LinkedIn and NetFlix.

Publicly Accessible APIs:

  • Reddit
  • tumblr
  • Wikipedia

NCES, UCI Machine Learning Repository

University websites (Berkeley has lots of data)

Difficult sources of data, mostly because of restrictive APIs / anti-scraping policies:

  • LinkedIn
  • Facebook
  • Yelp
  • Foursquare
  • Craigslist
