Download large data for Hadoop [closed]

Solution 1:

I would suggest you downloading million songs Dataset from the following website:

http://labrosa.ee.columbia.edu/millionsong/

The best thing with Millions Songs Dataset is that you can download 1GB (about 10000 songs), 10GB, 50GB or about 300GB dataset to your Hadoop cluster and do whatever test you would want. I love using it and learn a lot using this data set.

To start with you can download dataset start with any one letter from A-Z, which will be range from 1GB to 20GB.. you can also use Infochimp site:

http://www.infochimps.com/collections/million-songs

In one of my following blog I showed how to download 1GB dataset and run Pig scripts:

http://blogs.msdn.com/b/avkashchauhan/archive/2012/04/12/processing-million-songs-dataset-with-pig-scripts-on-apache-hadoop-on-windows-azure.aspx

Solution 2:

Tom White mentioned about a sample weather data set in his Book(Hadoop: the definitive guide).

http://hadoopbook.com/code.html

Data is available for more than 100 years.

I used wget in linux to pull the data. For the year 2007 itself the data size is 27 GB.

It is hosted as an FTP link. So, you can download with any FTP utility.

ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

For complete details please check my blog:

http://myjourneythroughhadoop.blogspot.in/2013/07/how-to-download-weather-data-for-your.html

Solution 3:

There are public datasets availbale on Amazon:
http://aws.amazon.com/publicdatasets/
I would suggest to consider running demo cluster there - and thus to save downloading.
There is also good dataset of the crowled web from Common Crawl, which is also available on amazon s3. http://commoncrawl.org/