Large amounts of English text needed [closed]

I must admit that I'm unsure whether or not this is the right forum for this question. It may belong to statistics and AI also. If there exists a more suitable forum, then tell me.

The thing is, I want to analyse a lot of English text for an AI project (Confabulation theory). Is there an online collection of freely available English texts? Books, news would be preferred — scientific texts will properly not do, due to large amounts of math etc.


Solution 1:

Project Gutenberg contains more than 33000 classic books for free, downloadable in different formats.

Also, look at the affiliates sites, where you can find even more books.

EDIT: you may try to contact them to see if there is a way for you to download the books "painlessly" (e.g. with an automated script). I would suggest you ask permission before trying to download all those books automatically. Also, consider making a donation if you end up using their data.

EDIT2: here are the instructions for accessing the site with a robot.

Solution 2:

Project Gutenberg is a poor choice if you are looking for contemporary language, as most of the PG texts are from the early 20th century and earlier. For large samples of more recent text, you want one of the many available text corpora, such as the following:

  • American National Corpus
  • British National Corpus
  • Corpus of Contemporary American English
  • The Oxford English Corpus

Some corpora require a fee for access, and others are free to use. Consult each Web site for specifics.

Solution 3:

If n-grams are fine (i.e., you don't need complete sentences and such), you could also try the Google n-grams dataset.