Creating a new corpus with NLTK
I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python.
I have a bunch of .txt
files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data
.
I've tried PlaintextCorpusReader
but I couldn't get further than:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I segment the newcorpus
sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader
class?
Can you also lead me to how I can write the segmented data into text files?
Solution 1:
After some years of figuring out how it works, here's the updated tutorial of
How to create an NLTK corpus with a directory of textfiles?
The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.
If you have a directory that looks like this:
newcorpus/
file1.txt
file2.txt
...
Simply use these lines of code and you can get a corpus:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'newcorpus/' # Directory of corpus.
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
NOTE: that the PlaintextCorpusReader
will use the default nltk.tokenize.sent_tokenize()
and nltk.tokenize.word_tokenize()
to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.
Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]
# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
filename+=1
with open(corpusdir+str(filename)+'.txt','w') as fout:
print>>fout, text
# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
assert open(corpusdir+infile,'r').read().strip() == text.strip()
# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
print infile # The fileids of each file.
with newcorpus.open(infile) as fin: # Opens the file.
print fin.read().strip() # Prints the content of the file
print
# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()
print
# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and
# nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()
print
# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])
# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()
print
# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])
# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()
# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
>>> sent_tokenize(txt1)
['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.']
>>> word_tokenize(sent_tokenize(txt1)[0])
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']
Solution 2:
I think the PlaintextCorpusReader
already segments the input with a punkt tokenizer, at least if your input language is english.
PlainTextCorpusReader's constructor
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle')
.
For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())