How do I turn this oddly formatted looped print function into a data frame with similar output?

There is a code chunk I found useful in my project, but I can't get it to build a data frame in the same given/desired format as it prints (2 columns).

The code chunk and desired output:

import nltk
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
# Step Two: Load Data
 
sentence = "Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr."

# Step Three: Tokenise, find parts of speech and chunk words 

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))

Clean Output of tag in one column and entity in another:

PERSON Martin
PERSON Luther King
PERSON Michael King
ORGANIZATION American
GPE American
GPE Christian
PERSON Mahatma Gandhi
PERSON Martin Luther

I tried something like this, but the results are not nearly as clean.

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        df.append(chunk)

Output:

    [Tree('PERSON', [('Martin', 'NNP')]),
 Tree('PERSON', [('Luther', 'NNP'), ('King', 'NNP')]),
 Tree('PERSON', [('Michael', 'NNP'), ('King', 'NNP')]),
 Tree('ORGANIZATION', [('American', 'JJ')]),
 Tree('GPE', [('American', 'NNP')]),
 Tree('GPE', [('Christian', 'JJ')]),
 Tree('PERSON', [('Mahatma', 'NNP'), ('Gandhi', 'NNP')]),
 Tree('PERSON', [('Martin', 'NNP'), ('Luther', 'NNP')])]

Is there a easy way to change the print format to df with just 2 columns??


Create nested lists and convert to DataFrame:

L = []
for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        L.append([chunk.label(), ' '.join(c[0] for c in chunk)])
        
df = pd.DataFrame(L, columns=['a','b'])
print (df)
              a               b
0        PERSON          Martin
1        PERSON     Luther King
2        PERSON    Michael King
3  ORGANIZATION        American
4           GPE        American
5           GPE       Christian
6        PERSON  Mahatma Gandhi
7        PERSON   Martin Luther

In list comperehension solution is:

L= [[chunk.label(), ' '.join(c[0] for c in chunk)]  
     for sent in nltk.sent_tokenize(sentence) 
     for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))) 
     if hasattr(chunk, 'label')]

df = pd.DataFrame(L, columns=['a','b'])