How to capitalize the first letter of every sentence?
I'm trying to write a program that capitalizes the first letter of each sentence. This is what I have so far, but I cannot figure out how to add back the period in between sentences. For example, if I input:
hello. goodbye
the output is
Hello Goodbye
and the period has disappeared.
string=input('Enter a sentence/sentences please:')
sentence=string.split('.')
for i in sentence:
print(i.capitalize(),end='')
You could use nltk for sentence segmentation:
#!/usr/bin/env python3
import textwrap
from pprint import pprint
import nltk.data # $ pip install http://www.nltk.org/nltk3-alpha/nltk-3.0a3.tar.gz
# python -c "import nltk; nltk.download('punkt')"
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = input('Enter a sentence/sentences please:')
print("\n" + textwrap.fill(text))
sentences = sent_tokenizer.tokenize(text)
sentences = [sent.capitalize() for sent in sentences]
pprint(sentences)
Output
Enter a sentence/sentences please: a period might occur inside a sentence e.g., see! and the sentence may end without the dot! ['A period might occur inside a sentence e.g., see!', 'And the sentence may end without the dot!']
You could use regular expressions. Define a regex that matches the first word of a sentence:
import re
p = re.compile(r'(?<=[\.\?!]\s)(\w+))
This regex contains a positive lookbehind assertion (?<=...)
which matches either a .
, ?
or !
, followed by a whitespace character \s
. This is followed by a group that matches one or more alphanumeric characters \w+
. In effect, matching the next word after the end of a sentence.
You can define a function that will capitalise regex match objects, and feed this function to sub()
:
def cap(match):
return(match.group().capitalize())
p.sub(cap, 'Your text here. this is fun! yay.')
You might want to do the same for another regex that matches the word at the beginning of a string:
p2 = re.compile(r'^\w+')
Or make the original regex even harder to read, by combining them:
p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
You can use,
In [25]: st = "this is first sentence. this is second sentence. and this is third. this is fourth. and so on"
In [26]: '. '.join(list(map(lambda x: x.strip().capitalize(), st.split('.'))))
Out[26]: 'This is first sentence. This is second sentence. And this is third. This is fourth. And so on'
In [27]: