Get Block of lines between same patttern using regex [duplicate]
In general, an extraction regex looks like
(?s)pattern.*?(?=pattern|$)
Or, if the pattern
is at the start of a line,
(?sm)^pattern.*?(?=\npattern|\Z)
Here, you could use
re.findall(r'chapter [0-9].*?(?=chapter [0-9]|\Z)', text)
See this regex demo. Details:
-
chapter [0-9]
-chapter
+ space and a digit -
.*?
- any zero or more chars, as few as possible -
(?=chapter [0-9]|\Z)
- a positive lookahead that matches a location immediately followed withchapter
, space, digit, or end of the whole string.
Here, since the text starts with the keyword, you may use
import re
teststr= 'chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.'
my_result = [x.strip() for x in re.split(r'(?!^)(?=chapter \d)', teststr)]
print( my_result )
# => ['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']
See the Python demo. The (?!^)(?=chapter \d)
regex means:
-
(?!^)
- find a location that is not at the start of string and -
(?=chapter \d)
- is immediately followed withchapter
, space and any digit.
The pattern is used to split the string at the found locations, and does not consume any chars, hence, the results are stripped from whitespace in a list comprehension.