Align whitespace after segmentation into sentences

I am using a segmentation library to split a string into sentences:

s = "Lucas ipsum dolor sit amet darth\n 
     mandalore kit. Endor Mr. Wookiee wicket\n 
     jawa yavin ackbar jabba? Padmé\n
     utapau palpatine kenobi moff.\n
     Sidious anakin mace:\n
      - Ben darth.\n
      - Ben vader."

segment(s)
# => 
["Lucas ipsum dolor sit amet darth mandalore kit.",
 "Endor Mr. Wookiee wicket jawa yavin ackbar jabba?",
 "Padmé utapau palpatine kenobi moff.",
 "Sidious anakin mace:",
 "- Ben darth.",
 "- Ben vader."]

Unfortunately the library has no way to retain the whitespacing, but strips all newlines, leading and trailing newlines and collapses multiple spaces into a single space.

Given that the segmentation library is awesome otherwise (it keeps stuff like Mr. Wookie): What would be a concise code to re-insert the whitespace into the split sentences given the original text is still available?

Expected outcome:

["Lucas ipsum dolor sit amet darth\nmandalore kit. ",
 "Endor Mr. Wookiee wicket\njawa yavin ackbar jabba? ",
 "Padmé\nutapau palpatine kenobi moff.\n",
 "Sidious anakin mace:\n",
 " - Ben darth.\n",
 " - Ben vader."]

Original code is Ruby, but solutions could be in any language.

I think the easiest and most concise way to do this (although certainly not the most efficient) is with regular expressions. The idea being to escape each segmented line, replace spaces with a regex for 1+ of any whitespace, allow whitespace at the beginning and end of each pattern, glue them together in capture groups, and match the original text with it.

In python:

import re

def insert_whitespace(original, segmented):
    inner = "".join(r"([ \t]*{}[ \t]*(?:\s*\n)?)".format(
        re.escape(line).replace(r"\ ", r"\s+"))
        for line in segmented)
    pattern = f"^{inner}$"

    return re.match(pattern, original).groups()

Example:

from pprint import pprint

original = """Lucas ipsum dolor sit amet darth\n
 mandalore kit. Endor Mr. Wookiee wicket
 jawa yavin ackbar jabba? Padmé
utapau palpatine kenobi moff.
Sidious anakin mace:
 - Ben darth.
 - Ben vader."""

segmented = ["Lucas ipsum dolor sit amet darth mandalore kit.",
 "Endor Mr. Wookiee wicket jawa yavin ackbar jabba?",
 "Padmé utapau palpatine kenobi moff.",
 "Sidious anakin mace:",
 "- Ben darth.",
 "- Ben vader."]

output = insert_whitespace(original, segmented)
pprint(output)
# out: ('Lucas ipsum dolor sit amet darth\n\n mandalore kit. ',
# out:  'Endor Mr. Wookiee wicket\n jawa yavin ackbar jabba? ',
# out:  'Padmé\nutapau palpatine kenobi moff.\n',
# out:  'Sidious anakin mace:\n',
# out:  ' - Ben darth.\n',
# out:  ' - Ben vader.')

Note: You may need to update the regex to your needs as you decide where to put certain whitespace (everything up to / including the first newline gets appended to the previous sentence, or maybe all whitespace other than trailing spaces, etc.). The regex I'm using is admittedly a bit gnarly, but it should handle all forms of whitespace gracefully, and most of the time appends/prepends sensibly.

Align whitespace after segmentation into sentences

Related

Recent Posts