How to parse hierarchy based on indents with python

python

You can mimick the way Python actually parses the indentation. First, create a stack that will contain the indentation levels. At each line:

If the indentation is bigger than the top of the stack, push it and increase the depth level.
If it is the same, continue at the same level.
If it is lower, pop the top of the stack while it is higher than the new indentation. If you find a lower indentation level before finding exactly the same, then there is an indentation error.

indentation = []
indentation.append(0)
depth = 0

f = open("test.txt", 'r')

for line in f:
    line = line[:-1]

    content = line.strip()
    indent = len(line) - len(content)
    if indent > indentation[-1]:
        depth += 1
        indentation.append(indent)

    elif indent < indentation[-1]:
        while indent < indentation[-1]:
            depth -= 1
            indentation.pop()

        if indent != indentation[-1]:
            raise RuntimeError("Bad formatting")

    print(f"{content} (depth: {depth})")

With a "test.txt" file whose content is as you provided:

Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses

Here is the output:

Income (depth: 0)
Revenue (depth: 1)
IAP (depth: 2)
Ads (depth: 2)
Other-Income (depth: 1)
Expenses (depth: 0)
Developers (depth: 1)
In-house (depth: 2)
Contractors (depth: 2)
Advertising (depth: 1)
Other Expense (depth: 1)

So, what can you do with this? Suppose you want to build nested lists. First, create a data stack.

When you find an indentation, append a new list at the end of the data stack.
When you find an unindentation, pop the top list, and append it to the new top.

And regardless, for each line, append the content to the list at the top of the data stack.

Here is the corresponding implementation:

for line in f:
    line = line[:-1]

    content = line.strip()
    indent = len(line) - len(content)
    if indent > indentation[-1]:
        depth += 1
        indentation.append(indent)
        data.append([])

    elif indent < indentation[-1]:
        while indent < indentation[-1]:
            depth -= 1
            indentation.pop()
            top = data.pop()
            data[-1].append(top)

        if indent != indentation[-1]:
            raise RuntimeError("Bad formatting")

    data[-1].append(content)

while len(data) > 1:
    top = data.pop()
    data[-1].append(top)

Your nested list is at the top of your data stack. The output for the same file is:

['Income',
    ['Revenue',
        ['IAP',
         'Ads'
        ],
     'Other-Income'
    ],
 'Expenses',
    ['Developers',
        ['In-house',
         'Contractors'
        ],
     'Advertising',
     'Other Expense'
    ]
 ]

This is rather easy to manipulate, although quite deeply nested. You can access the data by chaining the item accesses:

>>> l = data[0]
>>> l
['Income', ['Revenue', ['IAP', 'Ads'], 'Other-Income'], 'Expenses', ['Developers', ['In-house', 'Contractors'], 'Advertising', 'Other Expense']]
>>> l[1]
['Revenue', ['IAP', 'Ads'], 'Other-Income']
>>> l[1][1]
['IAP', 'Ads']
>>> l[1][1][0]
'IAP'

If the indentation is a fixed amount of spaces (3 spaces here), you can simplify the calculation of the indentation level.

note: I use a StringIO to simulate a file

import io
import itertools

content = u"""\
Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses
"""

stack = []
for line in io.StringIO(content):
    content = line.rstrip()  # drop \n
    row = content.split("   ")
    stack[:] = stack[:len(row) - 1] + [row[-1]]
    print("\t".join(stack))

You get:

Income
Income  Revenue
Income  Revenue IAP
Income  Revenue Ads
Income  Other-Income
Expenses
Expenses    Developers
Expenses    Developers  In-house
Expenses    Developers  Contractors
Expenses    Advertising
Expenses    Other Expenses

EDIT: indentation not fixed

If the indentation isn't fixed (you don't always have 3 spaces) like in the example below:

content = u"""\
Income
   Revenue
    IAP
    Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
  Advertising
  Other Expenses
"""

You need to estimate the shifting at each new line:

stack = []
last_indent = u""
for line in io.StringIO(content):
    indent = "".join(itertools.takewhile(lambda c: c == " ", line))
    shift = 0 if indent == last_indent else (-1 if len(indent) < len(last_indent) else 1)
    index = len(stack) + shift
    stack[:] = stack[:index - 1] + [line.strip()]
    last_indent = indent
    print("\t".join(stack))

How to parse hierarchy based on indents with python

Related

Recent Posts