How to parse hierarchy based on indents with python
You can mimick the way Python actually parses the indentation. First, create a stack that will contain the indentation levels. At each line:
- If the indentation is bigger than the top of the stack, push it and increase the depth level.
- If it is the same, continue at the same level.
- If it is lower, pop the top of the stack while it is higher than the new indentation. If you find a lower indentation level before finding exactly the same, then there is an indentation error.
indentation = []
indentation.append(0)
depth = 0
f = open("test.txt", 'r')
for line in f:
line = line[:-1]
content = line.strip()
indent = len(line) - len(content)
if indent > indentation[-1]:
depth += 1
indentation.append(indent)
elif indent < indentation[-1]:
while indent < indentation[-1]:
depth -= 1
indentation.pop()
if indent != indentation[-1]:
raise RuntimeError("Bad formatting")
print(f"{content} (depth: {depth})")
With a "test.txt" file whose content is as you provided:
Income
Revenue
IAP
Ads
Other-Income
Expenses
Developers
In-house
Contractors
Advertising
Other Expenses
Here is the output:
Income (depth: 0)
Revenue (depth: 1)
IAP (depth: 2)
Ads (depth: 2)
Other-Income (depth: 1)
Expenses (depth: 0)
Developers (depth: 1)
In-house (depth: 2)
Contractors (depth: 2)
Advertising (depth: 1)
Other Expense (depth: 1)
So, what can you do with this? Suppose you want to build nested lists. First, create a data stack.
- When you find an indentation, append a new list at the end of the data stack.
- When you find an unindentation, pop the top list, and append it to the new top.
And regardless, for each line, append the content to the list at the top of the data stack.
Here is the corresponding implementation:
for line in f:
line = line[:-1]
content = line.strip()
indent = len(line) - len(content)
if indent > indentation[-1]:
depth += 1
indentation.append(indent)
data.append([])
elif indent < indentation[-1]:
while indent < indentation[-1]:
depth -= 1
indentation.pop()
top = data.pop()
data[-1].append(top)
if indent != indentation[-1]:
raise RuntimeError("Bad formatting")
data[-1].append(content)
while len(data) > 1:
top = data.pop()
data[-1].append(top)
Your nested list is at the top of your data
stack.
The output for the same file is:
['Income',
['Revenue',
['IAP',
'Ads'
],
'Other-Income'
],
'Expenses',
['Developers',
['In-house',
'Contractors'
],
'Advertising',
'Other Expense'
]
]
This is rather easy to manipulate, although quite deeply nested. You can access the data by chaining the item accesses:
>>> l = data[0]
>>> l
['Income', ['Revenue', ['IAP', 'Ads'], 'Other-Income'], 'Expenses', ['Developers', ['In-house', 'Contractors'], 'Advertising', 'Other Expense']]
>>> l[1]
['Revenue', ['IAP', 'Ads'], 'Other-Income']
>>> l[1][1]
['IAP', 'Ads']
>>> l[1][1][0]
'IAP'
If the indentation is a fixed amount of spaces (3 spaces here), you can simplify the calculation of the indentation level.
note: I use a StringIO to simulate a file
import io
import itertools
content = u"""\
Income
Revenue
IAP
Ads
Other-Income
Expenses
Developers
In-house
Contractors
Advertising
Other Expenses
"""
stack = []
for line in io.StringIO(content):
content = line.rstrip() # drop \n
row = content.split(" ")
stack[:] = stack[:len(row) - 1] + [row[-1]]
print("\t".join(stack))
You get:
Income
Income Revenue
Income Revenue IAP
Income Revenue Ads
Income Other-Income
Expenses
Expenses Developers
Expenses Developers In-house
Expenses Developers Contractors
Expenses Advertising
Expenses Other Expenses
EDIT: indentation not fixed
If the indentation isn't fixed (you don't always have 3 spaces) like in the example below:
content = u"""\
Income
Revenue
IAP
Ads
Other-Income
Expenses
Developers
In-house
Contractors
Advertising
Other Expenses
"""
You need to estimate the shifting at each new line:
stack = []
last_indent = u""
for line in io.StringIO(content):
indent = "".join(itertools.takewhile(lambda c: c == " ", line))
shift = 0 if indent == last_indent else (-1 if len(indent) < len(last_indent) else 1)
index = len(stack) + shift
stack[:] = stack[:index - 1] + [line.strip()]
last_indent = indent
print("\t".join(stack))