How can I parse free-text time intervals in Python, ranging from years to seconds?
This one is new to me, but based on some googling have you tried whoosh?
Edit: There's also parsedatetime:
#!/usr/bin/env python
from datetime import datetime
import parsedatetime as pdt # $ pip install parsedatetime
cal = pdt.Calendar()
for time_str in ['1 second', '2 minutes','3 hours','5 weeks','6 months','7 years']:
diff = cal.parseDT(time_str, sourceTime=datetime.min)[0] - datetime.min
print("{time_str:<10} -> {diff!s:>20} <{diff!r}>".format(**vars()))
Output
1 second -> 0:00:01 <datetime.timedelta(0, 1)>
2 minutes -> 0:02:00 <datetime.timedelta(0, 120)>
3 hours -> 3:00:00 <datetime.timedelta(0, 10800)>
5 weeks -> 35 days, 0:00:00 <datetime.timedelta(35)>
6 months -> 181 days, 0:00:00 <datetime.timedelta(181)>
7 years -> 2556 days, 0:00:00 <datetime.timedelta(2556)>
how about pytimeparse
lib
Returns the time as a number of seconds:
from pytimeparse.timeparse import timeparse
>>> timeparse('33m')
1980
>>> timeparse('2h33m')
9180
>>> timeparse('4:17')
257
>>> timeparse('5hr34m56s')
20096
>>> timeparse('1.2 minutes')
72
source seems to be here https://github.com/wroberts/pytimeparse
Original answer
Not a solution because dateutil
can parse points in time, but not intervals
from dateutil.parser import parse
examples = """
August 3rd, 2019
2019-08-03
2019, 3rd aug, 2:45 pm
"""
formatted_examples = [
(example, f"{(p := parse(example))} <{p!r}>")
for example in filter(None, examples.splitlines())
]
longest_example = max(map(lambda tup: len(tup[0]), formatted_examples))
longest_parsed = max(map(lambda tup: len(tup[1]), formatted_examples))
for example, parsed_example in formatted_examples:
print(f"{example: <{longest_example}s} -> {parsed_example: >{longest_parsed}s}")
On PyPI, the package is called python-dateutil
.
Parsing
We can write a parser. It doesn't make a huge difference which parser is used. I searched for "python parser" and chose lark
because it popped up in the top of the results.
First, I defined the units as a mapping. This is where more units could be added, if "centuries" or "microseconds" are needed.
Note: For very small or large numbers, keep in mind timedelta.resolution
units = {
"second": timedelta(seconds=1),
"minute": timedelta(minutes=1),
"hour": timedelta(hours=1),
"day": timedelta(days=1),
"week": timedelta(weeks=1),
"month": timedelta(days=30),
"year": timedelta(days=365),
}
Next, the grammar is defined using lark
's variant of EBNF. Here, WS
hopefully matches all whitespace:
time_interval_grammar = r"""
%import common.WS
%import common.NUMBER
?interval: time+
time: value unit _separator?
value: NUMBER -> number
unit: SECOND
| MINUTE
| HOUR
| DAY
| WEEK
| MONTH
| YEAR
_separator: (WS | ",")+
SECOND: /s\w*/i
MINUTE: /mi\w*/i
HOUR: /h\w*/i
DAY: /d\w*/i
WEEK: /w\w*/i
MONTH: /mo\w*/i
YEAR: /y\w*/i
%ignore WS
%ignore ","
"""
The grammar should allow arbitrary time intervals to be chained together, with or without commas as separators.
Each time interval's unit can be given as the shortest unique prefix:
second -> s
minute -> mi
hour -> h
day -> d
week -> w
month -> mo
year -> y
Including the ones in the original question, these will serve as the target examples we want to parse:
1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years
1 month, 7 years, 2 days, 30 hours, 0.05 seconds
0.0003 years, 100000 seconds
3y 4mo 9min 6d
1mo,3d 1.3e2 hours, 0.04yrs 2mi444
Lastly, I followed one of the lark
tutorials and used a transformer:
class IntervalToTimedelta(Transformer):
def interval(tree: List[timedelta]) -> timedelta:
"sums all timedeltas"
return reduce(add, tree, timedelta(seconds=0))
def time(tree: List[Union[float, timedelta]]) -> timedelta:
"returns a timedelta representing the "
return mul(*tree)
def unit(tokens: List[Token]) -> timedelta:
"""
converts a unit into a timedelta that represents 1 of the unit type
"""
return units[tokens[0].type.lower()]
def number(tokens: List[Token]) -> float:
"returns the value as a python type"
return float(tokens[0].value)
The grammar is interpreted by lark.Lark
. Since it is compatible with
lark
's LALR(1) parser, that parser is specified to gain some speed and
improve memory efficiency by allowing the transformer to be used directly by
the parser:
time_interval_parser = Lark(
grammar=time_interval_grammar,
start="interval",
parser="lalr",
transformer=IntervalToTimedelta,
)
This produces a mostly working parser. The complete answer.py
file is this:
"""
Example parsing date and time interval with lark
"""
from datetime import timedelta
from functools import reduce
from operator import add, mul
from typing import List, Union
from lark import Lark, Token, Transformer
__all__ = [
"examples",
"IntervalToTimedelta",
"parse",
]
examples = list(
filter(
None,
"""
1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years
1 month, 0.05 weeks
0.003y, 100000secs
3y 4mo 9min 6d
1mo,3d 1.3e2 hours,
0.04yrs 2miasdf
""".splitlines(),
)
)
units = {
"second": timedelta(seconds=1),
"minute": timedelta(minutes=1),
"hour": timedelta(hours=1),
"day": timedelta(days=1),
"week": timedelta(weeks=1),
"month": timedelta(days=30),
"year": timedelta(days=365),
}
time_interval_grammar = r"""
%import common.WS
%import common.NUMBER
?interval: time+
time: value unit _separator?
value: NUMBER -> number
unit: SECOND
| MINUTE
| HOUR
| DAY
| WEEK
| MONTH
| YEAR
_separator: (WS | ",")+
SECOND: /s\w*/i
MINUTE: /mi\w*/i
HOUR: /h\w*/i
DAY: /d\w*/i
WEEK: /w\w*/i
MONTH: /mo\w*/i
YEAR: /y\w*/i
%ignore WS
%ignore ","
"""
class IntervalToTimedelta(Transformer):
def interval(tree: List[timedelta]) -> timedelta:
"sums all timedeltas"
return reduce(add, tree, timedelta(seconds=0))
def time(tree: List[Union[float, timedelta]]) -> timedelta:
"returns a timedelta representing the "
return mul(*tree)
def unit(tokens: List[Token]) -> timedelta:
"""
converts a unit into a timedelta that represents 1 of the unit type
"""
return units[tokens[0].type.lower()]
def number(tokens: List[Token]) -> float:
"returns the value as a python type"
return float(tokens[0].value)
time_interval_parser = Lark(
grammar=time_interval_grammar,
start="interval",
parser="lalr",
transformer=IntervalToTimedelta,
)
parse = time_interval_parser.parse
if __name__ == "__main__":
parsed_examples = [(example, parse(example)) for example in examples]
longest_example = max(map(lambda tup: len(tup[0]), parsed_examples))
longest_formatted = max(map(lambda tup: len(f"{tup[1]!s}"), parsed_examples))
longest_parsed = max(map(lambda tup: len(f"<{tup[1]!r}>"), parsed_examples))
for example, parsed_example in parsed_examples:
print(
f"{example: <{longest_example}s} -> "
f"{parsed_example!s: <{longest_formatted}s} "
f"{'<' + repr(parsed_example) + '>': >{longest_parsed}s}"
)
Running it runs through the examples:
$ python .\answer.py
1 second -> 0:00:01 <datetime.timedelta(seconds=1)>
2 minutes -> 0:02:00 <datetime.timedelta(seconds=120)>
3 hours -> 3:00:00 <datetime.timedelta(seconds=10800)>
4 days -> 4 days, 0:00:00 <datetime.timedelta(days=4)>
5 weeks -> 35 days, 0:00:00 <datetime.timedelta(days=35)>
6 months -> 180 days, 0:00:00 <datetime.timedelta(days=180)>
7 years -> 2555 days, 0:00:00 <datetime.timedelta(days=2555)>
1 month, 0.05 weeks -> 30 days, 8:24:00 <datetime.timedelta(days=30, seconds=30240)>
0.003y, 100000secs -> 2 days, 6:03:28 <datetime.timedelta(days=2, seconds=21808)>
3y 4mo 9min 6d -> 1221 days, 0:09:00 <datetime.timedelta(days=1221, seconds=540)>
1mo,3d 1.3e2 hours, -> 38 days, 10:00:00 <datetime.timedelta(days=38, seconds=36000)>
0.04yrs 2miasdf -> 14 days, 14:26:00 <datetime.timedelta(days=14, seconds=51960)>
This works fine, and the performance is adequate:
$ python -m timeit -s "from answer import parse, examples" "for example in examples:" " parse(example)"
500 loops, best of 5: 415 usec per loop
Potential improvements
Currently, this does not have any error handling, though this is by ommission:
lark
does raise errors, so the parse()
function could catch any that can be
handled gracefully.
Some other downsides to this particular implementation:
- Doesn't type check with
mypy --strict
- It requires the use of a 3rd-party library
- The grammar could better shape the resulting parse tree
Regular Expressions
Alternatively, instead of using a library for parsing, regular expressions can be used with the builtin re
.
This has a few disadvantages:
- Regular expressions are challenging to make flexible
- Complex regular expressions are difficult to read
- Regular expressions generally take longer for a human to interpret
It can be faster, though, and should only need the standard library included in CPython.
Using the previous example as a starting point, this is one way regular expressions could be swapped in:
"""
Example parsing date and time interval with re
"""
import re
from datetime import timedelta
from functools import reduce
from operator import add, mul
from typing import List, Tuple
__all__ = [
"examples",
"parse",
]
examples = list(
filter(
None,
"""
1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years
1 month, 0.05 weeks
0.003y, 100000secs
3y 4mo 9min 6d
1mo,3d 1.3e2 hours,
0.04yrs 2miasdf
""".splitlines(),
)
)
comma = ","
ws = r"\s"
separator = fr"[{ws}{comma}]+"
def unit_name(string: str) -> re.Pattern:
return re.compile(fr"{string}\w*")
second = unit_name("s")
minute = unit_name("mi")
hour = unit_name("h")
day = unit_name("d")
week = unit_name("w")
month = unit_name("mo")
year = unit_name("y")
units = {
second: timedelta(seconds=1),
minute: timedelta(minutes=1),
hour: timedelta(hours=1),
day: timedelta(days=1),
week: timedelta(weeks=1),
month: timedelta(days=30),
year: timedelta(days=365),
}
unit = re.compile(
"("
+ "|".join(
regex.pattern for regex in [second, minute, hour, day, week, month, year]
)
+ ")"
)
digit = r"\d"
integer = fr"({digit}+)"
decimal = fr"({integer}\.({integer})?|\.{integer})"
signed_integer = fr"([+-]?{integer})"
exponent = fr"([eE]{signed_integer})"
float_ = fr"({integer}{exponent}|{decimal}({exponent})?)"
number = re.compile(fr"({float_}|{integer})")
time = re.compile(fr"(?P<number>{number.pattern}){ws}*(?P<unit>{unit.pattern})")
interval = re.compile(fr"({time.pattern}({separator})*)+", flags=re.IGNORECASE)
def normalize_unit(text: str) -> timedelta:
"maps units to their respective timedelta"
if not unit.match(text):
raise ValueError(f"Not a unit: {text}")
for unit_re in units:
if unit_re.match(text):
return units[unit_re]
raise ValueError(f"No matching unit found: {text}")
def parse(text: str) -> timedelta:
if not interval.match(text):
raise ValueError(f"Parser Error: {text}")
parsed_pairs: List[Tuple[float, timedelta]] = list()
for match in time.finditer(text):
parsed_number = float(match["number"])
parsed_unit = normalize_unit(match["unit"])
parsed_pairs.append((parsed_number, parsed_unit))
timedeltas = [mul(*pair) for pair in parsed_pairs]
return reduce(add, timedeltas, timedelta(seconds=0))
if __name__ == "__main__":
parsed_examples = [(example, parse(example)) for example in examples]
longest_example = max(map(lambda tup: len(tup[0]), parsed_examples))
longest_formatted = max(map(lambda tup: len(f"{tup[1]!s}"), parsed_examples))
longest_parsed = max(map(lambda tup: len(f"<{tup[1]!r}>"), parsed_examples))
for example, parsed_example in parsed_examples:
print(
f"{example: <{longest_example}s} -> "
f"{parsed_example!s: <{longest_formatted}s} "
f"{'<' + repr(parsed_example) + '>': >{longest_parsed}s}"
)
The number parsing is mimicked from lark
's builtin grammar definitions.
The performance for this is better:
$ python -m timeit -s "from answer_re import parse, examples" "for example in examples:" " parse(example)"
2000 loops, best of 5: 109 usec per loop
But it's less readable, and making changes to maintain it will require more work.
Notes
As-is, both examples behave in a way that doesn't quite match up with how humans expect time intervals to work:
>>> from answer_re import parse
>>> from datetime import datetime
>>> datetime(2000, 1, 1) + parse("9 years")
datetime.datetime(2008, 12, 29, 0, 0)
>>> str(_)
'2008-12-29 00:00:00'
Compare this to what most people would expect it to be:
This stack overflow question provides a few solutions, one of which uses dateutil
. Both of the examples above can be adapted by modifying the units
mapping to use appropriate relativedelta
's.
This is what the first example would look like:
...
units = {
"second": relativedelta(seconds=1),
"minute": relativedelta(minutes=1),
"hour": relativedelta(hours=1),
"day": relativedelta(days=1),
"week": relativedelta(weeks=1),
"month": relativedelta(months=1),
"year": relativedelta(years=1),
}
...
This returns what's expected:
>>> from answer_with_dateutil import parse
>>> from datetime import datetime
>>> datetime(2000, 1, 1) + parse("9 years")
datetime.datetime(2009, 1, 1, 0, 0)
>>> str(_)
'2009-01-01 00:00:00'
Also, the use of f-strings and type annotations restricts this to Python 3.6 and up, though this can be changed to use str.format
instead for Python 3.5+.
Conclusion
With the currently accepted answer in the running, this is the performance for the more normal examples given in the original question:
Note: for sh
, replace `
with \
in the following commands
$ python -m timeit -s "from answer import examples;examples = examples[:7]" `
-s "from parsedatetime import Calendar; from datetime import datetime" `
-s "parse = Calendar().parseDT; now = datetime.now()" `
"for example in examples:" " parse(example)[0] - now"
1000 loops, best of 5: 232 usec per loop
$ python -m timeit -s "from answer_re import examples;examples = examples[:7]" `
-s "from answer import parse" `
"for example in examples:" " parse(example)"
2000 loops, best of 5: 157 usec per loop
$ python -m timeit -s "from answer_re import examples;examples = examples[:7]" `
-s "from answer_re import parse" `
"for example in examples:" " parse(example)"
10000 loops, best of 5: 39.5 usec per loop
The performance differences are largely negligible for a large variety of use cases.
Currently, the easiest one to use is going to be the example given in the currently accepted answer:
Unless very custom parsing is needed, use parsedatetime
.