Text processing - Python vs Perl performance [closed]
Solution 1:
This is exactly the sort of stuff that Perl was designed to do, so it doesn't surprise me that it's faster.
One easy optimization in your Python code would be to precompile those regexes, so they aren't getting recompiled each time.
exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')
And then in your loop:
mprev = exists_re.search(currline)
and
mcurr = location_re.search(currline)
That by itself won't magically bring your Python script in line with your Perl script, but repeatedly calling re in a loop without compiling first is bad practice in Python.
Solution 2:
Hypothesis: Perl spends less time backtracking in lines that don't match due to optimisations it has that Python doesn't.
What do you get by replacing
^(.*?) INFO.*Such a record already exists
with
^((?:(?! INFO).)*?) INFO.*Such a record already
or
^(?>(.*?) INFO).*Such a record already exists
Solution 3:
Function calls are a bit expensive in terms of time in Python. And yet you have a loop invariant function call to get the file name inside the loop:
fn = fileinput.filename()
Move this line above the for
loop and you should see some improvement to your Python timing. Probably not enough to beat out Perl though.