How to delete the words between two delimiters?
Use regular expressions:
>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '
[Update]
If you tried a pattern like <.+>
, where the dot means any character and the plus sign means one or more, you know it does not work.
>>> re.sub(r'<.+>', s, '')
''
Why!?! It happens because regular expressions are "greedy" by default. The expression will match anything until the end of the string, including the >
- and this is not what we want. We want to match <
and stop on the next >
, so we use the [^x]
pattern which means "any character but x" (x being >
).
The ?
operator turns the match "non-greedy", so this has the same effect:
>>> re.sub(r'<.+?>', '', s)
'something something '
The previous is more explicit, this one is less typing; be aware that x?
means zero or one occurrence of x.
Of course, you can use regular expressions.
import re
s = #your string here
t = re.sub('<.*?>', '', s)
The above code should do it.
First thank you Paulo Scardine, I used your re to do great thing. The idea was to have tag free LibreOffice po file for printing purposes. And I made the following script which will clean the help file for smaller and easier ones.
import re
f = open('a.csv')
text = f.read()
f.close()
clean = re.sub('<[^>]+>', ' ', text)
f = open('b.csv', 'w')
f.write(clean)
f.close()
import re
my_str = '<@ """@$ FSDF >something something <more noise>'
re.sub('<.*?>', '', my_str)
'something something '
The re.sub
function takes a regular expresion and replace all the matches in the string with the second parameter. In this case, we are searching for all characters between <
and >
('<.*?>'
) and replacing them with nothing (''
).
The ?
is used in re
for non-greedy searches.
More about the re module.
If that "noises" are actually html tags, I suggest you to look into BeautifulSoup
Just for interest, you could write some code such as:
with open('blah.txt','w') as f:
f.write("""<sdgsa>one<as<>asfd<asdf>
<asdf>two<asjkdgai><iasj>three<fasdlojk>""")
def filter_line(line):
count=0
ignore=False
result=[]
for c in line:
if c==">" and count==1:
count=0
ignore=False
if not ignore:
result.append(c)
if c=="<" and count==0:
ignore=True
count=1
return "".join(result)
with open('blah.txt') as f:
print "".join(map(filter_line,f.readlines()))
>>>
<>one<>asfd<>
<>two<><>three<>