python re.sub, only replace part of match [duplicate]
I am very new to python
I need to match all cases by one regex expression and do a replacement. this is a sample substring --> desired result:
<cross_sell id="123" sell_type="456"> --> <cross_sell>
i am trying to do this in my code:
myString = re.sub(r'\<[A-Za-z0-9_]+(\s[A-Za-z0-9_="\s]+)', "", myString)
instead of replacing everything after <cross_sell
, it replaces everything and just returns '>'
is there a way for re.sub to replace only the capturing group instead of the entire pattern?
Solution 1:
You can use substitution groups:
>>> my_string = '<cross_sell id="123" sell_type="456"> --> <cross_sell>'
>>> re.sub(r'(\<[A-Za-z0-9_]+)(\s[A-Za-z0-9_="\s]+)', r"\1", my_string)
'<cross_sell> --> <cross_sell>'
Notice I put the first group (the one you want to keep) in parenthesis and then I kept that in the output by using the "\1"
modifier (first group) in the replacement string.
Solution 2:
You can use a group reference to match the first word and a negated character class to match the rest of the string between <>
:
>>> s='<cross_sell id="123" sell_type="456">'
>>> re.sub(r'(\w+)[^>]+',r'\1',s)
'<cross_sell>'
\w
is equal to [A-Za-z0-9_]
.
Solution 3:
Since the input data is XML, you'd better parse it with an XML parser.
Built-in xml.etree.ElementTree
is one option:
>>> import xml.etree.ElementTree as ET
>>> data = '<cross_sell id="123" sell_type="456"></cross_sell>'
>>> cross_sell = ET.fromstring(data)
>>> cross_sell.attrib = {}
>>> ET.tostring(cross_sell)
'<cross_sell />'
lxml.etree
is an another option.