python re.sub, only replace part of match [duplicate]

I am very new to python

I need to match all cases by one regex expression and do a replacement. this is a sample substring --> desired result:

<cross_sell id="123" sell_type="456"> --> <cross_sell>

i am trying to do this in my code:

myString = re.sub(r'\<[A-Za-z0-9_]+(\s[A-Za-z0-9_="\s]+)', "", myString)

instead of replacing everything after <cross_sell, it replaces everything and just returns '>'

is there a way for re.sub to replace only the capturing group instead of the entire pattern?


Solution 1:

You can use substitution groups:

>>> my_string = '<cross_sell id="123" sell_type="456"> --> <cross_sell>'
>>> re.sub(r'(\<[A-Za-z0-9_]+)(\s[A-Za-z0-9_="\s]+)', r"\1", my_string)
'<cross_sell> --> <cross_sell>'

Notice I put the first group (the one you want to keep) in parenthesis and then I kept that in the output by using the "\1" modifier (first group) in the replacement string.

Solution 2:

You can use a group reference to match the first word and a negated character class to match the rest of the string between <> :

>>> s='<cross_sell id="123" sell_type="456">'
>>> re.sub(r'(\w+)[^>]+',r'\1',s)
'<cross_sell>'

\w is equal to [A-Za-z0-9_].

Solution 3:

Since the input data is XML, you'd better parse it with an XML parser.

Built-in xml.etree.ElementTree is one option:

>>> import xml.etree.ElementTree as ET
>>> data = '<cross_sell id="123" sell_type="456"></cross_sell>'
>>> cross_sell = ET.fromstring(data)
>>> cross_sell.attrib = {}
>>> ET.tostring(cross_sell)
'<cross_sell />'

lxml.etree is an another option.