How to use sed to extract substring

I have a file containing the following lines:

  <parameter name="PortMappingEnabled" access="readWrite" type="xsd:boolean"></parameter>
  <parameter name="PortMappingLeaseDuration" access="readWrite" activeNotify="canDeny" type="xsd:unsignedInt"></parameter>
  <parameter name="RemoteHost" access="readWrite"></parameter>
  <parameter name="ExternalPort" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="ExternalPortEndRange" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="InternalPort" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="PortMappingProtocol" access="readWrite"></parameter>
  <parameter name="InternalClient" access="readWrite"></parameter>
  <parameter name="PortMappingDescription" access="readWrite"></parameter>

I want to execute command on this file to extract only the parameter names as displayed in the following output:

$sedcommand file.txt
PortMappingEnabled
PortMappingLeaseDuration
RemoteHost
ExternalPort
ExternalPortEndRange
InternalPort
PortMappingProtocol
InternalClient
PortMappingDescription

What could be this command?


Solution 1:

grep was born to extract things:

grep -Po 'name="\K[^"]*'

test with your data:

kent$  echo '<parameter name="PortMappingEnabled" access="readWrite" type="xsd:boolean"></parameter>
  <parameter name="PortMappingLeaseDuration" access="readWrite" activeNotify="canDeny" type="xsd:unsignedInt"></parameter>
  <parameter name="RemoteHost" access="readWrite"></parameter>
  <parameter name="ExternalPort" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="ExternalPortEndRange" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="InternalPort" access="readWrite" type="xsd:unsignedInt"></parameter>
  <parameter name="PortMappingProtocol" access="readWrite"></parameter>
  <parameter name="InternalClient" access="readWrite"></parameter>
  <parameter name="PortMappingDescription" access="readWrite"></parameter>
'|grep -Po 'name="\K[^"]*'
PortMappingEnabled
PortMappingLeaseDuration
RemoteHost
ExternalPort
ExternalPortEndRange
InternalPort
PortMappingProtocol
InternalClient
PortMappingDescription

Solution 2:

sed 's/[^"]*"\([^"]*\).*/\1/'

does the job.

explanation of the part inside ' '

  • s - tells sed to substitute
  • / - start of regex string to search for
  • [^"]* - any character that is not ", any number of times. (matching parameter name=)
  • " - just a ".
  • ([^"]*) - anything inside () will be saved for reference to use later. The \ are there so the brackets are not considered as characters to search for. [^"]* means the same as above. (matching RemoteHost for example)
  • .* - any character, any number of times. (matching " access="readWrite"> /parameter)
  • / - end of the search regex, and start of the substitute string.
  • \1 - reference to that string we found in the brackets above.
  • / end of the substitute string.

basically s/search for this/replace with this/ but we're telling him to replace the whole line with just a piece of it we found earlier.

Solution 3:

You want awk.

This would be a quick and dirty hack:

awk -F "\"" '{print $2}' /tmp/file.txt

PortMappingEnabled
PortMappingLeaseDuration
RemoteHost
ExternalPort
ExternalPortEndRange
InternalPort
PortMappingProtocol
InternalClient
PortMappingDescription

Solution 4:

You should not parse XML using tools like sed, or awk. It's error-prone.

If input changes, and before name parameter you will get new-line character instead of space it will fail some day producing unexpected results.

If you are really sure, that your input will be always formated this way, you can use cut. It's faster than sed and awk:

cut -d'"' -f2 < input.txt

It will be better to first parse it, and extract only parameter name attribute:

xpath -q -e //@name input.txt | cut -d'"' -f2

To learn more about xpath, see this tutorial: http://www.w3schools.com/xpath/

Solution 5:

Explaining how you can use cut:

cat yourxmlfile | cut -d'"' -f2

It will 'cut' all the lines in the file based on " delimiter, and will take the 2nd field , which is what you wanted.