I have some HTML that I am trying to extract links from. Right now the file looks like this.

website.com/path/to/file/234432517.gif" width="620">
website.com/path/to/file/143743e53.gif" width="620">
website.com/path/to/file/123473232.gif" width="620">
website.com/path/to/file/634132317.gif" width="620">
website.com/path/to/file/432432173.gif" width="620">

I am trying to use sed to remove the " width="620"> from all the lines. Here is my sed code:

sudo sed -i "s/\"\swidth\=\"\d+\"\>//g" output

Why is this not working? everything I google leads to some code that looks like this but this does not work for some reason.


Because you are using PCRE (Perl Compatible Regular Expressions) syntax and sed doesn't understand that, it uses Basic Regular Expressions (BRE) by default. It knows neither \s nor \d. You are also escaping all sorts of things that don't need to be escaped (neither the \= nor the \> are doing anything useful) while not escaping things that do need to be escaped (+ just means the symbol + in BRE, you need \+ for "one or more".

This should do what you need:

sed 's/" width="[0-9]\+">//g' file

Or, using Extended Regular Expressions:

sed -E 's/"\s*width="[0-9]+">//g' file

Finally, as a general rule you never use sed -i without first testing without the -i to be sure it works or, if you do, at least use -i.bak (-i with any text will do this) to create a backup.


Here is my sed solution:

sed -E 's/(.*)" width="[0-9]+">/\1/' filename

And as an alternative to the sed I suggest using grep to extract data from a file:

This would work for you:

grep -o "website.*\.gif" filename

And as terdon suggested, here is a look ahead solution using grep:

grep -Po '.*(?="\swidth="\d*">)' filename

Also cut is a good option in your situation:

cut -f1 -d'"' filename