Regex to get src value from an img tag
I am using the following regex to get the src
value of the first img
tag in an HTML document.
string match = "src=(?:\"|\')?(?<imgSrc>[^>]*[^/].(?:jpg|png))(?:\"|\')?"
Now it captures total src
attribute that I dont need. I just need the url inside the src
attribute. How to do it?
Solution 1:
Parse your HTML with something else. HTML is not regular and thus regular expressions aren't at all suited to parsing it.
Use an HTML parser, or an XML parser if the HTML is strict. It's a lot easier to get the src attribute's value using XPath:
//img/@src
XML parsing is built into the System.Xml
namespace. It's incredibly powerful. HTML parsing is a bit more difficult if the HTML isn't strict, but there are lots of libraries around that will do it for you.
Solution 2:
see When not to use Regex in C# (or Java, C++ etc) and Looking for C# HTML parser
PS, how can I put a link to a StackOverflow question in a comment?
Solution 3:
Your regex should (in english) match on any character after a quote, that is not a quote inside an tag on the src attribute.
In perl regex, it would be like this:
/src=[\"\']([^\"\']+)/
The URL will be in $1
after running this.
Of course, this assumes that the urls in your src attributes are quoted. You can modify the values in the []
brackets accordingly if they are not.