Grabbing the href attribute of an A element
Reliable Regex for HTML are difficult. Here is how to do it with DOM:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
}
The above would find and output the "outerHTML" of all A
elements in the $html
string.
To get all the text values of the node, you do
echo $node->nodeValue;
To check if the href
attribute exists you can do
echo $node->hasAttribute( 'href' );
To get the href
attribute you'd do
echo $node->getAttribute( 'href' );
To change the href
attribute you'd do
$node->setAttribute('href', 'something else');
To remove the href
attribute you'd do
$node->removeAttribute('href');
You can also query for the href
attribute directly with XPath
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
echo $href->nodeValue; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute
}
Also see:
- Best methods to parse HTML
- DOMDocument in php
On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here
I agree with Gordon, you MUST use an HTML parser to parse HTML. But if you really want a regex you can try this one :
/^<a.*?href=(["\'])(.*?)\1.*$/
This matches <a
at the begining of the string, followed by any number of any char (non greedy) .*?
then href=
followed by the link surrounded by either "
or '
$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);
Output:
array(3) {
[0]=>
string(37) "<a title="this" href="that">what?</a>"
[1]=>
string(1) """
[2]=>
string(4) "that"
}
The pattern you want to look for would be the link anchor pattern, like (something):
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
why don't you just match
"<a.*?href\s*=\s*['"](.*?)['"]"
<?php
$str = '<a title="this" href="that">what?</a>';
$res = array();
preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);
var_dump($res);
?>
then
$ php test.php
array(2) {
[0]=>
array(1) {
[0]=>
string(27) "<a title="this" href="that""
}
[1]=>
array(1) {
[0]=>
string(4) "that"
}
}
which works. I've just removed the first capture braces.
For the one who still not get the solutions very easy and fast using SimpleXML
$a = new SimpleXMLElement('<a href="www.something.com">Click here</a>');
echo $a['href']; // will echo www.something.com
Its working for me