Regex to find specific word and remove complete tag attrubute from html
I want to capture javascript:***
in my HTML text, and when the regex found it so I want its attribute to remove as well, I have also created a regex but it is not covering all cases, I have attached my code and cases that are not covered in that regex.
regEx:
/(?<=<[_a-zA-Z][^<]*?)\s+href="javascript:[^"]*"/
cases are not covered like:
href='javascript:
href = "javascript:
href=" javascript:
href=
"javascript:
so I want something that finds the javascript:
and remove its complete attribute
Solution 1:
This is a possible solution for simple JavaScript injections.
Hackers can break out of this, a regular expression is not enough.
See this regular expression:
(?<=<[_a-zA-Z][^<]*?)\s+href\s*=\s*("?)\s*javascript:(\\?[^"])*[\s<>]?\1?
- HTML attributes are allowed with double quotes
"
, also works with no quotes (as long as the URL does not have a space in it). This is corrected in the expression. - Attributes in single quotes
'
would be valid HTML as well. You should run the regular expression twice, a second time with'
instead of"
. - There is a general problem with the quotes, as the JavaScript could have escaped quotes in it to circumvent your protection. See further below.
-
\s*
means "zero or more white-space characters (spaces, newlines, etc.)", so we can ignore optional white-space. - Capturing the first part so replacing with
href
completely. - The regular expression will preserve other attributes, see demo link below
Demo
You can see the regular expression in action here:
https://regex101.com/r/1SPJKy/5
It works with the following test cases:
<a title="Click me!" href="javascript:console.log(document.cookie)" target="_top"></a>
<a title="Click me!" href = "javascript:console.log(document.cookie)" target="_top"></a>
<a title="Click me!" href=" javascript:console.log('' + document.cookie)" target="_top"></a>
<a title="Click me!" href='javascript:console.log(document.cookie)' target="_top"></a>
<a title="Click me!" href=javascript:console.log(document.cookie) target="_top"></a>
<a title="Click me!"
href=
"javascript:console.log(document.cookie+\"\");"
target="_top">
</a>
Just text: href="javascript:console.log(document.cookie)"
The HTML is replaced to this (using "
, so don't forget running it with '
once more):
<a title="Click me!" target="_top"></a>
<a title="Click me!" target="_top"></a>
<a title="Click me!" target="_top"></a>
<a title="Click me!" href='javascript:console.log(document.cookie)' target="_top"></a>
<a title="Click me!" "_top"></a>
<a title="Click me!" \");"
target="_top">
</a>
Just text: href="javascript:console.log(document.cookie)"
But please don't...
- In cases where the JavaScript gets a little clever using
'' +
,"" +
,\"
etc., the regular expression will break. - The output will not be valid HTML anymore, but at least the
javascript:
part is removed. - You must ensure the HTML is normalized anyway, so all attributes are wrapped in
"
delimiters. There is an HTML Tidy plugin for CKEditor, I am sure. - Expressions like
[^"]
or your lookbehind assertion will only look so far. Hackers can just make the HTML several kilobytes or megabytes long. this makes the browser's RegEx engine give up "early". Otherwise the browser risks an infinite loop or a lockup of the UI, so understandable there is a limit to what it can find. - What about events like
onclick
?
You should check CKEditors best-practices page:
https://ckeditor.com/docs/ckeditor4/latest/guide/dev_best_practices.html#security
Access the DOM instead
That being said, HTML needs to be parsed. Use it as a DOM, not text.
AFAIK CKEditor has its HTML elements accessable as DOM elements. For example, you can select these elements with this JavaScript:
document.querySelectorAll('[href*=javascript]')
This will get those elements, no matter how the attribute is written. It will not work with broken HTML, if the browser cannot correct it. But the regular expression has that same problem and won't lead to a valid href anyway.
Maybe you give that a try instead.