Regex to match URL
$search = "#^((?#
the scheme:
)(?:https?://)(?#
second level domains and beyond:
)(?:[\S]+\.)+((?#
top level domains:
)MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#
)COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#
)A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#
)C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#
)E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#
)H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#
)K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#
)N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#
)S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#
)U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#
the path, can be there or not:
)(/[a-z0-9\._/~%\-\+&\#\?!=\(\)@]*)?)$#i";
Just cleaned up a bit. This will match only HTTP(s) addresses, and, as long as you copied all top level domains correctly from IANA, only those standardized (it will not match http://localhost
) and with the http://
declared.
Finally you should end with the path part, that will always start with a /, if it is there.
However, I'd suggest to follow Cerebrus: If you're not sure about this, learn regexps in a more gentle way and use proven patterns for complicated tasks.
Cheers,
By the way: Your regexp will also match something.r
and something.h
(between |TO| and |TR| in your example). I left them out in my version, as I guess it was a typo.
On re-reading the question: Change
)(?:https?://)(?#
to
)(?:https?://)?(?#
(there is a ?
extra) to match 'URLs' without the scheme.
Not exactly what the OP asked for but this is a much simpler regular expression that does not need to be updated each time the IANA introduces a new TLD. I believe this is more adequate for most simple needs:
^(?:https?://)?(?:[\w]+\.)(?:\.?[\w]{2,})+$
no list of TLD, localhost is not matched, the number of subparts must be >= 2 and the length of each subpart must be >= 2 (fx: "a.a" will not match but "a.ab" will match).
This question was surprisingly difficult to find an answer for. The regexes I found were too complicated to understand, and anything more that a regex is overkill and too difficult to implement.
Finally came up with:
/(\S+\.(com|net|org|edu|gov)(\/\S+)?)/
Works with http://example.com
, https://example.com
, example.com
, http://example.com/foo
.
Explanation:
- Looks for .com, etc.
- Matches everything before it up to the space
- Matches everything after it up to the space
This will get any url in its entirety, including ?= and #/ if they exist:
/[A-Za-z]+:\/\/[A-Za-z0-9\-_]+\.[A-Za-z0-9\-_:%&;\?\#\/.=]+/g
Using a single regexp to match an URL string makes the code incredible unreadable. I'd suggest to use parse_url to split the URL into its components (which is not a trivial task), and check each part with a regexp.