How to extract slug from URL with regular expression in Python?
Solution 1:
Use a capturing group by putting parentheses around the part of the regex that you want to capture (...)
. You can get the contents of a capturing group by passing in its number as an argument to m.group()
:
>>> m = re.search('/([0-9]+)-', url)
>>> m.group(1)
123456
From the docs:
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the\number
special sequence, described below. To match the literals'('
or')'
, use\(
or\)
, or enclose them inside a character class:[(] [)]
.
Solution 2:
You may want to use urllib.parse
combined with a capturing group for mildly cleaner code.
import urllib.parse, re
url = 'http://www.example.com/this-2-me-4/123456-subj'
parsed = urllib.parse.urlparse(url)
path = parsed.path
slug = re.search(r'/([\d]+)-', path).group(1)
print(slug)
Result:
123456
In Python 2, use urlparse
instead of urllib.parse
.