Python: Get URL path sections
How do I get specific path sections from a url? For example, I want a function which operates on this:
http://www.mydomain.com/hithere?image=2934
and returns "hithere"
or operates on this:
http://www.mydomain.com/hithere/something/else
and returns the same thing ("hithere")
I know this will probably use urllib or urllib2 but I can't figure out from the docs how to get only a section of the path.
Solution 1:
Extract the path component of the URL with urlparse:
>>> import urlparse
>>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
>>> path
'/hithere/something/else'
Split the path into components with os.path.split:
>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')
The dirname and basename functions give you the two pieces of the split; perhaps use dirname in a while loop:
>>> while os.path.dirname(path) != '/':
... path = os.path.dirname(path)
...
>>> path
'/hithere'
Solution 2:
Python 3.4+ solution:
from urllib.parse import unquote, urlparse
from pathlib import PurePosixPath
url = 'http://www.example.com/hithere/something/else'
PurePosixPath(
unquote(
urlparse(
url
).path
)
).parts[1]
# returns 'hithere' (the same for the URL with parameters)
# parts holds ('/', 'hithere', 'something', 'else')
# 0 1 2 3
Solution 3:
The best option is to use the posixpath
module when working with the path component of URLs. This module has the same interface as os.path
and consistently operates on POSIX paths when used on POSIX and Windows NT based platforms.
Sample Code:
#!/usr/bin/env python3
import urllib.parse
import sys
import posixpath
import ntpath
import json
def path_parse( path_string, *, normalize = True, module = posixpath ):
result = []
if normalize:
tmp = module.normpath( path_string )
else:
tmp = path_string
while tmp != "/":
( tmp, item ) = module.split( tmp )
result.insert( 0, item )
return result
def dump_array( array ):
string = "[ "
for index, item in enumerate( array ):
if index > 0:
string += ", "
string += "\"{}\"".format( item )
string += " ]"
return string
def test_url( url, *, normalize = True, module = posixpath ):
url_parsed = urllib.parse.urlparse( url )
path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
normalize=normalize, module=module )
sys.stdout.write( "{}\n --[n={},m={}]-->\n {}\n".format(
url, normalize, module.__name__, dump_array( path_parsed ) ) )
test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
module = ntpath )
Code output:
http://eg.com/hithere/something/else
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
--[n=False,m=posixpath]-->
[ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
--[n=True,m=posixpath]-->
[ "else" ]
http://eg.com/hithere/../else
--[n=False,m=posixpath]-->
[ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
--[n=True,m=posixpath]-->
[ "else" ]
http://eg.com/hithere/../../else
--[n=False,m=posixpath]-->
[ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
--[n=False,m=posixpath]-->
[ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
--[n=False,m=posixpath]-->
[ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
--[n=False,m=posixpath]-->
[ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
--[n=False,m=ntpath]-->
[ "see", "if", "this", "works" ]
Notes:
- On Windows NT based platforms
os.path
isntpath
- On Unix/Posix based platforms
os.path
isposixpath
-
ntpath
will not handle backslashes (\
) correctly (see last two cases in code/output) - which is whyposixpath
is recommended. - remember to use
urllib.parse.unquote
- consider using
posixpath.normpath
- The semantics of multiple path separators (
/
) is not defined by RFC 3986. However,posixpath
collapses multiple adjacent path separators (i.e. it treats///
,//
and/
the same) - Even though POSIX and URL paths have similar syntax and semantics, they are not identical.
Normative References:
- IEEE Std 1003.1, 2013 - Vol. 1: Base Definitions - Section 4.12: Pathname Resolution
- The GNU C Library Reference Manual - Section 11.2: File Names
- IETF RFC 3986: Uniform Resource Identifier (URI): Generic Syntax - Section 3.3: Path
- IETF RFC 3986: Uniform Resource Identifier (URI): Generic Syntax - Section 6: Normalization and Comparison
- Wikipedia: URL normalization