Python: urllib/urllib2/httplib confusion

I'm trying to test the functionality of a web app by scripting a login sequence in Python, but I'm having some troubles.

Here's what I need to do:

  1. Do a POST with a few parameters and headers.
  2. Follow a redirect
  3. Retrieve the HTML body.

Now, I'm relatively new to python, but the two things I've tested so far haven't worked. First I used httplib, with putrequest() (passing the parameters within the URL), and putheader(). This didn't seem to follow the redirects.

Then I tried urllib and urllib2, passing both headers and parameters as dicts. This seems to return the login page, instead of the page I'm trying to login to, I guess it's because of lack of cookies or something.

Am I missing something simple?

Thanks.


Focus on urllib2 for this, it works quite well. Don't mess with httplib, it's not the top-level API.

What you're noting is that urllib2 doesn't follow the redirect.

You need to fold in an instance of HTTPRedirectHandler that will catch and follow the redirects.

Further, you may want to subclass the default HTTPRedirectHandler to capture information that you'll then check as part of your unit testing.

cookie_handler= urllib2.HTTPCookieProcessor( self.cookies )
redirect_handler= HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)

You can then use this opener object to POST and GET, handling redirects and cookies properly.

You may want to add your own subclass of HTTPHandler to capture and log various error codes, also.


Here's my take on this issue.

#!/usr/bin/env python

import urllib
import urllib2


class HttpBot:
    """an HttpBot represents one browser session, with cookies."""
    def __init__(self):
        cookie_handler= urllib2.HTTPCookieProcessor()
        redirect_handler= urllib2.HTTPRedirectHandler()
        self._opener = urllib2.build_opener(redirect_handler, cookie_handler)

    def GET(self, url):
        return self._opener.open(url).read()

    def POST(self, url, parameters):
        return self._opener.open(url, urllib.urlencode(parameters)).read()


if __name__ == "__main__":
    bot = HttpBot()
    ignored_html = bot.POST('https://example.com/authenticator', {'passwd':'foo'})
    print bot.GET('https://example.com/interesting/content')
    ignored_html = bot.POST('https://example.com/deauthenticator',{})

@S.Lott, thank you. Your suggestion worked for me, with some modification. Here's how I did it.

data = urllib.urlencode(params)
url = host+page
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)

cookies = CookieJar()
cookies.extract_cookies(response,request)

cookie_handler= urllib2.HTTPCookieProcessor( cookies )
redirect_handler= HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)

response = opener.open(request)

I had to do this exact thing myself recently. I only needed classes from the standard library. Here's an excerpt from my code:

from urllib import urlencode
from urllib2 import urlopen, Request

# encode my POST parameters for the login page
login_qs = urlencode( [("username",USERNAME), ("password",PASSWORD)] )

# extract my session id by loading a page from the site
set_cookie = urlopen(URL_BASE).headers.getheader("Set-Cookie")
sess_id = set_cookie[set_cookie.index("=")+1:set_cookie.index(";")]

# construct headers dictionary using the session id
headers = {"Cookie": "session_id="+sess_id}

# perform login and make sure it worked
if "Announcements:" not in urlopen(Request(URL_BASE+"login",headers=headers), login_qs).read():
    print "Didn't log in properly"
    exit(1)

# here's the function I used after this for loading pages
def download(page=""):
    return urlopen(Request(URL_BASE+page, headers=headers)).read()

# for example:
print download(URL_BASE + "config")

I'd give Mechanize (http://wwwsearch.sourceforge.net/mechanize/) a shot. It may well handle your cookie/headers transparently.