How to login and then download a file from aspx web pages with R
I'm trying to automate the download of the Panel Study of Income Dynamics files available on this web page using R. Clicking on any of those files takes the user through to this login/authentication page. After authentication, it's easy to download the files with your web browser. Unfortunately, the httr
code below does not appear to be maintaining the authentication. I have tried inspecting the Headers
in Chrome for the Login.aspx page (as described here), but it doesn't appear to maintain the authentication even when I believe I'm passing in all the correct values. I don't care if it's done with httr
or RCurl
or something else, I'd just like something that works inside R so I don't need to have users of this script have to download the files manually or with some completely separate program. One of my attempts at this is below, but it doesn't work. Any help would be appreciated. Thanks!! :D
require(httr)
values <-
list(
"ctl00$ContentPlaceHolder3$Login1$UserName" = "[email protected]" ,
"ctl00$ContentPlaceHolder3$Login1$Password" = "somepassword" ,
"ctl00$ContentPlaceHolder3$Login1$LoginButton" = "Log In" ,
"_LASTFOCUS" = "" ,
"_EVENTTARGET" = "" ,
"_EVENTARGUMENT" = ""
)
POST( "http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx" , body = values )
resp <- GET( "http://simba.isr.umich.edu/Zips/GetFile.aspx" , query = list( file = "1053" ) )
Solution 1:
Beside storing the cookie after authentication (see my above comment) there was another problematic point in your solution: the ASP.net site sets a VIEWSTATE
key-value pair in the cookie which is to be reserved in your queries - if you check, you could not even login in your example (the result of the POST
command holds info about how to login, just check it out).
An outline of a possible solution:
-
Load
RCurl
package:> library(RCurl)
-
Set some handy
curl
options:> curl = getCurlHandle() > curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
-
Load the page for the first time to capture
VIEWSTATE
:> html <- getURL('http://simba.isr.umich.edu/u/Login.aspx', curl = curl)
-
Extract
VIEWSTATE
with a regular expression or any other tool:> viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
-
Set the parameters as your username, password and the
VIEWSTATE
:> params <- list( 'ctl00$ContentPlaceHolder3$Login1$UserName' = '<USERNAME>', 'ctl00$ContentPlaceHolder3$Login1$Password' = '<PASSWORD>', 'ctl00$ContentPlaceHolder3$Login1$LoginButton' = 'Log In', '__VIEWSTATE' = viewstate )
-
Log in at last:
> html = postForm('http://simba.isr.umich.edu/u/Login.aspx', .params = params, curl = curl)
Congrats, now you are logged in and
curl
holds the cookie verifying that! -
Verify if you are logged in:
> grepl('Logout', html) [1] TRUE
So you can go ahead and download any file - just be sure to pass
curl = curl
in all your queries.