How to download a csv file using PhantomJS

Solution 1:

I found a solution for PhantomJS. Reading through this discussion I found a jsfiddle which downloads a url via jQuery's ajax method and encodes the file as base64.

The file I wanted to download was plain text (CSV) so I have removed the encoding functions. My target page also already had jQuery included so I didn't need to inject jQuery into the target page.

My code assumes you have already opened the page you want to download the file from using PhantomJS, and that page has jQuery in it. In my case I had to first login to the site in order to get the download link.

var fs = require('fs');

var page=this;

var result = page.evaluate(function() {

    var out;
    $.ajax({
        'async' : false,
        'url' : 'fullurltodownload.csv',
        'success' : function(data, status, xhr) {
            out = data;
        }
    });
    return out;

});

fs.write('mydownloadedfile.csv', result);

Solution 2:

After days and days of investigation, I have to say that there are some solutions:

  • In your evaluate function you can make AJAX call to download and encode your file, then you can return this content back to phantom script
  • You can use some custom Phantom library available on some GitHub pages

If you need to download a file using PhanotmJS, then run away from PhantomJS and use CasperJS. CasperJS is based on PhantomJS, but it has much better and intuitive syntax and program flow.

Here is good post explaining "Why CasperJS is better than PhantomJS". In this post you can find section about file download.

How to download CSV file using CasperJS (this works even when server sends header Content-Disposition:attachment; filename='file.csv)

Here you can find some custom csv file available for download: http://captaincoffee.com.au/dump/items.csv

In order to download this file using CasperJS execute the following code:

var casper = require('casper').create();

casper.start("http://captaincoffee.com.au/dump/", function() {
    this.echo(this.getTitle())
});
casper.then(function(){
    var url = 'http://captaincoffee.com.au/dump/csv.csv';
    require('utils').dump(this.base64encode(url, 'get'));
});

casper.run();

The code above will download http://captaincoffee.com.au/dump/csv.csv CSV file and will print results as base64 string. So this way, you don't even have to download data to file, you have your data as base64 string.

If you explicitly want to download file to file system, you can use download function which is available in CasperJS.

Solution 3:

The previous 2 answers assume you can know in advance the URL of the final CSV file. That won't be the case if the link goes to an HTML page that does a Javascript-computed redirect to the file and you don't want to evaluate that Javascript outside of PhantomJS. Your options then are:

  1. put PhantomJS behind an upstream proxy, and use said upstream proxy to intercept the download URL (and its expected Cookie and Referer headers)—but you'd have to be careful to positively identify the real download URL and not some random data 'blob' if the page makes binary XMLHttpRequests as well;
  2. instead of PhantomJS use Headless Chrome which can automatically save downloaded files (or Firefox with PyVirtualDisplay, which can also be set to do this, or wait for Headless Firefox) and monitor the downloads directory—but you'd have to be able to figure out by yourself when the download has completed (or use an upstream proxy to monitor it for completion, but Headless Chrome/Firefox cannot currently be set to ignore SSL certificates, which means if the site goes "secure" it's much more difficult to monitor the requests of Headless Chrome/Firefox than it is to monitor the requests of PhantomJS, at least until Chromium issue 721739 is fixed; you could watch a CONNECT request but if it's kept alive you will have no way of knowing for sure that a transfer has finished);
  3. put PhantomJS behind an upstream proxy that changes all unknown content types to text/plain and deletes Content-Disposition headers, so you can read the file from PhantomJS in the normal way—that should work for a CSV file but won't work for binaries with 0-bytes in them.

The first of these options (PhantomJS + upstream proxy) is made easier if the upstream proxy can monitor the Accept header that PhantomJS sends to the remote site. At least in PhantomJS version 2.1.1, main requests have Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, stylesheet requests have Accept: text/css,*/*;q=0.1, and all other requests (images, scripts, XMLHttpRequest) default to Accept: */* although this can be overridden by sites that use XMLHttpRequest.setRequestHeader(). Therefore if the upstream proxy sees a request with an Accept header containing text/html, and passing on this request to the server results in a CSV file or other non-HTML document, then there's a good chance this is the one to save.