How to analyze a link to figure out the actual link
Sometimes when downloading something, I find the links are not the direct ones to the files.
For example, this is a link to download a PDF file:
http://ishare.down.sina.com.cn/15181391.PDF?ssig=2jEFaNQs7K&Expires=1312905600&KID=sina,ishare&IP=1312761745,68.50.222.
I was wondering how to figure/hack out the actual link (I.e. http://*.PDF) to the file?
What are the names for such and similar techniques of not showing direct links? Some references, such as Wikipedia,...?
Yes, sometimes.
There are two things that commonly happen. Your link doesn't work anymore, so I am not sure the actual scenario in this case, so I will summarize on another link.
HTTP Redirection
This is what you see with Bit.ly
and other services. What then do is provide an HTTP redirect response. When you visit http://bit.ly/oH3410 it redirects to the actual URL. Sometimes one URL redirects to another. You can see this happening if you plug the URL into http://web-sniffer.net/ or by using [curl][1] -I http://bit.ly/oH3410
you will see returns a 301 pointing to a new Location.
So to deal with HTTP Redirection you just need to loop an HTTP HEAD request until you stop getting responses in the 300's (hopefully getting a 200). Keep in mind it is possible that they will redirect in a loop, which will never end. You can do this with CURL or any HTTP tool.
Downloader Page
This is what most download sites use. You click the download link and it takes you to a page with a bunch of ads and says "Your download will begin shortly" something similarly. [Example]. With these you can try to parse the actual direct link from the URL, but that would be site specific, and most sites will not include it to prevent you from circumventing it. This is done either via a meta http-equiv="refresh"
tag in the header, or JavaScript (most common). The JS usually has a header fallback though.
There is a solution though. If you look at the source on download page you will usually see a <meta http-equiv="refresh">
tag (usually in a <noscript>
tag) with an attribute of URL
that points to the actual download. So use CURL (or any other HTTP tool) to download the page, parse it out, and grab that value. A site may exclude this though if they want to be really nasty, thus requiring you to have JavaScript to download files.
There is probably a JavaScript block that links to the download as well. It may be obfuscated, or linked from another URL. Your mileage may vary trying to parse that out. There may also be a "direct link" on the page. You could try a few techniques to find that, but again that could be obfuscated via JavaScript or even missing all together.
It might not be possible. The sites could feed you through a hundred redirects before you get to the file.
In addition, javascript can be used to give out links based on the URL that was given to the server.