Why do images from some Tumblr pages not load, but using wget on them works?

Solution 1:

UPDATE: It seems the core issue with images not loading stemmed from the way the EFF’s HTTPS Everywhere plugin/extension handled some Tumblr URLs. The developer’s were notified and a fix appears to be in place. This answer basically breaks down the detective work done to uncover the issue as outlined by the initial question and could prove useful for further debugging/diagnosis if a similar issue appears in the future.


EDIT: The larger content about image leeching seems invalid. So will add a new idea at the top and leave the image leeching info at the bottom just in case it is useful to someone.

Amazon CloudFront CDN Ideas

Okay, using the URLs you have provided—as well as some of my real world experience with Amazon CloudFront CDN setups—I think I discovered something. It seems like Tumblr’s Amazon CloudFront CDN config is choking for some reason. Here is why I think that is the case.

Let’s take this example URL:

http://36.media.tumblr.com/d685b02fdf2d3f167c22d9a97e27e87a/tumblr_nfpq5qPZ4v1tognpro1_1280.png

Now let’s run curl -I to get header information on that file:

curl -I http://36.media.tumblr.com/d685b02fdf2d3f167c22d9a97e27e87a/tumblr_nfpq5qPZ4v1tognpro1_1280.png

The output for that would be something like this:

HTTP/1.1 200 OK
Content-Type: image/png
Content-Length: 782141
Connection: keep-alive
Accept-Ranges: bytes
Cache-Control: max-age=1209600
Date: Thu, 05 Mar 2015 02:15:44 GMT
Server: nginx
X-Cache: Miss from cloudfront
Via: 1.1 7e54fc06cd70e4752fe050bbe5c130be.cloudfront.net (CloudFront)
X-Amz-Cf-Id: QyIUyzfaJJN3PU_xWkW0P-D2kjg_1cVenKzFAoY2PubgZQlBHWorZQ==

Now the things to pay attention to here are the Date (the date and time of the file on the CloudFront endpoint) and X-Cache (Amazon content delivery status) headers. Typical behavior on Amazon CloudFront is the first access will convey a “Miss from cloudfront” and then if you do another curl -I right away afterwards there should be a Hit from cloudfront.

But that’s not what I saw just now. Here is a breakdown of the Date and X-Cache status of a bunch of accesses I made:

  • Date: Thu, 05 Mar 2015 02:19:37 GMT = X-Cache: Miss from cloudfront
  • Date: Thu, 05 Mar 2015 02:19:39 GMT = X-Cache: Miss from cloudfront
  • Date: Thu, 05 Mar 2015 02:19:44 GMT = X-Cache: Miss from cloudfront
  • Date: Thu, 05 Mar 2015 02:19:50 GMT = X-Cache: Miss from cloudfront
  • Date: Thu, 05 Mar 2015 02:19:50 GMT = X-Cache: Hit from cloudfront
  • Date: Thu, 05 Mar 2015 02:19:50 GMT = X-Cache: Hit from cloudfront
  • Date: Thu, 05 Mar 2015 02:19:50 GMT = X-Cache: Hit from cloudfront

The reason why there are multiple items with the same exact data which are Hit from cloudfront near the end is because that is what happens on a CDN: If the endpoint of the CDN has the file, then Date correlates to the actual creation/modification date of the file that endpoint has.

You notice the first four access are seconds apart, with different dates/times and all of them are Miss from cloudfront, right? That means the CDN endpoint is just echoing back that there was an attempt to access that file at those times and all attempts were misses.

So my armchair assessment of this is that Tumblr’s systems are not keeping up with the Amazon CloudFront CDN or the Amazon CloudFront CDN is not keeping up with Tumblr. But in some way, things are amiss on their server side. And since this is a CDN, someone accessing the files in one location might not notice an issue while someone else in another location would have issues viewing the image.

Which is all to say, I don’t think this can easily be cleared up on the client side.


EDIT: So the original poster added some new URLs, and this still points to a server-side issue, but I just wanted to post the details for the record.

EdgeCast & Highwinds CDN Ideas

So the original poster added more specifics, so here are more details based on the blog post that is being used as an example:

http://claystorks.tumblr.com/post/112741831192/soulmister-claystorks-windspeare-explain

And these image URLs are provided as examples of URLs in that post:

https://gs1.wac.edgecastcdn.net/8019B6/data.tumblr.com/76493f424ebb3b62d6de43e53643180a/tumblr_nkps82DdCh1sjn35qo1_500.png

https://gs1.wac.edgecastcdn.net/8019B6/data.tumblr.com/76493f424ebb3b62d6de43e53643180a/tumblr_nkps82DdCh1sjn35qo1_1280.png

And those two image URLs do indeed fail. But from my side—looking at the original soure code of the blog post from Brooklyn, New York, USA—I am not seeing those EdgeCast (gs1.wac.edgecastcdn.net) URLs. Rather, these are the URLs I am seeing:

http://41.media.tumblr.com/76493f424ebb3b62d6de43e53643180a/tumblr_nkps82DdCh1sjn35qo1_500.png

http://41.media.tumblr.com/76493f424ebb3b62d6de43e53643180a/tumblr_nkps82DdCh1sjn35qo1_1280.png

So my first thought is why is the original poster seeing those EdgeCast (gs1.wac.edgecastcdn.net). But then if I do a traceroute to the 41.media.tumblr.com I see that is a server managed by Highwinds (!?!?). In contrast the initial URLs passed on by the original user are using the 36.media.tumblr.com hostname and you can see they are managed by Amazon CloudFront CDN servers.

Which is all to say—which I said before—all of this seems to be a server side issue with Tumblr and their CDN management. But from my side—in Brooklyn, New York, USA—I am clearly seeing content being delivered as expected from Highwinds CDN servers as well as Amazon CloudFront CDN servers. Where these EdgeCast URLS are coming from or how/why they are then failing is out of anyone’s control on the client side. This would definitely be something to contact Tumblr tech staff about because there is no way a desktop end-user could resolve this.


Image Leeching Ideas

Might not be relevant anymore, but here for reference.

You stating this give me a clue:

Using wget on the images' direct links works.

Many sites have rules in place—usually set via Apache—that prevent image leeching. More details on how those rules work are provided here and is summarized as this:

Using .htaccess, you can disallow hot linking on your server, so those attempting to link to an image or CSS file on your site, for example, is either blocked (failed request, such as a broken image) or served a different content (ie: an image of an angry man).

Based on your description—and the fact you can access the images via wget—leads me to believe that the images you are having issues with are not hosted on Tumblr by users, but rather images that are placed on a Tumblr blog but actually hosted on another site.

When standard image leeching procedures are put in place, viewing an embedded image on one site that is hosted on another site—which blocks leeching—would result in a broken image link or perhaps a “Stop Leeching!” image being returned. This is because basic anti-leeching rules—such as those in that example page—crosscheck image referrers to make sure the page requesting the image matches the domain hosting the image.

So when you are accessing the image via wget you are accessing the image directly. So image leeching rules would not kick in. Thus you can get the image via wget but not when it is embedded in another page.

Solution 2:

I am currently having this very problem. This is a safe for work—well it’s a silly comic— example of an affected blog.

If found however that the problem happened only in Chrome for me. After a while, I realized that the cause of the issue was the extension “HTTPS Everywhere.” When I installed it in Firefox, I had the same problem there too. And actually, if I disable the HTTPS rule “Tumblr (partial)” (which I guess means *.tumblr.com), it works fine again.

So, the issue seems to be that, at least sometimes, when HTTPS is used to access an image, you are redirected to an invalid EdgeCast URL. For example, this image URL works fine:

http://36.media.tumblr.com/57d2af15f7b21037364125f9f32c4379/tumblr_nktjzyNkv91s667kio1_1280.png

But if you change the protocol from http to https you get redirected to this URL which does not work:

https://gs1.wac.edgecastcdn.net/8019B6/data.tumblr.com/57d2af15f7b21037364125f9f32c4379/tumblr_nktjzyNkv91s667kio1_1280.png

I am not sure if this counts as an error from Tumblr side or not. I guess that if clients are not supposed to access their media servers with HTTPS you cannot really blame them for it.

EDIT: And actually the problem seems to have been dealt with as reported in this GitHub thread.

Solution 3:

I’ve noticed this behavior more while on my mobile carrier, T-Mobile. I'm thinking this is some sort of traffic shaping based off of image size or some carrier built “difficulty metric” in retreaving said item.

In previous testing—over a year ago—I’ve then shared the broken post to a friend who has Verizon, and the image loads fine.

While I can’t test this image I’m about to provide—as my friend is unavailable—this image doesn’t load for me. I am running stock Android (5.0.1) on a Nexus 5 using Chrome as a browser.

http://41.media.tumblr.com/efebad51567e927b8f130f9bdc4efae3/tumblr_ndvnpjcBZa1qewacoo1_500.png

When I try to load the image directly I get a 504 gateway timeout error.

EDIT: This is @JakeGould posting the actual image for reference.

enter image description here

Further testing and details: I'm in Baltimore MD, running off of LTE data and the following image did work: http://40.media.tumblr.com/a5e0a96d36170c997aabad7efc630d3e/tumblr_njnalkSD7M1s5cyzso1_500.jpg

Further testing shows that PNG doesn't seem to be the issue. Most of the other images I hit that worked were a mix of png and jpg, but all were on non "41" servers.

Final note: I got home, hopped on my wifi -Comcast- with my phone -the device I have been testing on- and all the photos I couldn't see due to 504 I can now see.

EDIT: New to superuser, trimmed and edited post so it was more factual and less discussion.

UPDATE: Issue seems to be tied to LTE. Loaded up tumblr, found some images that wouldn't load, forced my phone down to 3g, reloaded page, all images show. Reverted phone back to LTE, cleared cache, and the images that previously didn't load under LTE now load.
(I'm testing again and now i can't reproduce. So maybe the above behavior was a fluke.)