wget fails to download some images in a webpage

404 Errors

This seems to have something to do with url encoding[.]

Decoding the encoded portions of the failing links reveals that the "paths" are actually variable names present in the document source (so e.g. %7B%7B%20data.avatar_url%20%7D%7D becomes {{ data.avatar_url }}). So that would likely be the reason for returning the 404 responses, not the encoding.

The leading https://www.inhaltsangabe.de/autoren/ is probably (mis)applied by wget because each variable appears in an <img> tag src attribute:

ex. {{ data.images.thumbnail.url }}

<# if ( data.images.thumbnail ) { #>
      <img class="suggestion-post-thumbnail" src="{{ data.images.thumbnail.url }}" alt="{{ data.post_title }}">
      <# } #>

ex. {{ data.avatar_url }}

<# if ( data.avatar_url ) { #>
    <img class="suggestion-user-thumbnail" src="{{ data.avatar_url }}" alt="{{ data.display_name }}">
    <# } #>

Missing JPEG

Other images work fine in the downloaded file.

Regarding brecht-276fafb8.jpeg, while admittedly a bit of an educated guess, it appears likely that wget is processing <img> tag src and srcset attributes in the document source, but not any data-src or data-srcset attributes. For example:

ex. brecht-276fafb8.jpeg -> data-src, data-srcset (Fail!)

<img class="el-image uk-border-circle uk-box-shadow-small" alt="Bertolt Brecht" data-src="/wp-content/themes/yootheme/cache/brecht-276fafb8.jpeg" data-srcset="/wp-content/themes/yootheme/cache/brecht-276fafb8.jpeg 350w" data-sizes="(min-width: 350px) 350px" data-width="350" data-height="350" uk-img>

ex. bradbury.jpg ->src, srcset (Success!)

<img width="300" height="300" src="https://www.inhaltsangabe.de/dateien/bradbury-300x300.jpg" alt="Ray Bradbury" sizes="(min-width: 300px) 300px" srcset="https://www.inhaltsangabe.de/dateien/bradbury-300x300.jpg 300w, https://www.inhaltsangabe.de/dateien/bradbury-150x150.jpg 150w, https://www.inhaltsangabe.de/dateien/bradbury.jpg 400w"/>

This makes sense as the src and srcset attributes likely affect the general presentation of the document (i.e. images to show), whereas data-* attributes are primarily aimed at scripting, etc. and don't have any presentational value on their own.

As far as I am aware, at least in prior versions, custom attributes (e.g. data-*) were generally unsupported by wget. Regarding src and scrset, you can see them explicitly mentioned in the lists of attributes to process under src/html-url.c in the source code for wget ).

I have no idea on how to solve this problem.

Unfortunately, I am not aware of a good solution to this issue. My thought might be to do some manual post-processing on the given document source with something like BeautifulSoup to extract any relevant links. But I am not sure if that could be considered a "good" or not.

wget fails to download some images in a webpage

Related

Recent Posts