wget fails to download some images in a webpage
404 Errors
This seems to have something to do with url encoding[.]
Decoding the encoded portions of the failing links reveals that the "paths" are actually variable names present in the document source (so e.g. %7B%7B%20data.avatar_url%20%7D%7D
becomes {{ data.avatar_url }}
). So that would likely be the reason for returning the 404
responses, not the encoding.
The leading https://www.inhaltsangabe.de/autoren/
is probably (mis)applied by wget
because each variable appears in an <img>
tag src
attribute:
ex. {{ data.images.thumbnail.url }}
<# if ( data.images.thumbnail ) { #>
<img class="suggestion-post-thumbnail" src="{{ data.images.thumbnail.url }}" alt="{{ data.post_title }}">
<# } #>
ex. {{ data.avatar_url }}
<# if ( data.avatar_url ) { #>
<img class="suggestion-user-thumbnail" src="{{ data.avatar_url }}" alt="{{ data.display_name }}">
<# } #>
Missing JPEG
Other images work fine in the downloaded file.
Regarding brecht-276fafb8.jpeg
, while admittedly a bit of an educated guess, it appears likely that wget
is processing <img>
tag src
and srcset
attributes in the document source, but not any data-src
or data-srcset
attributes. For example:
ex. brecht-276fafb8.jpeg -> data-src, data-srcset (Fail!)
<img class="el-image uk-border-circle uk-box-shadow-small" alt="Bertolt Brecht" data-src="/wp-content/themes/yootheme/cache/brecht-276fafb8.jpeg" data-srcset="/wp-content/themes/yootheme/cache/brecht-276fafb8.jpeg 350w" data-sizes="(min-width: 350px) 350px" data-width="350" data-height="350" uk-img>
ex. bradbury.jpg ->src, srcset (Success!)
<img width="300" height="300" src="https://www.inhaltsangabe.de/dateien/bradbury-300x300.jpg" alt="Ray Bradbury" sizes="(min-width: 300px) 300px" srcset="https://www.inhaltsangabe.de/dateien/bradbury-300x300.jpg 300w, https://www.inhaltsangabe.de/dateien/bradbury-150x150.jpg 150w, https://www.inhaltsangabe.de/dateien/bradbury.jpg 400w"/>
This makes sense as the src
and srcset
attributes likely affect the general presentation of the document (i.e. images to show), whereas data-*
attributes are primarily aimed at scripting, etc. and don't have any presentational value on their own.
As far as I am aware, at least in prior versions, custom attributes (e.g. data-*
) were generally unsupported by wget
. Regarding src
and scrset
, you can see them explicitly mentioned in the lists of attributes to process under src/html-url.c
in the source code for wget
).
I have no idea on how to solve this problem.
Unfortunately, I am not aware of a good solution to this issue. My thought might be to do some manual post-processing on the given document source with something like BeautifulSoup to extract any relevant links. But I am not sure if that could be considered a "good" or not.