How to copy text content of a webpage

There is the ogimet webpage for weather data. I believe this data is free for use. The webpage provides a script for requesting surface observations data which is as follows

curl "http://www.ogimet.com/cgi-bin/getsynop?block=123&begin=200912010000&end=200912040000" -o "your_desired_file_name"

I am able to use this data for my area. In addition to this I would like to access upper air observation data. No script has been provided for accessing this data. It can be accessed manually, on e.g. for one station, with the following link

https://www.ogimet.com/display_sond.php?lang=en&lugar=63741&tipo=ALL&ord=DIR&nil=SI&fmt=html&ano=2021&mes=09&day=02&hora=19&anof=2021&mesf=09&dayf=03&horaf=19&send=send

This gives me text content as in the screenshot enter image description herefigure attached. I am wondering if it is possible to copy the text, once I arrive at this page, using, e.g. a Perl script. Unfortunately i do not have any minimum working example which I could try.


Yes, you can harvest ("scrape") data from web pages like that. Here's a crude roadmap.

Normally you'd get the page -- retrieve from the web server a string with the HTML of that web page -- using a tool like LWP::UserAgent or Mojo::UserAgent, and then parse the HTML to extract data of interest, using a library like Mojo::DOM or HTML::TreeBuilder

There are many posts around here for use of these tools (and for yet other tools). Here is a rounded example with Mojo::DOM in a Perl.com article.

If that web page uses JavaScript for displaying data of interest to you then that's a different game. It means that the HTML downloaded from the server to your browser also contains JavaScript code -- programs -- which can run right in the browser. They get triggered when you click on (or hover, etc) elements of a page and rework the page without having to go back to the server.

This is a very (over-)simplified explanation, but the point is that the libraries need to understand the JavaScript in order to hand you that last page for parsing, otherwise you'd only get HTML that last came from the server. But the main libraries linked above don't know any JavaScript; they just go to the server with HTTP and hand you what the server returns.

For a tool that understands JavaScript I'd recommend Selenium, meant for testing webpages but perfectly suitable for this job as well, itself written with JavaScript. One way to use it in Perl is with Selenium::Chrome (or ::Firefox), and Selenium::Remote::Driver.


Here is an example using Mojo::DOM :

use feature qw(say);
use strict;
use warnings;
use LWP::UserAgent;
use Mojo::DOM;

my $ua = LWP::UserAgent->new();
my $url = 'https://www.ogimet.com/display_sond.php?' .
  'lang=en&lugar=63741&tipo=ALL&ord=DIR&nil=SI&fmt=html' .
  '&ano=2021&mes=09&day=02&hora=19&anof=2021&mesf=09&dayf=03&horaf=19&send=send';
my $res = $ua->get( $url );
if (!$res->is_success) {
    die $res->status_line;
}
my $html = $res->content;
my $dom = Mojo::DOM->new($html);
my @tables_raw_txt = $dom->find('table')->map('all_text')->each;
say $tables_raw_txt[1];
say "--------------- TABLE DATA --------------\n";
say $tables_raw_txt[2];

Output:

63741, Nairobi / Dagoretti (Kenya) 
ICAO index: HKNC. Latitude 01-18S. Longitude 036-45E. Altitude 1798 m.

--------------- TABLE DATA --------------

 TEMP/PILOT from 63741, Nairobi / Dagoretti (Kenya)
TTAA
02/09/2021 23:00->
TTAA 52231 63741 99822 14818 14003 70132 07008 20507 50584 03975
     01508 40756 15970 09018 30967 31160 21008 25094 41557 23506
     20241 53358 22511 15421 66958 14023 10658 76556 09013 88104
     76957 13519 77999 31313 47708 82323=

TTBB
02/09/2021 23:00->
TTBB 52238 63741 00822 14818 11800 12407 22793 12006 33661 04001
     44645 02800 55614 02009 66589 01445 77568 00717 88552 00960
     99526 02366 11513 04160 22500 03975 33450 08773 44389 17571
     55380 17978 66349 21969 77300 31160 88284 33964 99255 40360
     11200 53358 22142 69558 33128 73757 44114 74556 55104 76957
     21212 00822 14003 11569 31016 22522 01503 33398 09019 44359
     08541 55277 24503 66211 25513 77134 12035 88108 15526 31313
     47708 82323 41414 7543/=

TTCC
02/09/2021 23:00->
TTCC 52237 63741 70867 72760 32520 88999 77999 31313 47708 82323=

TTDD
02/09/2021 23:00->
TTDD 5223/ 63741 11877 74556 22831 68956 33700 72760 21212 11908
     07010 22729 30529 33652 00000 44543 20020 31313 47708 82323=