Extract information from HTML using wget and Perl
I'm trying to write a Perl script that works like a tv guide that displays the current show playing for certain channels, for example Fox(7.1 WSVNH) and ABC(10.1 WPLGH).
The output I'm trying achieve would look like this:
7.1 - Hell's Kitchen
10.1 - 20/20
... and so on
(Channel number and current show title)
Here's the site I'm trying to extract HTML from: https://nocable.org/tv-listings/2f46-miami-fl
Here's the command I'm using to execute the script:
wget -O - website | ./script.pl
And here's some of the code I'm working on (Note: I'm trying to stick to using regular expression's in Perl for pattern matching as I'm still learning Perl):
#!/usr/bin/perl
while ( <> ) {
@htmlstring = m/wplgh(.*?)br/i
}
print @htmlstring;
I'm able to extract chunks of html but not what I want. I'm trying to extract the show title. Also I've been thinking it might be best to store show titles in a hash after extracting from the html.
%channel;
$channel{'7.1'} = $showtitle;
$channel{'10.1'} = $showtitle;
First things first: processing HTML using regular expressions is a bad idea. They are inadequate for the job in principle and borne with trouble in practice. A lot has been written on that.
I understand that you "only" want to pick up the titles, but you have on your hands a full-blown HTML document. Issues will keep creeping in, things will get worse, and there'll be no end to that.
Instead, there are many modules that can do parsing of various types of content for you. As for tables, what you need, HTML::TableExtract in particular is a most excellent tool.
An HTML document can also be easily retrieved in your script, by a number of good modules. I use LWP::Simple below but see the full LWP::UserAgent, or Mojo::UserAgent, for example.
For simplicity, I fetch the first table in the document (which happens to be the right one) and only do basic processing for a demo. I hope that you can take it from there.
use warnings;
use strict;
use feature 'say';
use LWP::Simple;
use HTML::TableExtract;
use open qw(:encoding(UTF-8) :std);
my $url = 'https://nocable.org/tv-listings/2f46-miami-fl';
my $page = get($url) or die "Can't load $url: $!";
my $tec = HTML::TableExtract->new();
$tec->parse($page);
foreach my $rowref ($tec->rows)
{
next if not @$rowref;
# Clean up undefined/whitespace/newlines, often found in HTML
my @row = map {
$_ = '' if not defined; # keep undefined fields for formatting
s/^\s*|\s*$//g; #/ leading and trailing whitespace
s/\s+|\n/ /g; # collapse multiple spaces, newlines
$_ # return it
} @$rowref;
say join ' | ', @row;
}
Note the undef, white-space, and newline cleaning statements, where an arrayref for each row is "unpacked" into an array. There are other ways to do that but I left it raw to show how it goes once you have to get into HTML details with regex.
I change undefined elements to empty strings in case you want to format the table and align its elements for print. I add |
between elements for easier review. Please adjust to your needs.
A first few rows, also cut-off for readability
All | 11:00 pm (ON AIR) | 11:30 pm | 12:00 am | 12:30 am | 1:00 am ... WPBT2HD 2.1 | Celtic Woman: Ancient Land 11:00 pm | | | | Retire Safe ... WPBT2-2 2.2 | Globe Trekker Delhi & Agra10:30 pm | Lidia's Kitchen ... ...