Jump to content

how to fetch a page with a parser [live demo]


dilbertone

Recommended Posts

good evening dear community! Howdy,

 

 

at the moment i am debugging some lines of code...

 

 

purpose: i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

 

This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records

 

Attempt: Here are the first 5 page URLs:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after.  We can use this information to create a loop:

 

 

 


#!/usr/bin/perl  
use warnings;  
use strict;  
use LWP::Simple;  
use HTML::TableExtract;  
use Text::CSV;  

my @cols = qw(  
    rownum  
    number  
    name  
    phone  
    type  
    website  
);  
  
my @fields = qw(  
    rownum  
    number  
    name  
    street  
    postal  
    town  
    phone  
    fax  
    type  
    website  
);  
  
my $i_first = "0";   
my $i_last = "6100";   
my $i_interval = "50";   
   
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
$html =~ tr/r//d;     # strip the carriage returns  
$html =~ s/ / /g; # expand the spaces  
  
my $te = new HTML::TableExtract();  
$te->parse($html);  
  
my $csv = Text::CSV->new({ binary => 1 });  
  
foreach my $ts ($te->table_states) {  
	foreach my $row ($ts->rows) {  
			#trim leading/trailing whitespace from base fields  
		s/^s+//, s/\s+$// for @$row;  

		#load the fields into the hash using a "hash slice"  
		my %h;  
		@h{@cols} = @$row;  
  
		#derive some fields from base fields, again using a hash slice  
		@h{qw/name street postal town/} = split /n+/, $h{name};  
		@h{qw/phone fax/} = split /n+/, $h{phone};  
  
		#trim leading/trailing whitespace from derived fields  
		s/^s+//, s/\s+$// for @h{qw/name street postal town/};  
  
		$csv->combine(@h{@fields});  
		print $csv->string, "\n";  
	}  
} 
}


 

 

i tested the code and  get the following  results: .- see below - the error message shown in the command line...

 

btw: here the lines 57 and 58:

	#trim leading/trailing whitespace from derived fields  
		s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

 

what do you think?

 

 

 

Sta�e
    PLZ                                                                                                                                                                                                  
    Ot",,,Telefo,Fax,Schulat,Webseite                                                                                                                                                                    
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
"lfd. N.",Schul-numme,Schul,"ame                                                                                                                                                                         
    Sta�e                                                                                                                                                                                                
    PLZ                                                                                                                                                                                                  
    Ot",,,Telefo,Fax,Schulat,Webseite                                                                                                                                                                    
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
"lfd. N.",Schul-numme,Schul,"ame                                                                                                                                                                         
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.