Jump to content

Html::tableExtract: how to optimize the CSV-Output?


dilbertone

Recommended Posts

hello good day dear community,

 

 

i like this place. It is a great place for idea and knowlege sharing! But by far the most impressive thing i learned is that this community here is so supportive. I am overwhelmed by this experience. This forum has so many many great folks.

 

i have a little parser that parses a site - with 6150 records.  But i need to have this in a CSV-formate. First of all see here the  target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

 

i need all the data - with separation in the filed of

 

    number
    schoolnumber
    school-name
    Adress
    Street 
    Postal Code 
     phone
     fax 
    School-type
    website

 

BTW - see here the  target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 and compare!

 

Well - i have a script: i am very interested what you think about this .... not all the fields are gained yet - i need more of them!

 

    #!/usr/bin/perl
    use strict;
    use HTML::TableExtract;
    use LWP::Simple;
    use Cwd;
    use POSIX qw(strftime);
    
    my $total_records = 0;
    my $alpha = "x";
    my $results = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $percent = 0;
    
    workDir();
    chdir $processdir;
    processURL();
    print "\nPress <enter> to continue\n";
    <>;
    my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
    processData();
    close $outfile;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n";
    unlink 'processing.html';
    
    sub processURL() {
    print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n";
    getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';
    
       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }
    
    sub processData() {
       while ( $range <= $total_records) {
          my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
          getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          foreach my $ts ($te->table_states) {
             foreach my $row ($ts->rows) {
                cleanup(@$row);
    	    # Add a table column delimiter in this case ||
                print $outfile join("||", @$row)."\n";
                }
             }
          $| = 1;  
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
       }
    }
    
    sub cleanup() {
       for ( @_ ) {
          s/\s+/ /g;
       }
    }
    
    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }

output:


    1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule|| 
    2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)|| 
    3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 
    4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de

With the || being the delimiter.

 

 

My problem is: i need to have more fields - i need to have the following divided:

 

    name: Volksschule Abenberg (Grundschule)
    street: Güssübelstr. 2
    postal-code and town: 91183 Abenberg
    fax and telephone: 09178/215 09178/905060
    type of school: Volksschulen
    website: home.t-online.de/home/vs-abenberg 

 

well - how to add more fields?

This obviously has to be done in this line here, doesn t it!?

 

my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);

But how. I tried out several things - but i dont helped. I allways got bad results. Btw: i played around - and tried another solution - but here i have good csv-data - but unfortunatley no spider logic...

 

    #!/usr/bin/perl
    use warnings;
    use strict;
    use LWP::Simple;
    use HTML::TableExtract;
    use Text::CSV;
    
    my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
    $html =~ tr/r//d;     # strip the carriage returns
    $html =~ s/ / /g; # expand the spaces
    
    my $te = new HTML::TableExtract();
    $te->parse($html);
    
    my @cols = qw(
        rownum
        number
        name
        phone
        type
        website
    );
    
    my @fields = qw(
        rownum
        number
        name
        street
        postal
        town
        phone
        fax
        type
        website
    );
    
    my $csv = Text::CSV->new({ binary => 1 });
    
    foreach my $ts ($te->table_states) {
        foreach my $row ($ts->rows) {
    
            #  trim leading/trailing whitespace from base fields
            s/^s+//, s/\s+$// for @$row;
    
            # load the fields into the hash using a "hash slice"
            my %h;
            @h{@cols} = @$row;
    
            # derive some fields from base fields, again using a hash slice
            @h{qw/name street postal town/} = split /n+/, $h{name};
            @h{qw/phone fax/} = split /n+/, $h{phone};
    
            #  trim leading/trailing whitespace from derived fields
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};
    
            $csv->combine(@h{@fields});
            print $csv->string, "\n";
        }
    }  

 

Well - with this i tried another solution - but here i have good csv-data - but unfortunatley no spider logic.

How to add the spider-logic here... !?

 

look forward to any and all help!

Link to comment
Share on other sites

hi dear Abracadaver,

 

many many thanks - i am very very happy to hear  from you.

 

Why not try a Perl board?

 

i am pretty sure that this can be done in php as well - and the usage of csv-formatted output is also known in php-fields.. But the best argument is - i am a big  big fan of this site here.

 

And yes - you helped me years and years... your code is a live  time saver..!!!  ;)

[ i know you from  the AutoTheme and i am/was a user of your site from the early beginning in 2003....

 

So i would  be glad if you can help me here...

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.