dilbertone Posted February 24, 2011 Share Posted February 24, 2011 hello good day dear community, i like this place. It is a great place for idea and knowlege sharing! But by far the most impressive thing i learned is that this community here is so supportive. I am overwhelmed by this experience. This forum has so many many great folks. i have a little parser that parses a site - with 6150 records. But i need to have this in a CSV-formate. First of all see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 i need all the data - with separation in the filed of number schoolnumber school-name Adress Street Postal Code phone fax School-type website BTW - see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 and compare! Well - i have a script: i am very interested what you think about this .... not all the fields are gained yet - i need more of them! #!/usr/bin/perl use strict; use HTML::TableExtract; use LWP::Simple; use Cwd; use POSIX qw(strftime); my $total_records = 0; my $alpha = "x"; my $results = 50; my $range = 0; my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q="; my $processdir = "processing"; my $counter = 50; my $percent = 0; workDir(); chdir $processdir; processURL(); print "\nPress <enter> to continue\n"; <>; my $displaydate = strftime('%Y%m%d%H%M%S', localtime); open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file'; processData(); close $outfile; print "Finished processing $total_records records...\n"; print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n"; unlink 'processing.html'; sub processURL() { print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n"; getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page'; while( <tempfile.html> ) { open( FH, "$_" ) or die; while( <FH> ) { if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) { $total_records = $6; print "Total records to process is $total_records\n"; } } close FH; } unlink 'tempfile.html'; } sub processData() { while ( $range <= $total_records) { my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]); getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page'; $te->parse_file('processing.html'); my ($table) = $te->tables; foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { cleanup(@$row); # Add a table column delimiter in this case || print $outfile join("||", @$row)."\n"; } } $| = 1; print "Processed records $range to $counter"; print "\r"; $counter = $counter + 50; $range = $range + 50; } } sub cleanup() { for ( @_ ) { s/\s+/ /g; } } sub workDir() { # Use home directory to process data chdir or die "$!"; if ( ! -d $processdir ) { mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!"; } } output: 1||9752||Deutsche Schule Alamogordo USA Alamogorde - New Mexico || ||Deutschsprachige Auslandsschule|| 2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET Alexandrien - Ägypten || ||Begegnungsschule (Auslandsschuldienst)|| 3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503 Altötting ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 4||0006||Max-Reger-Gymnasium Amberg Kaiser-Wilhelm-Ring 7 92224 Amberg ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de With the || being the delimiter. My problem is: i need to have more fields - i need to have the following divided: name: Volksschule Abenberg (Grundschule) street: Güssübelstr. 2 postal-code and town: 91183 Abenberg fax and telephone: 09178/215 09178/905060 type of school: Volksschulen website: home.t-online.de/home/vs-abenberg well - how to add more fields? This obviously has to be done in this line here, doesn t it!? my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]); But how. I tried out several things - but i dont helped. I allways got bad results. Btw: i played around - and tried another solution - but here i have good csv-data - but unfortunatley no spider logic... #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; use Text::CSV; my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; $html =~ tr/r//d; # strip the carriage returns $html =~ s/ / /g; # expand the spaces my $te = new HTML::TableExtract(); $te->parse($html); my @cols = qw( rownum number name phone type website ); my @fields = qw( rownum number name street postal town phone fax type website ); my $csv = Text::CSV->new({ binary => 1 }); foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { # trim leading/trailing whitespace from base fields s/^s+//, s/\s+$// for @$row; # load the fields into the hash using a "hash slice" my %h; @h{@cols} = @$row; # derive some fields from base fields, again using a hash slice @h{qw/name street postal town/} = split /n+/, $h{name}; @h{qw/phone fax/} = split /n+/, $h{phone}; # trim leading/trailing whitespace from derived fields s/^s+//, s/\s+$// for @h{qw/name street postal town/}; $csv->combine(@h{@fields}); print $csv->string, "\n"; } } Well - with this i tried another solution - but here i have good csv-data - but unfortunatley no spider logic. How to add the spider-logic here... !? look forward to any and all help! Quote Link to comment Share on other sites More sharing options...
AbraCadaver Posted February 24, 2011 Share Posted February 24, 2011 Why not try a Perl board? Quote Link to comment Share on other sites More sharing options...
dilbertone Posted February 24, 2011 Author Share Posted February 24, 2011 hi dear Abracadaver, many many thanks - i am very very happy to hear from you. Why not try a Perl board? i am pretty sure that this can be done in php as well - and the usage of csv-formatted output is also known in php-fields.. But the best argument is - i am a big big fan of this site here. And yes - you helped me years and years... your code is a live time saver..!!! [ i know you from the AutoTheme and i am/was a user of your site from the early beginning in 2003.... So i would be glad if you can help me here... Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.