如何使用Perl的LWP :: UserAgent使用不同的查询字符串获取相同的URL? [英] How can I use Perl's LWP::UserAgent to fetch the same URL with different query strings?

查看:83
本文介绍了如何使用Perl的LWP :: UserAgent使用不同的查询字符串获取相同的URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正在运行的LWP :: UserAgent,应将其应用于以下URL:

I have a running LWP::UserAgent that should be applied on following URL:

http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5503

这与许多类似的目标一起运行,请参见以下结尾:

This runs with many many similar targets see the following endings:

html?show_school=5503
html?show_school=9002
html?show_school=5512

我想使用LWP :: UserAgent:

I want to do this with use LWP::UserAgent:

for my $i (0..10000) 

{ $ua->get(' [here the URL should be applied] ', id => 21, extern_uid => $i); 
# process reply }

无论如何,对这样的工作使用这样的循环是一种方法.我想LWP的API并不是要取代核心Perl的功能,我可以使用Perl循环来查询多个URL.

In any case, using a loop like this for that kind of job is a way to do it. I guess the LWP's API does not aim to replace the functionality of the core Perl, and I can use Perl loops to query multiple URLs.

由于必须应用循环而无法运行的代码:

The code that does not run because the loop has to be applied:

#use strict;

use DBI;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder::XPath;

# first get a list of all schools
my ($url = '[here the url should be applied] =',id);

for my $id (0..10000) {
  $ua->get(' [here the url should be applied ] ', id => 21, extern_uid => $i);
  # process reply
}  

#my $request = POST $url,
#                 [
#         Schulsuche=> "Ergebnisse anzeigen",
#         order => "schule_ort",
#         schulname => undef, 
#         schulort => undef, 
#         typid => "11",
#         verbinder => "AND"
#                 ];

my $ua = LWP::UserAgent->new;
print "getting all schools - this could take some time\n";
my $response = $ua->request($request);

# extract the ids
my @ids = $response->content =~ /getSchoolDetail\((\d+)/gs;
print "found " . scalar @ids . " schools\n";

# for this demo we only do the first 5
my @ids_to_do = @ids[0..4];

# use your own user and password
my $dbh = DBI->connect("DBI:mysql:database=schulen", "user", "pass", { AutoCommit => 0 }) or die $!;

my $sth = $dbh->prepare(<<sqlend);
   insert into schulen ( name , plz , ort, strasse , tel, fax , mail, quelle , original_id )
               values  ( ?, ?, ?, ?, ?, ?, ?, ?, ? )
sqlend

# now loop over ids
for my $id (@ids_to_do) {

  # get detail information for id
  my $res = $ua->get("[url]=> &gid=$id");

  # parse the response
  my $tree = HTML::TreeBuilder::XPath->new;
  $tree->parse($res->content);

  my $xpath = q|//div[@id='MCinhview']//div[@class='contentitem']//table|;
  my ($adress_table, $tel_table) = $tree->findnodes($xpath);

  my ($adr) = $adress_table->find("td");
  my ($name, $city, $street) = map { s/^\s*//; s/\s*$//; $_ } ($adr->content_list)[2,4,6];

  my($plz, $ort) = $city =~ /^(\d+)\s*(.*)/;
  my ($tel, $fax, $mail) = map { s/^\s*//; s/\s*$//; $_ } map { ($_->content_list)[1] } $tel_table->find("td");

  $sth->execute($name, $plz, $ort, $street, $tel, $fax, $mail, "SA", $id);
  $dbh->commit;

  $tree->delete;

  print "$name done\n";
}


10月25日星期日更新::我已经应用了OmnipotentEntity的建议.


update on sunday october 25 th: I have applied the advice from OmnipotentEntity.

#!/usr/bin/perl -W

use strict;
use warnings;         # give out some warnings if something does not run well
use diagnostics;      # tell me when something is wrong 
use DBI;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder::XPath;

# first get a list of all schools

my $ua = LWP::UserAgent->new;

$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"); 

#pretending to be firefox on linux.

for my $i (0..10000) {
  my $request = HTTP::Request->new(GET => sprintf(" here to put the URL into =%d", $i));
  $request->header('Accept' => 'text/html');
  my $response = $ua->request($request);
  if ($response->is_success) {
    $pagecontent = $response -> content;
  }
# now we can do whatever with the $pagecontent

}
my $request = POST $url,
[
          order => "schule_ort",
          schulname => undef, 
          Basisdaten => undef,        
          Profil  => undef, 
          Schulort => undef, 
          typid => "11",
          Fax  => 
          Homepage  => undef, 
          verbinder => "AND"

];

print "getting all schools - this could take some time\n";
my $response = $ua->request($request);

# extract the ids
my @ids = $response->content =~ /getSchoolDetail\((\d+)/gs;
print "found " . scalar @ids . " schools\n";

# for this demo we only do the first 5
my @ids_to_do = @ids[0..4];

# use your own user and password
my $dbh = DBI->connect("DBI:mysql:database=schulen", "user", "pass", { AutoCommit => 0 }) or die $!;

my $sth = $dbh->prepare(<<sqlend);
   insert into schulen ( name , plz , ort, strasse , tel, fax , mail, quelle , original_id )
               values  ( ?, ?, ?, ?, ?, ?, ?, ?, ? )
sqlend

# now loop over ids
for my $id (@ids_to_do) {

  # get detail information for id
  my $res = $ua->get(" here to put the URL into => &gid=$id");

  # parse the response
  my $tree = HTML::TreeBuilder::XPath->new;
  $tree->parse($res->content);

  my $xpath = q|//div[@id='MCinhview']//div[@class='floatbox']//table|;
  my ($adress_table, $tel_table) = $tree->findnodes($xpath);

  my ($adr) = $adress_table->find("td");
  my ($name, $city, $street) = map { s/^\s*//; s/\s*$//; $_ } ($adr->content_list)[2,4,6];

  my($plz, $ort) = $city =~ /^(\d+)\s*(.*)/;
  my ($tel, $fax, $mail) = map { s/^\s*//; s/\s*$//; $_ } map { ($_->content_list)[1] } $tel_table->find("td");

  $sth->execute($name, $plz, $ort, $street, $tel, $fax, $mail, "SA", $id);
  $dbh->commit;

  $tree->delete;

  print "$name done\n";
}

我想遍历结果,因此我尝试应用相应的URL,但是出现了很多错误:

I want to loop over the results and therefore I tried to apply the corresponding URLs but I got a bunch of errors:


suse-linux:/usr/perl # perl perl_mecha_example_two.pl
Global symbol "$pagecontent" requires explicit package name at perl_mecha_example_two.pl line 24.
Global symbol "$url" requires explicit package name at perl_mecha_example_two.pl line 29.
Execution of perl_mecha_example_two.pl aborted due to compilation errors (#1)
    (F) You've said "use strict" or "use strict vars", which indicates 
    that all variables must either be lexically scoped (using "my" or "state"), 
    declared beforehand using "our", or explicitly qualified to say 
    which package the global variable is in (using "::").

Uncaught exception from user code:
Global symbol "$pagecontent" requires explicit package name at perl_mecha_example_two.pl line 24.
Global symbol "$url" requires explicit package name at perl_mecha_example_two.pl line 29.
Execution of perl_mecha_example_two.pl aborted due to compilation errors.
at perl_mecha_example_two.pl line 86

现在调试部分.我要改变什么?如何以正确的方式应用网址?

Now the debugging part. What do I change? How to apply the URLs in the right way?

当我使用strict时,在声明它之前不允许使用变量.通常的解决方法是在my之前添加例如my $urlmy $pagecontent首次出现.

When I use strict I'm not allowed to use a variable before I declare it. The usual fix is to prepend my, e.g. my $url and my $pagecontent on the first appearance of it.

推荐答案

它很简单:

#!/usr/bin/perl -W

use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"); #pretending to be firefox on linux.
for my $i (0..10000) {
  my $req = HTTP::Request->new(GET => sprintf("http://path/to/url?=%d", $i));
  $req->header('Accept' => 'text/html');
  my $res = $ua->request($req);
  if ($res->is_success) {
    $pagecontent = $res -> content;
  }
# Do whatever with the $pagecontent
}

这是假设您要提取所有10000页.如果您只想获取特定的数字,则应尝试将这些数字扔到一个数组中,然后遍历该数组,而不是1..10000

This is assuming you want to fetch all 10000 pages. If you only want to fetch particular ones then you should try throwing those numbers in an array, and then have for walk that array, rather than 1..10000

这篇关于如何使用Perl的LWP :: UserAgent使用不同的查询字符串获取相同的URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆