刮去的页面的多个项目进一排整齐 [英] Scraping multiple items off of a Page into a Neat Row

查看:148
本文介绍了刮去的页面的多个项目进一排整齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为一个例子:

我在输入从.txt加载:

I load in the input from a .txt:

本杰明,Schuvlein,德国,1912年,男,白

Benjamin,Schuvlein,Germany,1912,M,White

我做了一些code,我不会张贴在这里为简便起见,并获得了链接:

I do some code that I will not post here for brevity and get to the link:

https://familysearch.org/pal:/MM9.1.1/K3BN-低空急流

  1. 我要刮多的东西从该页面。在code以下,我只做1。
  2. 我也想使每个项目由一个分开,在输出的.txt。
  3. 而且,我想输出是由输入pceded $ P $。

我使用了code以下软件包:

I'm using the following packages in the code:

use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;

下面是相关的code:

Here's the relevant code:

my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ'; 
print "My URL is <$url>\n";  
my $request = HTTP::Request->new(GET => $url);
  $request->push_header('Content-Type' => 'application/json');
  my $response = $ua->request($request);
 die "Error ".$response->code if !$response->is_success;
 my $dom_tree = new HTML::DOM;
 $dom_tree->write($response->content);
 $dom_tree->close;
  my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
 print $str;
print $o $str;

所需的输出(从该链接)是这样的:

Desired Output (from that link) is something like:

本杰明,Schuvlein,德国,1912年,男,白,纽约皇后区,已婚,同样的地方,头,等...

Benjamin,Schuvlein,Germany,1912,M,White,Queens,New York,Married,Same Place,Head, etc ....

(如何多,输出部分是scrapable?)

(How much of that output section is scrapable?)

如何得到的链接中的链接,将大大AP preciated任何帮助!

Any help on how to get the link within the link would be much appreciated!

推荐答案

这是相当简单的做用的 HTML :: TreeBuilder作为:: XPath的 访问HTML。这个方案建立使用标签作为键数据的散列,因此任何的期望的信息可以被提取。我已经用引号括起来包含逗号或空格任何字段。

This is fairly simply done using HTML::TreeBuilder::XPath to access the HTML. This program builds a hash of the data using the labels as keys, so any of the desired information can be extracted. I have enclosed in quotes any fields that contain commas or whitespace.

我不知道你是否有这样的网站的权限来提取数据的这种方式,但我要提醒你注意这一点,在 X-版权所有头HTTP响应。这种方法的标题下显然属于的编程访问的。

I don't know whether you have the permission of this web site to extract data this way, but I should draw your attention to this X-Copyright header in the HTTP responses. This approach clearly falls under the header of programmatic access.

X-版权所有:通过FamilySearch API访问版权警告数据受版权保护。任何编程访问,重新格式化,或者该数据重新路由,未经许可,严禁。 FamilySearch认为,这种未经授权的使用侵犯了其繁殖,派生和经销权。联系国际信息发展组织(AT)familysearch.org了解更多信息。

X-Copyright: COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.

我是从你期望的电子邮件?我回答你的第一个邮件,但因为还没有听说过。

Am I to expect an email from you? I replied to your first mail but haven't heard since.

use strict;
use warnings;

use URI;
use LWP;
use HTML::TreeBuilder::XPath;

my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');

my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;

my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my @results = $tree->findnodes('//table[@class="result-data"]//tr[@class="result-item"]');
my %data;
for my $item (@results) {
  my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
  $key =~ s/:$//;
  $data{$key} = $val;
}

my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
  'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
  'event place', 'marital status', 'residence in 1935',
  'relationship to head of household (standardized)';

print $record, "\n";

输出

"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head

这篇关于刮去的页面的多个项目进一排整齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆