如何使用 LWP 和正则表达式抓取 javascript 函数的日期参数? [英] How to scrape, using LWP and a regex, the date argument to a javascript function?

查看:57
本文介绍了如何使用 LWP 和正则表达式抓取 javascript 函数的日期参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法从特定网页中抓取日期,因为日期显然是传递给 javascript 函数的参数.我过去写过一些没有任何重大问题的简单刮刀,所以我没想到会出现问题,但我正在为此苦苦挣扎.该页面有 5-6 个日期,采用常规 yyyy/mm/dd 格式,例如 dateFormat('2012/02/07')

I'm having difficulty scraping dates from a specific web page because the date is apparently an argument passed to a javascript function. I have in the past written a few simple scrapers without any major issues so I didn't expect problems but I am struggling with this. The page has 5-6 dates in regular yyyy/mm/dd format like this dateFormat('2012/02/07')

理想情况下,我想删除所有除了我想保存在数组中的六个日期.在这一点上,我连一个约会都不能成功,更不用说所有约会了.它可能只是一个格式错误的正则表达式,我已经找了很长时间,无法再发现了.

Ideally I would like to remove everything except the half-dozen dates, which I want to save in an array. At this point, I can't even successfully get one date, let alone all of them. It is probably just a malformed regex that I have been looking it so long that I can't spot any more.

第一季度.为什么我没有与下面的正则表达式匹配?

Q1. Why am I not getting a match with the regex below?

第 2 季度.继上述问题之后,如何将所有日期刮到一个数组中?我想假设页面上有 x 个日期,for 循环 x 次并将捕获的组分配给每个循环的数组,但这似乎很笨拙.

Q2. Following on from the above question how can I scrape all the dates into an array? I was thinking of assuming x number of dates on the page, for-looping x times and assigning the captured group to an array each loop, but that seems rather clunky.

问题代码如下.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc

推荐答案

为什么你的模式中有两个空白字符?

Why do you have two whitespace characters in your pattern?

$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/;
                                                 ^^^^^

它们不在您的格式示例中 'dateFormat('2012/02/07')'

they are not in your format example 'dateFormat('2012/02/07')'

我会说这就是你的模式不匹配的原因.

I would say this is the reason why your pattern does not match.

捕获所有日期

您可以简单地将所有匹配项放入这样的数组中

You can simply get all matches into an array like this

( my @Result ) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;

(?<=dateFormat\(') 是一个肯定的后视断言,它确保在您的日期模式之前有 dateFormat\(' (但这不是包含在您的比赛中)

(?<=dateFormat\(') is a positive lookbehind assertion that ensures that there is dateFormat\(' before your date pattern (but this is not included in your match)

(?='\)) 是一个肯定的前瞻断言,确保在模式之后有 '\)

(?='\)) is a positive lookahead assertion that ensures that there is '\) after the pattern

g 修饰符让您的模式搜索字符串中的所有匹配项.

The g modifier let your pattern search for all matches in the string.

这篇关于如何使用 LWP 和正则表达式抓取 javascript 函数的日期参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆