屏幕抓取:正则表达式还是 XQuery 表达式? [英] Screen scraping: regular expressions or XQuery expressions?

查看:52
本文介绍了屏幕抓取:正则表达式还是 XQuery 表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在面试时回答了一些测验问题,问题是我将如何进行屏幕抓取.也就是说,从网页中挑选内容,假设您没有更好的结构化方式来直接查询信息(例如网络服务).

I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don't have a better structured way to query the information directly (e.g. a web service).

我的解决方案是使用 XQuery 表达式.表达式相当长,因为我需要的内容在 HTML 层次结构中非常深.在找到具有 id 属性的元素之前,我必须以公平的方式搜索祖先.例如,抓取 Amazon.com 页面的 Product Dimensions 如下所示:

My solution was to use an XQuery expression. The expression was fairly long because the content I needed was pretty deep in the HTML hierarchy. I had to search up through the ancestors a fair way before I found an element with an id attribute. For example, scraping an Amazon.com page for Product Dimensions looks like this:

//a[@id="productDetails"]
/following-sibling::table
//h2[contains(child::text(), "Product Details")]
/following-sibling::div
//li
/b[contains(child::text(), "Product Dimensions:")]
/following-sibling::text()

这是一个非常讨厌的表达,但这就是亚马逊提供网络服务 API 的原因.无论如何,这只是一个例子.问题不是关于亚马逊,而是关于屏幕抓取.

That's a pretty nasty expression, but that's why Amazon provides a web service API. Anyway, it's just one example. The question was not about Amazon, it's about screen scraping.

面试官不喜欢我的解决方案.他认为它很脆弱,因为 Amazon 对页面设计的更改可能需要重写 XQuery 表达式.调试与其应用的页面中的任何内容都不匹配的 XQuery 表达式很困难.

The interviewer didn't like my solution. He thought it was fragile, because a change to the page design by Amazon could require rewriting the XQuery expression. Debugging an XQuery expression that doesn't match anything in the page it's applied against is hard.

我并不反对他的说法,但我认为他的解决方案没有任何改进:他认为最好使用正则表达式,并搜索靠近运输重量的内容和标记.例如,使用 Perl:

I did not disagree with his statements, but I didn't think his solution was any improvement: he thought it's better to use a regular expression, and search for content and markup near the shipping weight. For example, using Perl:

$html =~ m{<li>\s*<b>\s*Product Dimensions:\s*</b>\s*(.*?)</li>}s;

我的反驳是,这也容易受到亚马逊更改其 HTML 代码的影响.他们可以用大写字母拼写 HTML 标签(

  • ),或者添加 CSS 属性或将 更改为 或将标签产品尺寸:"更改为尺寸:"或许多其他类型的更改.我的观点是正则表达式不能解决他在我的 XQuery 解决方案中指出的弱点.

    My counter-argument was that this is also susceptible to Amazon changing their HTML code. They could spell HTML tags in capitals (<LI>), or add CSS attributes or change <b> to <span> or change the label "Product Dimensions:" to "Dimensions:" or many other kinds of changes. My point was that regular expressions don't solve the weaknesses he called out in my XQuery solution.

    但除此之外,正则表达式可以发现误报,除非您为表达式添加足够的上下文.它还可能无意中匹配恰好位于注释、属性字符串或 CDATA 部分中的内容.

    But in addition, regular expressions can find false positives, unless you add enough context to the expression. It can also unintentionally match content that happens to be inside a comment, or an attribute string, or a CDATA section.

    我的问题是,您使用什么技术进行屏幕抓取?你为什么选择那个解决方案?是否有一些令人信服的理由使用它?还是从不使用另一个?除了我上面展示的那些之外,还有第三种选择吗?

    My question is, what technology do you use to do screen scraping? Why did you choose that solution? Is there some compelling reason to use one? Or never use the other? Is there a third choice besides those I showed above?

    PS:为了论证,假设没有网络服务 API 或其他更直接的方式来获取所需的内容.

    PS: Assume for the sake of argument that there is no web service API or other more direct way to acquire the desired content.

    推荐答案

    我会使用正则表达式,因为经理给出的原因,加上一些(更便携,更容易让外部程序员遵循等).

    I'd use a regular expression, for the reasons the manager gave, pluss a few (more portable, easier for outside programmers to follow, etc).

    你的反驳没有指出他的解决方案在本地变化方面很脆弱,而你的解决方案在全局变化方面很脆弱.任何破坏他的可能都会破坏你的,但反之亦然.

    Your counter argument misses the point that his solution was fragile with regard to local changes while yours is fragile with regard to global changes. Anything that breaks his will probably break yours, but not visa-versa.

    最后,将 slop/flex 构建到他的解决方案中要容易得多(例如,如果您必须处理输入中的多个微小变化).

    Finally, it's a lot easier to build slop / flex into his solution (if, for example, you have to deal with multiple minor variations in the input).

    这篇关于屏幕抓取:正则表达式还是 XQuery 表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆