什么是写作维护的Web scrappers最好的做法? [英] What is the best practice for writing maintainable web scrappers?

查看:123
本文介绍了什么是写作维护的Web scrappers最好的做法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要实现几个刮削器抓取某些网页(因为网站没有开放的API),提取信息,并保存到数据库中。我目前使用美汤写code是这样的:

I need to implement a few scrapers to crawl some web pages (because the site doesn't have open API), extracting information and save to database. I am currently using beautiful soup to write code like this:

discount_price_text = soup.select("#detail-main del.originPrice")[0].string;
discount_price = float(re.findall('[\d\.]+', discount_price_text)[0]);

我想,当网页被更改,甚至略带code这样可以很容易变得无效。
我应该怎么写scrappers这些变化不敏感,除了编写回归测试,定期运行赶上失败?

I guess code like this can very easily become invalid when the web page is changed, even slightly. How should I write scrappers less susceptible to these changes, other than writing regression tests to run regularly to catch failures?

在特定的,没有任何现有的智能刮',可以使尽力而为的猜测,即使原来的XPath / CSS选择器不再有效?

In particular, is there any existing 'smart scrapper' that can make 'best effort guess' even when the original xpath/css selector is no longer valid?

推荐答案

页面有潜力这么厉害,建立一个非常聪明刷屏可能是pretty难以改变;如果可能的话,刮板将有所联合国predictable,甚至像机器学习诸如此类花哨的技巧。很难使同时具有可信性和自动化的灵活性刮刀。

Pages have the potential to change so drastically that building a very "smart" scraper might be pretty difficult; and if possible, the scraper would be somewhat unpredictable, even with fancy techniques like machine-learning etcetera. It's hard to make a scraper that has both trustworthiness and automated flexibility.

可维护性是有点艺术形式为中心的选择是如何定义和使用。

Maintainability is somewhat of an art-form centered around how selectors are defined and used.

在过去,我推出我自己的两阶段选择:

In the past I have rolled my own "two stage" selectors:


  1. (找到)第一阶段是高度不灵活并且朝向期望的元件检查页面的结构。如果第一阶段失败,那么它抛出某种页面结构变化的错误。

  1. (find) The first stage is highly inflexible and checks the structure of the page toward a desired element. If the first stage fails, then it throws some kind of "page structure changed" error.

(检索),那么第二个阶段是稍微柔性和从页面上所需的元素中提取的数据。

(retrieve) The second stage then is somewhat flexible and extracts the data from the desired element on the page.

这使得刮板本身从自动检测某种程度的急剧变化页的隔离,同时仍保持守信灵活性水平。

This allows the scraper to isolate itself from drastic page changes with some level of auto-detection, while still maintaining a level of trustworthy flexibility.

我经常已经使用XPath选择,它真的退出令人吃惊,稍加练习,你怎么可以灵活凭借良好的选择,同时仍然非常准确。我敢肯定,CSS选择器类似。这变得更容易更语义和平的页面设计。

I frequently have used xpath selectors, and it is really quit surprising, with a little practice, how flexible you can be with a good selector while still being very accurate. I'm sure css selectors are similar. This gets easier the more semantic and "flat" the page design is.

要回答几个重要的问题是:

A few important questions to answer are:


  1. 你有什么期望更改页面上的?

  1. What do you expect to change on the page?

您希望保持相同的页面上有什么呢?

What do you expect to stay the same on the page?

在回答这些问题,更准确,你可以更好的选择能成。

When answering these questions, the more accurate you can be the better your selectors can become.

在最后,这是你的选择,你想要多大的风险承担,你的选择将如何值得信赖的是,当发现在页面上检索数据两者,你怎么手艺他们产生很大的差别;理想情况下,最好从一个网络API,它希望更多的人士将开始提供获取数据。

In the end, it's your choice how much risk you want to take, how trustworthy your selectors will be, when both finding and retrieving data on a page, how you craft them makes a big difference; and ideally, it's best to get data from a web-api, which hopefully more sources will begin providing.

编辑:小例子

使用您的情况,在那里你想要的元素是 .content> .deal> .TAG> 。价格,一般 .content。价格选择很灵活的有关页面的变化;但如果说,假阳性的元素出现,我们可能希望避免这个新元素提取。

Using your scenario, where the element you want is at .content > .deal > .tag > .price, the general .content .price selector is very "flexible" regarding page changes; but if, say, a false positive element arises, we may desire to avoid extracting from this new element.

使用两阶段选择,我们可以指定一个一般的少,更不灵活的第一个阶段,比如 .content> .deal ,然后像。价格第二,更普遍的阶段使用查询检索最后一个元素的相对的到第一的结果。

Using two-stage selectors we can specify a less general, more inflexible first stage like .content > .deal, and then a second, more general stage like .price to retrieve the final element using a query relative to the results of the first.

那么,为什么不直接使用一个选择像 .content> .deal。价格

So why not just use a selector like .content > .deal .price?

有关我的使用,我希望能够检测到大页面的变化,而不用单独运行额外的回归测试。我意识到,而不是一个大的选择,我可以写的第一阶段,包括重要的页面结构元素。第一阶段会失败(或报告)如果结构元素不复存在。然后,我可以写一个第二阶段,以更为优美检索相对于第一阶段的结果数据。

For my use, I wanted to be able to detect large page changes without running extra regression tests separately. I realized that rather than one big selector, I could write the first stage to include important page-structure elements. This first stage would fail (or report) if the structural elements no longer exist. Then I could write a second stage to more gracefully retrieve data relative to the results of the first stage.

我不应该说这是一个最好的做法,但它一直很好。

I shouldn't say that it's a "best" practice, but it has worked well.

这篇关于什么是写作维护的Web scrappers最好的做法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆