编写可维护的Web刮板的最佳实践是什么? [英] What is the best practice for writing maintainable web scrapers?

查看:40
本文介绍了编写可维护的Web刮板的最佳实践是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要实施一些抓取工具来抓取某些网页(因为该网站没有开放的API),提取信息并将其保存到数据库.我目前正在使用漂亮的汤来编写这样的代码:

  discount_price_text = soup.select(#detail-main del.originPrice")[0] .string;discount_price = float(re.findall('[\ d \.] +',discount_price_text)[0]);; 

我猜想这样的代码很容易在更改网页时变得无效,即使只是稍作更改.除了编写回归测试以定期运行以发现故障之外,我应该如何编写不易受这些更改影响的刮板?

特别是,即使原始的xpath/css选择器不再有效,是否存在任何可以做出尽力而为猜测"的智能刮板"?

解决方案

页面具有巨大的改变潜力,以至于构建一个非常智能"的刮板可能非常困难.如果可能的话,即使使用像机器学习等类似的技巧,刮板也会有些不可预测.很难制造出既具有可信赖性又具有自动灵活性的刮板.

可维护性某种程度上是一种艺术形式,围绕选择器的定义和使用方式.

过去,我推出了自己的两阶段"选择器:

  1. (查找)第一阶段是高度不灵活的,它会根据所需元素检查页面的结构.如果第一阶段失败,则将引发某种页面结构已更改"错误.

  2. (检索),第二阶段则比较灵活,可以从页面上所需的元素中提取数据.

这可以使抓取工具通过某种程度的自动检测将自身与剧烈的页面更改隔离开来,同时仍保持一定程度的可信赖的灵活性.

我经常使用xpath选择器,通过一点实践,使用一个好的选择器在保持准确度的同时具有多大的灵活性确实让我感到惊讶.我确定CSS选择器是相似的.页面设计越语义化和扁平化",就越容易.

要回答的几个重要问题是:

  1. 您希望在页面上更改什么?

  2. 您希望页面上的内容保持不变吗?

回答这些问题时,选择的准确性越高,选择者就越好.

最后,由您选择要承受的风险,选择器的可信度,同时在页面上查找和检索数据时,如何制作它们会产生很大的不同.理想情况下,最好是从网络API获取数据,希望可以开始提供更多资源.


小例子

使用您的方案,您想要的元素位于 .content>交易>.tag>.price ,一般的 .content .price 选择器在页面更改方面非常灵活";但是,例如,如果出现错误的肯定因素,我们可能希望避免从这一新因素中提取出来.

使用两阶段选择器,我们可以指定不太通用,更不灵活的第一阶段,例如 .content>.deal ,然后是第二个更通用的阶段,例如 .price ,它使用对第一个结果的查询 relative 来检索最终元素.

那么为什么不只使用 .content>这样的选择器呢?.deal .price ?

就我的使用而言,我希望能够检测出较大的页面更改而无需单独运行额外的回归测试.我意识到,除了一个大选择器之外,我还可以编写第一阶段以包含重要的页面结构元素.如果结构元素不再存在,则第一阶段将失败(或报告).然后,我可以编写第二阶段,以便更优雅地检索相对于第一阶段结果的数据.

我不应该说这是最佳"做法,但效果很好.

I need to implement a few scrapers to crawl some web pages (because the site doesn't have open API), extracting information and save to database. I am currently using beautiful soup to write code like this:

discount_price_text = soup.select("#detail-main del.originPrice")[0].string;
discount_price = float(re.findall('[\d\.]+', discount_price_text)[0]);

I guess code like this can very easily become invalid when the web page is changed, even slightly. How should I write scrapers less susceptible to these changes, other than writing regression tests to run regularly to catch failures?

In particular, is there any existing 'smart scraper' that can make 'best effort guess' even when the original xpath/css selector is no longer valid?

解决方案

Pages have the potential to change so drastically that building a very "smart" scraper might be pretty difficult; and if possible, the scraper would be somewhat unpredictable, even with fancy techniques like machine-learning etcetera. It's hard to make a scraper that has both trustworthiness and automated flexibility.

Maintainability is somewhat of an art-form centered around how selectors are defined and used.

In the past I have rolled my own "two stage" selectors:

  1. (find) The first stage is highly inflexible and checks the structure of the page toward a desired element. If the first stage fails, then it throws some kind of "page structure changed" error.

  2. (retrieve) The second stage then is somewhat flexible and extracts the data from the desired element on the page.

This allows the scraper to isolate itself from drastic page changes with some level of auto-detection, while still maintaining a level of trustworthy flexibility.

I frequently have used xpath selectors, and it is really quit surprising, with a little practice, how flexible you can be with a good selector while still being very accurate. I'm sure css selectors are similar. This gets easier the more semantic and "flat" the page design is.

A few important questions to answer are:

  1. What do you expect to change on the page?

  2. What do you expect to stay the same on the page?

When answering these questions, the more accurate you can be the better your selectors can become.

In the end, it's your choice how much risk you want to take, how trustworthy your selectors will be, when both finding and retrieving data on a page, how you craft them makes a big difference; and ideally, it's best to get data from a web-api, which hopefully more sources will begin providing.


EDIT: Small example

Using your scenario, where the element you want is at .content > .deal > .tag > .price, the general .content .price selector is very "flexible" regarding page changes; but if, say, a false positive element arises, we may desire to avoid extracting from this new element.

Using two-stage selectors we can specify a less general, more inflexible first stage like .content > .deal, and then a second, more general stage like .price to retrieve the final element using a query relative to the results of the first.

So why not just use a selector like .content > .deal .price?

For my use, I wanted to be able to detect large page changes without running extra regression tests separately. I realized that rather than one big selector, I could write the first stage to include important page-structure elements. This first stage would fail (or report) if the structural elements no longer exist. Then I could write a second stage to more gracefully retrieve data relative to the results of the first stage.

I shouldn't say that it's a "best" practice, but it has worked well.

这篇关于编写可维护的Web刮板的最佳实践是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆