如何知道被抓取的网站是否发生了变化? [英] How to know if the website being scraped has changed?

查看:37
本文介绍了如何知道被抓取的网站是否发生了变化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PHP 抓取网站并收集一些数据.这一切都是在不使用正则表达式的情况下完成的.我正在使用 php 的 expand() 方法来查找特定的 HTML 标签.

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.

如果网站的结构发生变化(CSS、HTML),抓取工具可能会收集到错误的数据.所以问题是 - 我如何知道 HTML 结构是否已更改?在将任何数据存储到我的数据库之前如何识别这一点,以避免存储错误的数据.

It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.

推荐答案

如果您抓取内容发生变化的页面,我认为您没有任何干净的解决方案.

I think you don't have any clean solutions if you are scraping a page where content changes.

我开发了几个 python 抓取工具,我知道当网站对其布局进行细微的更改时会多么令人沮丧.

I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.

您可以尝试一种机械化解决方案(不知道对应的 php),如果幸运的话,您可以隔离需要提取的内容(链接?).

You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).

另一种可能的方法是编写一些约束并在存储到数据库之前检查它们.

Another possibile approach would be to code some constraints and check them before store to db.

例如,如果您正在抓取网址,则需要验证抓取工具解析的内容是否正式为有效网址;对于整数 ID 或任何您想抓取的可识别为有效的内容也是如此.

For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.

如果您正在抓取纯文本,则检查起来会更加困难.

If you are scraping plain text, it will be more difficult to check.

这篇关于如何知道被抓取的网站是否发生了变化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆