如何使用scrapy抓取数千个页面? [英] How to crawl thousands of pages using scrapy?

查看:59
本文介绍了如何使用scrapy抓取数千个页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑抓取数千个页面,需要一个解决方案.每个站点都有自己的 html 代码——它们都是独特的站点.没有干净的数据馈送或 API 可用.我希望将捕获的数据加载到某种数据库中.

I'm looking at crawling thousands of pages and need a solution. Every site has it's own html code - they are all unique sites. No clean datafeed or API is available. I'm hoping to load the captured data into some sort of DB.

如果可能的话,有没有关于如何用scrapy来做到这一点的想法?

Any ideas on how to do this with scrapy if possible?

推荐答案

如果我必须从数千个站点中抓取干净的数据,并且每个站点都有自己的布局、结构等,我会实施(实际上已经在一些项目)以下方法:

If I had to scrape clean data from thousands of sites, with each site having its own layout, structure, etc I would implement (and actually have done so in some projects) the following approach:

  1. Crawler - 一个爬虫脚本,用于抓取这些站点及其所有子页面(这是最简单的部分)并将它们转换为纯文本
  2. NLP 处理 - 对纯文本进行一些基本的 NLP(自然语言)处理(标记化、词性 (POS) 标记、命名实体识别 (NER))
  3. 分类 - 一个分类器,可以使用第 2 步中的数据来决定页面是否包含我们正在寻找的数据 - 基于简单的规则或 - 如果需要 - 使用机器学习.那些怀疑包含任何可用数据的页面将被放入下一步:
  4. 提取 - 一种基于语法、统计或机器学习的提取器,使用 POS 标签和 NER 标签(以及任何其他特定领域的因素)来提取我们正在寻找的数据
  5. 清理 - 对步骤 4 中创建的重复记录进行一些基本匹配,也许还需要丢弃步骤 2 到 4 中置信度较低的记录.
  1. Crawler - a scrapy script that crawls these sites with all their subpages (that's the easiest part) and transforms them into plain text
  2. NLP Processing - some basic NLP (natural language) processing (tokenizing, part of speech (POS) tagging, named entity-recognition (NER)) on the plain text
  3. Classification - a classifier that can use the data from step 2 to decide whether a page contains the data we're looking for - either simple rules based or - if needed - using machine learning. Those pages that are suspected to contain any usable data will be put into the next step:
  4. Extraction - an grammar-based, statistical or machine learning based extractor that uses POS-tags and NER-tags (and any other domain specific factors) to extract that data we're looking for
  5. Clean up - some basic matching of duplicate records that were created in step 4 and maybe it's also necessary to throw away records that had low confidence scores in steps 2 to 4.

当然,这远远超出了构建爬虫程序,还需要 NLP 和机器学习方面的深厚知识和经验.

This goes way beyond building a scrapy scraper of course and requires deep knowlegde and experience in NLP and maybe machine learning.

此外,您不能期望通过这种方法获得接近 100% 准确的结果.根据算法的调整和训练方式,这样的系统要么会跳过一些有效数据(假阴性),要么会拾取实际上没有任何数据的数据(假阳性)......或两者的混合(假阳性和假阴性).

Also you can't expect to get anywhere close to 100% accurate results from such an approach. Depending on how the algorithms are adjusted and trained such a system either will skip some of the valid data (false negatives) or will pick up data where actually isn't any data (false positives) ... or a mix of both (false positives and false negatives).

尽管如此,我还是希望我的回答能帮助您更好地了解有关情况.

Nonetheless I hope my answer helps you to get a good picture about.

这篇关于如何使用scrapy抓取数千个页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆