在不知道网页结构的情况下使用 Scrapy 抓取所有文本 [英] Scraping all text using Scrapy without knowing webpages' structure

查看:46
本文介绍了在不知道网页结构的情况下使用 Scrapy 抓取所有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在进行一项与分发互联网索引相关的研究.

I am conducting a research which relates to distributing the indexing of the internet.

虽然存在几个这样的项目(IRLbot、分布式索引、Cluster-Scrapy、Common-Crawl 等),但我更专注于激励此类行为.我正在寻找一种简单的方法来抓取真实的网页,而无需了解其 URL 或 HTML 结构,并且:

While several such projects exist (IRLbot, Distributed-indexing, Cluster-Scrapy, Common-Crawl etc.), mine is more focused on incentivising such behavior. I am looking for a simple way to crawl real webpages without knowing anything about their URL or HTML structure and:

  1. 提取他们所有的文本(为了索引)
  2. 收集他们的所有网址并将其添加到要抓取的网址中
  3. 防止在网页格式错误的情况下崩溃并优雅地继续(即使没有抓取的文本)

澄清一下 - 这仅用于概念证明 (PoC),所以我不介意它不会扩展,它很慢等.我的目标是抓取呈现给用户的大部分文本,在大多数情况下,有或没有动态内容,以及尽可能少的垃圾",如函数、标签、关键字等.开箱即用的简单部分解决方案优于需要大量专业知识的完美解决方案部署.

To clarify - this is only for Proof of Concept (PoC), so I don't mind it won't scale, it's slow, etc. I am aiming at scraping most of the text which is presented to the user, in most cases, with or without dynamic content, and with as little "garbage" such as functions, tags, keywords etc. A working simple partial solution which works out of the box is preferred over the perfect solution which requires a lot of expertise to deploy.

第二个问题是存储(url,提取的文本)用于索引(通过不同的过程?),但我想我可以通过更多的挖掘来解决这个问题.

A secondary issue is the storing of the (url,extracted text) for indexing (by a different process?), but I think I will be able to figure it out myself with some more digging.

任何关于如何增强itsy"解析函数的建议将不胜感激!

Any advice on how to augment "itsy"'s parse function will be highly appreciated!

import scrapy

from scrapy_1.tutorial.items import WebsiteItem


class FirstSpider(scrapy.Spider):
name = 'itsy'

# allowed_domains = ['dmoz.org'] 

start_urls = \
    [
        "http://www.stackoverflow.com"
    ]

# def parse(self, response):
#     filename = response.url.split("/")[-2] + '.html'
#     with open(filename, 'wb') as f:
#         f.write(response.body)

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        item = WebsiteItem()
        item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a/@href').extract()
        item['body_text'] = sel.xpath('text()').extract()
        yield item

推荐答案

您在这里寻找的是scrapy CrawlSpider

What you are looking for here is scrapy CrawlSpider

CrawlSpider 允许您定义每个页面遵循的抓取规则.它足够智能,可以避免抓取不属于网络资源的图像、文档和其他文件,而且几乎可以为您完成所有工作.

CrawlSpider lets you define crawling rules that are followed for every page. It's smart enough to avoid crawling images, documents and other files that are not web resources and it pretty much does the whole thing for you.

下面是一个很好的例子,你的蜘蛛在使用 CrawlSpider 时的样子:

Here's a good example how your spider might look with CrawlSpider:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'crawlspider'
    start_urls = ['http://scrapy.org']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = dict()
        item['url'] = response.url
        item['title'] = response.meta['link_text']
        # extracting basic body
        item['body'] = '\n'.join(response.xpath('//text()').extract())
        # or better just save whole source
        item['source'] = response.body
        return item

这个蜘蛛会抓取它可以在网站上找到的每个网页,并记录标题、网址和整个文本正文.
对于文本正文,您可能希望以更智能的方式提取它(以排除 javascript 和其他不需要的文本节点),但这是一个需要讨论的问题.实际上,对于您所描述的内容,您可能希望保存完整的 html 源代码,而不仅仅是文本,因为非结构化文本对于任何类型的分析或索引都是无用的.

This spider will crawl every webpage it can find on the website and log the title, url and whole text body.
For text body you might want to extract it in some smarter way(to exclude javascript and other unwanted text nodes), but that's an issue on it's own to discuss. Actually for what you are describing you probably want to save full html source rather than text only, since unstructured text is useless for any sort of analitics or indexing.

还有一堆可以针对这种类型的抓取进行调整的抓取设置.Broad Crawl 文档页面

There's also bunch of scrapy settings that can be adjusted for this type of crawling. It's very nicely described in Broad Crawl docs page

这篇关于在不知道网页结构的情况下使用 Scrapy 抓取所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆