Scrapy是否有可能直接从原始html数据中获取纯文本而不是使用xPath选择器? [英] Is it possible that Scrapy to get plain text from raw html data directly instead of using xPath selectors?

查看:131
本文介绍了Scrapy是否有可能直接从原始html数据中获取纯文本而不是使用xPath选择器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如:

For example:

scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content

然后,我得到以下原始html代码:

then,I got following raw html codes:

<div id="content">


    <h2>Welcome to Scrapy</h2>

    <h3>What is Scrapy?</h3>

    <p>Scrapy is a fast high-level screen scraping and web crawling
    framework, used to crawl websites and extract structured data from their
    pages. It can be used for a wide range of purposes, from data mining to
    monitoring and automated testing.</p>

    <h3>Features</h3>

    <dl>

    <dt>Simple</dt><dt>
    </dt><dd>Scrapy was designed with simplicity in mind, by providing the features
    you need without getting in your way</dd>

    <dt>Productive</dt>
    <dd>Just write the rules to extract the data from web pages and let Scrapy
    crawl the entire web site for you</dd>

    <dt>Fast</dt>
    <dd>Scrapy is used in production crawlers to completely scrape more than
    500 retailer sites daily, all in one server</dd>

    <dt>Extensible</dt>
    <dd>Scrapy was designed with extensibility in mind and so it provides
    several mechanisms to plug new code without having to touch the framework
    core

    </dd><dt>Portable, open-source, 100% Python</dt>
    <dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd>

    <dt>Batteries included</dt>
    <dd>Scrapy comes with lots of functionality built in. Check <a href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this
    section</a> of the documentation for a list of them.</dd>

    <dt>Well-documented &amp; well-tested</dt>
    <dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite
    with <a href="http://static.scrapy.org/coverage-report/">very good code
    coverage</a></dd>

    <dt><a href="/community">Healthy community</a></dt>
    <dd>
    1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>
    700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br>
    850 questions on StackOverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br>
    200 messages per month on mailing list (<a href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>
    40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)
    </dd>

    <dt><a href="/support">Commercial support</a></dt>
    <dd>A few companies provide Scrapy consulting and support</dd>

    <p>Still not sure if Scrapy is what you're looking for?. Check out <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a
    glance</a>.

    </p><h3>Companies using Scrapy</h3>

    <p>Scrapy is being used in large production environments, to crawl
    thousands of sites daily. Here is a list of <a href="/companies/">Companies
using Scrapy</a>.</p>

    <h3>Where to start?</h3>

    <p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>,
    then <a href="/download/">download Scrapy</a> and follow the <a href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>.


          </p></dl></div>

---------->但我想要纯文本就像直接从scrapy接收:-----





欢迎到Scrapy



什么是Scrapy?



Scrapy是一种快速的高级别屏幕抓取和网页爬行
框架,用于抓取网站并从
页面提取结构化数据。它可以用于广泛的用途,从数据
挖掘到监控和自动化测试。

Welcome to Scrapy

What is Scrapy?

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

仍然不确定Scrapy是否是您要查找的内容?以
的眼光检查Scrapy。

Still not sure if Scrapy is what you're looking for?. Check out Scrapy at a glance.

Scrapy正在大型生产环境中使用,抓取
数千网站每天。以下是使用Scrapy的公司名单。

Scrapy is being used in large production environments, to crawl thousands of sites daily. Here is a list of Companies using Scrapy.

首先阅读Scrapy b $ b一目了然,然后下载Scrapy 并按照
的教程。


Start by reading Scrapy at a glance, then download Scrapy and follow the Tutorial.

我不想使用任何xPath选择器来提取那些p,h2,h3等标签,因为我正在爬行一个主要内容嵌入到表格中的网站,tbody;递归。找到这些xPath可能是一项繁琐的任务。这可以通过Scrapy中的内置函数来实现吗?或者我需要外部工具来转换它?我已经阅读了Scrapy的所有文档,但都没有收获。这是一个可以将原始html转换为纯文本的示例网站: http://beaker.mailchimp.com / html-to-text

I do not want to use any xPath selectors to extract those p, h2, h3 etc,tags,since I am crawling a website whose main content is embedded into a table, tbody; recursively. It can be a tedious task to find those xPath. Can this be implemented by a built in function in Scrapy? Or do I need external tools to convert it? I have read through all of Scrapy's docs, but have gained nothing. This is a sample site which can convert raw html into plain text: http://beaker.mailchimp.com/html-to-text

推荐答案

Scrapy没有内置此类功能。 html2text 就是你要找的东西。

Scrapy doesn't have such functionality built-in. html2text is what you are looking for.

以下是一个抓取维基百科的python页面的示例蜘蛛,使用xpath获取第一段并转换html转换为纯文本使用 html2text

Here's a sample spider that scrapes wikipedia's python page, gets first paragraph using xpath and converts html into plain text using html2text:

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text


class WikiSpider(BaseSpider):
    name = "wiki_spider"
    allowed_domains = ["www.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]

        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print(converter.handle(sample)) #Python 3 print syntax

打印:


** Python **是一种广泛使用的通用高级编程语言[11] [12] [13]它的设计理念强调代码
的可读性,其语法允许程序员用
以下的代码行表示概念,而不用像b $ b C这样的语言。[14] [15]该语言提供的结构旨在实现清晰的
程序在小规模和大规模。[16]

**Python** is a widely used general-purpose, high-level programming language.[11][12][13] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C.[14][15] The language provides constructs intended to enable clear programs on both a small and large scale.[16]

这篇关于Scrapy是否有可能直接从原始html数据中获取纯文本而不是使用xPath选择器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆