Scrapy是否有可能直接从原始html数据中获取纯文本而不是使用xPath选择器？ [英] Is it possible that Scrapy to get plain text from raw html data directly instead of using xPath selectors?

查看：131 发布时间：2018/6/20 14:53:05 python html web-scraping scrapy web-crawler

本文介绍了Scrapy是否有可能直接从原始html数据中获取纯文本而不是使用xPath选择器？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

例如：

For example:

scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content

然后，我得到以下原始html代码：

then,I got following raw html codes:

<div id="content"> <h2>Welcome to Scrapy</h2> <h3>What is Scrapy?</h3> <p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.</p> <h3>Features</h3> <dl> <dt>Simple</dt><dt> </dt><dd>Scrapy was designed with simplicity in mind, by providing the features you need without getting in your way</dd> <dt>Productive</dt> <dd>Just write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you</dd> <dt>Fast</dt> <dd>Scrapy is used in production crawlers to completely scrape more than 500 retailer sites daily, all in one server</dd> <dt>Extensible</dt> <dd>Scrapy was designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core </dd><dt>Portable, open-source, 100% Python</dt> <dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd> <dt>Batteries included</dt> <dd>Scrapy comes with lots of functionality built in. Check <a href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this section</a> of the documentation for a list of them.</dd> <dt>Well-documented & well-tested</dt> <dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite with <a href="http://static.scrapy.org/coverage-report/">very good code coverage</a></dd> <dt><a href="/community">Healthy community</a></dt> <dd> 1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br> 700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br> 850 questions on StackOverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br> 200 messages per month on mailing list (<a href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br> 40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>) </dd> <dt><a href="/support">Commercial support</a></dt> <dd>A few companies provide Scrapy consulting and support</dd> <p>Still not sure if Scrapy is what you're looking for?. Check out <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>. </p><h3>Companies using Scrapy</h3> <p>Scrapy is being used in large production environments, to crawl thousands of sites daily. Here is a list of <a href="/companies/">Companies using Scrapy</a>.</p> <h3>Where to start?</h3> <p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>, then <a href="/download/">download Scrapy</a> and follow the <a href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>. </p></dl></div>

---------->但我想要纯文本就像直接从scrapy接收：-----

欢迎到Scrapy

什么是Scrapy？

Scrapy是一种快速的高级别屏幕抓取和网页爬行
框架，用于抓取网站并从
页面提取结构化数据。它可以用于广泛的用途，从数据
挖掘到监控和自动化测试。

Welcome to Scrapy

What is Scrapy?

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

仍然不确定Scrapy是否是您要查找的内容？以
的眼光检查Scrapy。

Still not sure if Scrapy is what you're looking for?. Check out Scrapy at a glance.

Scrapy正在大型生产环境中使用，抓取
数千网站每天。以下是使用Scrapy的公司名单。

Scrapy is being used in large production environments, to crawl thousands of sites daily. Here is a list of Companies using Scrapy.

首先阅读Scrapy b $ b一目了然，然后下载Scrapy 并按照
的教程。

Start by reading Scrapy at a glance, then download Scrapy and follow the Tutorial.

我不想使用任何xPath选择器来提取那些p，h2，h3等标签，因为我正在爬行一个主要内容嵌入到表格中的网站，tbody;递归。找到这些xPath可能是一项繁琐的任务。这可以通过Scrapy中的内置函数来实现吗？或者我需要外部工具来转换它？我已经阅读了Scrapy的所有文档，但都没有收获。这是一个可以将原始html转换为纯文本的示例网站： http://beaker.mailchimp.com / html-to-text

I do not want to use any xPath selectors to extract those p, h2, h3 etc,tags,since I am crawling a website whose main content is embedded into a table, tbody; recursively. It can be a tedious task to find those xPath. Can this be implemented by a built in function in Scrapy? Or do I need external tools to convert it? I have read through all of Scrapy's docs, but have gained nothing. This is a sample site which can convert raw html into plain text: http://beaker.mailchimp.com/html-to-text

推荐答案

Scrapy没有内置此类功能。 html2text 就是你要找的东西。

Scrapy doesn't have such functionality built-in. html2text is what you are looking for.

以下是一个抓取维基百科的python页面的示例蜘蛛，使用xpath获取第一段并转换html转换为纯文本使用 html2text ：

Here's a sample spider that scrapes wikipedia's python page, gets first paragraph using xpath and converts html into plain text using html2text:

from scrapy.selector import HtmlXPathSelector from scrapy.spider import BaseSpider import html2text class WikiSpider(BaseSpider): name = "wiki_spider" allowed_domains = ["www.wikipedia.org"] start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"] def parse(self, response): hxs = HtmlXPathSelector(response) sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0] converter = html2text.HTML2Text() converter.ignore_links = True print(converter.handle(sample)) #Python 3 print syntax

打印：

** Python **是一种广泛使用的通用高级编程语言[11] [12] [13]它的设计理念强调代码
的可读性，其语法允许程序员用
以下的代码行表示概念，而不用像b $ b C这样的语言。[14] [15]该语言提供的结构旨在实现清晰的
程序在小规模和大规模。[16]

**Python** is a widely used general-purpose, high-level programming language.[11][12][13] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C.[14][15] The language provides constructs intended to enable clear programs on both a small and large scale.[16]

这篇关于Scrapy是否有可能直接从原始html数据中获取纯文本而不是使用xPath选择器？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy是否有可能直接从原始html数据中获取纯文本而不是使用xPath选择器？ [英] Is it possible that Scrapy to get plain text from raw html data directly instead of using xPath selectors?

问题描述

欢迎到Scrapy

什么是Scrapy？

Welcome to Scrapy

What is Scrapy?

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Scrapy是否有可能直接从原始html数据中获取纯文本而不是使用xPath选择器？ [英] Is it possible that Scrapy to get plain text from raw html data directly instead of using xPath selectors?

问题描述

欢迎到Scrapy

什么是Scrapy？

Welcome to Scrapy

What is Scrapy?

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭