如何使用 Scrapy 从网站获取所有纯文本? [英] How can I get all the plain text from a website with Scrapy?

查看:67
本文介绍了如何使用 Scrapy 从网站获取所有纯文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望在呈现 HTML 后让网站上的所有文本都可见.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()') 我能够得到它,但是使用 HTML 标签,我只想要文本.有什么解决办法吗?

解决方案

最简单的选择是 extract //body//text()join 找到的所有内容:

''.join(sel.select("//body//text()").extract()).strip()

其中 sel选择器 实例.

另一种选择是使用nltkclean_html():

<预><代码>>>>导入 nltk>>>html = """... <div class="post-text" itemprop="description">...... <p>我希望在呈现 HTML 之后从网站上看到所有文本.我正在使用 Scrapy 框架在 Python 中工作.... 使用 xpath('//body//text()')我能够得到它,但是使用 HTML 标签,我只想要文本.有什么解决办法吗?谢谢!</p>...... </div>""">>>nltk.clean_html(html)在呈现 HTML 之后,我希望从网站上看到所有文本.我正在使用 Scrapy 框架在 Python 中工作. 使用 xpath('//body//text()') 我能够明白了,但使用 HTML 标签,我只想要文本.有什么解决方案吗?谢谢!"

另一种选择是使用 BeautifulSoup's get_text():

<块引用>

get_text()

如果您只想要文档或标签的文本部分,您可以可以使用 get_text() 方法.它返回文档中的所有文本或在标签下方,作为单个 Unicode 字符串.

<预><代码>>>>从 bs4 导入 BeautifulSoup>>>汤 = BeautifulSoup(html)>>>打印汤.get_text().strip()在呈现 HTML 之后,我希望所有文本都可以从网站上看到.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()') 我能够得到它,但是使用 HTML 标签,我只想要文本.有什么解决办法吗?谢谢 !

另一种选择是使用lxml.htmltext_content():

<块引用>

.text_content()

返回元素的文本内容,包括其子项的文本内容,没有标记.

<预><代码>>>>导入 lxml.html>>>树 = lxml.html.fromstring(html)>>>打印 tree.text_content().strip()在呈现 HTML 之后,我希望所有文本都可以从网站上看到.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()') 我能够得到它,但是使用 HTML 标签,我只想要文本.有什么解决办法吗?谢谢 !

I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this?

解决方案

The easiest option would be to extract //body//text() and join everything found:

''.join(sel.select("//body//text()").extract()).strip()

where sel is a Selector instance.

Another option is to use nltk's clean_html():

>>> import nltk
>>> html = """
... <div class="post-text" itemprop="description">
... 
...         <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p>
... 
...     </div>"""
>>> nltk.clean_html(html)
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"

Another option is to use BeautifulSoup's get_text():

get_text()

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.get_text().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

Another option is to use lxml.html's text_content():

.text_content()

Returns the text content of the element, including the text content of its children, with no markup.

>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>> print tree.text_content().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

这篇关于如何使用 Scrapy 从网站获取所有纯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆