如何使用 Scrapy 从网站获取所有纯文本? [英] How can I get all the plain text from a website with Scrapy?

查看：67 发布时间：2021/12/17 13:26:12 python html xpath web-scraping scrapy

本文介绍了如何使用 Scrapy 从网站获取所有纯文本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望在呈现 HTML 后让网站上的所有文本都可见.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()') 我能够得到它，但是使用 HTML 标签，我只想要文本.有什么解决办法吗?

解决方案

最简单的选择是 extract //body//text() 和 join 找到的所有内容:

''.join(sel.select("//body//text()").extract()).strip()

其中 sel 是 选择器 实例.

另一种选择是使用nltk的clean_html():

<预><代码>>>>导入 nltk>>>html = """... <div class="post-text" itemprop="description">...... <p>我希望在呈现 HTML 之后从网站上看到所有文本.我正在使用 Scrapy 框架在 Python 中工作.... 使用 xpath('//body//text()')我能够得到它，但是使用 HTML 标签，我只想要文本.有什么解决办法吗?谢谢！</p>...... </div>""">>>nltk.clean_html(html)在呈现 HTML 之后，我希望从网站上看到所有文本.我正在使用 Scrapy 框架在 Python 中工作.
使用 xpath('//body//text()') 我能够明白了，但使用 HTML 标签，我只想要文本.有什么解决方案吗?谢谢！"

另一种选择是使用 BeautifulSoup's get_text():

<块引用>

get_text()

如果您只想要文档或标签的文本部分，您可以可以使用 get_text() 方法.它返回文档中的所有文本或在标签下方，作为单个 Unicode 字符串.

<预><代码>>>>从 bs4 导入 BeautifulSoup>>>汤 = BeautifulSoup(html)>>>打印汤.get_text().strip()在呈现 HTML 之后，我希望所有文本都可以从网站上看到.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()') 我能够得到它，但是使用 HTML 标签，我只想要文本.有什么解决办法吗?谢谢！

另一种选择是使用lxml.html的text_content():

<块引用>.text_content()
返回元素的文本内容，包括其子项的文本内容，没有标记.

<预><代码>>>>导入 lxml.html>>>树 = lxml.html.fromstring(html)>>>打印 tree.text_content().strip()在呈现 HTML 之后，我希望所有文本都可以从网站上看到.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()') 我能够得到它，但是使用 HTML 标签，我只想要文本.有什么解决办法吗?谢谢！

I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this?

解决方案

The easiest option would be to extract //body//text() and join everything found:

''.join(sel.select("//body//text()").extract()).strip()

where sel is a Selector instance.

Another option is to use nltk's clean_html():

>>> import nltk
>>> html = """
... <div class="post-text" itemprop="description">
... 
...         <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p>
... 
...     </div>"""
>>> nltk.clean_html(html)
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"

Another option is to use BeautifulSoup's get_text():

get_text()

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.get_text().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

Another option is to use lxml.html's text_content():

.text_content()

Returns the text content of the element, including the text content of its children, with no markup.

>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>> print tree.text_content().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

这篇关于如何使用 Scrapy 从网站获取所有纯文本?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 Scrapy 从网站获取所有纯文本? [英] How can I get all the plain text from a website with Scrapy?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何使用 Scrapy 从网站获取所有纯文本? [英] How can I get all the plain text from a website with Scrapy?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭