如何使用 Scrapy 从网站获取所有纯文本? [英] How can I get all the plain text from a website with Scrapy?
问题描述
我希望在呈现 HTML 后让网站上的所有文本都可见.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()')
我能够得到它,但是使用 HTML 标签,我只想要文本.有什么解决办法吗?
最简单的选择是 extract
//body//text()
和 join
找到的所有内容:
''.join(sel.select("//body//text()").extract()).strip()
其中 sel
是 选择器
实例.
另一种选择是使用nltk
的clean_html()代码>:
xpath('//body//text()')
我能够得到它,但是使用 HTML 标签,我只想要文本.有什么解决办法吗?谢谢!</p>...... </div>""">>>nltk.clean_html(html)在呈现 HTML 之后,我希望从网站上看到所有文本.我正在使用 Scrapy 框架在 Python 中工作.
使用 xpath('//body//text()') 我能够明白了,但使用 HTML 标签,我只想要文本.有什么解决方案吗?谢谢!"另一种选择是使用 BeautifulSoup
's get_text()
:
get_text()
如果您只想要文档或标签的文本部分,您可以可以使用 get_text()
方法.它返回文档中的所有文本或在标签下方,作为单个 Unicode 字符串.
<预><代码>>>>从 bs4 导入 BeautifulSoup>>>汤 = BeautifulSoup(html)>>>打印汤.get_text().strip()在呈现 HTML 之后,我希望所有文本都可以从网站上看到.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()') 我能够得到它,但是使用 HTML 标签,我只想要文本.有什么解决办法吗?谢谢 !
另一种选择是使用lxml.html
的text_content()代码>:
.text_content()
返回元素的文本内容,包括其子项的文本内容,没有标记.
<预><代码>>>>导入 lxml.html>>>树 = lxml.html.fromstring(html)>>>打印 tree.text_content().strip()在呈现 HTML 之后,我希望所有文本都可以从网站上看到.我正在使用 Scrapy 框架在 Python 中工作.使用 xpath('//body//text()') 我能够得到它,但是使用 HTML 标签,我只想要文本.有什么解决办法吗?谢谢 !
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()')
I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this?
The easiest option would be to extract
//body//text()
and join
everything found:
''.join(sel.select("//body//text()").extract()).strip()
where sel
is a Selector
instance.
Another option is to use nltk
's clean_html()
:
>>> import nltk
>>> html = """
... <div class="post-text" itemprop="description">
...
... <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p>
...
... </div>"""
>>> nltk.clean_html(html)
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"
Another option is to use BeautifulSoup
's get_text()
:
get_text()
If you only want the text part of a document or tag, you can use the
get_text()
method. It returns all the text in a document or beneath a tag, as a single Unicode string.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.get_text().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !
Another option is to use lxml.html
's text_content()
:
.text_content()
Returns the text content of the element, including the text content of its children, with no markup.
>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>> print tree.text_content().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !
这篇关于如何使用 Scrapy 从网站获取所有纯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!