我如何使用Scrapy从网站获取所有纯文本? [英] How can I get all the plain text from a website with Scrapy?
问题描述
我希望在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。
使用 xpath('// body // text()')
我可以得到它,但是使用HTML标签,我只想要文本。任何解决方案?感谢!
最简单的选择是 摘录
// body // text()
和 加入
找到的所有东西:
''。join(sel.select(// body // text()) .extract())。strip()
其中 sel
是 选择器
实例。
另一个选择是使用 另一个选择是使用 如果您只需要文档或标签的文本部分,那么 另一个选择是使用 返回文本内容 I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With The easiest option would be to where Another option is to use Another option is to use If you only want the text part of a document or tag, you
can use the
Another option is to use Returns the text content of the element, including
the text content of its children, with no markup.
这篇关于我如何使用Scrapy从网站获取所有纯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
$ $ $ $ $ p> >>> import nltk
>>> html =
...< div class =post-textitemprop =description>
...
...< p>我想使用Scrapy框架来处理Python中的所有文本
...使用< code> xpath('// body // text()' )< / code>我可以得到它,但是使用HTML标签,我只想要文本。 ...< / div>
>>> nltk.clean_html(html)
我希望在呈现HTML之后让所有文本都可以在网站上看到,我正在使用Scrapy框架在Python中工作。\ \\ nWith xpath('// body / / text()')我可以得到它,但是使用HTML标签,而且我只想要文本,任何解决方案都是这样的?谢谢!
BeautifulSoup
的 get_text()
: $ b
get_text()
可以使用 get_text()
方法。它将文档
中的所有文本或标签下的所有文本作为单个Unicode字符串返回。
>>> from bs4 import BeautifulSoup
>>>汤= BeautifulSoup(html)
>>> print soup.get_text()。strip()
我想在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。
使用xpath('// body // text()')我可以得到它,但是使用HTML标签,而我只需要文本。任何解决方案?谢谢 !
lxml.html
的 text_content()
:
.text_content()
>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>>打印tree.text_content()。strip()
我希望在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。
使用xpath('// body // text()')我可以得到它,但是使用HTML标签,而我只需要文本。任何解决方案?谢谢 !
xpath('//body//text()')
I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !extract
//body//text()
and join
everything found:''.join(sel.select("//body//text()").extract()).strip()
sel
is a Selector
instance.nltk
's clean_html()
:>>> import nltk
>>> html = """
... <div class="post-text" itemprop="description">
...
... <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p>
...
... </div>"""
>>> nltk.clean_html(html)
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"
BeautifulSoup
's get_text()
:
get_text()
get_text()
method. It returns all the text in a document
or beneath a tag, as a single Unicode string.>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.get_text().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !
lxml.html
's text_content()
:
.text_content()
>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>> print tree.text_content().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !