我如何使用Scrapy从网站获取所有纯文本? [英] How can I get all the plain text from a website with Scrapy?

查看:701
本文介绍了我如何使用Scrapy从网站获取所有纯文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。
使用 xpath('// body // text()')我可以得到它,但是使用HTML标签,我只想要文本。任何解决方案?感谢!

解决方案

最简单的选择是 摘录 // body // text() 加入 找到的所有东西:

 ''。join(sel.select(// body // text()) .extract())。strip()

其中 sel 选择器 实例。



另一个选择是使用



$ $ $ $ $ p> >>> import nltk
>>> html =
...< div class =post-textitemprop =description>
...
...< p>我想使用Scrapy框架来处理Python中的所有文本
...使用< code> xpath('// body // text()' )< / code>我可以得到它,但是使用HTML标签,我只想要文本。 ...< / div>
>>> nltk.clean_html(html)
我希望在呈现HTML之后让所有文本都可以在网站上看到,我正在使用Scrapy框架在Python中工作。\ \\ nWith xpath('// body / / text()')我可以得到它,但是使用HTML标签,而且我只想要文本,任何解决方案都是这样的?谢谢!

另一个选择是使用 BeautifulSoup get_text() $ b


get_text()



如果您只需要文档或标签的文本部分,那么
可以使用 get_text()方法。它将文档
中的所有文本或标签下的所有文本作为单个Unicode字符串返回。




 >>> from bs4 import BeautifulSoup 
>>>汤= BeautifulSoup(html)
>>> print soup.get_text()。strip()
我想在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。
使用xpath('// body // text()')我可以得到它,但是使用HTML标签,而我只需要文本。任何解决方案?谢谢 !

另一个选择是使用 lxml.html text_content()


.text_content()



返回文本内容



 >>> import lxml.html 
>>> tree = lxml.html.fromstring(html)
>>>打印tree.text_content()。strip()
我希望在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。
使用xpath('// body // text()')我可以得到它,但是使用HTML标签,而我只需要文本。任何解决方案?谢谢 !


I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

解决方案

The easiest option would be to extract //body//text() and join everything found:

''.join(sel.select("//body//text()").extract()).strip()

where sel is a Selector instance.

Another option is to use nltk's clean_html():

>>> import nltk
>>> html = """
... <div class="post-text" itemprop="description">
... 
...         <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p>
... 
...     </div>"""
>>> nltk.clean_html(html)
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"

Another option is to use BeautifulSoup's get_text():

get_text()

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.get_text().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

Another option is to use lxml.html's text_content():

.text_content()

Returns the text content of the element, including the text content of its children, with no markup.

>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>> print tree.text_content().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

这篇关于我如何使用Scrapy从网站获取所有纯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆