我如何使用Scrapy从网站获取所有纯文本？ [英] How can I get all the plain text from a website with Scrapy?

查看：701 发布时间：2018/6/13 10:53:26 python html xpath web-scraping scrapy

本文介绍了我如何使用Scrapy从网站获取所有纯文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。
使用 xpath（'// body // text（）'）我可以得到它，但是使用HTML标签，我只想要文本。任何解决方案？感谢！

解决方案

最简单的选择是 摘录 // body // text（）和 加入 找到的所有东西：

 ''。join（sel.select（// body // text（）） .extract（））。strip（）

其中 sel 是 选择器 实例。

另一个选择是使用 $ $ $ $ $ p> >>> import nltk >>> html = ...< div class =post-textitemprop =description> ... ...< p>我想使用Scrapy框架来处理Python中的所有文本 ...使用< code> xpath（'// body // text（）' ）< / code>我可以得到它，但是使用HTML标签，我只想要文本。 ...< / div> >>> nltk.clean_html（html）我希望在呈现HTML之后让所有文本都可以在网站上看到，我正在使用Scrapy框架在Python中工作。\ \\ nWith xpath（'// body / / text（）'）我可以得到它，但是使用HTML标签，而且我只想要文本，任何解决方案都是这样的？谢谢！
另一个选择是使用 BeautifulSoup 的 get_text（）： $ b
get_text（）如果您只需要文档或标签的文本部分，那么可以使用 get_text（）方法。它将文档中的所有文本或标签下的所有文本作为单个Unicode字符串返回。
>>> from bs4 import BeautifulSoup >>>汤= BeautifulSoup（html） >>> print soup.get_text（）。strip（）我想在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。使用xpath（'// body // text（）'）我可以得到它，但是使用HTML标签，而我只需要文本。任何解决方案？谢谢！
另一个选择是使用 lxml.html 的 text_content（）：
.text_content（）返回文本内容
>>> import lxml.html >>> tree = lxml.html.fromstring（html） >>>打印tree.text_content（）。strip（）我希望在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。使用xpath（'// body // text（）'）我可以得到它，但是使用HTML标签，而我只需要文本。任何解决方案？谢谢！
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !
解决方案
The easiest option would be to extract //body//text() and join everything found:
''.join(sel.select("//body//text()").extract()).strip()
where sel is a Selector instance.
Another option is to use nltk's clean_html(): >>> import nltk >>> html = """ ... <div class="post-text" itemprop="description"> ... ... <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. ... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p> ... ... </div>""" >>> nltk.clean_html(html) "I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !" Another option is to use BeautifulSoup's get_text(): get_text() If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string. >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> print soup.get_text().strip() I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks ! Another option is to use lxml.html's text_content(): .text_content() Returns the text content of the element, including the text content of its children, with no markup. >>> import lxml.html >>> tree = lxml.html.fromstring(html) >>> print tree.text_content().strip() I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks ! 这篇关于我如何使用Scrapy从网站获取所有纯文本？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我如何使用Scrapy从网站获取所有纯文本？ [英] How can I get all the plain text from a website with Scrapy?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

我如何使用Scrapy从网站获取所有纯文本？ [英] How can I get all the plain text from a website with Scrapy?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭