从任意嵌套的HTML中提取所有文本 [英] Extract all text from arbitrarily nested HTML

查看:77
本文介绍了从任意嵌套的HTML中提取所有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Scrapy从新闻站点中提取新闻文章的文本.我假设<p>标记内的所有文本均为实际文章. (这不一定是一个安全的假设,但这是我正在使用的.)要找到所有<p>标记,Scrapy让我使用css选择器,如下所示:

I am using Scrapy to extract the text of news articles from news sites. I am assuming that all of the text within <p> tags is the actual article. (Which isn't necessarily a safe assumption, but it's what I'm working with) To find all of the <p> tags, Scrapy lets me use css selectors, like so:

response.css("p::text")

问题在于某些新闻网站喜欢在其文章中添加很多标记,例如:

The problem is that some news sites like to put a lot of markup in their articles, like so:

<p>
    Senator <a href="/people/senator_whats_their_name">What&#39s-their-name</a> is <em>furious</em> about politics!
</p>

在Scrapy中是否有一个css选择器或其他一些简单方法来提取文本并去除所有格式,所以结果是这样的?

Is there a css selector, or otherwise some simple way within Scrapy, to extract the text and strip all formatting, so that is results in something like this?

Senator What's-their-name is furious about politics!

问题在于,从理论上讲,这些标签可以任意嵌套:

The problem is that these tags could, in theory, be arbitrarily nested:

<p>
    <span class="some-annoying-markup"><a href="who cares"><em>Wow this link must be important </em></a></span>
<p>

我仍然想提取文本

Wow this link must be important

我知道这是从HTML页面提取内容的一种非常幼稚的方法,但这超出了此问题的范围.如果有更简单的方法可以完成此操作,我将提出建议,但是我在此主题上发现的内容似乎比这里介绍的要复杂得多,因此我只想解决我所遇到的问题已经提出.

I understand that this is a pretty naive way to extract content from an HTML page, but that's outside the scope of this question. If there's a simpler way to accomplish this, I'll take suggestions, but what I've found on this topic seems to be much more complicated than what I've presented here, so I'm just interested in solving the problem I've presented.

推荐答案

In [7]: sel = Selector(text='''<p>
   ...:     Senator <a href="/people/senator_whats_their_name">What&#39s-their-n
   ...: ame</a> is <em>furious</em> about politics!
   ...: </p>''')

In [9]: sel.xpath('normalize-space(//p)').extract_first()
Out[9]: "Senator What's-their-name is furious about politics!"

OR:

In [10]: sel = Selector(text='''<p>
    ...:     <span class="some-annoying-markup"><a href="who cares"><em>Wow this
    ...:  link must be important </em></a></span>
    ...: <p>''')

In [11]: sel.xpath('normalize-space(//p)').extract_first()
Out[11]: 'Wow this link must be important'

使用xpath的string函数将标签下的所有文本串联起来.

use xpath's string function to concatenate all the text under a tag.

normalize-space将去除字符串中的空白.

normalize-space will strip the white space in the string.

这篇关于从任意嵌套的HTML中提取所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆