使用Beautiful Soup Python模块用纯文本替换标签 [英] Using Beautiful Soup Python module to replace tags with plain text

查看:97
本文介绍了使用Beautiful Soup Python模块用纯文本替换标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用美丽汤" 从网页中提取内容".我知道有人问过这个问题之前,他们都被指向美丽汤",这就是我开始使用它的方式.

I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it.

我能够成功获取大部分内容,但是使用作为内容一部分的标签遇到了一些挑战. (我从一个基本策略开始:如果一个节点中的x-char数超过了,那么它就满足了).让我们以下面的html代码为例:

I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example:

<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>

results = soup.findAll(text=lambda(x): len(x) > 20)

当我使用上面的代码来获取长文本时,它在标签处中断了(标识的文本将从并且希望开始于.."开始).因此,我尝试将标签替换为纯文本,如下所示:

When I use the above code to get at the long text, it breaks (the identified text will start from 'and hopefully..') at the tags. So I tried to replace the tag with plain text as follows:

anchors = soup.findAll('a')

for a in anchors:
  a.replaceWith('plain text')

以上方法不起作用,因为Beautiful Soup将字符串作为NavigableString插入,当我将len(x)>与findAll一起使用时,会导致相同的问题.20.我可以使用正则表达式首先将html解析为纯文本,清除所有不需要的标签,然后致电美丽汤".但是我想避免两次处理相同的内容-我试图解析这些页面,以便可以显示给定链接的内容片段(非常类似于Facebook Share)-如果一切都用Beautiful Soup完成,我想它将更快.

The above does not work because Beautiful Soup inserts the string as a NavigableString and that causes the same problem when I use findAll with the len(x) > 20. I can use regular expressions to parse the html as plain text first, clear out all the unwanted tags and then call Beautiful Soup. But I would like to avoid processing the same content twice -- I am trying to parse these pages so I can show a snippet of content for a given link (very much like Facebook Share) -- and if everything is done with Beautiful Soup, I presume it will be faster.

所以我的问题是:有没有一种方法可以使用Beautiful Soup来清除标签"并将其替换为纯文本".如果没有,那么最好的方法是什么?

So my question: is there a way to 'clear tags' and replace them with 'plain text' using Beautiful Soup. If not, what will be best way to do so?

感谢您的建议!

更新:对于示例示例,Alex的代码工作得很好.我还尝试了各种边缘情况,它们都工作正常(使用下面的修改).因此,我在现实生活中的网站上试了一下,遇到了令我困惑的问题.

Update: Alex's code worked very well for the sample example. I also tried various edge cases and they all worked fine (with the modification below). So I gave it a shot on a real life website and I run into issues that puzzle me.

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')

anchors = soup.findAll('a')
i = 0
for a in anchors:
    print str(i) + ":" + str(a)
    for a in anchors:
        if (a.string is None): a.string = ''
        if (a.previousSibling is None and a.nextSibling is None):
            a.previousSibling = a.string
        elif (a.previousSibling is None and a.nextSibling is not None):
            a.nextSibling.replaceWith(a.string + a.nextSibling)
        elif (a.previousSibling is not None and a.nextSibling is None):
            a.previousSibling.replaceWith(a.previousSibling + a.string)
        else:
            a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
            a.nextSibling.extract()
    i = i+1

运行上面的代码时,出现以下错误:

When I run the above code, I get the following error:

0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with 
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
  File "parselink.py", line 44, in <module>
  a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
 TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'

当我查看HTML代码时,保持最新.."没有以前的兄弟姐妹(我没有以前的兄弟姐妹如何工作,直到我看到Alex的代码并根据测试发现它看起来像是在寻找标签之前的文本").因此,如果之前没有兄弟姐妹,我很惊讶它没有通过a.previousSibling为None和a; nextSibling为None的if逻辑.

When I look at the HTML code, 'Stay up to date.." does not have any previous sibling (I did not how previous sibling worked until I saw Alex's code and based on my testing it looks like it is looking for 'text' before the tag). So, if there is no previous sibling, I am surprised that it is not going through the if logic of a.previousSibling is None and a;nextSibling is None.

能否让我知道我做错了什么?

Could you please let me know what I am doing wrong?

-认知

推荐答案

适用于您的特定示例的方法是:

An approach that works for your specific example is:

from BeautifulSoup import BeautifulSoup

ht = '''
<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>
'''
soup = BeautifulSoup(ht)

anchors = soup.findAll('a')
for a in anchors:
  a.previousSibling.replaceWith(a.previousSibling + a.string)

results = soup.findAll(text=lambda(x): len(x) > 20)

print results

发出

$ python bs.py
[u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']

当然,您可能需要多加注意,例如,如果没有a.stringa.previousSiblingNone怎么办–您将需要合适的if语句来处理照顾这种极端情况.但我希望这个总体思路能对您有所帮助. (实际上,如果您是字符串,您可能还想合并 next 同级兄弟-不知道它如何与您的启发式方法len(x) > 20一起使用,但是例如说您有两个9个字符的字符串,其中<a>中间包含一个5个字符的字符串,也许您想将其作为"23个字符的字符串"使用呢?了解启发式的动机.

Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).

我猜想,除了<a>标签之外,您还希望删除其他标签,例如<b><strong>,也许是<p>和/或<br>等...?我想这也取决于您的启发式技术背后的实际想法是什么!

I imagine that besides <a> tags you'll also want to remove others, such as <b> or <strong>, maybe <p> and/or <br>, etc...? I guess this, too, depends on what the actual idea behind your heuristics is!

这篇关于使用Beautiful Soup Python模块用纯文本替换标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆