BeautifulSoup标签去除 [英] BeautifulSoup Tag Removal

查看:2426
本文介绍了BeautifulSoup标签去除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在找解析与Python HTML表格/ BeautifulSoup ...

I have am looking to parse a HTML table with Python/BeautifulSoup...

这是我在Python编码什么的第一次尝试,所以它可能不是最有效的。

This is my first attempt at coding anything in Python, so its probably not the most efficient.

我抓住一个函数在这里另一篇文章(的伟大工程,在大多数情况下),但是我遇到了几个问题。

I grabbed a function another post here (works great for the most part), but I am running into a couple of problems.

在code我运行的是在这里:

The code I am running is here:

def strip_tags(html, invalid_tags):
    bs2 = BeautifulSoup(str(html))
    for tag in bs2.findAll(True):
        if tag.name in invalid_tags:
            s = ""      

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)
    return bs2

invalid_tags = ['td','b']

for row in bs.findAll('tr'):
    col = row.findAll('td')

for index,item in enumerate(col):
    t = item.findAll('a')
    for ta in t:
        ta.replaceWithChildren()
        col[index] == item  

for item in col:
    print(strip_tags(item.string,invalid_tags).string

中的原始数据表(HTML)看起来是这样的:

The raw data table (HTML) looks like this:

<td align="left">11/10</td>
<td>N ARMY</td>
<td>-7.5</td>
<td>NL</td>
<td><b>76-65</b></td>
<td><span style="color:green">W</span></td>
<td><span style="color:green">W</span></td>
<td></td>
<td class="cell4">50.0%</td>
<td class="cell4">76.9%</td>
<td class="cell4">37.5%</td>
<td class="cell5">37.1%</td>
<td class="cell5">90.0%</td>
<td class="cell5">29.4%</td>

当我运行strip_tags的功能,它适用于所有的标签,除了第二行...无返回作为输出。

When I run the strip_tags function, It works for all the tags except for the second line... 'None' is returned as the output.

如果任何人都可以提供关于为什么这个任何了解正在发生的事情,我将不胜AP preciate它。

If anyone could provide any insight on why this is happening I would greatly appreciate it.

编辑:每个人的快速反应哇感谢。总之,这里是当我运行code会发生什么:

edit: wow thanks for everyone's quick responses. anyhow, here is what happens when I run the code:


11/10
None
-7.5
NL
76-65
W
W
None
50.0%
76.9%
37.5%
37.1%
90.0%
29.4%

问题在于围绕第二行,它返回,而不是'N军队无。所以,是的,我非常希望只是在标签中找到的文本。

The problem lies around the second line, where it returns 'None' instead of 'N ARMY'. So yes, ideally I would like just the text that is found within the tags.

推荐答案

如果我理解你想要正确的输出,你应该不需要做任何手动删除的标签 - 这就是为什么我们使用 BeautifulSoup ! ;)

If I'm understanding the output you want correctly, you shouldn't need to do any manual removing of tags -- that's why we use BeautifulSoup! ;)

您需要调用什么是 get_text()标记 find_all()的回报。

What you need to call is the get_text() method on the tag instances that find_all() returns.

使用您的样本HTML:

Using your sample html:

<table>
    <tr>
        <td align="left">11/10</td>
        <td>N ARMY</td>
        <td>-7.5</td>
        <td>NL</td>
        <td><b>76-65</b></td>
        <td><span style="color:green">W</span></td>
        <td><span style="color:green">W</span></td>
        <td></td>
        <td class="cell4">50.0%</td>
        <td class="cell4">76.9%</td>
        <td class="cell4">37.5%</td>
        <td class="cell5">37.1%</td>
        <td class="cell5">90.0%</td>
        <td class="cell5">29.4%</td>
    </tr>
</table>

一个简单的迭代通过 D s和调用 get_text(),我们是好走!

A simple iteration over the tds, and a call to get_text() and we're good to go!

from bs4 import BeautifulSoup

with open('test.html', 'rb') as html: #My local version of your html file
    soup = BeautifulSoup(html.read())

for td in soup.find_all('td'):
    print td.get_text()

这给出了输出:

11/10
N ARMY
-7.5
NL
76-65
W
W

50.0%
76.9%
37.5%
37.1%
90.0%
29.4%
[Finished in 0.1s]

这篇关于BeautifulSoup标签去除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆