美丽的汤-获取所有文本,但保留链接html? [英] Beautiful Soup - Get all text, but preserve link html?

查看:94
本文介绍了美丽的汤-获取所有文本,但保留链接html?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须处理大量杂乱无章的HTML档案,其中充满多余的表格,跨度和内联样式到markdown中.

I have to process a large archive of extremely messy HTML full of extraneous tables, spans and inline styles into markdown.

我正在尝试使用美丽汤来完成此任务,我的目标基本上是get_text()函数的输出,除了保留完整保留href的锚标记.

I am trying to use Beautiful Soup to accomplish this task, and my goal is basically the output of the get_text() function, except to preserve anchor tags with the href intact.

作为一个例子,我想转换:

As an example, I would like to convert:

<td>
    <font><span>Hello</span><span>World</span></font><br>
    <span>Foo Bar <span>Baz</span></span><br>
    <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span>
</td>

进入:

Hello World
Foo Bar Baz
Example Link: <a href="https://google.com">Google</a>

到目前为止,我的思考过程是简单地获取所有标签,如果它们不是锚,则将它们全部解包,但这会导致文本重复几次,因为soup.find_all(True)将递归嵌套的标签作为单独的元素返回: >

My thought process so far was to simply grab all the tags and unwrap them all if they aren't anchors, but this causes the text to be repeated several times as soup.find_all(True) returns recursively nested tags as individual elements:

#!/usr/bin/env python

from bs4 import BeautifulSoup

example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)

for tag in tags:
    if (tag.name == 'a'):
        print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
    else:
        print(tag.get_text())

随着解析器在树上向下移动,哪个会返回多个片段/重复项:

Which returns multiple fragments/duplicates as the parser moves down the tree:

HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World

Foo Bar Baz
Baz

Example Link: Google
<a href='https://google.com'>Google</a>

推荐答案

解决此问题的一种可能方法是在打印元素文本时对a元素进行一些特殊处理.

One of the possible ways to tackle this problem would be to introduce some special handling for a elements when it comes to printing out a text of an element.

您可以通过重写_all_strings()方法并返回a后代元素的字符串表示形式,并跳过a元素内的可导航字符串来实现.遵循以下原则:

You can do it by overriding _all_strings() method and returning a string representation of an a descendant element and skip a navigable string inside an a element. Something along these lines:

from bs4 import BeautifulSoup, NavigableString, CData, Tag


class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            # return "a" string representation if we encounter it
            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str(descendant)

            # skip an inner text node inside "a"
            if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                continue

            # default behavior
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                (types is not None and type(descendant) not in types)):
                continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

演示:

In [1]: data = """
   ...: <td>
   ...:     <font><span>Hello</span><span>World</span></font><br>
   ...:     <span>Foo Bar <span>Baz</span></span><br>
   ...:     <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
   ...: t-decoration: underline;">Google</a></span>
   ...: </td>
   ...: """

In [2]: soup = MyBeautifulSoup(data, "lxml")

In [3]: print(soup.get_text())

HelloWorld
Foo Bar Baz
Example Link: <a href="https://google.com" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;" target="_blank">Google</a>

这篇关于美丽的汤-获取所有文本,但保留链接html?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆