美丽的汤去除上标 [英] Beautiful soup remove superscripts

查看:74
本文介绍了美丽的汤去除上标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从所有文本中删除上标?我下面的代码可以获取所有可见的文本,但是脚注的上标使事情变得混乱.如何删除它们?

How do I remove the superscripts from all of the text? I have code below that gets all visible text, but the superscripts for footnoting are messing things up. How do I remove them?

例如Active accounts (1),(2)(1),(2)是可见的上标.

for example Active accounts (1),(2), (1),(2) are visible superscripts.

from bs4 import BeautifulSoup
from bs4.element import Comment
import requests


f_url='https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/exhibit991prq12018pypl.htm'

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = requests.get(f_url)
text= text_from_html(html.text)

推荐答案

BeautifulSoup函数 filter 遍历此列表并删除其回调例程返回False的项目.回调函数将测试每个代码段的标签名称,如果该列表不在不需要的列表中,则返回False,否则返回True.

The BeautifulSoup function find_all returns a list of all single discrete HTML elements in the input (find_all is the proper function to use in BeautifulSoup 4 and preferred over findAll). The next function, filter, goes through this list and removes the items for which its callback routine returns False. The callback function tests the tag name of each snippet and returns False if it's in the not-wanted list, True otherwise.

如果这些上标始终由正确的HTML标记sup指示,则可以将其添加到回调函数中不需要的列表中.

If these superscripts are always indicated by the proper HTML tag sup then you can add it to the not-wanted list in the callback function.

可能的陷阱是:

  1. 假设使用文字(在语义上正确的)标记sup,而不使用例如仅在其CSS中指定 vertical-align: superscript;的类或跨度;
  2. 假设您要摆脱此上标标签中的所有所有元素.如果有例外("20世纪"),您可以检查文本内容;例如,仅当内容全部为数字时才删除.如果有 例外("a 2 = b 2 + c 2 "),您将拥有检查更广泛的上下文,或建立包含/排除的白名单或黑名单.
  1. It is assumed that the literal (semantically correct) tag sup is used, and not, for example, a class or a span that merely specifies vertical-align: superscript; in its CSS;
  2. It is assumed that you want to get rid of all elements that are in this superscript tag. If there are exceptions ("the 20th century"), you can check the text contents; for example, only remove if its contents are all numerical. If there are exceptions to that ("a2 = b2 + c2"), you will have to check for a wider context, or build a whitelist or blacklist of inclusions/exclusions.

这篇关于美丽的汤去除上标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆