美丽的汤去除上标 [英] Beautiful soup remove superscripts
问题描述
如何从所有文本中删除上标?我下面的代码可以获取所有可见的文本,但是脚注的上标使事情变得混乱.如何删除它们?
How do I remove the superscripts from all of the text? I have code below that gets all visible text, but the superscripts for footnoting are messing things up. How do I remove them?
例如Active accounts (1),(2)
,(1),(2)
是可见的上标.
for example Active accounts (1),(2)
, (1),(2)
are visible superscripts.
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
f_url='https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/exhibit991prq12018pypl.htm'
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = requests.get(f_url)
text= text_from_html(html.text)
推荐答案
BeautifulSoup函数 filter
遍历此列表并删除其回调例程返回False
的项目.回调函数将测试每个代码段的标签名称,如果该列表不在不需要的列表中,则返回False
,否则返回True
.
The BeautifulSoup function find_all
returns a list of all single discrete HTML elements in the input (find_all
is the proper function to use in BeautifulSoup 4 and preferred over findAll
). The next function, filter
, goes through this list and removes the items for which its callback routine returns False
. The callback function tests the tag name of each snippet and returns False
if it's in the not-wanted list, True
otherwise.
如果这些上标始终由正确的HTML标记sup
指示,则可以将其添加到回调函数中不需要的列表中.
If these superscripts are always indicated by the proper HTML tag sup
then you can add it to the not-wanted list in the callback function.
可能的陷阱是:
- 假设使用文字(在语义上正确的)标记
sup
,而不使用例如仅在其CSS中指定vertical-align: superscript;
的类或跨度; - 假设您要摆脱此上标标签中的所有所有元素.如果有例外("20世纪"),您可以检查文本内容;例如,仅当内容全部为数字时才删除.如果有 例外("a 2 = b 2 + c 2 "),您将拥有检查更广泛的上下文,或建立包含/排除的白名单或黑名单.
- It is assumed that the literal (semantically correct) tag
sup
is used, and not, for example, a class or a span that merely specifiesvertical-align: superscript;
in its CSS; - It is assumed that you want to get rid of all elements that are in this superscript tag. If there are exceptions ("the 20th century"), you can check the text contents; for example, only remove if its contents are all numerical. If there are exceptions to that ("a2 = b2 + c2"), you will have to check for a wider context, or build a whitelist or blacklist of inclusions/exclusions.
这篇关于美丽的汤去除上标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!