删除包含一定数量数字字符的 HTML 元素 [英] Delete HTML element if it contains a certain amount of numeric characters

查看:61
本文介绍了删除包含一定数量数字字符的 HTML 元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于使用Python将html格式的文件转换为纯文本文件,如果表格中的文本包含超过40%的数字字符,我需要删除所有表格.

For transforming a html-formatted file to a plain text file with Python, I need to delete all tables if the text within the table contains more than 40% numeric characters.

具体来说,我想:

  1. 识别 html 文件中的每个表格元素
  2. 计算文本中数字和字母字符的数量和对应比例,不考虑任何html标签内的字符.因此,删除所有 html 标签.
  3. 如果表格的文本由超过 40% 的数字字符组成,请删除该表格.如果表格包含少于 40% 的数字字符,请保留该表格.

我定义了一个在 re.sub 命令运行时调用的函数.rawtext 变量包含我想要解析的整个 html 格式的文本.在函数中,我尝试处理上述步骤并返回表格的 html 剥离版本或空格,具体取决于数字字符的比例.但是,函数中的第一个 re.sub 命令似乎不仅删除了标签,还删除了所有内容,包括文本内容.

I defined a function that is called when the re.sub command is run. The rawtext variable contains the whole html-formatted text I want to parse. Within the function, I try to process the steps described above and return a html-stripped version of the table or a blank space, depending on the ratio of numeric characters. However, the first re.sub command within the function seems to delete not only tags, but everything, including the textual content.

def tablereplace(table):
    table = re.sub('<[^>]*>', ' ', str(table))
    numeric = sum(c.isdigit() for c in table)
    alphabetic = sum(c.isalpha() for c in table)
    try:
            ratio = numeric / (numeric + alphabetic)
            print('ratio = ' + ratio)
    except ZeroDivisionError as err:
            ratio = 1
    if ratio > 0.4:
            emptystring = re.sub('.*?', ' ', table, flags=re.DOTALL)  
            return emptystring
    else:
            return table

rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)

如果您知道这段代码可能有什么问题,如果您能与我分享,我会很高兴.谢谢!

If you have an idea on what might be wrong with this code, I would be very happy if you share it with me. Thank you!

推荐答案

感谢您到目前为止的回复!

Thank you for your replies so far!

经过深入研究,我找到了整场比赛神秘删除的解决方案.该函数似乎只考虑了匹配的前 150 个左右的字符.但是,如果您指定 table = table.group(0),则会处理整个匹配项.group(0) 解释了这里的巨大差异.

After intensive research, I found the solution to the mysterious deletion of the whole match. It seemed that the function only considered the first 150 or so characters of the match. However, if you specify table = table.group(0), the whole match is processed. group(0) accounts for the big difference here.

您可以在下面找到我更新后的脚本,它可以正常工作(还包括其他一些小改动):

Below you can find my updated script thats works properly (also includes some other minor changes):

def tablereplace(table):
    table = table.group(0)
    table = re.sub('<[^>]*>', '\n', table)
    numeric = sum(c.isdigit() for c in table)
    alphabetic = sum(c.isalpha() for c in table)
    try: 
        ratio = numeric / (numeric + alphabetic)
    except ArithmeticError:
        ratio = 1
    else:
        pass
    if ratio > 0.4:
        emptystring = ''  
        return emptystring
    else:
        return table 
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)

这篇关于删除包含一定数量数字字符的 HTML 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆