从html文档中删除文本组 [英] Remove text groups from html document

查看:61
本文介绍了从html文档中删除文本组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下示例代码"来自html文档.我的意图是期望的结果"我希望脚本在文档中运行并检查重复项.如果返回true,并且基于"NAME2"存在重复项,则返回"0".行,然后我希望脚本删除整个名称组"但离开1组. 到目前为止,如果找到重复的脚本,我将删除所有组,并保留第一个....

I have the below "SAMPLE CODE" from a html document. My intention is the "EXPECTED RESULT" I would like the script to run through the document and check for duplicates. If it returns true and and duplicates are present based on "NAME2" line then I would like the script to remove the entire "name group" but leave 1 group. The script I have so far removes all groups if a duplicate is found and keeps the first....

<SAMPLE CODE>
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size=-4>

##Name Group (Name1,Name2,Name3,Name4)
NAME1....... DOUG
NAME2....... 12345
NAME3....... BILL
NAME4....... BOB

NAME1....... ALLAN
NAME2....... 12345
NAME3....... MITCHELL
NAME4....... TOM   
         
NAME1....... CRAIG
NAME2....... 12345
NAME3....... SIMON
NAME4....... ANDREW

NAME1....... GARY
NAME2....... 65897
NAME3....... OLIVER
NAME4....... MICHAEL

</font>
</pre>
</body>
</html>  


<EXPECTED RESULT>
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size=-4>                                                

NAME1....... DOUG
NAME2....... 12345
NAME3....... BILL
NAME4....... BOB

NAME1....... GARY
NAME2....... 65897
NAME3....... OLIVER
NAME4....... MICHAEL


</font>
</pre>
</body>
</html>

import re
from bs4 import BeautifulSoup


html_data = '''
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size=-4>

NAME1......... DOUG
NAME2........... 12345
NAME3... BILL
NAME4...... BOB

NAME1......... ALLAN
NAME2........... 12345
NAME3... MITCHELL
NAME4...... TOM

</font>
</pre>
</body>
</html>
soup = BeautifulSoup(html_data, 'html.parser')
vals = set(re.findall(r'NAME2\.+\s*(.*)\s*', soup.font.text))
if len(vals) == 1:
    soup.font.string = re.search(r'.*?NAME1.*?\n\n', soup.font.text, flags=re.S).group(0)

print(soup.prettify()`

推荐答案

我不是regex的忠实拥护者,尤其是在html中不是,所以我建议这样做:

I'm not a big fan of regex, especially not in html, so I would suggest doing something like this:

data = soup.select_one('font')
targets = data.text.replace('NAME1','xxxNAME1').split('xxx')
groups = [target.strip().split('\n') for target in targets[1:]]
for group in groups[1:]:
    if group[1] == groups[groups.index(group)-1][1]:
        groups.remove(group)
new_ts = '\n'
for group in groups:
    new_ts += '\n'.join(group)+'\n\n'
data.string.replace_with(new_ts)
soup

输出是您的预期输出.

这篇关于从html文档中删除文本组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆