从html文档中删除文本组 [英] Remove text groups from html document
问题描述
我有以下示例代码"来自html文档.我的意图是期望的结果"我希望脚本在文档中运行并检查重复项.如果返回true,并且基于"NAME2"存在重复项,则返回"0".行,然后我希望脚本删除整个名称组"但离开1组. 到目前为止,如果找到重复的脚本,我将删除所有组,并保留第一个....
I have the below "SAMPLE CODE" from a html document. My intention is the "EXPECTED RESULT" I would like the script to run through the document and check for duplicates. If it returns true and and duplicates are present based on "NAME2" line then I would like the script to remove the entire "name group" but leave 1 group. The script I have so far removes all groups if a duplicate is found and keeps the first....
<SAMPLE CODE>
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size=-4>
##Name Group (Name1,Name2,Name3,Name4)
NAME1....... DOUG
NAME2....... 12345
NAME3....... BILL
NAME4....... BOB
NAME1....... ALLAN
NAME2....... 12345
NAME3....... MITCHELL
NAME4....... TOM
NAME1....... CRAIG
NAME2....... 12345
NAME3....... SIMON
NAME4....... ANDREW
NAME1....... GARY
NAME2....... 65897
NAME3....... OLIVER
NAME4....... MICHAEL
</font>
</pre>
</body>
</html>
<EXPECTED RESULT>
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size=-4>
NAME1....... DOUG
NAME2....... 12345
NAME3....... BILL
NAME4....... BOB
NAME1....... GARY
NAME2....... 65897
NAME3....... OLIVER
NAME4....... MICHAEL
</font>
</pre>
</body>
</html>
import re
from bs4 import BeautifulSoup
html_data = '''
<html>
<head>
</head>
<body>
<pre>
<font face="courier new" size=-4>
NAME1......... DOUG
NAME2........... 12345
NAME3... BILL
NAME4...... BOB
NAME1......... ALLAN
NAME2........... 12345
NAME3... MITCHELL
NAME4...... TOM
</font>
</pre>
</body>
</html>
soup = BeautifulSoup(html_data, 'html.parser')
vals = set(re.findall(r'NAME2\.+\s*(.*)\s*', soup.font.text))
if len(vals) == 1:
soup.font.string = re.search(r'.*?NAME1.*?\n\n', soup.font.text, flags=re.S).group(0)
print(soup.prettify()`
推荐答案
我不是regex的忠实拥护者,尤其是在html中不是,所以我建议这样做:
I'm not a big fan of regex, especially not in html, so I would suggest doing something like this:
data = soup.select_one('font')
targets = data.text.replace('NAME1','xxxNAME1').split('xxx')
groups = [target.strip().split('\n') for target in targets[1:]]
for group in groups[1:]:
if group[1] == groups[groups.index(group)-1][1]:
groups.remove(group)
new_ts = '\n'
for group in groups:
new_ts += '\n'.join(group)+'\n\n'
data.string.replace_with(new_ts)
soup
输出是您的预期输出.
这篇关于从html文档中删除文本组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!