在标签之间提取HTML [英] Extracting HTML between tags
问题描述
我想提取特定HTML标记之间的所有HTML.
I want to extract all HTML between specific HTML tags.
<html>
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>
所以要grep class1
div
和class2
span
之间的所有HTML(标记和值).
so want to grep all HTML (tags & values) between the class1
div
and the class2
span
.
Included Text
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
HTML文件中也有多个匹配项,因此我想将它们全部匹配.这是我的意思:
Also there are multiple occurrences within the HTML file so I want to match them all. Here is what I mean:
<html>
(first occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>
(2nd occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>
(third occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>
</html>
我一直在使用Beautifulsoup 4搜索答案.但是,所有问题/答案都与提取文本之间的值有关,但这不是我想要的.我还想知道,Beautifulsoup是否有可能实现这一目标,或者我是否必须使用正则表达式.
I've been searching for answers using Beautifulsoup 4. However, all questions/answers are related to extracting values between text, but that is not want I want. I was also wondering if this is even possible with Beautifulsoup or whether I must use regex instead.
推荐答案
您可以使用 bs4 和 itertools.takewhile
h = """<html>
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>"""
soup = BeautifulSoup(h)
def get_html_between(start_select, end_tag, cls):
start = soup.select_one(start_select)
all_next = start.find_all_next()
yield "".join(start.contents)
for t in takewhile(lambda tag: tag.get("name") != end_tag and tag.get("class") != [cls], all_next):
yield t
for ele in get_html_between("div.class1","div","class2"):
print(ele)
输出:
Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]
</span>
<div>[...]</div>
为使其更加灵活,您可以传递初始标签和 cond lambda/function,对于多个class1,只需对其进行迭代并传递:
To make it a little more flexible, you can pass in the initial tag and a cond lambda/function, for multiple class1s just iterate and pass each on:
def get_html_between(start_tag, cond):
yield "".join(start_tag.contents)
all_next = start_tag.find_all_next()
for ele in takewhile(cond, all_next):
yield ele
cond = lambda tag: tag.get("name") != "div" and tag.get("class") != ["class2"]
soup = BeautifulSoup(h, "lxml")
for tag in soup.select("div.class1"):
for ele in get_html_between(tag, cond):
print(ele)
使用最新的编辑内容
In [15]: cond = lambda tag: tag.get("name") != "div" and tag.get("class") != ["class2"]
In [16]: for tag in soup.select("div.class1"):
for ele in get_html_between(tag, cond):
print(ele)
print("\n")
....:
Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>
Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>
Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>
这篇关于在标签之间提取HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!