BeautifulSoup-在一个类中提取文本 [英] BeautifulSoup - extracting texts within one class
问题描述
我正尝试从下面的此网页中提取文本:
I'm trying to extract texts from this webpage below:
<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> > Category2: <a href="SomeURL" >Text2 I want</a></div>
我尝试过:
for div in soup.find_all('div', class_='MYCLASS'):
for url in soup.find_all('a', id='category1'):
print(url)
它返回:
<a href="someURL" id="category1">Text1 I want</a>
所以我分割了文字...
So I split the text...
for div in soup.find_all('div', class_='MYCLASS'):
for url in soup.find_all('a', id='category1'):
category1 = str(url).split('category1">')[1].split('</a>')[0]
print(category1)
并提取了我想要的Text1",但仍然错过了我想要的Text2".任何想法?谢谢.
and extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.
还有其他<a></a>中的源代码,因此,如果我从代码中删除 id =
,它将返回我不需要的所有其他文本.例如,
There are other < a> < /a> in the source code, so if I remove id=
from my code, it would return all other texts that I don't need. For examples,
<div class="MYClass"><span class="Class">RandomText.<br>RandomText.<br>
<a href=someURL>RandomTextExtracted.</a><br>
还
</div><div class=MYClass>
<a href="SomeURL>RandomTextExtracted</a>
推荐答案
由于元素的 id
是唯一的,因此您可以找到第一个< a>
标记使用 id ="category1"
.要查找下一个< a>
标记,可以使用
Since the id
of an element is unique, you can find the first <a>
tag using id="category1"
. To find the next <a>
tag, you can use find_next()
method.
html = '''<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >Text1 I want</a> > Category2: <a href="SomeURL" >Text2 I want</a></div>'''
soup = BeautifulSoup(html, 'lxml')
a_tag1 = soup.find('a', id='category1')
print(a_tag1) # or use `a_tag1.text` to get the text
a_tag2 = a_tag1.find_next('a')
print(a_tag2)
输出:
<a href="SomeURL" id="category1">Text1 I want</a>
<a href="SomeURL">Text2 I want</a>
(我已经对其提供的链接进行了测试,并且也可以在其中使用.)
这篇关于BeautifulSoup-在一个类中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!