如何使用美丽的汤和稀土来找到包含特定文本的特定类的跨度? [英] How to find spans with a specific class containing specific text using beautiful soup and re?

查看:36
本文介绍了如何使用美丽的汤和稀土来找到包含特定文本的特定类的跨度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我怎么能找到所有带有'blue'类的跨度,其中包含以下格式的文本:

how can I find all span's with a class of 'blue' that contain text in the format:

04/18/13 7:29pm

因此可能是:

04/18/13 7:29pm

或:

Posted on 04/18/13 7:29pm

就构建用于执行此操作的逻辑而言,这就是我到目前为止所获得的:

in terms of constructing the logic to do this, this is what i have got so far:

new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
    result = re.findall(pattern, _)
    print result

我一直在指 https://stackoverflow.com/a/7732827 https://stackoverflow.com/a/12229134 尝试找出实现此目的的方法,但是以上就是我到目前为止所掌握的一切.

I've been referring to https://stackoverflow.com/a/7732827 and https://stackoverflow.com/a/12229134 to try and figure out a way to do this, but the above is all i have got so far.

为弄清楚场景,跨度带有:

to clarify the scenario, there are span's with:

<span class="blue">here is a lot of text that i don't need</span>

<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>

请注意,我只需要 04/18/13 7:29 pm 即可,而无需其余内容.

and note i only need 04/18/13 7:29pm not the rest of the content.

修改2:

我也尝试过:

pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
    result = re.findall(pattern, _)
    print result

出现错误:

'TypeError: expected string or buffer'

推荐答案

import re
from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""

# parse the html
soup = BeautifulSoup(html_doc)

# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})

# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]

# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
    m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
    if m:
        found_dates.append(m.group(1))

# print the dates we collected
for date in found_dates:
    print(date)


输出:

04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm

这篇关于如何使用美丽的汤和稀土来找到包含特定文本的特定类的跨度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆