蟒蛇beautifulsoup提取文本 [英] python beautifulsoup extracting text

查看:162
本文介绍了蟒蛇beautifulsoup提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提取粗体文字,这是说明从本网站<一个最新的天气PSI href=\"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\" rel=\"nofollow\">http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours.
有谁知道如何使用低于此code提取?

此外,我需要提取两个值是盈方当前的天气磅做计算。三种价值的合计(最新previous两个值)

例如:当前值(粗体)是凌晨5点51,我还需要3AM和凌晨4点。有谁知道,可以帮助我?在此先感谢!

 从pprint进口pprint
    进口的urllib2
    从BS4进口BeautifulSoup汤
    URL = \"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\"
    web_soup =汤(urllib2.urlopen(URL))    表= web_soup.find(NAME =格,ATTRS = {'类':'C1'})。find_all(NAME =格)[2] .find_all('表')[0]    table_rows = []
    在table.find_all('TR')行:
        table_rows.append([td.text.strip()在row.find_all TD('TD')])    数据= {}
    对于tr_index,TR,在历数(table_rows):
        如果tr_index%2 == 0:
            对于td_index,在历数(TR)TD:
                数据[TD] = table_rows [tr_index + 1] [td_index]    pprint(数据)

打印:

  {'10AM':'49',
     晚上10点':' - ',
     上午11':'52',
     11PM':' - ',
     12AM':'76',
     12PM':'54',
     凌晨1点':'70',
     下午1点':'59',
     '2AM':'64',
     '2PM':'65',
     3AM':'59',
     下午3点':'72'
     凌晨4点':'54',
     下午4点':'79',
     '5AM':'51',
     下午5点':'82',
     上午6时':'48',
     下午6时':'79',
     上午7点':'47',
     晚上7点':' - ',
     早上8点':'47',
     晚上8点':' - ',
     上午9点':'47',
     晚上9点':' - ',
     时代:3小时PSI'}


解决方案

请确保你明白是怎么回事:

 进口的urllib2
进口日期时间从BS4进口BeautifulSoup汤
URL = \"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\"
web_soup =汤(urllib2.urlopen(URL))表= web_soup.find(NAME =格,ATTRS = {'类':'C1'})。find_all(NAME =格)[2] .find_all('表')[0]数据= {}
bold_time =''
CUR_TIME = datetime.datetime.strptime(12AM,%I%P)
对于tr_index,TR,在枚举(table.find_all('TR')):
    如果tr.text时间:
        继续
    对于td_index,TD在枚举(tr.find_all('TD')):
        如果没有td_index:
            继续
        数据[CUR_TIME] = td.text.strip()
        如果td.find(强):
            bold_time = CUR_TIME
        CUR_TIME + = datetime.timedelta(小时= 1)打印data.get(bold_time)#大胆
打印data.get(bold_time - datetime.timedelta(小时= 1))之前大胆#
打印data.get(bold_time - datetime.timedelta(小时= 2))#前大胆前

这将打印 3小时PSI 值,被标记为粗体和前两个值(如果存在的话)。

希望有所帮助。

I would like to extract the bold text, which is indicating the latest weather psi from this website http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours. Does anyone know how to extract using this code below ?

Also I needed to extract two values that is infront of the current weather psi to do calculate. Total of three value (latest and previous two values)

Example: current value (bold) is 5AM : 51, I need also 3AM and 4AM. Does anyone knows and can help me with this ? Thanks in advance !

    from pprint import pprint
    import urllib2
    from bs4 import BeautifulSoup as soup


    url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
    web_soup = soup(urllib2.urlopen(url))

    table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]

    table_rows = []
    for row in table.find_all('tr'):
        table_rows.append([td.text.strip() for td in row.find_all('td')])

    data = {}
    for tr_index, tr in enumerate(table_rows):
        if tr_index % 2 == 0:
            for td_index, td in enumerate(tr):
                data[td] = table_rows[tr_index + 1][td_index]

    pprint(data)

prints:

    {'10AM': '49',
     '10PM': '-',
     '11AM': '52',
     '11PM': '-',
     '12AM': '76',
     '12PM': '54',
     '1AM': '70',
     '1PM': '59',
     '2AM': '64',
     '2PM': '65',
     '3AM': '59',
     '3PM': '72',
     '4AM': '54',
     '4PM': '79',
     '5AM': '51',
     '5PM': '82',
     '6AM': '48',
     '6PM': '79',
     '7AM': '47',
     '7PM': '-',
     '8AM': '47',
     '8PM': '-',
     '9AM': '47',
     '9PM': '-',
     'Time': '3-hr PSI'}

解决方案

Make sure you understand what is going on here:

import urllib2
import datetime

from bs4 import BeautifulSoup as soup


url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))

table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]

data = {}
bold_time = ''
cur_time = datetime.datetime.strptime("12AM", "%I%p")
for tr_index, tr in enumerate(table.find_all('tr')):
    if 'Time' in tr.text:
        continue
    for td_index, td in enumerate(tr.find_all('td')):
        if not td_index:
            continue
        data[cur_time] = td.text.strip()
        if td.find('strong'):
            bold_time = cur_time
        cur_time += datetime.timedelta(hours=1)

print data.get(bold_time)  # bold
print data.get(bold_time - datetime.timedelta(hours=1))  # before bold
print data.get(bold_time - datetime.timedelta(hours=2))  # before before bold

This will print the 3-hr PSI value that is marked in bold and two values before it (if exist).

Hope that helps.

这篇关于蟒蛇beautifulsoup提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆