蟒蛇beautifulsoup提取文本 [英] python beautifulsoup extracting text

查看：162 发布时间：2016/8/5 19:00:57 python beautifulsoup extract extraction

本文介绍了蟒蛇beautifulsoup提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想提取粗体文字，这是说明从本网站<一个最新的天气PSI href=\"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\" rel=\"nofollow\">http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours.
有谁知道如何使用低于此code提取？

此外，我需要提取两个值是盈方当前的天气磅做计算。三种价值的合计（最新previous两个值）

例如：当前值（粗体）是凌晨5点51，我还需要3AM和凌晨4点。有谁知道，可以帮助我？在此先感谢！

 从pprint进口pprint
    进口的urllib2
    从BS4进口BeautifulSoup汤
    URL = \"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\"
    web_soup =汤（urllib2.urlopen（URL））    表= web_soup.find（NAME =格，ATTRS = {'类'：'C1'}）。find_all（NAME =格）[2] .find_all（'表'）[0]    table_rows = []
    在table.find_all（'TR'）行：
        table_rows.append（[td.text.strip（）在row.find_all TD（'TD'）]）    数据= {}
    对于tr_index，TR，在历数（table_rows）：
        如果tr_index％2 == 0：
            对于td_index，在历数（TR）TD：
                数据[TD] = table_rows [tr_index + 1] [td_index]    pprint（数据）

打印：

  {'10AM'：'49'，
     晚上10点'：' - '，
     上午11'：'52'，
     11PM'：' - '，
     12AM'：'76'，
     12PM'：'54'，
     凌晨1点'：'70'，
     下午1点'：'59'，
     '2AM'：'64'，
     '2PM'：'65'，
     3AM'：'59'，
     下午3点'：'72'
     凌晨4点'：'54'，
     下午4点'：'79'，
     '5AM'：'51'，
     下午5点'：'82'，
     上午6时'：'48'，
     下午6时'：'79'，
     上午7点'：'47'，
     晚上7点'：' - '，
     早上8点'：'47'，
     晚上8点'：' - '，
     上午9点'：'47'，
     晚上9点'：' - '，
     时代：3小时PSI'}

解决方案

请确保你明白是怎么回事：

 进口的urllib2
进口日期时间从BS4进口BeautifulSoup汤
URL = \"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\"
web_soup =汤（urllib2.urlopen（URL））表= web_soup.find（NAME =格，ATTRS = {'类'：'C1'}）。find_all（NAME =格）[2] .find_all（'表'）[0]数据= {}
bold_time =''
CUR_TIME = datetime.datetime.strptime（12AM，％I％P）
对于tr_index，TR，在枚举（table.find_all（'TR'））：
    如果tr.text时间：
        继续
    对于td_index，TD在枚举（tr.find_all（'TD'））：
        如果没有td_index：
            继续
        数据[CUR_TIME] = td.text.strip（）
        如果td.find（强）：
            bold_time = CUR_TIME
        CUR_TIME + = datetime.timedelta（小时= 1）打印data.get（bold_time）＃大胆
打印data.get（bold_time  -  datetime.timedelta（小时= 1））之前大胆＃
打印data.get（bold_time  -  datetime.timedelta（小时= 2））＃前大胆前

这将打印 3小时PSI 值，被标记为粗体和前两个值（如果存在的话）。

希望有所帮助。

I would like to extract the bold text, which is indicating the latest weather psi from this website http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours. Does anyone know how to extract using this code below ?

Also I needed to extract two values that is infront of the current weather psi to do calculate. Total of three value (latest and previous two values)

Example: current value (bold) is 5AM : 51, I need also 3AM and 4AM. Does anyone knows and can help me with this ? Thanks in advance !

    from pprint import pprint
    import urllib2
    from bs4 import BeautifulSoup as soup


    url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
    web_soup = soup(urllib2.urlopen(url))

    table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]

    table_rows = []
    for row in table.find_all('tr'):
        table_rows.append([td.text.strip() for td in row.find_all('td')])

    data = {}
    for tr_index, tr in enumerate(table_rows):
        if tr_index % 2 == 0:
            for td_index, td in enumerate(tr):
                data[td] = table_rows[tr_index + 1][td_index]

    pprint(data)

prints:

    {'10AM': '49',
     '10PM': '-',
     '11AM': '52',
     '11PM': '-',
     '12AM': '76',
     '12PM': '54',
     '1AM': '70',
     '1PM': '59',
     '2AM': '64',
     '2PM': '65',
     '3AM': '59',
     '3PM': '72',
     '4AM': '54',
     '4PM': '79',
     '5AM': '51',
     '5PM': '82',
     '6AM': '48',
     '6PM': '79',
     '7AM': '47',
     '7PM': '-',
     '8AM': '47',
     '8PM': '-',
     '9AM': '47',
     '9PM': '-',
     'Time': '3-hr PSI'}

解决方案

Make sure you understand what is going on here:

import urllib2
import datetime

from bs4 import BeautifulSoup as soup


url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))

table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]

data = {}
bold_time = ''
cur_time = datetime.datetime.strptime("12AM", "%I%p")
for tr_index, tr in enumerate(table.find_all('tr')):
    if 'Time' in tr.text:
        continue
    for td_index, td in enumerate(tr.find_all('td')):
        if not td_index:
            continue
        data[cur_time] = td.text.strip()
        if td.find('strong'):
            bold_time = cur_time
        cur_time += datetime.timedelta(hours=1)

print data.get(bold_time)  # bold
print data.get(bold_time - datetime.timedelta(hours=1))  # before bold
print data.get(bold_time - datetime.timedelta(hours=2))  # before before bold

This will print the 3-hr PSI value that is marked in bold and two values before it (if exist).

Hope that helps.

这篇关于蟒蛇beautifulsoup提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

蟒蛇beautifulsoup提取文本 [英] python beautifulsoup extracting text

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

蟒蛇beautifulsoup提取文本 [英] python beautifulsoup extracting text

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭