蟒蛇beautifulsoup提取文本 [英] python beautifulsoup extracting text
问题描述
我想提取粗体文字,这是说明从本网站<一个最新的天气PSI href=\"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\" rel=\"nofollow\">http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours.
有谁知道如何使用低于此code提取?
此外,我需要提取两个值是盈方当前的天气磅做计算。三种价值的合计(最新previous两个值)
例如:当前值(粗体)是凌晨5点51,我还需要3AM和凌晨4点。有谁知道,可以帮助我?在此先感谢!
从pprint进口pprint
进口的urllib2
从BS4进口BeautifulSoup汤
URL = \"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\"
web_soup =汤(urllib2.urlopen(URL)) 表= web_soup.find(NAME =格,ATTRS = {'类':'C1'})。find_all(NAME =格)[2] .find_all('表')[0] table_rows = []
在table.find_all('TR')行:
table_rows.append([td.text.strip()在row.find_all TD('TD')]) 数据= {}
对于tr_index,TR,在历数(table_rows):
如果tr_index%2 == 0:
对于td_index,在历数(TR)TD:
数据[TD] = table_rows [tr_index + 1] [td_index] pprint(数据)
打印:
{'10AM':'49',
晚上10点':' - ',
上午11':'52',
11PM':' - ',
12AM':'76',
12PM':'54',
凌晨1点':'70',
下午1点':'59',
'2AM':'64',
'2PM':'65',
3AM':'59',
下午3点':'72'
凌晨4点':'54',
下午4点':'79',
'5AM':'51',
下午5点':'82',
上午6时':'48',
下午6时':'79',
上午7点':'47',
晚上7点':' - ',
早上8点':'47',
晚上8点':' - ',
上午9点':'47',
晚上9点':' - ',
时代:3小时PSI'}
请确保你明白是怎么回事:
进口的urllib2
进口日期时间从BS4进口BeautifulSoup汤
URL = \"http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours\"
web_soup =汤(urllib2.urlopen(URL))表= web_soup.find(NAME =格,ATTRS = {'类':'C1'})。find_all(NAME =格)[2] .find_all('表')[0]数据= {}
bold_time =''
CUR_TIME = datetime.datetime.strptime(12AM,%I%P)
对于tr_index,TR,在枚举(table.find_all('TR')):
如果tr.text时间:
继续
对于td_index,TD在枚举(tr.find_all('TD')):
如果没有td_index:
继续
数据[CUR_TIME] = td.text.strip()
如果td.find(强):
bold_time = CUR_TIME
CUR_TIME + = datetime.timedelta(小时= 1)打印data.get(bold_time)#大胆
打印data.get(bold_time - datetime.timedelta(小时= 1))之前大胆#
打印data.get(bold_time - datetime.timedelta(小时= 2))#前大胆前
这将打印 3小时PSI
值,被标记为粗体和前两个值(如果存在的话)。
希望有所帮助。
I would like to extract the bold text, which is indicating the latest weather psi from this website http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours. Does anyone know how to extract using this code below ?
Also I needed to extract two values that is infront of the current weather psi to do calculate. Total of three value (latest and previous two values)
Example: current value (bold) is 5AM : 51, I need also 3AM and 4AM. Does anyone knows and can help me with this ? Thanks in advance !
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]
table_rows = []
for row in table.find_all('tr'):
table_rows.append([td.text.strip() for td in row.find_all('td')])
data = {}
for tr_index, tr in enumerate(table_rows):
if tr_index % 2 == 0:
for td_index, td in enumerate(tr):
data[td] = table_rows[tr_index + 1][td_index]
pprint(data)
prints:
{'10AM': '49',
'10PM': '-',
'11AM': '52',
'11PM': '-',
'12AM': '76',
'12PM': '54',
'1AM': '70',
'1PM': '59',
'2AM': '64',
'2PM': '65',
'3AM': '59',
'3PM': '72',
'4AM': '54',
'4PM': '79',
'5AM': '51',
'5PM': '82',
'6AM': '48',
'6PM': '79',
'7AM': '47',
'7PM': '-',
'8AM': '47',
'8PM': '-',
'9AM': '47',
'9PM': '-',
'Time': '3-hr PSI'}
Make sure you understand what is going on here:
import urllib2
import datetime
from bs4 import BeautifulSoup as soup
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]
data = {}
bold_time = ''
cur_time = datetime.datetime.strptime("12AM", "%I%p")
for tr_index, tr in enumerate(table.find_all('tr')):
if 'Time' in tr.text:
continue
for td_index, td in enumerate(tr.find_all('td')):
if not td_index:
continue
data[cur_time] = td.text.strip()
if td.find('strong'):
bold_time = cur_time
cur_time += datetime.timedelta(hours=1)
print data.get(bold_time) # bold
print data.get(bold_time - datetime.timedelta(hours=1)) # before bold
print data.get(bold_time - datetime.timedelta(hours=2)) # before before bold
This will print the 3-hr PSI
value that is marked in bold and two values before it (if exist).
Hope that helps.
这篇关于蟒蛇beautifulsoup提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!