解析HTML数据转换成操作Python列表 [英] Parsing html data into python list for manipulation
问题描述
我想在HTML网站阅读并提取它们的数据。例如,我想在EPS(每股收益)为近5年来公司看。基本上,我可以阅读,可以使用BeautifulSoup或html2text创造了巨大的文本块。那么我要搜索的文件 - 我一直在使用re.search - 但似乎无法得到它才能正常工作。这里是我试图访问的行:
EPS(基本)\\ n13.4620.6226.6930.1732.81 \\ n \\ n
所以,我想创建一个名为EPS名单= [13.46,20.62,26.69,30.17,32.81。
感谢您的帮助。
从stripogram进口html2text
从进口的urllib的urlopen
进口重
从BeautifulSoup进口BeautifulSoupTICKER_SYMBOL =goog
URL ='http://www.marketwatch.com/investing/stock/
full_url = URL + TICKER_SYMBOL +'/财务#build网址text_soup = BeautifulSoup(的urlopen(full_url).read())#阅读中text_parts = text_soup.findAll(文= TRUE)
文字=''。加入(text_parts)EPS = re.search(EPS \\ S +(\\ d +),文字)
如果EPS不无:
打印eps.group(1)
这是不使用正则表达式解析HTML一个很好的做法。使用 BeautifulSoup
解析:找到 rowTitle
类和 EPS(基本)$细胞C $ C>在它的文字,那么未来的兄弟姐妹迭代与
valueCell
类:
从进口的urllib的urlopen
从BeautifulSoup进口BeautifulSoupURL ='http://www.marketwatch.com/investing/stock/goog/financials
text_soup = BeautifulSoup(的urlopen(URL).read())#阅读中标题= text_soup.findAll('TD',{'类':'rowTitle'})
在标题标题:
如果每股收益(基本)在title.text:
打印[td.text为TD在title.findNextSiblings(ATTRS = {'类':'valueCell'})如果td.text]
打印:
['13.46','20 0.62','26 0.69','30 0.17','32 0.81']
希望有所帮助。
I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)
It's not a good practice to use regex for parsing html. Use BeautifulSoup
parser: find the cell with rowTitle
class and EPS (Basic)
text in it, then iterate over next siblings with valueCell
class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.
这篇关于解析HTML数据转换成操作Python列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!