解析HTML数据转换成操作Python列表 [英] Parsing html data into python list for manipulation

查看:288
本文介绍了解析HTML数据转换成操作Python列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在HTML网站阅读并提取它们的数据。例如,我想在EPS(每股收益)为近5年来公司看。基本上,我可以阅读,可以使用BeautifulSoup或html2text创造了巨大的文本块。那么我要搜索的文件 - 我一直在使用re.search - 但似乎无法得到它才能正常工作。这里是我试图访问的行:

EPS(基本)\\ n13.4620.6226.6930.1732.81 \\ n \\ n

所以,我想创建一个名为EPS名单= [13.46,20.62,26.69,30.17,32.81。

感谢您的帮助。

 从stripogram进口html2text
从进口的urllib的urlopen
进口重
从BeautifulSoup进口BeautifulSoupTICKER_SYMBOL =goog
URL ='http://www.marketwatch.com/investing/stock/
full_url = URL + TICKER_SYMBOL +'/财务#build网址text_soup = BeautifulSoup(的urlopen(full_url).read())#阅读中text_parts = text_soup.findAll(文= TRUE)
文字=''。加入(text_parts)EPS = re.search(EPS \\ S +(\\ d +),文字)
如果EPS不无:
    打印eps.group(1)


解决方案

这是不使用正则表达式解析HTML一个很好的做法。使用 BeautifulSoup 解析:找到 rowTitle 类和 EPS(基本)在它的文字,那么未来的兄弟姐妹迭代与 valueCell 类:

 从进口的urllib的urlopen
从BeautifulSoup进口BeautifulSoupURL ='http://www.marketwatch.com/investing/stock/goog/financials
text_soup = BeautifulSoup(的urlopen(URL).read())#阅读中标题= text_soup.findAll('TD',{'类':'rowTitle'})
在标题标题:
    如果每股收益(基本)在title.text:
        打印[td.text为TD在title.findNextSiblings(ATTRS = {'类':'valueCell'})如果td.text]

打印:

  ['13.46','20 0.62','26 0.69','30 0.17','32 0.81']

希望有所帮助。

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:

EPS (Basic)\n13.4620.6226.6930.1732.81\n\n

So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].

Thanks for any help.

from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup

ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials'  #build url

text_soup = BeautifulSoup(urlopen(full_url).read()) #read in 

text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)

eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
    print eps.group(1)

解决方案

It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in

titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
    if 'EPS (Basic)' in title.text:
        print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]

prints:

['13.46', '20.62', '26.69', '30.17', '32.81']

Hope that helps.

这篇关于解析HTML数据转换成操作Python列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆