解析HTML数据转换成操作Python列表 [英] Parsing html data into python list for manipulation

查看：288 发布时间：2016/8/5 19:07:07 python html parsing html-parsing beautifulsoup

本文介绍了解析HTML数据转换成操作Python列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在HTML网站阅读并提取它们的数据。例如，我想在EPS（每股收益）为近5年来公司看。基本上，我可以阅读，可以使用BeautifulSoup或html2text创造了巨大的文本块。那么我要搜索的文件 - 我一直在使用re.search - 但似乎无法得到它才能正常工作。这里是我试图访问的行：

EPS（基本）\\ n13.4620.6226.6930.1732.81 \\ n \\ n

所以，我想创建一个名为EPS名单= [13.46，20.62，26.69，30.17，32.81。

感谢您的帮助。

 从stripogram进口html2text
从进口的urllib的urlopen
进口重
从BeautifulSoup进口BeautifulSoupTICKER_SYMBOL =goog
URL ='http://www.marketwatch.com/investing/stock/
full_url = URL + TICKER_SYMBOL +'/财务#build网址text_soup = BeautifulSoup（的urlopen（full_url）.read（））＃阅读中text_parts = text_soup.findAll（文= TRUE）
文字=''。加入（text_parts）EPS = re.search（EPS \\ S +（\\ d +），文字）
如果EPS不无：
    打印eps.group（1）

解决方案

这是不使用正则表达式解析HTML一个很好的做法。使用 BeautifulSoup 解析：找到 rowTitle 类和 EPS（基本）在它的文字，那么未来的兄弟姐妹迭代与 valueCell 类：

 从进口的urllib的urlopen
从BeautifulSoup进口BeautifulSoupURL ='http://www.marketwatch.com/investing/stock/goog/financials
text_soup = BeautifulSoup（的urlopen（URL）.read（））＃阅读中标题= text_soup.findAll（'TD'，{'类'：'rowTitle'}）
在标题标题：
    如果每股收益（基本）在title.text：
        打印[td.text为TD在title.findNextSiblings（ATTRS = {'类'：'valueCell'}）如果td.text]

打印：

  ['13.46'，'20 0.62'，'26 0.69'，'30 0.17'，'32 0.81']

希望有所帮助。

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:



EPS (Basic)\n13.4620.6226.6930.1732.81\n\n

So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].  

Thanks for any help.  
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup

ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials'  #build url

text_soup = BeautifulSoup(urlopen(full_url).read()) #read in 

text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)

eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
    print eps.group(1)

 解决方案 
It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in

titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
    if 'EPS (Basic)' in title.text:
        print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.

                        这篇关于解析HTML数据转换成操作Python列表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

解析HTML数据转换成操作Python列表 [英] Parsing html data into python list for manipulation

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

解析HTML数据转换成操作Python列表 [英] Parsing html data into python list for manipulation

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭