蟒蛇beautifulsoup4解析谷歌财经数据 [英] python beautifulsoup4 parsing google finance data

查看:188
本文介绍了蟒蛇beautifulsoup4解析谷歌财经数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来使用beautifulsoup和一般刮,所以我试图让我的脚湿这么说。

我想获得的从这里道琼斯工业平均指数信息的第一行:
http://www.google.com/finance/historical?q=INDEXDJX%3A.DJI&ei=ZN_2UqD9NOTt6wHYrAE

虽然我可以读取数据并打印(汤)输出的一切,我似乎无法趴下远远不够。我将如何选择,我保存到表中的行?如何在第一排?

感谢你这么多的帮助!

 进口的urllib.parse
进口urllib.request里
从BS4进口BeautifulSoup
进口JSON
进口SYS
进口OS
进口时间
导入CSV
进口错误号DJIA_URL =htt​​p://www.google.com/finance/historical?q=INDEXDJX%3A.DJI&ei=ZN_2UqD9NOTt6wHYrAE高清downloadData(的queryString):
    与urllib.request.urlopen(的queryString)的网址:
        编码= url.headers.get_content_charset()
        结果= url.read()。德code(编码)
    返回结果raw_html = downloadData(DJIA_URL)
汤= BeautifulSoup(raw_html)#PRINT(汤)表= soup.findAll(表,{级:GF-表historical_price})


解决方案

您希望第二个 TR 表行,则:

 价格= soup.find('表',类_ ='historical_price')
行= prices.find_all('TR')
打印行[1]

或者,要列出与价格信息的所有行,跳过一个与任何元素:

 在一行行:
    如果row​​.th:继续

或使用第一个标题为源字典键:

 键= [th.text.strip()在行日[0] .find_all('日')]
在连续的行[1:]:
    数据= {关键:td.text.strip()关键,TD拉链(键,row.find_all('TD'))}
    打印数据

主要生产:

  {u'Volume':u'105,782,495',u'High':u'15,798.51',u'Low':u'15,625.53',u'Date':U' 2014年2月7日,u'Close':u'15,794.08',u'Open':u'15,630.64'}
{u'Volume':u'106,979,691',u'High':u'15,632.09',u'Low':u'15,443.00',u'Date':u'Feb 6 2014',u'Close':U '15,628.53',u'Open':u'15,443.83'}
{u'Volume':u'105,125,894',u'High':u'15,478.21',u'Low':u'15,340.69',u'Date':u'Feb 5,2014年,u'Close':U '15,440.23',u'Open':u'15,443.00'}
{u'Volume':u'124,106,548',u'High':u'15,481.85',u'Low':u'15,356.62',u'Date':u'Feb 4,2014年,u'Close':U '15,445.24',u'Open':u'15,372.93'}

等。

I'm new to using beautifulsoup and scraping in general so I'm trying to get my feet wet so to speak.

I'd like to get the first row of information for the Dow Jones Industrial Average from here: http://www.google.com/finance/historical?q=INDEXDJX%3A.DJI&ei=ZN_2UqD9NOTt6wHYrAE

While I can read the data and print(soup) outputs everything, I can't seem to get down far enough. How would I select the rows that I save into table? How about the first rows?

Thank you so much for your help!

import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
import json
import sys
import os
import time
import csv
import errno

DJIA_URL = "http://www.google.com/finance/historical?q=INDEXDJX%3A.DJI&ei=ZN_2UqD9NOTt6wHYrAE"

def downloadData(queryString):
    with urllib.request.urlopen(queryString) as url:
        encoding = url.headers.get_content_charset()
        result = url.read().decode(encoding)
    return result

raw_html = downloadData(DJIA_URL)
soup = BeautifulSoup(raw_html)

#print(soup)

table = soup.findAll("table", {"class":"gf-table historical_price"})

解决方案

You want the second tr table row then:

prices = soup.find('table', class_='historical_price')
rows = prices.find_all('tr')
print rows[1]

or, to list all rows with prices info, skip the one with any th elements:

for row in rows:
    if row.th: continue

or use that first header as a source for dictionary keys:

keys = [th.text.strip() for th in rows[0].find_all('th')]
for row in rows[1:]:
    data = {key: td.text.strip() for key, td in zip(keys, row.find_all('td'))}
    print data

which produces:

{u'Volume': u'105,782,495', u'High': u'15,798.51', u'Low': u'15,625.53', u'Date': u'Feb 7, 2014', u'Close': u'15,794.08', u'Open': u'15,630.64'}
{u'Volume': u'106,979,691', u'High': u'15,632.09', u'Low': u'15,443.00', u'Date': u'Feb 6, 2014', u'Close': u'15,628.53', u'Open': u'15,443.83'}
{u'Volume': u'105,125,894', u'High': u'15,478.21', u'Low': u'15,340.69', u'Date': u'Feb 5, 2014', u'Close': u'15,440.23', u'Open': u'15,443.00'}
{u'Volume': u'124,106,548', u'High': u'15,481.85', u'Low': u'15,356.62', u'Date': u'Feb 4, 2014', u'Close': u'15,445.24', u'Open': u'15,372.93'}

etc.

这篇关于蟒蛇beautifulsoup4解析谷歌财经数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆