刮表使用美丽的汤在特定日期循环 [英] Scrape a Table Looping in Specific Dates using Beautiful Soup

查看:256
本文介绍了刮表使用美丽的汤在特定日期循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在推动自己了墙,试图刮必要的历史咖啡价格从这里找到使用BeautifulSoup表:
     http://www.investing.com/commodities/us-coffee-c - 历史数据

I have been driving myself up the wall with trying to scrape the necessary historical coffee prices from the table found here using BeautifulSoup: http://www.investing.com/commodities/us-coffee-c-historical-data

我试图拉市场星期的价格从16年4月4日至2016年4月8日。

I am trying to pull a market weeks worth of prices from 04-04-16 to 04-08-2016.

我的最终目标是要凑整表的日期。从日拉的所有列更改%。

My ultimate goal is to scrape the entire table for those dates. Pulling all columns from Date to Change %.

我的第一步是创建我想要的日期的字典,使用的元素中使用的日期格式:

My first step was to create a dictionary of the dates I want, using the date format of used in the element:

dates={1 : "Apr 04, 2016",
  2 : "Apr 05, 2016",
  3 : "Apr 06, 2016",
  4 : "Apr 07, 2016",
  5 : "Apr 08, 2016"}
dates

接下来,我想凑表中,但我不能让它做什么,我需要它的循环需要,所以我试图拉单个元素的日期:

Next I want to scrape the table but I can't get it to do what I need to where it loops the dates in as needed so I have tried to pull the individual elements:

import requests
from bs4 import BeautifulSoup

url = "http://www.investing.com/commodities/us-coffee-c-historical-data"
page  = requests.get(url).text
soup_coffee = BeautifulSoup(page)

coffee_table = soup_coffee.find("table", class_="genTbl closedTbl historicalTbl")
coffee_titles = coffee_table.find_all("th", class_="noWrap")

for coffee_title in coffee_titles:
  price = coffee_title.find("td", class_="greenfont")
  print(price)

除了被返回的值是:
    没有
    没有
    没有
    没有
    没有
    没有
    无

except the value that is returned is: None None None None None None None

首先,为什么我返回一个无的价值?我有一种感觉它与我的code的coffee_titles部分做的,这是不正确识别列标题。

Firstly, why am I returning a "None" value? I have a feeling it has to do with the coffee_titles part of my code, and it is not recognizing the column titles correctly.

其次,是否有一个有效的办法,我用我的日期范围的日期字典刮整个表?

Secondly, is there an efficient way for me to scrape the entire table using my date range in the dates dictionary?

任何建议将是极大的AP preciated。谢谢

Any suggestions would be greatly appreciated. Thanks

推荐答案

您code作为您正在寻找在标题标签TD标签,如果打印失败的 coffee_titles 的,它是pretty清楚为什么你看到

Your code fails as you are looking for td tags in the headers tags, if you print coffee_titles, it is pretty clear why you see None:

[<th class="first left noWrap">Date</th>, <th class="noWrap">Price</th>, <th class="noWrap">Open</th>, <th class="noWrap">High</th>, <th class="noWrap">Low</th>, <th class="noWrap">Vol.</th>, <th class="noWrap">Change %</th>]

有没有TD标签。

要获取所有表数据,可以从表中拉的日期,并把它们作为键:

To get all the table data, you can pull the dates from the table and use them as keys:

from bs4 import BeautifulSoup
from collections import OrderedDict

r = requests.get("http://www.investing.com/commodities/us-coffee-c-historical-data")
od = OrderedDict()
soup = BeautifulSoup(r.content,"lxml")

# select the table
table = soup.select_one("table.genTbl.closedTbl.historicalTbl")

# all col names
cols = [th.text for th in table.select("th")[1:]]
# get all rows bar the first i.e the headers
for row in table.select("tr + tr"):
    # get all the data including the date
    data = [td.text for td in row.select("td")]
    # use date as the key and store list of values
    od[data[0]] = dict(zip(cols,  data[1:]))


from  pprint import pprint as pp

pp(dict(od))

输出:

    {u'Jun 01, 2016': {u'Change %': u'0.29%',
                   u'High': u'123.10',
                   u'Low': u'120.85',
                   u'Open': u'121.50',
                   u'Price': u'121.90',
                   u'Vol.': u'18.55K'},
 u'Jun 02, 2016': {u'Change %': u'0.90%',
                   u'High': u'124.40',
                   u'Low': u'122.15',
                   u'Open': u'122.50',
                   u'Price': u'123.00',
                   u'Vol.': u'22.11K'},
 u'Jun 03, 2016': {u'Change %': u'3.33%',
                   u'High': u'127.40',
                   u'Low': u'122.50',
                   u'Open': u'122.60',
                   u'Price': u'127.10',
                   u'Vol.': u'28.47K'},
 u'Jun 06, 2016': {u'Change %': u'3.62%',
                   u'High': u'132.05',
                   u'Low': u'127.10',
                   u'Open': u'127.30',
                   u'Price': u'131.70',
                   u'Vol.': u'30.65K'},
 u'May 09, 2016': {u'Change %': u'2.49%',
                   u'High': u'126.60',
                   u'Low': u'123.28',
                   u'Open': u'125.65',
                   u'Price': u'126.53',
                   u'Vol.': u'-'},
 u'May 10, 2016': {u'Change %': u'0.29%',
                   u'High': u'125.90',
                   u'Low': u'125.90',
                   u'Open': u'125.90',
                   u'Price': u'126.90',
                   u'Vol.': u'0.01K'},
 u'May 11, 2016': {u'Change %': u'2.26%',
                   u'High': u'129.77',
                   u'Low': u'126.88',
                   u'Open': u'128.60',
                   u'Price': u'129.77',
                   u'Vol.': u'-'},
 u'May 12, 2016': {u'Change %': u'-1.21%',
                   u'High': u'128.75',
                   u'Low': u'127.30',
                   u'Open': u'128.75',
                   u'Price': u'128.20',
                   u'Vol.': u'0.01K'},
 u'May 13, 2016': {u'Change %': u'0.47%',
                   u'High': u'127.85',
                   u'Low': u'127.80',
                   u'Open': u'127.85',
                   u'Price': u'128.80',
                   u'Vol.': u'0.01K'},
 u'May 16, 2016': {u'Change %': u'3.03%',
                   u'High': u'131.95',
                   u'Low': u'128.75',
                   u'Open': u'128.75',
                   u'Price': u'132.70',
                   u'Vol.': u'0.01K'},
 u'May 17, 2016': {u'Change %': u'-0.64%',
                   u'High': u'132.60',
                   u'Low': u'132.60',
                   u'Open': u'132.60',
                   u'Price': u'131.85',
                   u'Vol.': u'-'},
 u'May 18, 2016': {u'Change %': u'-1.93%',
                   u'High': u'129.65',
                   u'Low': u'128.15',
                   u'Open': u'128.85',
                   u'Price': u'129.30',
                   u'Vol.': u'0.02K'},
 u'May 19, 2016': {u'Change %': u'-4.14%',
                   u'High': u'129.00',
                   u'Low': u'123.70',
                   u'Open': u'128.95',
                   u'Price': u'123.95',
                   u'Vol.': u'29.69K'},
 u'May 20, 2016': {u'Change %': u'0.61%',
                   u'High': u'125.95',
                   u'Low': u'124.25',
                   u'Open': u'124.75',
                   u'Price': u'124.70',
                   u'Vol.': u'15.54K'},
 u'May 23, 2016': {u'Change %': u'-2.04%',
                   u'High': u'124.70',
                   u'Low': u'122.00',
                   u'Open': u'124.50',
                   u'Price': u'122.15',
                   u'Vol.': u'15.89K'},
 u'May 24, 2016': {u'Change %': u'-0.29%',
                   u'High': u'123.30',
                   u'Low': u'121.55',
                   u'Open': u'122.45',
                   u'Price': u'121.80',
                   u'Vol.': u'15.06K'},
 u'May 25, 2016': {u'Change %': u'-0.33%',
                   u'High': u'122.95',
                   u'Low': u'121.20',
                   u'Open': u'122.45',
                   u'Price': u'121.40',
                   u'Vol.': u'18.11K'},
 u'May 26, 2016': {u'Change %': u'0.08%',
                   u'High': u'122.15',
                   u'Low': u'121.20',
                   u'Open': u'121.90',
                   u'Price': u'121.50',
                   u'Vol.': u'19.27K'},
 u'May 27, 2016': {u'Change %': u'-0.16%',
                   u'High': u'122.35',
                   u'Low': u'120.80',
                   u'Open': u'122.10',
                   u'Price': u'121.30',
                   u'Vol.': u'13.52K'},
 u'May 31, 2016': {u'Change %': u'0.21%',
                   u'High': u'123.90',
                   u'Low': u'121.35',
                   u'Open': u'121.55',
                   u'Price': u'121.55',
                   u'Vol.': u'23.62K'}}

现在得到具体日期,我们需要模仿和Ajax调用与后到 http://www.investing.com/instruments/HistoricalDataAjax

Now to get specific dates, we need to mimic and ajax call with a post to http://www.investing.com/instruments/HistoricalDataAjax:

from bs4 import BeautifulSoup
from collections import OrderedDict

# data to post
data = {"action": "historical_data",
        "curr_id": "8832",
        "st_date": "04/04/2016",
        "end_date": "04/08/2016",
        "interval_sec": "Daily"}

# add a user agent and specify that we are making an ajax request
head = {

        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"}

with requests.Session() as s:
    r = s.post("http://www.investing.com/instruments/HistoricalDataAjax", data=data, headers=head)
    od = OrderedDict()
    soup = BeautifulSoup(r.content, "lxml")

    table = soup.select_one("table.genTbl.closedTbl.historicalTbl")
       cols = [th.text for th in table.select("th")][1:]
    for row in table.select("tr + tr"):
        data = [td.text for td in row.select("td")]
        od[data[0]] = dict(zip(cols, data[1:]))

from pprint import pprint as pp

pp(dict(od))

现在我们只能从的 st_date 的日期范围内的 END_DATE 的:

Now we only get the date range from st_date to end_date:

{u'Apr 04, 2016': {u'Change %': u'-3.50%',
                   u'High': u'126.55',
                   u'Low': u'122.30',
                   u'Open': u'125.80',
                   u'Price': u'122.80',
                   u'Vol.': u'25.18K'},
 u'Apr 05, 2016': {u'Change %': u'-1.55%',
                   u'High': u'122.85',
                   u'Low': u'120.55',
                   u'Open': u'122.85',
                   u'Price': u'120.90',
                   u'Vol.': u'25.77K'},
 u'Apr 06, 2016': {u'Change %': u'0.50%',
                   u'High': u'122.15',
                   u'Low': u'120.00',
                   u'Open': u'121.45',
                   u'Price': u'121.50',
                   u'Vol.': u'17.94K'},
 u'Apr 07, 2016': {u'Change %': u'-1.40%',
                   u'High': u'122.60',
                   u'Low': u'119.60',
                   u'Open': u'122.35',
                   u'Price': u'119.80',
                   u'Vol.': u'32.69K'}}

您可以看到下的Chrome开发者工具后请求的 XHR 的标签:

You can see the post requests in chrome developer tools under the xhr tab:

在这里输入的形象描述

这篇关于刮表使用美丽的汤在特定日期循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆