美丽汤刮td&TR [英] BeautifulSoup Scraping td & tr

查看:85
本文介绍了美丽汤刮td&TR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从第3个表格(玉米)中提取价格数据(高价和低价).代码返回"None":

I am trying to extract the price data (high and low) from the 3rd table (corn). The code is return "None":

import urllib2                          
from bs4 import BeautifulSoup           
import time                           
import re                               
start_urls = 4539                       
nb_quotes = 10                          
for urls in range (start_urls, start_urls - nb_quotes, -1):

    start_time = time.time()

    # construct the URLs strings
    url = 'http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains' 

    # Read the HTML page content
    page = urllib2.urlopen(url)

    # Create a beautifulsoup object
    soup = BeautifulSoup(page)

    # Search the table to be parsed in the whole HTML code
    tables = soup.findAll('table')
    tab = tables[2]                 # This is the table to be parsed   

    low_tmp = str(tab.findAll('tr')[0].findAll('td')[1].getText())     #Low price
    low = re.sub('[+]', '', low_tmp)                                
    high_tmp = str(tab.findAll('tr')[0].findAll('td')[2].string)    # High price
    high = re.sub('[+]', '', high_tmp)                             


    stop_time = time.time()


    print low, '\t', high, '(%0.1f s)' % (stop_time - start_time)

推荐答案

使用以下javascript调用,在浏览器端填充表中的数据:

The data in the table is filled up on the browser side using the following javascript call:

document.write(getQuoteboardHTML(
    splitQuote(quotes, 'ZC*1,ZC*2,ZC*3,ZC*4,ZC*5,ZC*6,ZC*7,ZC*8,ZC*9'.split(/,/)),
    'shortmonthonly,high,low,last,change'.split(/,/), { nospacers: true }));

BeautifulSoup 是HTML解析器-它不会执行javascript.

BeautifulSoup is an HTML parser - it would not execute javascript.

基本上,您需要一些东西来为您执行该javascript.

Basically, you need something to execute that javascript for you.

一种解决方案是在 Selenium 的帮助下使用真正的浏览器a>:

One solution would be to utilize a real browser with the help of selenium:

from selenium import webdriver


url = "http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains"

driver = webdriver.Firefox()
driver.get(url)

table = driver.find_element_by_xpath('//td[contains(div[@class="fixedpage_heading"], "CORN")]/table[@class="homepage_quoteboard"]')
for row in table.find_elements_by_tag_name('tr')[1:]:
    month = row.find_element_by_class_name('quotefield_shortmonthonly').text
    low = row.find_element_by_class_name('quotefield_low').text
    high = row.find_element_by_class_name('quotefield_high').text

    print month, low, high

driver.close()

打印:

SEP 329-0 338-0
DEC 335-6 345-4
MAR 348-2 358-0
MAY 356-6 366-0
JUL 364-0 373-4
SEP 372-0 379-4
DEC 382-0 390-2
MAR 392-4 399-0
MAY 400-0 405-0


另一种选择是深入了解"并查看 splitQuote() getQuoteboardHTML() js函数的实际作用.使用浏览器开发人员工具,您可以看到有一个基础请求发送到


Another option would be to "go down to metal" and see what splitQuote() and getQuoteboardHTML() js function actually do. Using browser developer tools, you can see that there is an underlying request going to this url, that returns a piece of javascript code containing all objects with the data for the tables on the page:

var quotes = { 'ZC*1': { name: 'Corn', flag: 's', price_2_close: '338.75', open_interest: '2701', tradetime: '20140911133000', symbol: 'ZCU14', open: '338', high: '338', low: '329', last: '331.75', change: '-7', pctchange: '-2.07', volume: '1623', exchange: 'CBOT', type: '2', unitcode: '-1', date: '14104 ... ', month: 'May 2015', shortmonth: 'May 2015' } };

如果您设法从中提取必要的部分-这将是您的第二个选择.

If you manage to extract necessary parts from it - this would be your second option.

这篇关于美丽汤刮td&TR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆