如何使用不变的网址抓取多个页面-Python [英] How to scrape multiple pages with an unchanging URL - python

查看:138
本文介绍了如何使用不变的网址抓取多个页面-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试抓取此网站: http://data.eastmoney.com/xg/xg/

I'm trying to scrape this website: http://data.eastmoney.com/xg/xg/

到目前为止,我已经使用selenium执行javascript并抓取了表格.但是,现在我的代码仅使我获得第一页.我想知道是否可以访问其他17个页面,因为当我单击下一页时,URL不会更改,因此我不能每次都遍历另一个URL

So far I've used selenium to execute the javascript and get the table scraped. However, my code right now only gets me the first page. I was wondering if there's a way to access the other 17 pages, because when I click on next page the URL does not change, so I cannot just iterate over a different URL each time

到目前为止,以下是我的代码:

Below is my code so far:

from selenium import webdriver
import lxml
from bs4 import BeautifulSoup
import time

def scrape():
    url = 'http://data.eastmoney.com/xg/xg/'
    d={}
    f = open('east.txt','a')
    driver = webdriver.PhantomJS()
    driver.get(url)
    lst = [x for x in range(0,25)]
    htmlsource = driver.page_source
    bs = BeautifulSoup(htmlsource)
    heading = bs.find_all('thead')[0]
    hlist = []
    for header in heading.find_all('tr'):
        head = header.find_all('th')
    for i in lst:
        if i!=2:
            hlist.append(head[i].get_text().strip())
    h = '|'.join(hlist)
    print h
    table = bs.find_all('tbody')[0]
    for row in table.find_all('tr'):
        cells = row.find_all('td')
        d[cells[0].get_text()]=[y.get_text() for y in cells]
    for key in d:
        ret=[]
        for i in lst:
            if i != 2:
                ret.append(d.get(key)[i])
        s = '|'.join(ret)
        print s     

if __name__ == "__main__":  
    scrape()

或者如果我使用webdriver.Chrome()而不是PhantomJS,然后我每次单击后,Python都可以在浏览器中单击下一步吗?

Or is it possible for me to click next through the browser if I use webdriver.Chrome() instead of PhantomJS and then the Python run on the new page, after I click each time?

推荐答案

要与之交互并非简单的页面,因此需要使用

This is not a trivial page to interact with and would require the use of Explicit Waits to wait for invisibility of "loading" indicators.

以下是您可以用作起点的完整且可行的实施方式:

Here is the complete and working implementation that you may use as a starting point:

# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium import webdriver
import time

url = "http://data.eastmoney.com/xg/xg/"
driver = webdriver.PhantomJS()
driver.get(url)

def get_table_results(driver):
    for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"):
        print [cell.text for cell in row.find_elements_by_tag_name("td")]


# initial wait for results
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']")))


while True:
    # print current page number
    page_number = driver.find_element_by_id("gopage").get_attribute("value")
    print "Page #" + page_number

    get_table_results(driver)

    next_link = driver.find_element_by_link_text("下一页")
    if "nolink" in next_link.get_attribute("class"):
        break

    next_link.click()
    time.sleep(2)  # TODO: fix?

    # wait for results to load
    WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(@src, 'loading')]")))

    print "------"

这个想法是要产生一个无限循环,只有当下一页"链接被禁用(没有更多可用页面)时,我们才会退出循环.在每次迭代中,获取表结果(为示例起见,在控制台上打印),单击下一个链接,然后等待出现在网格顶部的正在加载"旋转圆的隐形性.

The idea is to have an endless loop which we would exit only if the "Next Page" link becomes disabled (no more pages available). On every iteration, get the table results (printing on the console for the sake of an example), click the next link and wait for invisibility of the "loading" spinning circle appearing on top of the grid.

这篇关于如何使用不变的网址抓取多个页面-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆