需要抓取一个使用python(selenium)通过ajax加载的表 [英] Need to scrap a table which is loaded through ajax using python(selenium)

查看:587
本文介绍了需要抓取一个使用python(selenium)通过ajax加载的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有页面有一个表格(表格ID ="ctl00_ContentPlaceHolder_ctl00_ctl00_GV" class ="GridListings")我需要取消. 我通常使用BeautifulSoup& urllib,但是在这种情况下,问题在于该表需要花费一些时间来加载,因此当我尝试使用BS来获取该表时不会捕获该表. 由于某些安装问题,我无法使用PyQt4,drysracpe或风车,因此唯一可行的方法是使用Selenium/PhantomJS 我尝试了以下方法,但仍然没有成功:

I have a page that has a table (table id= "ctl00_ContentPlaceHolder_ctl00_ctl00_GV" class="GridListings" )i need to scrap. I usually use BeautifulSoup & urllib for it,but in this case the problem is that the table takes some time to load ,so it isnt captured when i try to fetch it using BS. I cannot use PyQt4,drysracpe or windmill because of some installation issues,so the only possible way is to use Selenium/PhantomJS I tried the following,still no success:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located(By.CSS_SELECTOR, 'table#ctl00_ContentPlaceHolder_ctl00_ctl00_GV'))

上面的代码没有给我所需的表内容. 我该如何实现这一目标???

The above code doesnt give me the desired contents of the table. How do i go about achieveing this???

推荐答案

您可以使用 requests bs4,来获取数据,那里几乎没有所有的ASP站点以下是一些经常需要提供的帖子参数,例如 __ EVENTTARGET __ EVENTVALIDATION 等.:

You can get the data using requests and bs4,, with almost if not all asp sites there are a few post params that always need to be provided like __EVENTTARGET, __EVENTVALIDATION etc.. :

from bs4 import BeautifulSoup
import requests

data = {"__EVENTTARGET": "ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV",
    "__EVENTARGUMENT": "LISTINGS;0",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$ctl00$hdnProductID": "139",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$hdnProductID": "139",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortField": "Listing Number",
    "ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortDirection": "A-Z, Low-High",
    "__ASYNCPOST": "true"}

对于实际帖子,我们需要添加一些其他值以发布帖子数据:

And for the actual post, we need to add a few more values to out post data:

post = "https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
with requests.Session() as s:
    s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
    soup = BeautifulSoup(s.get(post).content)

    data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
    data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
    data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]

    r = s.post(post, data=data)
    soup2 = BeautifulSoup(r.content)
    table = soup2.select_one("div.GridListings")
    print(table)

运行代码时,您将看到打印的表格.

You will see the table printed when you run the code.

这篇关于需要抓取一个使用python(selenium)通过ajax加载的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆