通过网页抓取检索脚本页面网址 [英] Retrieving scripted page urls via web scrape

查看:35
本文介绍了通过网页抓取检索脚本页面网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从网络抓取的搜索查询中获取所有文章链接,但似乎没有得到任何结果.

I'm trying to get all of the article link from a web scraped search query, however I don't seem to get any results.

相关网页:http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&;graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&30=;sortMode=广告商&searchFrom=quick&searchType=

我的方法:我正在尝试获取文章的 id,然后将它们附加到已知的 url (http://www.seek.com.au/job/+ id) 但是我的请求中没有 id(来自 http://docs.python-requests.org/en/latest/) 检索,其实根本没有文章.

my approach: I'm trying to get the ids of articles and then append them to the already known url (http://www.seek.com.au/job/+ id) however there are no ids on my request(python package from http://docs.python-requests.org/en/latest/) retrieval, in fact there are no articles at all.

似乎在这种特殊情况下,我需要以某种方式执行脚本(生成 ID)以获取完整数据,我该怎么做?

it seems that in this particular case I need to execute the scripts(that generate ids) in some way to get the full data, how could I do that?

也许还有其他方法可以从该搜索查询中检索所有结果?

maybe there are other ways to retrieve all of the results from this search query?

推荐答案

如前所述,下载 Selenium.有 python 绑定.

As mentioned, download Selenium. There are python bindings.

Selenium 是一个 Web 测试自动化框架.实际上,通过使用 selenium,您可以远程控制 Web 浏览器.这是必要的,因为 Web 浏览器具有 javascript 引擎和 DOMs,允许 AJAX 发生.

Selenium is a web testing automation framework. In effect, by using selenium you are remote controlling a web browser. This is necessary as web browsers have javascript engines and DOMs, allowing AJAX to occur.

使用此测试脚本(假设您已安装 Firefox;如果需要,Selenium 支持其他浏览器):

Using this test script (it assumes you have Firefox installed; Selenium supports other browsers if needed):

# Import 3rd Party libraries
from selenium                                       import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

class requester_firefox(object):
    def __init__(self):
        self.selenium_browser = webdriver.Firefox()
        self.selenium_browser.set_page_load_timeout(30)

    def __del__(self):
        self.selenium_browser.quit()
        self.selenium_browser = None

    def __call__(self, url):
        try:
            self.selenium_browser.get(url)
            the_page = self.selenium_browser.page_source
        except Exception:
            the_page = ""
        return the_page

test = requester_firefox()
print test("http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=Advertiser&searchFrom=quick&searchType=").encode("ascii", "ignore")

它将加载 SEEK 并等待 AJAX 页面.encode 方法是必需的(至少对我而言),因为 SEEK 返回一个 unicode 字符串,Windows 控制台似乎无法打印该字符串.

It will load SEEK and wait for AJAX pages. The encode method is necessary (for me at least) because SEEK returns a unicode string which the Windows console seemingly can't print.

这篇关于通过网页抓取检索脚本页面网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆