通过网页抓取检索脚本页面网址 [英] Retrieving scripted page urls via web scrape
问题描述
我试图从网络抓取的搜索查询中获取所有文章链接,但似乎没有得到任何结果.
I'm trying to get all of the article link from a web scraped search query, however I don't seem to get any results.
我的方法:我正在尝试获取文章的 id,然后将它们附加到已知的 url (http://www.seek.com.au/job/+ id) 但是我的请求中没有 id(来自 http://docs.python-requests.org/en/latest/) 检索,其实根本没有文章.
my approach: I'm trying to get the ids of articles and then append them to the already known url (http://www.seek.com.au/job/+ id) however there are no ids on my request(python package from http://docs.python-requests.org/en/latest/) retrieval, in fact there are no articles at all.
似乎在这种特殊情况下,我需要以某种方式执行脚本(生成 ID)以获取完整数据,我该怎么做?
it seems that in this particular case I need to execute the scripts(that generate ids) in some way to get the full data, how could I do that?
也许还有其他方法可以从该搜索查询中检索所有结果?
maybe there are other ways to retrieve all of the results from this search query?
推荐答案
如前所述,下载 Selenium.有 python 绑定.
As mentioned, download Selenium. There are python bindings.
Selenium 是一个 Web 测试自动化框架.实际上,通过使用 selenium,您可以远程控制 Web 浏览器.这是必要的,因为 Web 浏览器具有 javascript 引擎和 DOMs,允许 AJAX 发生.
Selenium is a web testing automation framework. In effect, by using selenium you are remote controlling a web browser. This is necessary as web browsers have javascript engines and DOMs, allowing AJAX to occur.
使用此测试脚本(假设您已安装 Firefox;如果需要,Selenium 支持其他浏览器):
Using this test script (it assumes you have Firefox installed; Selenium supports other browsers if needed):
# Import 3rd Party libraries
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
class requester_firefox(object):
def __init__(self):
self.selenium_browser = webdriver.Firefox()
self.selenium_browser.set_page_load_timeout(30)
def __del__(self):
self.selenium_browser.quit()
self.selenium_browser = None
def __call__(self, url):
try:
self.selenium_browser.get(url)
the_page = self.selenium_browser.page_source
except Exception:
the_page = ""
return the_page
test = requester_firefox()
print test("http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=Advertiser&searchFrom=quick&searchType=").encode("ascii", "ignore")
它将加载 SEEK 并等待 AJAX 页面.encode
方法是必需的(至少对我而言),因为 SEEK 返回一个 unicode 字符串,Windows 控制台似乎无法打印该字符串.
It will load SEEK and wait for AJAX pages. The encode
method is necessary (for me at least) because SEEK returns a unicode string which the Windows console seemingly can't print.
这篇关于通过网页抓取检索脚本页面网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!