使用Python 3.6抓取Duckduckgo [英] Scraping Duckduckgo with Python 3.6

查看:83
本文介绍了使用Python 3.6抓取Duckduckgo的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个简单的问题.我可以从duckduckgo搜索的第一页抓取结果.但是,我正在努力进入第二页及后续页面.我已经将Python与Selenium网络驱动程序一起使用,这对于第一页的结果很好. 我用来抓取第一页的代码是:-

A simple question. i can scrape results from the first page of a duckduckgo search. However i am struggling to get onto the 2nd and subsequent pages. I have used Python with the Selenium webdriver, which is fine for the first page results. The code i have used to scrape the first page is:-

results_url = "https://duckduckgo.com/?q=paralegal&t=h_&ia=web" 
browser.get(results_url)
results = browser.find_elements_by_id('links') 
num_page_items = len(results) 
for i in range(num_page_items): 
    print(results[i].text) 
    print(len(results)) 

nxt_page = browser.find_element_by_link_text("Load More")
if nxt_page:
    nxt_page.send_keys(Keys.PAGE_DOWN)"

有换行符指示新页面的开始,但是它们似乎没有改变URL,因此我尝试了上述内容将页面下移,然后重复代码以查找next_page上的链接.但是,它不起作用. 任何帮助将不胜感激

There are line breaks indicating the start of a new page but they do not appear to alter the url, so i tried the above to move down the page and then repeat the code for finding the links on the next_page. However it does not work. Any help would be very much appreciated

推荐答案

如果我在结果的源代码中搜索Load More,则找不到它.您是否尝试过使用 non-javascript 版本?

If I search for Load More in the source code of the result I can't find it. Did you try using the non-javascript version?

您可以通过将html添加到url来使用它: https://duckduckgo.com/html?q=paralegal&t=h_&ia=web 您可以在最后找到next按钮.

You can use it by simply add htmlto the url: https://duckduckgo.com/html?q=paralegal&t=h_&ia=web There you can find the next button at the end.

这对我有用(Chrome版本):

This one works for me (Chrome version):

results_url = "https://duckduckgo.com/html?q=paralegal&t=h_&ia=web"
browser.get(results_url)
results = browser.find_elements_by_id('links')
num_page_items = len(results)
for i in range(num_page_items):
    print(results[i].text)
    print(len(results))
nxt_page = browser.find_element_by_xpath('//input[@value="Next"]')
if nxt_page:
    browser.execute_script('arguments[0].scrollIntoView();', nxt_page)
    nxt_page.click()

顺便说一句:Duckduckgo还提供了一个不错的api,它可能更易于使用;)

Btw.: Duckduckgo also provides a nice api, which is probably much easier to use ;)

修复下一页链接的选择器,该选择器在第二个结果页面上选择了prev按钮(感谢@kingbode)

edit: fix selector for next page link which selected the prev button on the second result page (thanks to @kingbode)

这篇关于使用Python 3.6抓取Duckduckgo的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆