网页抓取duckduckgo,但获取的链接格式错误 [英] Web scraping duckduckgo, but getting the links in the wrong format

查看:13
本文介绍了网页抓取duckduckgo,但获取的链接格式错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 BeautifulSoup 库创建了一个 Python 3 脚本.它的作用是使用以下网址进入 duckduckgo 搜索引擎:https://duckduckgo.com/?q=searchterm 然后,它将显示给我第一页中的所有网站.

这是代码,它运行良好:

导入请求从 bs4 导入 BeautifulSoupr = requests.get('https://duckduckgo.com/html/?q=test')汤 = BeautifulSoup(r.text, 'html.parser')结果 = 汤.find_all('a', attrs={'class':'result__a'})我 = 0当我 <len(结果):链接 = 结果[i]url = 链接['href']打印(网址)我 = 我 + 1

问题是,我没有得到正确格式的网址(例如:https://www.google.com).相反,我以搜索查询的格式获取所有网址.

当我在duckduckgo上搜索test时,我的意思是:

<预> <代码>/l/?kh=-1&uddg=https%3A%2F%2Fduckduckgo.com%2Fy.js%3Fu3%3Dhttps%253A%252F%252Fr.search.yahoo.com%252Fcbclk%252FdWU9MEQwQzVENEZDNDU0NDlEMyZ1dD0xNTM4MzE4MTI3MzE5JnVvPTc3NTg0MzM1OTYxMTUyJmx0PTImZXM9ZVBGTU9iWUdQUy42cVdRVQ%252D 252D%%252FRV%253D2%252FRE%253D1538346927%252FRO%253D10%252FRU%253Dhttps%25253a%25252f%25252fwww.bing.com%25252faclick%25253fld%25253dd3peyDLOVSWraifG78tpZ1GjVUCUzCMDkx%252DfJrFXeY2IfiXIwUmngX%252DYKvZWQ6q7hPHC_3kc%252DzBWS1SE015Or2c3CncFMVc9OjVV5OyB2kJqXdRsOzRnaCGy8gYCPuival0gLe7WCkfk_%252DAVKTWmYxranfh02ficTC7i6oC38n2q9U9KPe%252526u%25253dhttps%2525253a%2525252f%2525252fwww.dotdrugconsortium.com%2525252f%2525253futm_source%2525253dbing%25252526utm_medium%2525253dcpc%25252526utm_campaign%2525253dadcenter%25252526utm_term%2525253ddottest%252526rlid%25253d590f68ae34ff126ed0e3331eebd0c4fb%252FRK%253D2%252FRS%253DeKe3rY19jdg9vb_ayBSboMzPU1g%252D%26ad_provider%3Dyhs%26vqd%3D3%2D12729109948094676568590283448597440227%2D122882305188756590950269013545136161936/l/?kh=-1&uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fdictionary%2Ftest/l/?kh=-1&uddg=https%3A%2F%2Fwww.speedtest.net%2F/l/?kh=-1&uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTest/l/?kh=-1&uddg=https%3A%2F%2Fwww.dictionary.com%2Fbrowse%2Ftest/l/?kh=-1&uddg=https%3A%2F%2Fwww.thefreedictionary.com%2Ftest/l/?kh=-1&uddg=https%3A%2F%2Fwww.16personalities.com%2F/l/?kh=-1&uddg=https%3A%2F%2Fwww.speakeasy.net%2Fspeedtest%2F/l/?kh=-1&uddg=http%3A%2F%2Fwww.humanmetrics.com%2Fcgi%2Dwin%2Fjtypes2.asp/l/?kh=-1&uddg=https%3A%2F%2Fwww.typingtest.com%2F%3Fab/l/?kh=-1&uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTest_cricket/l/?kh=-1&uddg=https%3A%2F%2Fged.com%2F/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.xfinity.com%2F/l/?kh=-1&uddg=https%3A%2F%2Fwww.16personalities.com%2Ffree%2Dpersonality%2Dtest/l/?kh=-1&uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fthesaurus%2Ftest/l/?kh=-1&uddg=http%3A%2F%2Ftest%2Dipv6.com%2F/l/?kh=-1&uddg=https%3A%2F%2Fwww.thesaurus.com%2Fbrowse%2Ftest/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.att.com%2Fspeedtest%2F/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.googlefiber.net%2F/l/?kh=-1&uddg=http%3A%2F%2Ftest.salesforce.com%2F/l/?kh=-1&uddg=https%3A%2F%2Fmy.uscis.gov%2Fprep%2Ftest%2Fcivics/l/?kh=-1&uddg=https%3A%2F%2Fwww.tests.com%2F/l/?kh=-1&uddg=https%3A%2F%2Fen.wiktionary.org%2Fwiki%2FTest/l/?kh=-1&uddg=https%3A%2F%2Ftestmy.net%2F/l/?kh=-1&uddg=https%3A%2F%2Fwww.google.com%2F/l/?kh=-1&uddg=https%3A%2F%2Fwww.queendom.com%2Ftests%2Findex.htm/l/?kh=-1&uddg=http%3A%2F%2Fwww.yourdictionary.com%2Ftest/l/?kh=-1&uddg=http%3A%2F%2Fwww.testout.com%2F/l/?kh=-1&uddg=https%3A%2F%2Fimplicit.harvard.edu%2Fimplicit%2Ftakeatest.html/l/?kh=-1&uddg=http%3A%2F%2Fwww.act.org%2Fcontent%2Fact%2Fen%2Fproducts%2Dand%2Dservices%2Fthe%2Dact.html/l/?kh=-1&uddg=https%3A%2F%2Fwww.ets.org%2Fgre%2F

我想知道是否有办法以标准格式显示所有这些网址.

这不是我的其他主题的重复,因为在上一个主题中,我被告知库 PyCurl 不会得到我想要的(它无法捕获 url 中的 javascript 代码).这里我的代码正在运行,但我得到的输出不是我所期望的.

解决方案

Python 的 urllib.parse 库可以帮助你如下:

from bs4 import BeautifulSoup导入 urllib.parse进口请求r = requests.get('https://duckduckgo.com/html/?q=test')汤 = BeautifulSoup(r.text, 'html.parser')结果 = 汤.find_all('a', attrs={'class':'result__a'}, href=True)结果中的链接:url = 链接['href']o = urllib.parse.urlparse(url)d = urllib.parse.parse_qs(o.query)打印(d['uddg'][0])

这将显示一些开始:

http://www.speedtest.net/https://www.merriam-webster.com/dictionary/testhttps://en.wikipedia.org/wiki/Testhttps://www.thefreedictionary.com/testhttps://www.dictionary.com/browse/test

首先使用urlparse() 获取路径组件.从中获取query 字符串并将其传递给parse_qs() 以进一步处理它.然后,您可以使用 uddg 名称提取链接.

I created a Python 3 script using the BeautifulSoup library. What it does, is going to the duckduckgo search engine using the following url: https://duckduckgo.com/?q=searchterm and then, it will display to me all the websites in the first page.

Here is the code and it is working perfectly:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://duckduckgo.com/html/?q=test')
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('a', attrs={'class':'result__a'})

i = 0
while i < len(results):
    link = results[i]
    url = link['href']
    print(url)
    i = i + 1

The thing is, I am not getting the urls in the proper format (example: https://www.google.com). Instead, I am getting all my urls in the format of a search query.

Here is what I mean when I search test on duckduckgo:

/l/?kh=-1&uddg=https%3A%2F%2Fduckduckgo.com%2Fy.js%3Fu3%3Dhttps%253A%252F%252Fr.search.yahoo.com%252Fcbclk%252FdWU9MEQwQzVENEZDNDU0NDlEMyZ1dD0xNTM4MzE4MTI3MzE5JnVvPTc3NTg0MzM1OTYxMTUyJmx0PTImZXM9ZVBGTU9iWUdQUy42cVdRVQ%252D%252D%252FRV%253D2%252FRE%253D1538346927%252FRO%253D10%252FRU%253Dhttps%25253a%25252f%25252fwww.bing.com%25252faclick%25253fld%25253dd3peyDLOVSWraifG78tpZ1GjVUCUzCMDkx%252DfJrFXeY2IfiXIwUmngX%252DYKvZWQ6q7hPHC_3kc%252DzBWS1SE015Or2c3CncFMVc9OjVV5OyB2kJqXdRsOzRnaCGy8gYCPuival0gLe7WCkfk_%252DAVKTWmYxranfh02ficTC7i6oC38n2q9U9KPe%252526u%25253dhttps%2525253a%2525252f%2525252fwww.dotdrugconsortium.com%2525252f%2525253futm_source%2525253dbing%25252526utm_medium%2525253dcpc%25252526utm_campaign%2525253dadcenter%25252526utm_term%2525253ddottest%252526rlid%25253d590f68ae34ff126ed0e3331eebd0c4fb%252FRK%253D2%252FRS%253DeKe3rY19jdg9vb_ayBSboMzPU1g%252D%26ad_provider%3Dyhs%26vqd%3D3%2D12729109948094676568590283448597440227%2D122882305188756590950269013545136161936
/l/?kh=-1&uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fdictionary%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.speedtest.net%2F
/l/?kh=-1&uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.dictionary.com%2Fbrowse%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.thefreedictionary.com%2Ftest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.16personalities.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.speakeasy.net%2Fspeedtest%2F
/l/?kh=-1&uddg=http%3A%2F%2Fwww.humanmetrics.com%2Fcgi%2Dwin%2Fjtypes2.asp
/l/?kh=-1&uddg=https%3A%2F%2Fwww.typingtest.com%2F%3Fab
/l/?kh=-1&uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTest_cricket
/l/?kh=-1&uddg=https%3A%2F%2Fged.com%2F
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.xfinity.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.16personalities.com%2Ffree%2Dpersonality%2Dtest
/l/?kh=-1&uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fthesaurus%2Ftest
/l/?kh=-1&uddg=http%3A%2F%2Ftest%2Dipv6.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.thesaurus.com%2Fbrowse%2Ftest
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.att.com%2Fspeedtest%2F
/l/?kh=-1&uddg=http%3A%2F%2Fspeedtest.googlefiber.net%2F
/l/?kh=-1&uddg=http%3A%2F%2Ftest.salesforce.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fmy.uscis.gov%2Fprep%2Ftest%2Fcivics
/l/?kh=-1&uddg=https%3A%2F%2Fwww.tests.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fen.wiktionary.org%2Fwiki%2FTest
/l/?kh=-1&uddg=https%3A%2F%2Ftestmy.net%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.google.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fwww.queendom.com%2Ftests%2Findex.htm
/l/?kh=-1&uddg=http%3A%2F%2Fwww.yourdictionary.com%2Ftest
/l/?kh=-1&uddg=http%3A%2F%2Fwww.testout.com%2F
/l/?kh=-1&uddg=https%3A%2F%2Fimplicit.harvard.edu%2Fimplicit%2Ftakeatest.html
/l/?kh=-1&uddg=http%3A%2F%2Fwww.act.org%2Fcontent%2Fact%2Fen%2Fproducts%2Dand%2Dservices%2Fthe%2Dact.html
/l/?kh=-1&uddg=https%3A%2F%2Fwww.ets.org%2Fgre%2F

I would like to know if there is a way to display all these urls in the standard format.

Edit: This is not a duplication from my other topic, since in the last one I was told that the library PyCurl won't get me what i want (It wasn't able to catch the javascript code in the urls). Here my code is working but the output that I got isn't what I am expecting.

解决方案

Python's urllib.parse library can help you as follows:

from bs4 import BeautifulSoup
import urllib.parse
import requests

r = requests.get('https://duckduckgo.com/html/?q=test')
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('a', attrs={'class':'result__a'}, href=True)

for link in results:
    url = link['href']
    o = urllib.parse.urlparse(url)
    d = urllib.parse.parse_qs(o.query)
    print(d['uddg'][0])

This would display something starting:

http://www.speedtest.net/
https://www.merriam-webster.com/dictionary/test
https://en.wikipedia.org/wiki/Test
https://www.thefreedictionary.com/test
https://www.dictionary.com/browse/test

First use urlparse() to get the path components. From this take the query string and pass it to parse_qs() to further process it. You can then extract the link using the uddg name.

这篇关于网页抓取duckduckgo,但获取的链接格式错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆