从< a>中提取href美丽的汤 [英] extracting href from <a> beautiful soup

查看：72 发布时间：2020/9/20 5:59:03 python beautifulsoup

本文介绍了从< a>中提取href美丽的汤的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从Google搜索结果中提取链接.检查元素告诉我，我感兴趣的部分具有"class = r".第一个结果如下所示:

I'm trying to extract a link from a google search result. Inspect element tells me that the section I am interested in has "class = r". The first result looks like this:

<h3 class="r" original_target="https://en.wikipedia.org/wiki/chocolate" style="display: inline-block;">
    <a href="https://en.wikipedia.org/wiki/Chocolate" 
       ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://en.wikipedia.org/wiki/Chocolate&amp;ved=0ahUKEwjW6tTC8LXZAhXDjpQKHSXSClIQFgheMAM" 
       saprocessedanchor="true">
        Chocolate - Wikipedia
    </a>
</h3>

我要提取"href":

To extract the "href" I do:

import bs4, requests
res = requests.get('https://www.google.com/search?q=chocolate')
googleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elements= googleSoup.select(".r a")
elements[0].get("href")

但是我意外地得到:

'/url?q=https://en.wikipedia.org/wiki/Chocolate&sa=U&ved=0ahUKEwjHjrmc_7XZAhUME5QKHSOCAW8QFggWMAA&usg=AOvVaw03f1l4EU9fYd'

我要去的地方

"https://en.wikipedia.org/wiki/Chocolate"

属性"ping"似乎使它感到困惑.有什么想法吗?

The attribute "ping" seems to be confusing it. Any ideas?

发生了什么事?

如果打印响应内容(即googleSoup.text)，则会看到您获得的HTML完全不同.页面源和响应内容不匹配.

What's happening?

If you print the response content (i.e. googleSoup.text) you'll see that you're getting a completely different HTML. The page source and the response content don't match.

这不是发生，因为内容是动态加载的；即便如此，页面源和响应内容也是相同的. (但是您在检查元素时看到的HTML是不同的.)

This is not happening because the content is loaded dynamically; as even then, the page source and the response content are the same. (But the HTML you see while inspecting the element is different.)

对此的基本解释是Google可以识别Python脚本并更改其响应.

A basic explanation for this is that Google recognizes the Python script and changes its response.

您可以传递假 User-Agent 该脚本看起来像一个真实的浏览器请求.

You can pass a fake User-Agent to make the script look like a real browser request.

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

r = requests.get('https://www.google.co.in/search?q=chocolate', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

elements = soup.select('.r a')
print(elements[0]['href'])

输出:

https://en.wikipedia.org/wiki/Chocolate

资源:

使用Python中的请求库发送用户代理"

如何使用Python请求进行伪造浏览器访问?
在Python请求库的get方法中使用标头

Resources:

Sending "User-agent" using Requests library in Python
How to use Python requests to fake a browser visit?
Using headers with the Python requests library's get method

这篇关于从< a>中提取href美丽的汤的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从< a>中提取href美丽的汤 [英] extracting href from <a> beautiful soup

问题描述

推荐答案

发生了什么事?

What's happening?

资源:

Resources:

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从&lt; a&gt;中提取href美丽的汤 [英] extracting href from &lt;a&gt; beautiful soup

问题描述

推荐答案

发生了什么事?

What's happening?

资源:

Resources:

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

从< a>中提取href美丽的汤 [英] extracting href from <a> beautiful soup

登录关闭