取所有< a>来自< div>的代码具有特定类别的标签 [英] Taking all <a> tags from <div> tags with a specific class

查看:33
本文介绍了取所有< a>来自< div>的代码具有特定类别的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在调试并通过使用python自动完成无聊的工作"中的方法来获取lucky.py代码.这里的主要问题是作者的代码无法正常工作(可能已过时).该代码旨在在执行python脚本时传递命令行参数.该脚本会在新标签页中为该参数打开前五个(或更少)的Google搜索结果.现在,原始代码将提取所有带有'r'类的标签.但是,现在,谷歌不再使用"r"类来搜索结果超链接,而是将"selfsame"标记简单地用"r"类封装在div中.

I was working on debugging and getting the lucky.py code in "automate the boring stuff with python to work." The primary problem here is that the author's code isn't working (outdated probably). The code is aimed at passing a command-line argument while executing a python script. The script opens the first five (or less) Google search results for the argument in new tabs. Now, the original code extracts all tags with the 'r' class. However, now, instead of using the 'r' class for search result hyperlinks, google simply encases the selfsame tag in a div with the 'r' class.

这就是原始代码所做的

res = requests.get('http://google.com/search?q=' +' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')

linkElems = soup.select('.r a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))

我尝试将所有直接包含在divs中的标签提取出来,但是我找不到任何方法来提取直接包含在'r'类标签中的所有标签.

I've tried taking all the tags encased directly within divs, but I can't find any method to extract all tags encased directly within 'r' class tags.

这是我想到的一些事情,但是它们不能正常工作.

Here are some things I have thought of, but they don't work properly.

linkElems = soup.select('.r div > a')

还有这个,因为我想要的所有标签都具有以'\ url开头的ping属性.

and this, as all tags that I want have ping attributes that begin with '\url.'

 linkElems = soup.select('a')
 for link in linkElems:
     if link.attrs.hget('ping').startswith('\\url'):
         ...

推荐答案

TLDR :从python脚本运行时,Google发送不同的HTML响应.

TLDR : Google sends a different HTML response when running it from a python script.

好吧,如果您实际打印 linkElems 变量,您将看到它为空.我认为这是因为Gooogle根据许多HTTP标头更改了它们的HTML.用外行术语来说,这意味着您在浏览器中看到的HTML并不是从Python运行获取请求时所得到的HTML.

Well if you actually print the linkElems variable you will see that it is empty. I think the reason for this is that Gooogle changes their HTML based on a lot of HTTP headers. In layman terms, this means that the HTML you see in the browser is not what you will get while running a get request from Python.

现在,您可以使用 linkElems = soup.select('.jfp3ef> a'),这样就可以正常工作了.它将选择所有< a> 标记,它们是元素 .jfp3ef 的元素的直接子代.当从python发出请求时, .jfp3ef 类是Google似乎在使用的类,而不是 r .但是我不会将其投入生产,因为它可能会不时更改.

For now you can use linkElems = soup.select('.jfp3ef > a') and this will work just fine. It will select all the <a> tags that are immediate children of elements with the class .jfp3ef. The .jfp3ef class is what Google seems to be using instead of r when doing a request from python. But I would not put this in production because it might change from time to time.

更好和更可靠的解决方案是使用 Google搜索API .但是由于您是出于学习目的而这样做的,所以我上面提到的hack应该没问题.

A better and more reliable solution is to use the Google Search API. But since you are doing this for the sake of learning, the hack I mentioned above should be fine.

代码:

import bs4
import requests

res = requests.get('http://google.com/search?q=test')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
linkElems = soup.select('.jfp3ef > a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    print('http://google.com' + linkElems[i].get('href'))

输出:

http://google.com/url?q=https://www.speedtest.net/&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjAKegQIChAB&usg=AOvVaw0mhIK0jUq5fUfhEJTuA90h
http://google.com/url?q=https://fast.com/&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjALegQICRAB&usg=AOvVaw3WERIy0Wo_UNyqmNAVBCeZ
http://google.com/url?q=https://openspeedtest.com/Get-widget.php&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjAMegQICBAB&usg=AOvVaw1161mhQBhD75gfmsIzzg4n
http://google.com/url?q=https://www.meter.net/&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjANegQIBxAB&usg=AOvVaw2Z3xTSmhoxz6VS7MYAaS2x
http://google.com/url?q=https://speedtest.telstra.com/&sa=U&ved=2ahUKEwjP9eumr97jAhX2GLkGHbGoDuoQFjAOegQIARAB&usg=AOvVaw36SosexF66e8fQUWIG14mZ

这篇关于取所有&lt; a&gt;来自&lt; div&gt;的代码具有特定类别的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆