无法使用BeautifulSoup解析Google搜索结果页面 [英] Can't parse a Google search result page using BeautifulSoup

查看:126
本文介绍了无法使用BeautifulSoup解析Google搜索结果页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用bs4中的BeautifulSoup在python中解析网页.当我检查google搜索页面的元素时,这是第一个结果的部门:

I'm parsing webpages using BeautifulSoup from bs4 in python. When I inspected the elements of a google search page, this was the division having the 1st result:

,因为它具有class = 'r',所以我编写了这段代码:

and since it had class = 'r' I wrote this code:

import requests
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%\22scams%22+%\22frauds%22+%\22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5')
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)

但是命令提示符仅返回[]

But the command prompt returned just []

可能出了什么问题以及如何解决?

What could've gone wrong and how to correct it?

此外, 我通过添加标题字典来相应地编辑了代码,但结果却是相同的[]. 这是新代码:

EDIT 1: I edited my code accordingly by adding the dictionary for headers, yet the result is the same []. Here's the new code:

import requests
headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
site = requests.get('https://www.google.com/search?client=firefox-b-d&ei=CLtgXt_qO7LH4-EP6LSzuAw&q=%22narendra+modi%22+%22cams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&oq=%22narendra+modi%22+%22scams%22+%22frauds%22+%22corruption%22+%22modi%22+-lalit+-nirav&gs_l=psy-ab.3...5077.11669..12032...5.0..0.202.2445.1j12j1......0....1..gws-wiz.T_WHav1OCvk&ved=0ahUKEwjfjrfv94LoAhWy4zgGHWjaDMcQ4dUDCAo&uact=5', headers = headers)
from bs4 import BeautifulSoup
page = BeautifulSoup(site.content, 'html.parser')
results = page.find_all('div', class_="r")
print(results)

注意::当我告诉它打印整个页面时,没有问题,或者当我使用list(page.children)时,它都可以正常工作.

NOTE: When I tell it to print the entire page, there's no problem, or when I take list(page.children) , it works fine.

推荐答案

某些网站要求设置User-Agent标头,以防止来自非浏览器的 fake 请求.但是,幸运的是,有这样一种方法可以将标头传递给请求

Some website requires User-Agent header to be set to prevent fake request from non-browser. But, fortunately there's a way to pass headers to the request as such

# Define a dictionary of http request headers
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
} 

# Pass in the headers as a parameterized argument
requests.get(url, headers=headers)

注意:可以找到用户代理列表此处

Note: List of user agents can be found here

这篇关于无法使用BeautifulSoup解析Google搜索结果页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆