Beautiful Soup 没有提取网页的所有 html [英] Beautiful Soup not pulling all the html of a webpage

查看:34
本文介绍了Beautiful Soup 没有提取网页的所有 html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BeautifulSoup 进行练习.我想从这个网站拉出足球运动员图片的图片地址:https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652

当我 'inspect' 代码时,具有 img src 的部分如下:

 

<img src="https://tmssl.akamaized.net//images/portrait/header/195652-1456301478.jpg?lm=1456301501" title="Jordon Ibe" alt="Jordon Ibe" class=""><div class="bildquelle"><span title="imago">imago</span></div>

所以我想我可以使用 BeautifulSoup 来找到 divclass = "DataBild" 因为这是唯一的.

# 导入我需要的库导入 urllib3进口证明从 bs4 导入 BeautifulSoup# 指定网址url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())response = http.request('GET', url)#使用美丽的汤解析 html 并存储在变量 'soup' 中汤 = BeautifulSoup(response.data, "html.parser")player_img = soup.find_all('div', {'class':'dataBild'})打印 (player_img)

这会运行,但不会输出任何内容.所以我通过运行 print(soup)

来检查

# 导入我需要的库导入 urllib3进口证明从 bs4 导入 BeautifulSoup# 指定网址url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())response = http.request('GET', url)#使用美丽的汤解析 html 并存储在变量 'soup' 中汤 = BeautifulSoup(response.data, "html.parser")打印(汤)

这个输出

<head><title>404 Not Found</title></head><body bgcolor="white"><center><h1>404 Not Found</h1></center><hr/><center>nginx</center></html>

所以它显然不是从网页中提取所有 HTML?为什么会这样?是我寻找 div class = DataBild sound 的逻辑吗?

解决方案

该站点似乎在检查请求的 User-Agent 标头是否有效.

所以你需要像这样添加标题:

导入 urllib3进口证明url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())response = http.request('GET', url, headers={'User-Agent': 'Mozilla/5.0'})打印(响应.状态)

这会打印 200.如果删除标题,则会得到 404.

任何非空的 User-Agent 值(修剪空格后)似乎都有效.

I'm trying to practice using BeautifulSoup. I am trying to pull the image address of football player images from this website: https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652

When I 'inspect' the code, the section that has the img src is below:

    <div class="dataBild">
    <img src="https://tmssl.akamaized.net//images/portrait/header/195652-1456301478.jpg?lm=1456301501" title="Jordon Ibe" alt="Jordon Ibe" class="">
<div class="bildquelle"><span title="imago">imago</span></div>            
</div>

So I was thinking that I could just use BeautifulSoup to find the div with class = "DataBild" as this is unique.

# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup

# Specify the URL
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url)


#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")

player_img = soup.find_all('div', {'class':'dataBild'})
print (player_img)

This runs but it doesn't output anything. So I checked by just running print(soup)

# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup

# Specify the URL
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url)


#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")

print(soup)

This outputs

<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr/><center>nginx</center>
</body>
</html>

So it is obviously not pulling all the HTML from the webpage? Why is this happening? And is my logic of looking for div class = DataBild sound?

解决方案

The site seems to inspect whether the User-Agent header of the request is valid.

So you need to add the header like this:

import urllib3
import certifi

url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url, headers={'User-Agent': 'Mozilla/5.0'})
print(response.status)

This prints 200. If you remove the headers, you get 404.

Any non-empty User-Agent value (after trimming whitespace) seems to work.

这篇关于Beautiful Soup 没有提取网页的所有 html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆