这个网站不适合使用beautifulsoup进行网页抓取吗? [英] Is this site not suited for web scraping using beautifulsoup?

查看:75
本文介绍了这个网站不适合使用beautifulsoup进行网页抓取吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用beautifulsoup来获取以下站点上每场比赛的赔率:

I try to use beautifulsoup to get the odds for each match on the following site:

https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches

目标是最终得到某种包含以下内容的文本文件:

The goal is to end up with some kind of text file containing the following:

比赛1,比赛1,比赛1的赔率,比赛2,比赛2的赔率

Match1, Team1, Odds for team1 winning, Team2, Odds for team2 winning

比赛2,比赛1,赢得比赛1的赔率,比赛2,赢得比赛2的赔率

Match2, Team1, Odds for team1 winning, Team2, Odds for team2 winning

以此类推...

我是beautifulsoup的新手,所以在很简单的层次上事情就已经出错了.我的方法是步行"遍历html树,直到到达div标签,在那里我可以看到所有匹配项.效果很好,直到使用class ="sgd-wrapper"命中div标签为止,下面有一个链接以查看图片进行澄清.

I am new to beautifulsoup so things already go wrong at a very elementary level. My approach is to "walk" through the html tree until I arrive in a div tag, where I can see all the matches are contained. This works well until hit a div tag with class="sgd-wrapper", there is a link below to see a picture for clarification.

此图片仅供参考.

以下是我的代码,m1或m2都不起作用.Python只是不响应.

The following is my code, and neither m1 or m2 works. Python just responses with none.

from bs4 import BeautifulSoup as bs
import requests as res

#Load the webpage content
r = res.get('https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches').text

#Convert to a beautiful soup object
soup = bs(r,'lxml')

m1 = soup.find("div", attrs={"id": "wrapper"}).find("div", attrs={"class": "page-box"}).find("div", attrs={"class": "page-area"}).find("div", attrs={"id": "oddset-nashville"}).find("div", attrs={"class": "sgd-wrapper"})
m2 = soup.find("div", attrs={"class": "sgd-wrapper"})

如果我删除了m1中的最后一个发现或重新定义了m2

If I remove the last find in m1 or redefine m2

m1 = soup.find("div", attrs={"id": "wrapper"}).find("div", attrs={"class": "page-box"}).find("div", attrs={"class": "page-area"}).find("div", attrs={"id": "oddset-nashville"})
m2 = soup.find("div", attrs={"id": "oddset-nashville"})

然后我得到答复

print(m1)
<div data-digital-portal-loader-url="https://assets.sb.danskespil.dk/front-end/digitalPortal.js?noCache=20201011001813" id="oddset-nashville"></div>

有人可以向我解释为什么这个div class ="sgd-wrapper"有什么特别的?

Can someone explain me why this div class="sgd-wrapper" is so special?

推荐答案

问题出在 r = res.get('https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches').text

Python请求库只是将您的HTTP/HTTPS请求发送到服务器并获取原始html,它并不能帮助您加载更多资源,例如图片和脚本,这意味着某些元素在javascript脚本中进行了操作(例如,创建一个元素,设置类名称并插入DOM树):

Python requests library just sent your HTTP/HTTPS request to the server and get the raw html and it does not help you to load more resources like pictures and scripts, which means that some elements is manipulate in javascript scripts (for example, create an element, set class name and insert into DOM tree):

另一个示例,如果通过请求 GET main.html,它将不会加载 main.js ,并且div t1 的类将不能设置为 sgd-wrapper

another example, if you GET main.html via requests, it does not load main.js and the class of div t1 will not be set as sgd-wrapper

# main.html
<html>
   <body>
      <div id="t1"></div>
      <script src="main.js"></script>
   </body>
</html>

# in main.js
document.querySelector('#t1').classList.add('sgd-wrapper');

您需要做的是使用无头的Chrome(例如 google-chorme --headless 来启动Chrome),并使用Chrome API钩住页面加载事件,然后转储全部内容.

what you need to do is to use headless Chrome (like google-chorme --headless to launch Chrome) and use Chrome API to hook on page loading events then dump whole complete contents.

这篇关于这个网站不适合使用beautifulsoup进行网页抓取吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆