Python BeautifulSoup不抓取此网址 [英] Python BeautifulSoup not scraping this url

查看:67
本文介绍了Python BeautifulSoup不抓取此网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网址中抓取播放器数据(tr)的某些行,但是当我运行代码时似乎什么也没发生.我很肯定我的代码很好,因为它可与其他包含表格的统计网站一起使用.谁能告诉我为什么什么都没发生?预先感谢.

I am trying to scrape some rows of player data (tr) from a url, however nothing appears to happen when I run my code. I am positive my code is fine because it works with other statistical websites containing tables. Can anyone tell me why nothing is happening? Thanks in advance.

import urllib
import urllib.request
from bs4 import BeautifulSoup

def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata

soup = make_soup("https://www.whoscored.com/Regions/252/Tournaments/7/Seasons/6365/Stages/13832/PlayerStatistics/England-Championship-2016-2017")
for record in soup.findAll('tr'):
    print(record.text)

推荐答案

简短答案:您要查找的玩家数据在该URL中.

Short answer: The player data you are looking for is NOT in that URL.

然后您可能想问为什么?在该页面上我见过他们,为什么他们不在那里?

Then you might want to ask why? I've seen them in that page, how come they're not there?

因此,我将尝试解释使用现代浏览器(例如Chrome)浏览该网址时会发生什么情况.

So I'll try to explain what happens when you browse that url with a modern browser such as Chrome.

:输入网址,然后按Enter.

You: Type in the url and hit enter.

Chrome浏览器:.我将尽快为您提供该页面. (从该url获取内容),现在好了!但是等等我 在我向您展示之前,请先阅读/解析(阅读其中的内容 内容),哦,废话这个javascript告诉我获得更多 来自其他网址的信息,好的,我会做的;哦,等等,这是另一个 一个告诉我在标题中加载广告,我不喜欢,但是 我只是去做我被告知的事情;只需一秒钟,这些CSS告诉我 以粗体显示播放器名称,还可以;哦,这是另外一张照片 url xxx我需要加载,没问题...哦,天哪,有多少东西 给我处理?我对该网站不满意...(在 一堆其他东西...)终于一切准备就绪!现在看看吧!

Chrome: Gotcha. I'll get that page for you asap, just a second. (fetching content from that url), great now I have it! But wait let me read/parse it first before I show it to you, (reading what's inside the content), oh crap this javascript tells me to get additional information from another url, ok I'll do it; oh wait here's another one to tell me to load an ads in the header, well I don't like it but I'm just gonna do what I'm told; just a second, these css tells me to display player names in bold, ok not bad; oh here's another photo from url xxx I need to load, no problem... oh man, how many stuff are there for me to process? I'm not happy with this website... (working on a bunch of other stuff...) Finally everything's ready! Now check it out!

您:玩家xxx确实不错,我会检查一下. (点击播放器xxx)

You: Player xxx is actually quite good, I'll check it out. (click player xxx)

Chrome浏览器::......

正如您每次浏览网页时所看到的,浏览器会做很多幕后"工作来向用户显示.因此基本上是:输入的网址>>从获取的网址中获取的内容>>已解析的内容>>已获取的其他内容>>显示的所有内容>>显示的页面(可以同时执行一个或多个步骤)

As you can see every time when you browse a web page, a browser does lots of "behind the scene" stuff to display it for users. So basically: url entered >> content from url fetched >> content parsed >> additional content fetched >> all stuff rendered >> page displayed (one or more steps might be done simultaneously)

在您的代码中,这仅仅是从url中获取的内容",而且您想要的这些统计信息恰好是其他内容",必须从其他位置加载,所以这就是为什么您什么都没得到的原因.

And with your codes, it's only "content from url fetched", also those stats you want happens to be "additional content" which has to be loaded from elsewhere, so that's why you got nothing.

那我如何获得这些统计数据?了解了负责加载这些统计信息的网址后,只需关注它们即可.如何找出这些网址?好吧,如果您足够耐心的话,您可以随时阅读javascript ...

How do I get those stats then? Once you know the urls responsible for loading those stats, simply go after them. How do I find out those urls? Well you can always read javascripts... if you are patient enough...

获得所需信息的最简单方法是在该页面加载时分析点击量,并找出所有幕后访问量.我建议小提琴手,但是您可以使用任何您认为合适的工具.

The easiest way to get what you want is to analyze the traffic while that page is loading, and find out all those behind the scenes traffic. I would recommend fiddler, but you can use any tools you see fit.

现在让我们看看加载该页面时会发生什么:

Now let's see what happens when you load that page:

实际上有数百个请求可以完全呈现​​您访问的页面,而您所要做的就是找出哪个页面提供了实际"或真实"统计信息.即使其中包含"StatisticsFeed",也存在一个网址,可以吗?让我们看一下:

There're actually hundreds of requests made to fully render that page you visit, and all you need to do is to find out which one feeds the "actual" or "real" stats. There's this one url even with "StatisticsFeed" in it, could it be the one? Let's take a look:

是的!那么现在怎么办? 模拟此请求并解析内容,因为它已经采用JSON格式,内置模块json可以轻松完成工作,您甚至不必使用BeautifulSoup

Exactly! So now what? Simulate this request and parse the content, since it's JSON formated already, the builtin module json would do the job easily, you don't even have to use BeautifulSoup

您可能会问,当我直接浏览此链接时,我什么都没得到?那是因为他们在服务器上设置了限制,以便只有具有有效标头的请求才能获得供稿.那么我该如何绕过呢? 使用正确的参数(通常是标题)生动地" 进行模拟,以使他们相信您.

You might ask, how come I got nothing when I browse this link directly? That's because they set limit on their server so that only requests with valid headers would get feeds. So how do I bypass that? Simulate "vividly" with correct parameters(mostly headers) so that they believe you.

这篇关于Python BeautifulSoup不抓取此网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆