从python调用网址时获取“错误"的页面源 [英] Getting ‘wrong’ page source when calling url from python

查看:71
本文介绍了从python调用网址时获取“错误"的页面源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试从网站检索页面源时,与通过Web浏览器查看同一页面源时,我得到了完全不同(且更短)的文本.

Trying to retrieve the page source from a website, I get a completely different (and shorter) text than when viewing the same page source through a web browser.

https://stackoverflow .com/questions/24563601/python-getting-a-wrong-source-code-of-web-page-asp-net

这个人有一个相关的问题,但是获得了主页源而不是请求的源-我得到的东西完全与世隔绝.

This fellow has a related issue, but obtained the home page source instead of the requested one - I am getting something completely alien.

代码是:

from urllib import request

def get_page_source(n):
    url = 'https://www.whoscored.com/Matches/' + str(n) + '/live'
    response = request.urlopen(url)
    return str(response.read())

n = 1006233
text = get_page_source(n)

这是我在此示例中定位的页面: https://www.whoscored.com/Matches/1006233/live

This is the page I am targeting in this example: https://www.whoscored.com/Matches/1006233/live

所讨论的url在页面源代码中包含丰富的信息,但是在运行上述代码时,我最终只得到了以下内容:

The url in question contains rich information in the page source, but I end up getting only the following when running the above code:

文本=

b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX,
NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta 
name="viewport" content="initial-scale=1.0"><meta http-equiv="X-
UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;
height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=24&
xinfo=0-12919260-0 0NNY RT(1462118673272 111) q(0 -1 -1 -1) r(0 -1) 
B12(4,315,0) U2&incident_id=276000100045095595-100029307305590944&edet=12&
cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" 
marginwidth="0px">Request unsuccessful. Incapsula incident ID: 
276000100045095595-100029307305590944</iframe></body></html>'

这里出了什么问题?即使没有发送重复请求,服务器也能检测到机器人吗?如果可以,怎么办?还有没有办法?

What went wrong here? Can a server detect a robot even when it has not sent repetitive requests – if yes, how – and is there a way around?

推荐答案

这里有两个问题.根本原因是您要抓取的网站知道您不是真人,并且正在阻止您.许多网站只是通过检查标头以查看请求是否来自浏览器(机器人)来执行此操作.但是,此站点看起来像他们使用Incapsula一样,旨在提供更复杂的保护.您可以尝试通过设置标头来不同地设置您的请求,以欺骗页面上的安全性-但我怀疑这样做是否可行.

There is a couple of issues here. The root cause is that the website you are trying to scrape knows you're not a real person and is blocking you. Lots of websites do this simply by checking headers to see if a request is coming from a browser or not (robot). However, this site looks like they use Incapsula, which is designed to provide more sophisticated protection. You can try and setup your request differently to fool the security on the page by setting headers - but I doubt this will work.

import requests

def get_page_source(n):
    url = 'https://www.whoscored.com/Matches/' + str(n) + '/live'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response = requests.get(url, headers=headers)
    return response.text

n = 1006233
text = get_page_source(n)
print text

看起来该网站还使用了验证码-旨在防止网页抓取.如果网站正在努力防止刮擦-这很可能是因为它们提供的数据是专有数据.我建议您找到另一个提供此数据的站点-或尝试使用官方API.

Looks like the site also uses captchas - which are designed to prevent web scraping. If a site is trying this hard to prevent scraping - it's likely because the data they provide is proprietary. I would suggest finding another site that provides this data - or try and use an official API.

从一段时间回来看看这个( https://stackoverflow.com/a/17769971/701449 )答案.好像whoscored.com使用OPTA API来提供信息.您也许可以跳过中间人,直接进入数据源.祝你好运!

Check out this (https://stackoverflow.com/a/17769971/701449) answer from a while back. It looks like the whoscored.com uses the OPTA API to provide info. You may be able to skip the middleman and go straight to the source of the data. Good luck!

这篇关于从python调用网址时获取“错误"的页面源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆