来自PHP请求的意外结果 [英] Unexpected result from PHP request

查看:104
本文介绍了来自PHP请求的意外结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图建立一个小应用程序,它会通知我一个网站(我的学校,实际上)的变化 - 它是HTML代码修改



我试图获取以下网站的HTML代码。
http://www.tivon.ort.org.il/%D7%9E%D7%94-%D7%97%D7%93%D7%A9-1/



我已经测试了在PHP中执行此操作的方法,包括:

  $ html = file_get_contents(URL); 
// OR
$ html = file_get_html(URL); //使用http://simplehtmldom.sourceforge.net/

以及使用cURL。 p>

所有这些方法都会返回以下HTML,它不是我想要获取的页面的HTML:

 < html>< head>< meta charset =utf-8>< / head>< body>< script src =// d1a702rd0dylue。 cloudfront.net/js/iealml-03/10800.js\"></script><script>window.rbzns = {}; rbzns.hosts =schools.ort.org.il www.achva.ort.org.il achva.ort.org.il acotech.ort.org.il afula.ort.org.il afulaalon.ort.org.il www .aliya2.ort.org.il aliya2.ort.org.il arad.ort.org.il www.astro.ort.org.il astro.ort.org.il www.bamaale.ort.org.il bamaale.ort .org.il www.bialik.ort.org.il bialik.ort.org.il dafna.ort.org.il eshkolakko.ort.org.il www.ganyavne.ort.org.il ganyavne.ort.org.il geha-edu.org.il www.geula.ort.org.il geula.ort.org.il www.givatayim.ort.org.il givatayim.ort.org.il givatram.ort.org.il www.guttman。 ort.org.il guttman.ort.org.il www.hazor.ort.org.il hazor.ort.org.il hof-carmel.org.il www.hof-carmel.org.il www.holon.ort。 org.il holon.ort.org.il www.igalalon.ort.org.il igalalon.ort.org.il www.kramim.ort.org.il kramim.ort.org.il www.lilienthal.ort.org。 il lilienthal.ort.org.il lodtech.ort.org.il motzkin.ort.org.il neriya.ort.org.il www.orenafula.ort.org.il orenafula.ort.org.il www.ormat.ort .org.il ormat.ort.org.il www.oumbatin.ort.org.il oumbatin.ort.org.il www.psagot.ort.org.i l psagot.ort.org.il www.rogozin.ort.org.il rogozin.ort.org.il www.sajur.ort.org.il sajur.ort.org.il sapirextra.ort.org.il www.shamir .ort.org.il www.sharet.ort.org.il shemer.ort.org.il www.spanian.ort.org.il spanian.ort.org.il tarshiha.ort.org.il technology.ort.org .il www.technology.ort.org.il www.tivon.ort.org.il tivon.ort.org.il www.ulpanit.ort.org.il ulpanit.ort.org.il www.yadshapira.ort.org .il yadshapira.ort.org.il www.yeshmaalot.ort.org.il yeshmaalot.ort.org.il yeshtveria.ort.org.il www.kugel.org.il roz.ort.org.il ylb.ort。 org.il tzurarad.ort.org.il www.hilmi.ort.org.il oma.ort.org.il hauashle.ort.org.il vilnai.ort.org.il sheandati.ort.org.il ronsonc.ort .org.il afek.ort.org.il www.dekelvilnae.ort.org.il www.mevoot-eron.org yami-ashdod.ort.org.il www.sheanklali.ort.org.il molada.ort.org .il www.melton.ort.org.il www.sallama.ort.org.il www.telnof.ort.org.il ortlaaoc.ort.org.il www.shapira.ort.org.il www.bgg.co .il www.ebin.ort.org.il darski.ort.org.il www.iai.ort.org.il modiin.ort.org.i l www.modiin.ort.org.il ortmodiin.ort.org.il neve-sara.ort.org.il ort-yadin.ort.org.il www.lod.ort.org.il; rbzns.ctrbg = L2Pfvthe2b9jPQUWp0ZxIu248ov5v83 + GtxsvLzg1jjDmPckhvTjr0FM3NAO4BEKVXI7AgAz1PMMI2MlLtJDnajFt + 6HZ3Zi99Z55YvMvU8ardvckHHwI8 / O + x3DhYi0YjF7irWG0sgbbUEDU6m8JdUZsvvzDHnJiVyP7XeiY + gpZM6WCIrZ + NhhuWfwAuvNS5UY6mazB + ZIhvkNA + RObxAUD5VHeqzh8WJIVFYorZ4RCohU28Q2jjbtKqHn7wdJ; rbzns.rbzreqid = 2e6d1f6c31343232323037373231cb23df000c96b36c; winsocks(真);< /脚本>< /体>< / HTML> 

我设法使用我检查过的路线获取其他网站的HTML代码,但没有我需要HTML的特定网站。
根据我的理解,它有点受保护,避免机器人。



围绕这种不必要的保护的任何方法?提示?

解决方案

当您第一次访问该网站时,它会设置一个cookie rbzid 。你必须记住这个cookie。有一个例子如何使用curl和一个cookiejar 这里。他们还记得你的用户代理。我不确定他们是否也检查此用户代理是否是浏览器 - 我不这么认为,但他们可能会这样做。无论如何,你必须保持相同的用户代理。它可能以某种方式编码在cookie中。



您可以这样验证:在浏览器中打开网站。检查rbzid cookie的值。同时复制浏览器的用户代理。然后,在终端中运行:

  curlhttp://www.tivon.ort.org.il/% D7%9E%D7%94-%D7%97%D7%93%D7%A9-1 / -  A用户代理--cookie rbzid = cookie 


I'm trying to build a small app which will notify me of changes in a website (my school's, actually) - it's HTML code modified

I'm trying to get the HTML code of the following website. http://www.tivon.ort.org.il/%D7%9E%D7%94-%D7%97%D7%93%D7%A9-1/

I've tested serval ways to do this in PHP, including:

$html = file_get_contents(URL);
//OR
$html = file_get_html(URL); //Using http://simplehtmldom.sourceforge.net/

As well as using cURL.

All these ways return the following HTML, which isn't the HTML of the page I'm trying to get:

<html><head><meta charset="utf-8"></head><body><script src="//d1a702rd0dylue.cloudfront.net/js/iealml-03/10800.js"></script><script>window.rbzns = {}; rbzns.hosts="schools.ort.org.il www.achva.ort.org.il achva.ort.org.il acotech.ort.org.il afula.ort.org.il afulaalon.ort.org.il www.aliya2.ort.org.il aliya2.ort.org.il arad.ort.org.il www.astro.ort.org.il astro.ort.org.il www.bamaale.ort.org.il bamaale.ort.org.il www.bialik.ort.org.il bialik.ort.org.il dafna.ort.org.il eshkolakko.ort.org.il www.ganyavne.ort.org.il ganyavne.ort.org.il geha-edu.org.il www.geula.ort.org.il geula.ort.org.il www.givatayim.ort.org.il givatayim.ort.org.il givatram.ort.org.il www.guttman.ort.org.il guttman.ort.org.il www.hazor.ort.org.il hazor.ort.org.il hof-carmel.org.il www.hof-carmel.org.il www.holon.ort.org.il holon.ort.org.il www.igalalon.ort.org.il igalalon.ort.org.il www.kramim.ort.org.il kramim.ort.org.il www.lilienthal.ort.org.il lilienthal.ort.org.il lodtech.ort.org.il motzkin.ort.org.il neriya.ort.org.il www.orenafula.ort.org.il orenafula.ort.org.il www.ormat.ort.org.il ormat.ort.org.il www.oumbatin.ort.org.il oumbatin.ort.org.il www.psagot.ort.org.il psagot.ort.org.il www.rogozin.ort.org.il rogozin.ort.org.il www.sajur.ort.org.il sajur.ort.org.il sapirextra.ort.org.il www.shamir.ort.org.il www.sharet.ort.org.il shemer.ort.org.il www.spanian.ort.org.il spanian.ort.org.il tarshiha.ort.org.il technology.ort.org.il www.technology.ort.org.il www.tivon.ort.org.il tivon.ort.org.il www.ulpanit.ort.org.il ulpanit.ort.org.il www.yadshapira.ort.org.il yadshapira.ort.org.il www.yeshmaalot.ort.org.il yeshmaalot.ort.org.il yeshtveria.ort.org.il www.kugel.org.il roz.ort.org.il ylb.ort.org.il tzurarad.ort.org.il www.hilmi.ort.org.il oma.ort.org.il hauashle.ort.org.il vilnai.ort.org.il sheandati.ort.org.il ronsonc.ort.org.il afek.ort.org.il www.dekelvilnae.ort.org.il www.mevoot-eron.org yami-ashdod.ort.org.il www.sheanklali.ort.org.il molada.ort.org.il www.melton.ort.org.il www.sallama.ort.org.il www.telnof.ort.org.il ortlaaoc.ort.org.il www.shapira.ort.org.il www.bgg.co.il www.ebin.ort.org.il darski.ort.org.il www.iai.ort.org.il modiin.ort.org.il www.modiin.ort.org.il ortmodiin.ort.org.il neve-sara.ort.org.il ort-yadin.ort.org.il www.lod.ort.org.il"; rbzns.ctrbg="L2Pfvthe2b9jPQUWp0ZxIu248ov5v83+GtxsvLzg1jjDmPckhvTjr0FM3NAO4BEKVXI7AgAz1PMMI2MlLtJDnajFt+6HZ3Zi99Z55YvMvU8ardvckHHwI8/O+x3DhYi0YjF7irWG0sgbbUEDU6m8JdUZsvvzDHnJiVyP7XeiY+gpZM6WCIrZ+NhhuWfwAuvNS5UY6mazB+ZIhvkNA+RObxAUD5VHeqzh8WJIVFYorZ4RCohU28Q2jjbtKqHn7wdJ";rbzns.rbzreqid="2e6d1f6c31343232323037373231cb23df000c96b36c"; winsocks(true);</script></body></html>

I did manage to get the HTML code of other websites using the routes I've examined, but not of the particular site I actually need the HTML of. From my understanding, it's somehow "protected" against bots.

Any way around this unnecessary "protection"? Hints?

解决方案

When you visit the website for the first time, it sets a cookie rbzid. You have to remember this cookie. There's an example how to use curl with a cookiejar here. They also remember your user-agent. I'm not sure if they also check that this user-agent is a browser - I don't think so, but they may. In any case, you have to keep the same user-agent. It's probably somehow encoded in the cookie.

You can verify this like this: open the website in your browser. Check the value of the rbzid cookie. Also copy the user-agent of your browser. Then, in a terminal, run this:

curl "http://www.tivon.ort.org.il/%D7%9E%D7%94-%D7%97%D7%93%D7%A9-1/" -A "user-agent" --cookie rbzid=cookie

这篇关于来自PHP请求的意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆