在PHP中抓取网页时获取垃圾输出 [英] Getting garbage output when scraping a webpage in PHP

查看:85
本文介绍了在PHP中抓取网页时获取垃圾输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 file_get_html()从Amazon获取页面的内容,但是输出在 echo 。谁能解释我该如何解决这个问题?



我还在Stack Overflow上发现了以下两个相关问题,但它们没有解决我的问题。 :)


  1. file_get_html()返回垃圾

  2. 解压缩gzip压缩的http响应

这是我的代码:

  $ options = array(
'http'=> array(
'header'=>
接受:text / html,application / xhtml + xml,application / xml; q = 0.9,* / *; q = 0.8\r\n。
接受语言:en-US,en ; q = 0.5\r\n。
User-Agent:Mozilla / 5.0(Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6)Gecko / 20091201 Firefox / 3.5。 6\r\n

);
$ context = stream_context_create($ options);

$ amazon_url =‘https://www.amazon.com/my-url’;
$ amazon_html = file_get_contents($ amazon_url,false,$ context);

这是我得到的输出:

   T]o 6}  `   0  ݊-   [ bh tN b0  。%% $P  @  (Ų        F#    A 

约115k个字符像这样显示在浏览器窗口中。



这些是我的新标题:

  $ options = array(
'http'=> array(
'header'=>
Accept:text / html,application / xhtml + xml,application / xml; q = 0.9,* / *; q = 0.8\r\n。
接受语言:en-US,en; q = 0.5\r\n

);

使用cURL是否可以解决此问题?



更新:



我尝试了cURL。仍然得到了垃圾输出。这是我的响应标头:

  HTTP / 1.1 200 OK 
日期:2018年11月18日星期日20:29:28 GMT
服务器:Apache / 2.4.33(Win32 )OpenSSL / 1.1.0h PHP / 7.2.5
X-Powered-By:PHP / 7.2.5
Keep-Alive:超时= 5,最大= 100
连接:Keep-Alive
转帐-编码:块状
内容类型:text / html; charset = UTF-8

有人可以解释反对票吗?


  1. 我自己做了一个研究。

  2. 在Stack Overflow上发现了一些相关问题,但并不能解决我的问题。

  3. 提供了我认为会有所帮助的所有信息。

我还要在问题中包括什么?

这是我目前关于curl的全部代码。这是 URL 我正在抓取。

  $ handle = curl_init(); 
curl_setopt($ handle,CURLOPT_URL,$ amazon_url);
curl_setopt($ handle,CURLOPT_RETURNTRANSFER,true);
$ data = curl_exec($ handle);
curl_close($ handle);

echo $ data;

输出只是我上面提到的一堆字符。这是我的请求标头:

 主机:localhost 
用户代理:Mozilla / 5.0(Windows NT 10.0; Win64 ; x64; rv:63.0)Gecko / 20100101 Firefox / 63.0
接受:text / html,application / xhtml + xml,application / xml; q = 0.9,* / *; q = 0.8
Accept-语言:en-US,en; q = 0.5
接受编码:gzip,deflate
连接:keep-alive
Cookie:AMCV_17EB401053DAF4840A490D4C%40AdobeOrg = -227196251%7CMCIDTS%7C17650%7CMCMID% 7C67056225185486460220940124683302119708%7CMCAID%7CNONE%7CMCOPTOUT-1524907071s%7CNONE; mjx.menu = renderer%3ACommonHTML; _ga = GA1.1.2019605490.1529649408; csm-hit = adb:adblk_no& tb:s-3521C4J8F2EP1V0MMQEP | 1542578145652& t:1542578146256
升级不安全请求:1
语法:no-cache
缓存控制:no-cache

这些来自网络标签。响应标头与我上面提到的相同。



这里是添加 curl_setopt($ handle,CURLOPT_HEADER,1); 后的输出。

code>到我的代码:


HTTP / 1.1 200 OK服务器:Server内容类型:text / html; charset = UTF-8
严格运输安全性:max-age = 47474747; includeSubDomains;
preload x-amz-id-1:7A162B8JKV6MGZQ3PCH2变化:
Accept-Encoding,User-Agent,X-Amzn-CDN-Cache内容编码:gzip
x-amz-rid: 7A162B8JKV6MGZQ3PCH2缓存控制:无转换
X帧选项:SAMEORIGIN日期:2018年11月18日星期日22:42:51 GMT
传输编码:分块连接:保持活动连接:
Transfer-Encoding Set-Cookie:
x-wl-uid = 1a4u8 + XgF + IhFF / iavy9mKZCAA0g4HiIYZXR8hKjxGtmOtBW + j67wGABv7ZOTxDRcab + 7Qmpjqds =;


class = h2_lin>解决方案

这是解决方案:



在刮取Amazon时遇到了同样的问题。
在发送cURL请求之前,只需添加以下选项:

  curl_setopt($ handle,CURLOPT_ENCODING,'gzip,deflate ,sdch'); 


I am trying to get the contents of a page from Amazon using file_get_html() but the output comes with weird characters on echo. Can anyone please explain how can I resolve this issue?

I also found the following two related questions on Stack Overflow but they did not solve my issue. :)

  1. file_get_html() returns garbage
  2. Uncompress gzip compressed http response

Here is my code:

$options = array(
'http'=>array(
    'header'=>
            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
            "Accept-language: en-US,en;q=0.5\r\n" .
            "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n"
   )
); 
$context = stream_context_create($options);

$amazon_url = 'https://www.amazon.com/my-url';
$amazon_html = file_get_contents($amazon_url, false, $context);

Here is the output I get:

��T]o�6}��`���0��݊-��"[�bh�tN�b0��.%%�$P��@�(Ų�� ������F#����A�

about 115k characters like this show up in the browser window.

These are my new headers:

$options = array(
'http'=>array(
    'header'=>
            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
            "Accept-language: en-US,en;q=0.5\r\n"
   )
); 

Will using cURL resolve this issue?

Update:

I tried cURL. Still getting the garbage output. Here are my response headers:

HTTP/1.1 200 OK
Date: Sun, 18 Nov 2018 20:29:28 GMT
Server: Apache/2.4.33 (Win32) OpenSSL/1.1.0h PHP/7.2.5
X-Powered-By: PHP/7.2.5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

Can anyone explain the negative votes?

  1. I did a research myself.
  2. Found some related questions on Stack Overflow which did not solve my problem.
  3. Provided all the information that I thought would be helpful.

What else should I include in the question?

Here is my whole code for curl at present. This is the URL I am scraping.

$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $amazon_url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($handle);
curl_close($handle);

echo $data;

The output is just a bunch of characters I mentioned above. Here are my request headers:

Host: localhost
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: AMCV_17EB401053DAF4840A490D4C%40AdobeOrg=-227196251%7CMCIDTS%7C17650%7CMCMID%7C67056225185486460220940124683302119708%7CMCAID%7CNONE%7CMCOPTOUT-1524907071s%7CNONE; mjx.menu=renderer%3ACommonHTML; _ga=GA1.1.2019605490.1529649408; csm-hit=adb:adblk_no&tb:s-3521C4J8F2EP1V0MMQEP|1542578145652&t:1542578146256
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache

These are from the Network Tab. The response headers are the same as I mentioned above.

Here is the output after adding curl_setopt($handle, CURLOPT_HEADER, 1); to my code:

HTTP/1.1 200 OK Server: Server Content-Type: text/html; charset=UTF-8 Strict-Transport-Security: max-age=47474747; includeSubDomains; preload x-amz-id-1: 7A162B8JKV6MGZQ3PCH2 Vary: Accept-Encoding,User-Agent,X-Amzn-CDN-Cache Content-Encoding: gzip x-amz-rid: 7A162B8JKV6MGZQ3PCH2 Cache-Control: no-transform X-Frame-Options: SAMEORIGIN Date: Sun, 18 Nov 2018 22:42:51 GMT Transfer-Encoding: chunked Connection: keep-alive Connection: Transfer-Encoding Set-Cookie: x-wl-uid=1a4u8+XgF+IhFF/iavy9mKZCAA0g4HiIYZXR8hKjxGtmOtBW+j67wGABv7ZOTxDRcab+7Qmpjqds=;

解决方案

Here's the solution:

I ran into the same issue when scraping Amazon. Simply add the following option before sending your cURL request:

curl_setopt($handle, CURLOPT_ENCODING, 'gzip,deflate,sdch');

这篇关于在PHP中抓取网页时获取垃圾输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆