在PHP中抓取网页时获取垃圾输出 [英] Getting garbage output when scraping a webpage in PHP

查看：85 发布时间：2020/10/29 6:14:57 php html encoding file-get-contents

本文介绍了在PHP中抓取网页时获取垃圾输出的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 file_get_html（）从Amazon获取页面的内容，但是输出在 echo 。谁能解释我该如何解决这个问题？

 
 
 我还在Stack Overflow上发现了以下两个相关问题，但它们没有解决我的问题。 ：）
 
   file_get_html（）返回垃圾 
 
  解压缩gzip压缩的http响应 
 
 
这是我的代码：
  $ options = array（
'http'=> array（
'header'=> 
 接受：text / html，application / xhtml + xml，application / xml; q = 0.9，* / *; q = 0.8\r\n。
接受语言：en-US，en ; q = 0.5\r\n。
 User-Agent：Mozilla / 5.0（Windows; U; Windows NT 6.0; en-US; rv：1.9.1.6）Gecko / 20091201 Firefox / 3.5。 6\r\n 
）
）; 
 $ context = stream_context_create（$ options）; 
 
 $ amazon_url =‘https://www.amazon.com/my-url’； 
 $ amazon_html = file_get_contents（$ amazon_url，false，$ context）; 
  
这是我得到的输出：
   T]o 6}  `   0  ݊-   [ bh tN b0  。%% $P  @  （Ų        F＃    A 
  
约115k个字符像这样显示在浏览器窗口中。
 
 
 这些是我的新标题：
  $ options = array（
'http'=> array（
'header'=> 
 Accept：text / html，application / xhtml + xml，application / xml; q = 0.9，* / *; q = 0.8\r\n。
接受语言：en-US，en; q = 0.5\r\n 
 ）
）; 
  
使用cURL是否可以解决此问题？
 
 
 更新：
 
 
 我尝试了cURL。仍然得到了垃圾输出。这是我的响应标头：
  HTTP / 1.1 200 OK 
日期：2018年11月18日星期日20:29:28 GMT 
服务器：Apache / 2.4.33（Win32 ）OpenSSL / 1.1.0h PHP / 7.2.5 
 X-Powered-By：PHP / 7.2.5 
 Keep-Alive：超时= 5，最大= 100 
连接：Keep-Alive 
转帐-编码：块状
内容类型：text / html; charset = UTF-8 
  
有人可以解释反对票吗？
 
 我自己做了一个研究。
 
 在Stack Overflow上发现了一些相关问题，但并不能解决我的问题。
 
 提供了我认为会有所帮助的所有信息。
 
 
我还要在问题中包括什么？ 
 
 
这是我目前关于curl的全部代码。这是 URL 我正在抓取。
  $ handle = curl_init（）; 
 curl_setopt（$ handle，CURLOPT_URL，$ amazon_url）; 
 curl_setopt（$ handle，CURLOPT_RETURNTRANSFER，true）; 
 $ data = curl_exec（$ handle）; 
 curl_close（$ handle）; 
 
 echo $ data; 
  
输出只是我上面提到的一堆字符。这是我的请求标头：
 主机：localhost 
用户代理：Mozilla / 5.0（Windows NT 10.0; Win64 ; x64; rv：63.0）Gecko / 20100101 Firefox / 63.0 
接受：text / html，application / xhtml + xml，application / xml; q = 0.9，* / *; q = 0.8 
 Accept-语言：en-US，en; q = 0.5 
接受编码：gzip，deflate 
连接：keep-alive 
 Cookie：AMCV_17EB401053DAF4840A490D4C％40AdobeOrg = -227196251％7CMCIDTS％7C17650％7CMCMID％ 7C67056225185486460220940124683302119708％7CMCAID％7CNONE％7CMCOPTOUT-1524907071s％7CNONE; mjx.menu = renderer％3ACommonHTML; _ga = GA1.1.2019605490.1529649408; csm-hit = adb：adblk_no& tb：s-3521C4J8F2EP1V0MMQEP | 1542578145652& t：1542578146256 
升级不安全请求：1 
语法：no-cache 
缓存控制：no-cache 
  
这些来自网络标签。响应标头与我上面提到的相同。
 
 
 这里是添加 curl_setopt（$ handle，CURLOPT_HEADER，1）; 后的输出。
 code>到我的代码：
 
  HTTP / 1.1 200 OK服务器：Server内容类型：text / html; charset = UTF-8 
严格运输安全性：max-age = 47474747; includeSubDomains; 
 preload x-amz-id-1：7A162B8JKV6MGZQ3PCH2变化：
 Accept-Encoding，User-Agent，X-Amzn-CDN-Cache内容编码：gzip 
 x-amz-rid： 7A162B8JKV6MGZQ3PCH2缓存控制：无转换
 X帧选项：SAMEORIGIN日期：2018年11月18日星期日22:42:51 GMT 
传输编码：分块连接：保持活动连接：
 Transfer-Encoding Set-Cookie：
 x-wl-uid = 1a4u8 + XgF + IhFF / iavy9mKZCAA0g4HiIYZXR8hKjxGtmOtBW + j67wGABv7ZOTxDRcab + 7Qmpjqds =; 
 
   class = h2_lin>解决方案

这是解决方案：

 
 
 在刮取Amazon时遇到了同样的问题。 
在发送cURL请求之前，只需添加以下选项：
  curl_setopt（$ handle，CURLOPT_ENCODING，'gzip，deflate ，sdch'）; 
  
 
I am trying to get the contents of a page from Amazon using file_get_html() but the output comes with weird characters on echo. Can anyone please explain how can I resolve this issue?

I also found the following two related questions on Stack Overflow but they did not solve my issue. :)

file_get_html() returns garbage
Uncompress gzip compressed http response
Here is my code:
$options = array(
'http'=>array(
    'header'=>
            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
            "Accept-language: en-US,en;q=0.5\r\n" .
            "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n"
   )
); 
$context = stream_context_create($options);

$amazon_url = 'https://www.amazon.com/my-url';
$amazon_html = file_get_contents($amazon_url, false, $context);
Here is the output I get:
��T]o�6}��`���0��݊-��"[�bh�tN�b0��.%%�$P��@�(Ų�� ������F#����A�
about 115k characters like this show up in the browser window.

These are my new headers:
$options = array(
'http'=>array(
    'header'=>
            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
            "Accept-language: en-US,en;q=0.5\r\n"
   )
); 
Will using cURL resolve this issue?

Update:

I tried cURL. Still getting the garbage output. Here are my response headers:
HTTP/1.1 200 OK
Date: Sun, 18 Nov 2018 20:29:28 GMT
Server: Apache/2.4.33 (Win32) OpenSSL/1.1.0h PHP/7.2.5
X-Powered-By: PHP/7.2.5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
Can anyone explain the negative votes?

I did a research myself.
Found some related questions on Stack Overflow which did not solve my problem.
Provided all the information that I thought would be helpful.
What else should I include in the question?

Here is my whole code for curl at present. This is the URL I am scraping.
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $amazon_url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($handle);
curl_close($handle);

echo $data;
The output is just a bunch of characters I mentioned above. Here are my request headers:
Host: localhost
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: AMCV_17EB401053DAF4840A490D4C%40AdobeOrg=-227196251%7CMCIDTS%7C17650%7CMCMID%7C67056225185486460220940124683302119708%7CMCAID%7CNONE%7CMCOPTOUT-1524907071s%7CNONE; mjx.menu=renderer%3ACommonHTML; _ga=GA1.1.2019605490.1529649408; csm-hit=adb:adblk_no&tb:s-3521C4J8F2EP1V0MMQEP|1542578145652&t:1542578146256
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache
These are from the Network Tab. The response headers are the same as I mentioned above.

Here is the output after adding curl_setopt($handle, CURLOPT_HEADER, 1); to my code:

  HTTP/1.1 200 OK Server: Server Content-Type: text/html; charset=UTF-8
  Strict-Transport-Security: max-age=47474747; includeSubDomains;
  preload x-amz-id-1: 7A162B8JKV6MGZQ3PCH2 Vary:
  Accept-Encoding,User-Agent,X-Amzn-CDN-Cache Content-Encoding: gzip
  x-amz-rid: 7A162B8JKV6MGZQ3PCH2 Cache-Control: no-transform
  X-Frame-Options: SAMEORIGIN Date: Sun, 18 Nov 2018 22:42:51 GMT
  Transfer-Encoding: chunked Connection: keep-alive Connection:
  Transfer-Encoding Set-Cookie:
  x-wl-uid=1a4u8+XgF+IhFF/iavy9mKZCAA0g4HiIYZXR8hKjxGtmOtBW+j67wGABv7ZOTxDRcab+7Qmpjqds=;

 解决方案 
Here's the solution:

I ran into the same issue when scraping Amazon.
Simply add the following option before sending your cURL request:
curl_setopt($handle, CURLOPT_ENCODING, 'gzip,deflate,sdch');


                        
这篇关于在PHP中抓取网页时获取垃圾输出的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在PHP中抓取网页时获取垃圾输出 [英] Getting garbage output when scraping a webpage in PHP

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

在PHP中抓取网页时获取垃圾输出 [英] Getting garbage output when scraping a webpage in PHP

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭