cURL 返回空数组 [英] cURL returns null array

查看:31
本文介绍了cURL 返回空数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用 PHP cURL 制作了一个简单的网络爬虫,它应该从亚马逊抓取特定页面的所有图像,其中搜索了关键字 samsung.

I have made a simple web Crawler with PHP cURL that should grab all the images of a particular page from Amazon where the keyword samsung has been searched.

代码如下:

$curl = curl_init(); // $curl is going to be data type curl resource

$search_string = "samsung";

$url = "https://www.amazon.com/s?k$search_string";

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); // ssl
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // storing in variable 

$result = curl_exec($curl);

preg_match_all("!https://m.media-amazon.com/images/I/[^\s]*?._AC_UL320_.jpg!", $result, $matches);

print_r($matches);

curl_close($curl);

但现在我得到空数组:

Array ( [0] => Array ( ) )

我不知道为什么会这样,所以如果你知道出了什么问题或者我该如何处理,请告诉我,我真的很感激你们的任何想法......

I don't why it is showing that, so if you know what is going wrong or how can I handle this, please let me know, I would really appreciate any idea from you guys...

提前致谢.

注意,我指定了 [^\s]*? 正则表达式而不是图像名称来加载网页上的所有可用图像.

Note that I have specified [^\s]*? regular expression instead of image name to load all the available images on web page.

更新 #1:

curl --head https://www.amazon.com/s?k=samsung

HTTP/1.1 503 Service Unavailable
Content-Type: text/html
Content-Length: 2671
Connection: keep-alive
Server: Server
Date: Tue, 15 Jun 2021 20:59:38 GMT
x-amz-rid: 9BVX8KQMWJ4QDJ75ETYV
Vary: Content-Type,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
Last-Modified: Fri, 14 May 2021 19:08:48 GMT
ETag: "a6f-5c24ef9383000"
Accept-Ranges: bytes
Strict-Transport-Security: max-age=47474747; includeSubDomains; preload
Permissions-Policy: interest-cohort=()
X-Cache: Error from cloudfront
Via: 1.1 5345148f0ba8ae3c67b69d035acdbfc5.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: AMS50-C1
X-Amz-Cf-Id: AHdq2-QLEtCE4WvXZIEh_P75D8hCrHP09EAkNqBer5VBS-pI-blj1w==

推荐答案

第一期:你的代码:

$url = "https://www.amazon.com/s?k$search_string";

应该是(注意=")

$url = "https://www.amazon.com/s?k=$search_string";

第二个问题:亚马逊很聪明,他们不会让你随心所欲地刮.结果是以下内容:

Second issue: Amazon is smart, they're not going to let you scrape as you will. The result is the content for:

您可以通过以下方式查看:

You can see this with:

$result = curl_exec($curl);
var_dump($result);

第三个问题:正则表达式不起作用.人们应该在 https://www.phpliveregex.com/#tab-preg 测试正则表达式-匹配所有(使用右键单击 > 查看源代码,复制并粘贴页面内容.)

Third issue: Regex is not working. One should test Regex at https://www.phpliveregex.com/#tab-preg-match-all (Using a right-click > view source, copy and paste of the page content.)

从我得到的你的正则表达式没有返回任何结果,但这样做了:https://m.media-amazon.com/images/I/[^\s]*?.jpg

From what I got your regex did not return any results, but this did: https://m.media-amazon.com/images/I/[^\s]*?.jpg

可能是字符串位 ._AC_UL320_ 也是亚马逊反抓取的东西... :(

May be that the string bit ._AC_UL320_ is also a Amazon anti-scraping thing... :(

这篇关于cURL 返回空数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆