cURL 返回空数组 [英] cURL returns null array
问题描述
我用 PHP cURL 制作了一个简单的网络爬虫,它应该从亚马逊抓取特定页面的所有图像,其中搜索了关键字 samsung
.
I have made a simple web Crawler with PHP cURL that should grab all the images of a particular page from Amazon where the keyword samsung
has been searched.
代码如下:
$curl = curl_init(); // $curl is going to be data type curl resource
$search_string = "samsung";
$url = "https://www.amazon.com/s?k$search_string";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); // ssl
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // storing in variable
$result = curl_exec($curl);
preg_match_all("!https://m.media-amazon.com/images/I/[^\s]*?._AC_UL320_.jpg!", $result, $matches);
print_r($matches);
curl_close($curl);
但现在我得到空数组:
Array ( [0] => Array ( ) )
我不知道为什么会这样,所以如果你知道出了什么问题或者我该如何处理,请告诉我,我真的很感激你们的任何想法......
I don't why it is showing that, so if you know what is going wrong or how can I handle this, please let me know, I would really appreciate any idea from you guys...
提前致谢.
注意,我指定了 [^\s]*?
正则表达式而不是图像名称来加载网页上的所有可用图像.
Note that I have specified [^\s]*?
regular expression instead of image name to load all the available images on web page.
更新 #1:
curl --head https://www.amazon.com/s?k=samsung
HTTP/1.1 503 Service Unavailable
Content-Type: text/html
Content-Length: 2671
Connection: keep-alive
Server: Server
Date: Tue, 15 Jun 2021 20:59:38 GMT
x-amz-rid: 9BVX8KQMWJ4QDJ75ETYV
Vary: Content-Type,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
Last-Modified: Fri, 14 May 2021 19:08:48 GMT
ETag: "a6f-5c24ef9383000"
Accept-Ranges: bytes
Strict-Transport-Security: max-age=47474747; includeSubDomains; preload
Permissions-Policy: interest-cohort=()
X-Cache: Error from cloudfront
Via: 1.1 5345148f0ba8ae3c67b69d035acdbfc5.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: AMS50-C1
X-Amz-Cf-Id: AHdq2-QLEtCE4WvXZIEh_P75D8hCrHP09EAkNqBer5VBS-pI-blj1w==
推荐答案
第一期:你的代码:
$url = "https://www.amazon.com/s?k$search_string";
应该是(注意=")
$url = "https://www.amazon.com/s?k=$search_string";
第二个问题:亚马逊很聪明,他们不会让你随心所欲地刮.结果是以下内容:
Second issue: Amazon is smart, they're not going to let you scrape as you will. The result is the content for:
您可以通过以下方式查看:
You can see this with:
$result = curl_exec($curl);
var_dump($result);
第三个问题:正则表达式不起作用.人们应该在 https://www.phpliveregex.com/#tab-preg 测试正则表达式-匹配所有(使用右键单击 > 查看源代码,复制并粘贴页面内容.)
Third issue: Regex is not working. One should test Regex at https://www.phpliveregex.com/#tab-preg-match-all (Using a right-click > view source, copy and paste of the page content.)
从我得到的你的正则表达式没有返回任何结果,但这样做了:https://m.media-amazon.com/images/I/[^\s]*?.jpg
From what I got your regex did not return any results, but this did: https://m.media-amazon.com/images/I/[^\s]*?.jpg
可能是字符串位 ._AC_UL320_
也是亚马逊反抓取的东西... :(
May be that the string bit ._AC_UL320_
is also a Amazon anti-scraping thing... :(
这篇关于cURL 返回空数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!