特定网站SSL证书的奇怪CURL问题 [英] Strange CURL issue with a particular website SSL certificate

查看:76
本文介绍了特定网站SSL证书的奇怪CURL问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用CURL从paricualr网站获取网页,但是它会出现此错误:

I am trying to use CURL to get web pages from a paricualr website however it gives this error:

curl -q -v -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" https://www.saiglobal.com/ --output ./Downloads/test.html
....
*  SSL certificate verify ok.
} [5 bytes data]
> GET / HTTP/1.1
> Host: www.saiglobal.com
> User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
> Accept: */*
> 
  0     0    0     0    0     0      0      0 --:--:--  0:11:53 --:--:--     0* OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104
* stopped the pause stream!
  0     0    0     0    0     0      0      0 --:--:--  0:11:53 --:--:--     0
* Closing connection 0
} [5 bytes data]
curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104

我不是确定发生了什么事。我找不到很多有关该错误消息的有用信息。在我的Mac上,错误编号是60,而不是104。

I am not sure what is going on. I can't find a lot of useful info regarding to the error message. On my Mac, the errorno is 60 instead of 104.

但是,在这些计算机上使用Chrome浏览器可以加载页面而没有任何问题。其中一台计算机的CURL版本是7.58.0。

However, using Chrome on these machines can load the page without any issue. One of the machines' CURL version is 7.58.0.

我们将不胜感激。

推荐答案

问题不是此站点的证书。从调试输出中可以清楚地看到TLS握手已成功完成,并且在此握手之外,证书也无关紧要。

The problem is not the certificate of this site. From the debug output it can be clearly seen that the TLS handshake is done successfully and outside this handshake the certificate does not matter.

但是,可以看到该站点 www.saiglobal.com 受Akamai CDN和Akamai保护的CDN 具有某种类型的漫游器检测功能

But, it can be seen that the site www.saiglobal.com is CDN protected by Akamai CDN and Akamai features some kind of bot detection:

$ dig www.saiglobal.com
...
www.saiglobal.com.      45      IN      CNAME   www.saiglobal.com.edgekey.net.
www.saiglobal.com.edgekey.net. 62 IN    CNAME   e9158.a.akamaiedge.net.

该机器人检测程序已知使用一些启发式方法来区分机器人与普通浏览器以及检测到的漫游器可能会导致状态代码403访问被拒绝或使网站简单挂起-请参阅抓取尝试出现403错误请求SSL连接超时

This bot detection is known to use some heuristics in order to distinguish bots from normal browsers and detection of a bot might result in a status code 403 access denied or in a simple hang of the site - see Scraping attempts getting 403 error or Requests SSL connection timeout.

在这种特定情况下,如果添加一些特定的HTTP标头,特别是接受编码接受-语言连接,值保持活动 User-Agent Mozilla 匹配。未能添加这些标头或错误的值将导致挂起。

In this specific case it seems to currently help if some specific HTTP headers are added, specifically Accept-Encoding, Accept-Language, Connection with a value of keep-alive and User-Agent which matches somehow Mozilla. Failure to add these headers or having the wrong values will result in a hang.

以下当前适用于我的作品:

The following works currently for me:

$ curl -q -v \
   -H "Connection: keep-alive" \
   -H "Accept-Encoding: identity" \
   -H "Accept-Language: en-US" \
   -H "User-Agent: Mozilla/5.0"  \
   https://www.saiglobal.com/

请注意,这是故意绕过漫游器检测的。如果Akamai对漫游器检测进行了更改,它可能会停止工作。

Note that this deliberately tries to bypass the bot detection. It might stop working if Akamai makes changes to the bot detection.

请注意,该网站的所有者出于某种原因已明确启用漫游器检测。这意味着故意绕过检测以获取您自己的收益(例如基于抓取的信息提供某些服务),您可能会遇到法律问题。

Please note also that the owner of the site has explicitly enable bot detection for a reason. This means that with deliberately bypassing the detection for your own gain (like providing some service based on scraped information) you might get into legal problems.

这篇关于特定网站SSL证书的奇怪CURL问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆