为什么curl不工作,但wget工作? [英] Why does curl not work, but wget works?

查看:361
本文介绍了为什么curl不工作,但wget工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我同时使用curl和wget获取此网址: http://opinionator.blogs。 nytimes.com/2012/01/19/118675/

I am using both curl and wget to get this url: http://opinionator.blogs.nytimes.com/2012/01/19/118675/

对于curl,它根本不返回任何输出,但是使用wget,它返回整个HTML来源:

For curl, it returns no output at all, but with wget, it returns the entire HTML source:

这里是2个命令。我使用了相同的用户代理,并且都来自同一个IP,并且跟随重定向。 URL是完全相同的。对于curl,它会在1秒后立即返回,因此我知道这不是超时问题。

Here are the 2 commands. I've used the same user agent, and both are coming from the same IP, and are following redirects. The URL is exactly the same. For curl, it returns immediately after 1 second, so I know it's not a timeout issue.

curl -L -s "http://opinionator.blogs.nytimes.com/2012/01/19/118675/" --max-redirs 10000 --location --connect-timeout 20 -m 20 -A "Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 2>&1

wget http://opinionator.blogs.nytimes.com/2012/01/19/118675/ --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 

如果纽约时报可能伪装,并且没有返回源的卷曲,标题curl是什么可以不同发送?我假设,因为用户代理是相同的,请求应该看起来完全相同从这两个请求。我应该检查什么其他足迹?

If NY Times might be cloaking, and not returning the source to curl, what could be different in the headers curl is sending? I assumed since the user agent is the same, the request should look exactly the same from both of these requests. What other "footprints" should I check?

推荐答案

解决方法是分析 curl 通过执行请求 curl -v ... 和您的wget请求$ c>,表示curl已重定向到登录页面

The way to solve is to analyze your curl request by doing curl -v ... and your wget request by doing wget -d ... which shows that curl is redirected to a login page

> GET /2012/01/19/118675/ HTTP/1.1
> User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
> Host: opinionator.blogs.nytimes.com
> Accept: */*
> 
< HTTP/1.1 303 See Other
< Date: Wed, 08 Jan 2014 03:23:06 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://www.nytimes.com/glogin?URI=http://opinionator.blogs.nytimes.com/2012/01/19/118675/&OQ=_rQ3D0&OP=1b5c69eQ2FCinbCQ5DzLCaaaCvLgqCPhKP
< Content-Length: 0
< Content-Type: text/plain; charset=UTF-8

后面是一个循环重定向(你必须注意到,已经设置了-max-redirs标志)。

followed by a loop of redirections (which you must have noticed, because you have already set the --max-redirs flag).

另一方面, wget 它返回由nytimes.com设置的cookie及其后续请求

On the other hand, wget follows the same sequence except that it returns the cookie set by nytimes.com with its subsequent request(s)

---request begin---
GET /2012/01/19/118675/?_r=0 HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: */*
Host: opinionator.blogs.nytimes.com
Connection: Keep-Alive
Cookie: NYT-S=0MhLY3awSMyxXDXrmvxADeHDiNOMaMEZFGdeFz9JchiAIUFL2BEX5FWcV.Ynx4rkFI

curl发送的请求不包括cookie。

The request sent by curl never includes the cookie.

方式我看到修改您的curl命令,并获得所需的资源是通过添加 -c cookiefile 到您的curl命令。这将cookie存储在未使用的临时cookie jar文件中,名为cookiefile,从而使curl可以发送所需的cookie和随后的请求。

The easiest way I see to modify your curl command and obtain the desired resource is by adding -c cookiefile to your curl command. This stores the cookie in the otherwise unused temporary "cookie jar" file called "cookiefile" thereby enabling curl to send the needed cookie(s) with its subsequent requests.

例如,我在curl之后直接添加了标志 -cx ,并且我从wget获得了输出(除了wget将其写入文件,curl将其打印在STDOUT上)。

For example, I added the flag -c x directly after "curl " and I obtained the output just like from wget (except that wget writes it to a file and curl prints it on STDOUT).

这篇关于为什么curl不工作,但wget工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆