IMDB报废问题 [英] IMDB Scraping issue

查看:96
本文介绍了IMDB报废问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能的重复项:
IMDB是否提供API?
如何使用通过curl调用发出HTTP请求吗?

我正在使用PHP curl 从IMDB抓取电影详细信息.它在获取数据时效果很好,但是我现在面临的问题是:

当我获取非英语电影时,例如这部电影.

当我在浏览器中打开该电影时,它将显示该电影的"IMDB English"版本页面,其中显示电影名称"Boarding School".但是,当我通过 curl 获取数据时,它将获取该电影的原始页面,其中电影名称为"LeidenschaftlicheBlümchen".

所以请建议我如何在英文版IMDB页面中获取 curl 数据.

解决方案

使用浏览器请求页面时,浏览器会将特定的请求标头发送到服务器.像 firebug这样的firefox扩展程序可以显示这些内容(请检查 Net ),这些是示例性的我刚刚使用firefox将标头发送到服务器:

  GET/title/tt0076306/HTTP/1.1主持人:www.imdb.com用户代理:Mozilla/5.0(Windows NT 5.1; rv:5.0)Gecko/20100101 Firefox/5.0接受:text/html,application/xhtml + xml,application/xml; q = 0.9,*/*; q = 0.8接受语言:en-us,en; q = 0.8,de-de; q = 0.5,de; q = 0.3接受编码:gzip,放气接受字符集:ISO-8859-1,utf-8; q = 0.7,*; q = 0.7连接:保持活动状态... 

可能与众不同的一个:

 接受语言:en-us,en; q = 0.8,de-de; q = 0.5,de; q = 0.3 

请参见 14.4接受语言./p>

使用curl时,它也会发送特定的请求标头,但它们可能有所不同.但是,您也可以命令curl使用您指定的标题.

您只需要使curl使用浏览器使用的标头,就应该得到相同的结果.请参阅如何发送标头通过curl调用使用HTTP请求吗?.

例如,以获取页面的德语版本:

  curl -H接受语言:de-de; q = 0.8,de; q = 0.5" http://www.imdb.com/title/tt0076306/ 

对于英文版本:

  curl -H接受语言:en-us,en; q = 0.8,de-de; q = 0.5,de; q = 0.3" http://www.imdb.com/title/tt0076306/ 

Possible Duplicates:
Does IMDB provide an API?
How to send a header using a HTTP request through a curl call?

I am using PHP curl to scrape movie details from IMDB. It works perfectly in fetching data but the problem i am facing right now is:

When I fetch non English movies like this movie.

When I open this movie in my browser then it shows me "IMDB English"-version page of this movie which shows movie name "Boarding School". But when i fetch the data through curl then it fetch the original page for this movie where the movie name is "Leidenschaftliche Blümchen".

So please suggest me how to fetch the curl data in English version IMDB page.

解决方案

When you request a page with a Browser, the Browser sends specific request headers to the server. A firefox extension like firebug can show these (check Net), these are exemplary the headers I just send over to the server with firefox:

GET /title/tt0076306/ HTTP/1.1
Host: www.imdb.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
...

The one that makes a difference possibly:

Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3

See 14.4 Accept-Language.

When you use curl, it will send specific request headers as well but they might differ. However you can command curl to use the headers you specifiy, too.

You just need to make curl use the headers your browser uses and you should get the same result. See How to send a header using a HTTP request through a curl call?.

For getting the german version of the page for example:

curl -H "Accept-Language: de-de;q=0.8,de;q=0.5" http://www.imdb.com/title/tt0076306/

For the english version:

curl -H "Accept-Language: en-us,en;q=0.8,de-de;q=0.5,de;q=0.3" http://www.imdb.com/title/tt0076306/

这篇关于IMDB报废问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆