如何在使用cURL时显示图像? [英] How to display images when using cURL?

查看:286
本文介绍了如何在使用cURL时显示图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当抓取页面时,我想要包含文字的图片。

When scraping page, I would like the images included with the text.

目前我只能抓取文字。例如,作为测试脚本,我刮了谷歌的主页,它只显示文字,没有图像(谷歌标志)。

Currently I'm only able to scrape the text. For example, as a test script, I scraped Google's homepage and it only displayed the text, no images(Google logo).

我还使用Redbox创建了另一个测试脚本,没有成功,结果相同。
这是我尝试剪掉Redbox的查找电影页面:

I also created another test script using Redbox, with no success, same result. Here's my attempt at scraping the Redbox 'Find a Movie' page:

<?php

$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;

?>

页面损坏,缺少文字,缺少脚本等。

the page was broken, missing box art, missing scripts, etc.

看看FF的Firebug的扩展'网'工具(允许我检查头和文件路径),我发现Redbox的图像和CSS文件没有加载/缺少(404未找到)。我注意到为什么,这是因为我的浏览器正在寻找Redbox的图像和CSS文件在错误的地方。

Looking at FF's Firebug's Extension 'Net' tool(allows me to check headers and file paths), I discovered that Redbox's images and css files were not loaded/missing (404 not found). I noticed why, it was because my browser was looking for Redbox's images and css files in the wrong place.

不过,Redbox图片和CSS文件的位置相对于域名,同样适用于Google徽标。所以如果我上面的脚本使用它的域作为文件路径的基础,我该如何改变这?

Apperently the Redbox images and css files are located relative to the domain, likewise for Google's logo. So if my script above is using its domain as the base for the files path, how could I change this?

我试图改变主机和referer请求头与脚本

I tried altering the host and referer request headers with the script below, and I've googled extensively, but no luck.

我的修复尝试:

<?php

$url = 'http://www.redbox.com/Titles/AvailableTitles.aspx';
$referer = 'http://www.redbox.com/Titles/AvailableTitles.aspx';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Host: www.redbox.com") );
curl_setopt ($ch, CURLOPT_REFERER, $referer); 
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result= curl_exec ($ch);
curl_close ($ch);
echo $result;

?>

我希望我有意义,如果没有,让我知道,我会尽力解释。
任何帮助将是伟大的!非常感谢。

I hope I made sense, if not, let me know and I'll try to explain it better. Any help would be great! Thanks.

UPDATE



感谢大家(尤其是Marc和Wyatt)一种实现方法。

我可以通过以下步骤成功测试:

UPDATE


Thanks to everyone(especially Marc, and Wyatt), your answers helped me figure out a method to implement.
I was able to succesfully test by following the steps below:


  1. 下载页面和它的必需品通过Wget。

  2. 在下载的网页标题中新增< base href =.../>

  3. 通过 Wput 将修订的下载页面及其原始要求上传到临时服务器。

  4. 通过浏览器在临时服务器上测试上传的网页

  5. 如果上传的网页无法正常显示, jss,ect)。视图通过一个工具,使您可以查看标题响应(例如,'网'工具从FF的Firebug插件)。找到缺失的必备条件后,访问上传页面所基于的原始页面,记下缺少的正确必填位置,然后将下载的页面从步骤1修改为
    ,以容纳新的正确位置,并从步骤3开始再次。否则,如果页面呈现正常,则成功!

  1. Download the page and its requisites via Wget.
  2. Add <base href="..." /> to downloaded page's header.
  3. Upload the revised downloaded page and its original requisites via Wput to a temporary server.
  4. Test uploaded page on temporary server via browser
  5. If the uploaded page is not displayed properly, some of the requisites might be missing still(css,jss,ect). View which are missing via a tool that lets you view header responses(eg. the 'net' tool from FF's Firebug Addon). After locating the missing requisites, visit original page that the uploaded page is based on, take note of proper requisite locations that were missing, then revise the downloaded page from step 1 to accommodate the new proper locations and begin at step 3 again. Else, if page is rendered properly, then success!

注意:当修改下载的页面时,我手动编辑代码,确保您可以使用regEX或解析库来根据cUrl的请求自动完成此过程。

Note: When revising the downloaded page I manually edited the code, I'm sure you could use regEX or a parsing library on cUrl's request to automate the process.

推荐答案

重新检索单个文件,无论是html,image,css,javascript等等。您在浏览器中看到的文档几乎总是MULTIPLE文件的结果:原始html,每个独立图像,每个css文件,每个javascript文件。您只需输入一个地址,但是完全建立/显示网页将需要许多HTTP请求。

When you scrape a URL, you're retrieving a single file, be it html, image, css, javascript, etc... The document you see displayed in a browser is almost always the result of MULTIPLE files: the original html, each seperate image, each css file, each javascript file. You enter only a single address, but fully building/displaying the page will require many HTTP requests.

当您通过curl抓取google主页并将该HTML输出到用户,则无法让用户知道他们实际上正在查看Google提供的HTML - 看起来好像HTML来自您的服务器和您的服务器。用户的浏览器会愉快地吸引这个HTML,找到图像,并从您的服务器请求图像,而不是谷歌的。由于您不是托管任何Google的图片,您的服务器会以正确的404未找到错误响应。

When you scrape the google home page via curl and output that HTML to the user, there's no way for the user to know that they're actually viewing Google-sourced HTML - it appears as if the HTML came from your server, and your server only. The user's browser will happily suck in this HTML, find the images, and request the images from YOUR server, not google's. Since you're not hosting any of google's images, your server responds with a properly 404 "not found" error.

为了使网页正常工作,您必须几个选择。最简单的方法是解析页面的HTML,并在文档的标题块中插入< base href =.../> 标记。这将告诉任何查看浏览器的文档中的相对链接应该从这个基本源(例如谷歌)获取。

To make the page work properly, you've got a few choices. The easiest is to parse the HTML of the page and insert a <base href="..." /> tag into the document's header block. This will tell any viewing browsers that "relatively" links within the document should be fetched from this 'base' source (e.g. google).

一个更难的选择是解析文档并重写对外部文件(图像,css,js等)的任何引用,并放入发起服务器的URL,因此用户的浏览器转到原始站点并从那里提取。

A harder option is to parse the document and rewrite any references to external files (images ,css, js, etc...) and put in the URL of the originating server, so the user's browser goes to the original site and fetches from there.

最难的选择是基本上设置代理服务器,如果服务器上不存在文件的请求,则尝试通过curl从Google获取相应的文件并将其输出给用户。

The hardest option is to essentially set up a proxy server, and if a request comes in for a file that doesn't exist on your server, to try and fetch the corresponding file from Google via curl and output it to the user.

这篇关于如何在使用cURL时显示图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆