检索部分网页 [英] Retrieve partial web page

查看:152
本文介绍了检索部分网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法限制CURL将提取的数据量?我的屏幕抓取数据,一个页面是50kb,但我需要的数据是在页面的前1/4,所以我真的只需要检索页面的前10kb。



我问的是因为有很多数据需要监控哪些结果在我每月传输接近60GB的数据,当只有大约5GB的带宽是相关的。



我使用PHP处理数据,但是我灵活的数据检索方法,我可以使用CURL,WGET,fopen等。 >

我正在考虑的一种方法是

  $ fp = fopen ://www.website.com,r); 
fseek($ fp,5000);
$ data_to_parse = fread($ fp,6000);

上面的意思是我只会从www.website.com传输6kb,或者fopen加载www

解决方案

你也可以完成你所做的一切,重新寻找使用CURL以及。



如果您查看 CURLOPT_WRITEFUNCTION ,您可以注册一个回调,只要数据可用于从CURL读取。然后,您可以计数接收的字节数,当您收到超过6,000个字节时,您可以返回0以中止传输的其余部分。



libcurl 文档详细描述了回调:


libcurl调用这个函数,一旦收到数据需要
已保存。返回
实际占用的字节数。如果该金额
与传递给
函数的金额不同,它会向
库发出错误信号,并将中止传输
并返回CURLE_WRITE_ERROR。



在所有
调用中,回调函数将传递
尽可能多的数据,但不能使任何假设成为
。它可能是一个字节,
它可能是数千。



Is there any way of limiting the amount of data CURL will fetch? I'm screen scraping data off a page that is 50kb, however the data I require is in the top 1/4 of the page so I really only need to retrieve the first 10kb of the page.

I'm asking because there is a lot of data I need to monitor which results in me transferring close to 60GB of data per month, when only about 5GB of this bandwidth is relevant.

I am using PHP to process the data, however I am flexible in my data retrieval approach, I can use CURL, WGET, fopen etc.

One approach I'm considering is

$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);

Does the above mean I will only transfer 6kb from www.website.com, or will fopen load www.website.com into memory meaning I will still transfer the full 50kb?

解决方案

You may be able to also accomplish what you're looking for using CURL as well.

If you look at the documentation for CURLOPT_WRITEFUNCTION you can register a callback that is called whenever data is available for reading from CURL. You could then count the bytes received, and when you've received over 6,000 bytes you can return 0 to abort the rest of the transfer.

The libcurl documentation describes the callback a bit more:

This function gets called by libcurl as soon as there is data received that needs to be saved. Return the number of bytes actually taken care of. If that amount differs from the amount passed to your function, it'll signal an error to the library and it will abort the transfer and return CURLE_WRITE_ERROR.

The callback function will be passed as much data as possible in all invokes, but you cannot possibly make any assumptions. It may be one byte, it may be thousands.

这篇关于检索部分网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆