在即时荏苒&安培;大文件,在PHP或其他流 [英] On-the-fly zipping & streaming of large files, in PHP or otherwise

查看:209
本文介绍了在即时荏苒&安培;大文件,在PHP或其他流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下,一个Web服务方案,其中几个大文件必须被压缩,并提供给客户下载。最明显的方式做到这一点的LAMP是创建一个使用PHP的本地能力,那么一个临时zip文件为回声它给用户或将其保存到磁盘和重定向(要求删除它在未来的一段时间)。

Imagine a web serving scenario where several large files must be zipped and provided to the client for download. The most obvious way to do this on LAMP is to create a temporary zip file using PHP’s native capability, then either echo it to the user or save it to disk and redirect (requiring it to be deleted some time in the future).

然而,这样的模式有以下缺点:

However, such a schema has the following drawbacks:


  1. 一段密集的CPU和磁盘颠簸

  2. 每个请求不可接受的高内存利用率

  3. 大量的临时磁盘空间

  4. 相当初始延迟到用户,而归档是prepared。

此外,如果用户的方式,通过取消下载一半时,大量的资源将被浪费。

Additionally, if the user cancels the download half way through, a large amount of resources will have been wasted.

保罗邓肯的 ZipStream的PHP 的解决其中的一些问题,通过文件有效铲数据到Apache的文件。但是,它仍然从非常高的内存使用情况受到影响(文件被完全加载到内存中),并导致磁盘和CPU使用高峰。

Paul Duncan’s ZipStream-PHP solves some of these problems, effectively shoveling the data into Apache file by file. However, it still suffers from very high memory usage (files are loaded entirely into memory), and results in disk and CPU usage spikes.

在此相反,请考虑以下的bash片段:

In contrast, consider the following bash snippet:

ls -1 | zip -@ - | dd of=/dev/somewhere

一个管有一个整体的缓冲器,并且当这是满的,在OS暂停发送程序。所以,在这里信息-ZIP (压缩UTIL提供在OS X上,很容易的apt-了与放大器; C。在Linux上)进行操作在流模式,因此具有低内存占用,只有以最快的速度将其输出可以被读取的工作方式是 DD

A pipe has an integral buffer, and when this is full, the OS suspends the sending program. So, here Info-ZIP (the zip util provided on OS X, easily apt-got &c. on linux) operates in streaming mode, and therefore has a low memory footprint, and works only as fast as its output can be read by dd.

的最佳方法,那么,会做同样的:的通过拉链实用程序中的文件复制到用户。这将有很少的开销运作,会更类似于gzip是现代的Web服务器应用上飞的方式。

The optimal way, then, would be to do the same: stream the files to the user via a zip utility. This would operate with very little overhead, and would be much akin to the way gzip is applied by modern web servers on the fly.

是否有可能实现这一目标使用Apache和放大器; PHP?

Is it possible to achieve this using Apache & PHP?

(旁白:有没有更好的网络服务,可能是更适合这个任务的技术)

(Aside: are there better web serving technologies that might be better suited to this task?)

推荐答案

您可以使用的popen()(docs) proc_open()(docs)执行UNIX命令(如zip或gzip的),并取回标准输出作为PHP流。 刷新() (文档)会尽自己所能,PHP的输出缓冲区的内容推到浏览器。

You can use popen() (docs) or proc_open() (docs) to execute a unix command (eg. zip or gzip), and get back stdout as a php stream. flush() (docs) will do its very best to push the contents of php's output buffer to the browser.

结合所有的这会给你想要的东西(前提是没有其他的方式获得 - 见电除尘器的文档页面上的注意事项为刷新() )。

Combining all of this will give you what you want (provided that nothing else gets in the way -- see esp. the caveats on the docs page for flush()).

注意:请不要使用刷新()详见下更新)

(Note: don't use flush(). See the update below for details.)

类似下面可以做的伎俩:

Something like the following can do the trick:

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/x-gzip');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('tar cf - file1 file2 file3 | gzip -c', 'r');

// pick a bufsize that makes you happy (64k may be a bit too big).
$bufsize = 65535;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);


您问其他技术:而我会说,任何支持非阻塞的I / O请求的整个生命周期。你可以建立这样的组件如Java或C / C ++一个独立的服务器(或任何其他许多可用语言)的如果的你是愿意进入下来,脏的非-blocking文件访问和诸如此类的东西。


You asked about "other technologies": to which I'll say, "anything that supports non-blocking i/o for the entire lifecycle of the request". You could build such a component as a stand-alone server in Java or C/C++ (or any of many other available languages), if you were willing to get into the "down and dirty" of non-blocking file access and whatnot.

如果你想要一个非阻塞的实现,但你宁愿避免下来,脏,最简单的途径(恕我直言)是使用的NodeJS 。有大量的支持所有你的的NodeJS现有版本需要的功能:使用 HTTP 模块的http服务器(当然);并使用 child_process 模块产卵焦油/ ZIP /任何管道。

If you want a non-blocking implementation, but you would rather avoid the "down and dirty", the easiest path (IMHO) would be to use nodeJS. There is plenty of support for all the features you need in the existing release of nodejs: use the http module (of course) for the http server; and use child_process module to spawn the tar/zip/whatever pipeline.

最后,如果(且仅当)你正在运行一个多处理器(或多核)服务器,并且要从中的NodeJS最,你可以使用的 Spark2 到同一个端口上运行多个实例。不要运行多个实例的NodeJS每个处理器核心。

Finally, if (and only if) you're running a multi-processor (or multi-core) server, and you want the most from nodejs, you can use Spark2 to run multiple instances on the same port. Don't run more than one nodejs instance per-processor-core.

更新(从石磊在对这个答案的评论部分优秀的反馈)

Update (from Benji's excellent feedback in the comments section on this answer)

1 作为 FREAD的文档()表示该功能将在同一时间从任何只读多达8192个字节数据的是不是普通文件。因此,8192可能是缓冲区大小的一个不错的选择。

1. The docs for fread() indicate that the function will read only up to 8192 bytes of data at a time from anything that is not a regular file. Therefore, 8192 may be a good choice of buffer size.

[编者按] 8192几乎可以肯定是与平台相关的价值 - 在大多数平台上, FREAD()将读取数据,直到操作系统的内部缓冲区是空的,在这点它会返回,让操作系统填充缓冲区再次异步。 8192是许多流行的操作系统默认的缓冲区的大小。

[editorial note] 8192 is almost certainly a platform dependent value -- on most platforms, fread() will read data until the operating system's internal buffer is empty, at which point it will return, allowing the os to fill the buffer again asynchronously. 8192 is the size of the default buffer on many popular operating systems.

有可能会导致的fread返回甚至小于8192字节的其他情况下 - 例如,远程客户端(或方法)是缓慢填充缓冲区 - 在大多数情况下,的fread ()将返回输入缓冲区的内容,而无需等待它得到充分。这可能意味着随时随地从中获取返回0..os_buffer_size字节。

There are other circumstances that can cause fread to return even less than 8192 bytes -- for example, the "remote" client (or process) is slow to fill the buffer - in most cases, fread() will return the contents of the input buffer as-is without waiting for it to get full. This could mean anywhere from 0..os_buffer_size bytes get returned.

我们的启示是:您传递给 FREAD()值 BUFFSIZE 应该被认为是最大大小 - 永远不要假设你已经接收的字节你要的号码(或与此有关的任何其他数字)。

The moral is: the value you pass to fread() as buffsize should be considered a "maximum" size -- never assume that you've received the number of bytes you asked for (or any other number for that matter).

2 据对FREAD文档评论,有几个注意事项:魔术引号可能会干扰,必须关闭

2. According to comments on fread docs, a few caveats: magic quotes may interfere and must be turned off.

3。设置 mb_http_output(通行证)(文档)可能是一个好主意。虽然通行证已经默认设置,您可能需要显式地指定它,如果你的code或配置有$ P $它pviously变成别的东西。

3. Setting mb_http_output('pass') (docs) may be a good idea. Though 'pass' is already the default setting, you may need to specify it explicitly if your code or config has previously changed it to something else.

4 如果您要创建一个zip(而不是gzip压缩),你要使用的内容类型头:

4. If you're creating a zip (as opposed to gzip), you'd want to use the content type header:

Content-type: application/zip

或...应用程序/八位字节流可以用来代替。 (这是适用于所有不同种类的二进制下载一个通用的内容类型):

or... 'application/octet-stream' can be used instead. (it's a generic content type used for binary downloads of all different kinds):

Content-type: application/octet-stream

如果你想被提示用户下载并将文件保存到硬盘上(而不是潜在地具有浏览器尝试显示该文件为文本),那么你需要的内容处置头。 (其中filename表示要在保存对话框中所建议的名称):

and if you want the user to be prompted to download and save the file to disk (rather than potentially having the browser try to display the file as text), then you'll need the content-disposition header. (where filename indicates the name that should be suggested in the save dialog):

Content-disposition: attachment; filename="file.zip"

还应该发送Content-Length头,但是这是很难用这种技术,因为你不知道提前压缩的确切大小。 <强>是否有可以被设置以指示该内容是流动或未知长度的报头?有谁知道?

最后,这里使用的是@ 石磊的建议修订的例子(以及创建一个ZIP文件,而不是一个TAR.GZIP文件):

Finally, here's a revised example that uses all of @Benji's suggestions (and that creates a ZIP file instead of a TAR.GZIP file):

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="file.zip"');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('zip -r - file1 file2 file3', 'r');

// pick a bufsize that makes you happy (8192 has been suggested).
$bufsize = 8192;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);


更新:(2012年11月23日)我发现,调用刷新()读/回声循环中可能会出现问题的时候具有非常大的文件和/或速度很慢的网络工作。至少,运行PHP,Apache的背后CGI / FastCGI的时,这是真实的,而且在其他配置也运行时会发生同样的问题似乎很有可能。这个问题似乎导致PHP刷新输出时阿帕奇比Apache快实际上可以在插槽发送。对于非常大的文件(或连接速度慢),这将最终导致Apache的内部输出缓冲区的溢出。这导致Apache杀死PHP的过程,这当然会使下载挂起,或完全prematurely,只有被采取部分转移的地方。


Update: (2012-11-23) I have discovered that calling flush() within the read/echo loop can cause problems when working with very large files and/or very slow networks. At least, this is true when running PHP as cgi/fastcgi behind Apache, and it seems likely that the same problem would occur when running in other configurations too. The problem appears to result when PHP flushes output to Apache faster than Apache can actually send it over the socket. For very large files (or slow connections), this eventually causes in an overrun of Apache's internal output buffer. This causes Apache to kill the PHP process, which of course causes the download to hang, or complete prematurely, with only a partial transfer having taken place.

解决方案是的的调用刷新()可言。我已经更新了code上面的例子,以反映这一点,我把一张纸条文本的答案的顶端。

The solution is not to call flush() at all. I have updated the code examples above to reflect this, and I placed a note in the text at the top of the answer.

这篇关于在即时荏苒&安培;大文件,在PHP或其他流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆