LAMP:如何为用户动态创建大文件的.Zip,而不会磁盘/CPU跳动 [英] LAMP: How to create .Zip of large files for the user on the fly, without disk/CPU thrashing

查看:93
本文介绍了LAMP:如何为用户动态创建大文件的.Zip,而不会磁盘/CPU跳动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常,Web服务需要压缩几个大文件以供客户端下载.最明显的方法是创建一个临时zip文件,然后将其echo发送给用户,或将其保存到磁盘并重定向(以后将其删除).

Often a web service needs to zip up several large files for download by the client. The most obvious way to do this is to create a temporary zip file, then either echo it to the user or save it to disk and redirect (deleting it some time in the future).

但是,以这种方式做事有缺点:

However, doing things that way has drawbacks:

  • 密集的CPU和磁盘抖动的初始阶段,导致...
  • 在准备归档文件时,给用户带来了相当大的初始延迟
  • 每个请求的内存占用量非常高
  • 使用大量的临时磁盘空间
  • 如果用户在中途取消下载,则初始阶段使用的所有资源(CPU,内存,磁盘)都将被浪费

诸如 ZipStream-PHP 之类的解决方案通过将数据按文件逐个铲到Apache文件中而对此进行了改进.但是,结果仍然是较高的内存使用率(文件完全加载到内存中),以及磁盘和CPU使用率的大幅波动.

Solutions like ZipStream-PHP improve on this by shovelling the data into Apache file by file. However, the result is still high memory usage (files are loaded entirely into memory), and large, thrashy spikes in disk and CPU usage.

相反,请考虑以下bash片段:

In contrast, consider the following bash snippet:

ls -1 | zip -@ - | cat > file.zip
  # Note -@ is not supported on MacOS

此处,zip在流模式下运行,从而减少了内存占用.管道具有集成缓冲区–当缓冲区已满时,操作系统将挂起编写程序(位于管道左侧的程序).这样可以确保zip的工作速度与cat可以写入其输出的速度一样.

Here, zip operates in streaming mode, resulting in a low memory footprint. A pipe has an integral buffer – when the buffer is full, the OS suspends the writing program (program on the left of the pipe). This here ensures that zip works only as fast as its output can be written by cat.

那么,最佳的方法是做同样的事情:将cat替换为Web服务器进程,然后将zip文件实时流式传输给用户,并随即创建该zip文件.与仅流式传输文件相比,这将产生很少的开销,并且具有无问题的,非尖峰的资源配置文件.

The optimal way, then, would be to do the same: replace cat with a web server process, streaming the zip file to the user with it created on the fly. This would create little overhead compared to just streaming the files, and would have an unproblematic, non-spiky resource profile.

如何在LAMP堆栈上实现这一目标?

How can you achieve this on a LAMP stack?

推荐答案

您可以使用popen() (文档)proc_open() (docs)执行Unix命令(例如zip或gzip),并以php流的形式返回stdout. flush() (文档)会尽力推动php输出缓冲区到浏览器的内容.

You can use popen() (docs) or proc_open() (docs) to execute a unix command (eg. zip or gzip), and get back stdout as a php stream. flush() (docs) will do its very best to push the contents of php's output buffer to the browser.

将所有这些结合起来将为您提供所需的内容(前提是没有其他障碍-尤其是flush()的文档页面上的警告).

Combining all of this will give you what you want (provided that nothing else gets in the way -- see esp. the caveats on the docs page for flush()).

(注意:请勿使用flush().有关详细信息,请参见下面的更新.)

(Note: don't use flush(). See the update below for details.)

以下方法可以解决问题:

Something like the following can do the trick:

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/x-gzip');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('tar cf - file1 file2 file3 | gzip -c', 'r');

// pick a bufsize that makes you happy (64k may be a bit too big).
$bufsize = 65535;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);


您询问了其他技术":我会说,在请求的整个生命周期中支持无阻塞I/O的任何东西".您可以将这样的组件构建为Java或C/C ++(或许多其他可用语言中的任何一种)的独立服务器,如果您愿意进入非-阻止文件访问等等.


You asked about "other technologies": to which I'll say, "anything that supports non-blocking i/o for the entire lifecycle of the request". You could build such a component as a stand-alone server in Java or C/C++ (or any of many other available languages), if you were willing to get into the "down and dirty" of non-blocking file access and whatnot.

如果您想要一个非阻塞的实现,但是您希望避免麻烦",那么最简单的路径(IMHO)将是使用 nodeJS .现有的nodejs版本中需要的所有功能都有大量支持:对HTTP服务器使用http模块(当然);对于HTTP服务器,请使用http模块.并使用child_process模块生成tar/zip/任何管道.

If you want a non-blocking implementation, but you would rather avoid the "down and dirty", the easiest path (IMHO) would be to use nodeJS. There is plenty of support for all the features you need in the existing release of nodejs: use the http module (of course) for the http server; and use child_process module to spawn the tar/zip/whatever pipeline.

最后,如果(且仅)当您正在运行多处理器(或多核)服务器,并且希望从nodejs获得最大收益时,可以使用

Finally, if (and only if) you're running a multi-processor (or multi-core) server, and you want the most from nodejs, you can use Spark2 to run multiple instances on the same port. Don't run more than one nodejs instance per-processor-core.

更新(来自Benji在此答案的评论部分中的出色反馈)

Update (from Benji's excellent feedback in the comments section on this answer)

1..fread()的文档表明,该功能一次只能从非常规文件中读取最多8192字节的数据.因此,8192可能是缓冲区大小的不错选择.

1. The docs for fread() indicate that the function will read only up to 8192 bytes of data at a time from anything that is not a regular file. Therefore, 8192 may be a good choice of buffer size.

[版本说明] 8192几乎可以肯定是一个与平台相关的值-在大多数平台上,fread()会读取数据,直到操作系统的内部缓冲区为空,此后它将返回,从而允许os填充缓冲区再次异步. 8192是许多流行的操作系统上默认缓冲区的大小.

[editorial note] 8192 is almost certainly a platform dependent value -- on most platforms, fread() will read data until the operating system's internal buffer is empty, at which point it will return, allowing the os to fill the buffer again asynchronously. 8192 is the size of the default buffer on many popular operating systems.

还有其他情况可能导致fread返回少于8192字节的内容-例如,远程"客户端(或进程)填充缓冲区的速度很慢-在大多数情况下,fread()将返回内容输入缓冲区的原样而不用等待它变满.这可能意味着从0..os_buffer_size个字节中的任何地方都将返回.

There are other circumstances that can cause fread to return even less than 8192 bytes -- for example, the "remote" client (or process) is slow to fill the buffer - in most cases, fread() will return the contents of the input buffer as-is without waiting for it to get full. This could mean anywhere from 0..os_buffer_size bytes get returned.

道德是:传递给fread()作为buffsize的值应被视为最大"大小-永远不要假设您已收到要求的字节数(或该字节数的任何其他数字)事情).

The moral is: the value you pass to fread() as buffsize should be considered a "maximum" size -- never assume that you've received the number of bytes you asked for (or any other number for that matter).

2..根据有关fread文档的评论,有几点警告:关闭.

2. According to comments on fread docs, a few caveats: magic quotes may interfere and must be turned off.

3..设置mb_http_output('pass') (文档)可能是个好主意.尽管'pass'已经是默认设置,但是如果您的代码或配置先前已将其更改为其他设置,则可能需要显式指定它.

3. Setting mb_http_output('pass') (docs) may be a good idea. Though 'pass' is already the default setting, you may need to specify it explicitly if your code or config has previously changed it to something else.

4..如果要创建zip文件(而不是gzip文件),则要使用内容类型标头:

4. If you're creating a zip (as opposed to gzip), you'd want to use the content type header:

Content-type: application/zip

或...可以代替使用'application/octet-stream'. (这是用于各种二进制下载的通用内容类型):

or... 'application/octet-stream' can be used instead. (it's a generic content type used for binary downloads of all different kinds):

Content-type: application/octet-stream

,如果希望提示用户下载文件并将其保存到磁盘(而不是让浏览器尝试将文件显示为文本),则需要content-disposition标头. (其中filename表示应在保存对话框中建议的名称):

and if you want the user to be prompted to download and save the file to disk (rather than potentially having the browser try to display the file as text), then you'll need the content-disposition header. (where filename indicates the name that should be suggested in the save dialog):

Content-disposition: attachment; filename="file.zip"

还应该发送Content-length标头,但是使用此技术很难做到这一点,因为您事先不知道zip的确切大小. 是否可以设置标题以指示内容正在流式传输"或长度未知?有人知道吗?

One should also send the Content-length header, but this is hard with this technique as you don’t know the zip’s exact size in advance. Is there a header that can be set to indicate that the content is "streaming" or is of unknown length? Does anybody know?

最后,这是一个修改后的示例,它使用所有@ Benji的建议(并创建一个ZIP文件TAR.GZIP文件):

Finally, here's a revised example that uses all of @Benji's suggestions (and that creates a ZIP file instead of a TAR.GZIP file):

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="file.zip"');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('zip -r - file1 file2 file3', 'r');

// pick a bufsize that makes you happy (8192 has been suggested).
$bufsize = 8192;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);


更新:(2012-11-23)我发现在处理非常大的文件和/或非常慢的网络时,在读取/回显循环中调用flush()可能会导致问题.至少,当在Apache后面以cgi/fastcgi的身份运行PHP时,这是正确的,并且当在其他配置中运行时,似乎也可能发生相同的问题.当PHP将输出刷新到Apache的速度比Apache通过套接字实际发送输出的速度快时,就会出现此问题.对于非常大的文件(或慢速连接),这最终会导致Apache内部输出缓冲区溢出.这会导致Apache终止PHP进程,这当然会导致下载挂起或过早完成,而只进行了部分传输.


Update: (2012-11-23) I have discovered that calling flush() within the read/echo loop can cause problems when working with very large files and/or very slow networks. At least, this is true when running PHP as cgi/fastcgi behind Apache, and it seems likely that the same problem would occur when running in other configurations too. The problem appears to result when PHP flushes output to Apache faster than Apache can actually send it over the socket. For very large files (or slow connections), this eventually causes in an overrun of Apache's internal output buffer. This causes Apache to kill the PHP process, which of course causes the download to hang, or complete prematurely, with only a partial transfer having taken place.

解决方案根本不是 来调用flush().我已经更新了上面的代码示例以反映这一点,并在答案顶部的文本中添加了一个注释.

The solution is not to call flush() at all. I have updated the code examples above to reflect this, and I placed a note in the text at the top of the answer.

这篇关于LAMP:如何为用户动态创建大文件的.Zip,而不会磁盘/CPU跳动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆