LAMP:如何为用户动态创建 .Zip 大文件,而不会出现磁盘/CPU 抖动 [英] LAMP: How to create .Zip of large files for the user on the fly, without disk/CPU thrashing

查看:11
本文介绍了LAMP:如何为用户动态创建 .Zip 大文件,而不会出现磁盘/CPU 抖动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Web 服务通常需要压缩多个大文件以供客户端下载.最明显的方法是创建一个临时 zip 文件,然后将它echo 给用户或者将其保存到磁盘并重定向(在未来某个时间删除它).

Often a web service needs to zip up several large files for download by the client. The most obvious way to do this is to create a temporary zip file, then either echo it to the user or save it to disk and redirect (deleting it some time in the future).

但是,这样做有缺点:

  • 密集 CPU 和磁盘抖动的初始阶段,导致...
  • 在准备存档时对用户造成相当大的初始延迟
  • 每个请求的内存占用非常高
  • 使用大量临时磁盘空间
  • 如果用户中途取消下载,初始阶段使用的所有资源(CPU、内存、磁盘)都将被浪费

诸如 ZipStream-PHP 之类的解决方案通过将数据逐个文件推送到 Apache 文件中来对此进行改进.然而,结果仍然是高内存使用率(文件完全加载到内存中),以及磁盘和 CPU 使用率的大幅飙升.

Solutions like ZipStream-PHP improve on this by shovelling the data into Apache file by file. However, the result is still high memory usage (files are loaded entirely into memory), and large, thrashy spikes in disk and CPU usage.

相反,请考虑以下 bash 代码段:

In contrast, consider the following bash snippet:

ls -1 | zip -@ - | cat > file.zip
  # Note -@ is not supported on MacOS

这里,zip 在流模式下运行,从而减少内存占用.管道有一个完整的缓冲区——当缓冲区已满时,操作系统暂停写入程序(管道左侧的程序).这可以确保 zip 的运行速度与 cat 的输出一样快.

Here, zip operates in streaming mode, resulting in a low memory footprint. A pipe has an integral buffer – when the buffer is full, the OS suspends the writing program (program on the left of the pipe). This here ensures that zip works only as fast as its output can be written by cat.

最佳方法是这样做:将 cat 替换为 Web 服务器进程,将 zip 文件流式传输给用户,并在飞.与仅流式传输文件相比,这会产生很少的开销,并且不会有问题、无尖峰的资源配置文件.

The optimal way, then, would be to do the same: replace cat with a web server process, streaming the zip file to the user with it created on the fly. This would create little overhead compared to just streaming the files, and would have an unproblematic, non-spiky resource profile.

如何在 LAMP 堆栈上实现这一目标?

How can you achieve this on a LAMP stack?

推荐答案

你可以使用 popen() (docs)proc_open() (docs) 执行 unix 命令(例如 zip 或 gzip),并将标准输出作为 php 流返回.flush() (docs) 将尽量把php输出缓冲区的内容推送到浏览器.

You can use popen() (docs) or proc_open() (docs) to execute a unix command (eg. zip or gzip), and get back stdout as a php stream. flush() (docs) will do its very best to push the contents of php's output buffer to the browser.

将所有这些结合起来会给你你想要的东西(前提是没有其他东西妨碍——尤其是参见文档页面上关于 flush() 的警告).

Combining all of this will give you what you want (provided that nothing else gets in the way -- see esp. the caveats on the docs page for flush()).

(注意:不要使用flush().详情请看下面的更新.)

(Note: don't use flush(). See the update below for details.)

类似以下内容可以解决问题:

Something like the following can do the trick:

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/x-gzip');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('tar cf - file1 file2 file3 | gzip -c', 'r');

// pick a bufsize that makes you happy (64k may be a bit too big).
$bufsize = 65535;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);

<小时>

您询问了其他技术":对此我会说,在请求的整个生命周期中支持非阻塞 i/o 的任何东西".您可以使用 Java 或 C/C++(或许多其他可用语言中的任何一种)构建这样一个组件作为独立服务器,如果您愿意进入非- 阻止文件访问等等.


You asked about "other technologies": to which I'll say, "anything that supports non-blocking i/o for the entire lifecycle of the request". You could build such a component as a stand-alone server in Java or C/C++ (or any of many other available languages), if you were willing to get into the "down and dirty" of non-blocking file access and whatnot.

如果你想要一个非阻塞的实现,但你宁愿避免低级和肮脏",最简单的路径(恕我直言)是使用 nodeJS.在现有的 nodejs 版本中,你需要的所有功能都有大量支持:使用 http 模块(当然)作为 http 服务器;并使用 child_process 模块生成 tar/zip/whatever 管道.

If you want a non-blocking implementation, but you would rather avoid the "down and dirty", the easiest path (IMHO) would be to use nodeJS. There is plenty of support for all the features you need in the existing release of nodejs: use the http module (of course) for the http server; and use child_process module to spawn the tar/zip/whatever pipeline.

最后,如果(且仅当)您正在运行多处理器(或多核)服务器,并且您希望从 nodejs 中获得最大收益,您可以使用 Spark2 在同一个端口上运行多个实例.不要为每个处理器核心运行多个 nodejs 实例.

Finally, if (and only if) you're running a multi-processor (or multi-core) server, and you want the most from nodejs, you can use Spark2 to run multiple instances on the same port. Don't run more than one nodejs instance per-processor-core.

更新(来自 Benji 在此答案的评论部分中的出色反馈)

Update (from Benji's excellent feedback in the comments section on this answer)

1. fread() 的文档表明该函数一次最多只能从非常规文件中读取 8192 字节的数据.因此,8192 可能是缓冲区大小的不错选择.

1. The docs for fread() indicate that the function will read only up to 8192 bytes of data at a time from anything that is not a regular file. Therefore, 8192 may be a good choice of buffer size.

[编者注] 8192 几乎可以肯定是一个平台相关的值——在大多数平台上,fread() 将读取数据,直到操作系统的内部缓冲区为空,此时它将返回,允许操作系统再次异步填充缓冲区.8192 是许多流行操作系统上默认缓冲区的大小.

[editorial note] 8192 is almost certainly a platform dependent value -- on most platforms, fread() will read data until the operating system's internal buffer is empty, at which point it will return, allowing the os to fill the buffer again asynchronously. 8192 is the size of the default buffer on many popular operating systems.

还有其他情况会导致 fread 返回甚至少于 8192 字节——例如,远程"客户端(或进程)填充缓冲区的速度很慢——在大多数情况下,fread() 将按原样返回输入缓冲区的内容,而无需等待它变满.这可能意味着返回 0..os_buffer_size 字节的任何地方.

There are other circumstances that can cause fread to return even less than 8192 bytes -- for example, the "remote" client (or process) is slow to fill the buffer - in most cases, fread() will return the contents of the input buffer as-is without waiting for it to get full. This could mean anywhere from 0..os_buffer_size bytes get returned.

寓意是:作为 buffsize 传递给 fread() 的值应该被视为最大"大小——永远不要假设你已经收到了这个数字您要求的字节数(或任何其他数字).

The moral is: the value you pass to fread() as buffsize should be considered a "maximum" size -- never assume that you've received the number of bytes you asked for (or any other number for that matter).

2. 根据 fread 文档的评论,一些警告:魔法引用 可能会干扰并且必须关闭.

2. According to comments on fread docs, a few caveats: magic quotes may interfere and must be turned off.

3. 设置 mb_http_output('pass') (docs) 可能是个好主意.尽管 'pass' 已经是默认设置,但如果您的代码或配置之前已将其更改为其他内容,您可能需要明确指定它.

3. Setting mb_http_output('pass') (docs) may be a good idea. Though 'pass' is already the default setting, you may need to specify it explicitly if your code or config has previously changed it to something else.

4.如果您要创建 zip(而不是 gzip),您需要使用内容类型标头:

4. If you're creating a zip (as opposed to gzip), you'd want to use the content type header:

Content-type: application/zip

或... 'application/octet-stream' 可以代替使用.(它是用于各种不同类型的二进制下载的通用内容类型):

or... 'application/octet-stream' can be used instead. (it's a generic content type used for binary downloads of all different kinds):

Content-type: application/octet-stream

并且如果您希望提示用户下载文件并将其保存到磁盘(而不是可能让浏览器尝试将文件显示为文本),那么您将需要 content-disposition 标头.(其中文件名表示应该在保存对话框中建议的名称):

and if you want the user to be prompted to download and save the file to disk (rather than potentially having the browser try to display the file as text), then you'll need the content-disposition header. (where filename indicates the name that should be suggested in the save dialog):

Content-disposition: attachment; filename="file.zip"

还应该发送 Content-length 标头,但是使用这种技术很难做到这一点,因为您事先不知道 zip 的确切大小.是否有可以设置指示内容是流式传输"或长度未知的标头?有人知道吗?

One should also send the Content-length header, but this is hard with this technique as you don’t know the zip’s exact size in advance. Is there a header that can be set to indicate that the content is "streaming" or is of unknown length? Does anybody know?

最后,这是一个使用所有@Benji 建议的修订示例(并创建了一个 ZIP 文件TAR.GZIP 文件):

Finally, here's a revised example that uses all of @Benji's suggestions (and that creates a ZIP file instead of a TAR.GZIP file):

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="file.zip"');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('zip -r - file1 file2 file3', 'r');

// pick a bufsize that makes you happy (8192 has been suggested).
$bufsize = 8192;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);

<小时>

更新:(2012-11-23) 我发现在 read/echo 循环中调用 flush() 会导致在处理非常大的文件和/或非常慢的网络.至少,在 Apache 后面将 PHP 作为 cgi/fastcgi 运行时确实如此,而且在其他配置中运行时似乎也会出现同样的问题.当 PHP 将输出刷新到 Apache 的速度比 Apache 通过套接字实际发送它的速度快时,就会出现这个问题.对于非常大的文件(或慢速连接),这最终会导致 Apache 的内部输出缓冲区溢出.这会导致 Apache 终止 PHP 进程,这当然会导致下载挂起或过早完成,只进行了部分传输.


Update: (2012-11-23) I have discovered that calling flush() within the read/echo loop can cause problems when working with very large files and/or very slow networks. At least, this is true when running PHP as cgi/fastcgi behind Apache, and it seems likely that the same problem would occur when running in other configurations too. The problem appears to result when PHP flushes output to Apache faster than Apache can actually send it over the socket. For very large files (or slow connections), this eventually causes in an overrun of Apache's internal output buffer. This causes Apache to kill the PHP process, which of course causes the download to hang, or complete prematurely, with only a partial transfer having taken place.

解决方案是根本不调用flush().我更新了上面的代码示例以反映这一点,并在答案顶部的文本中添加了注释.

The solution is not to call flush() at all. I have updated the code examples above to reflect this, and I placed a note in the text at the top of the answer.

这篇关于LAMP:如何为用户动态创建 .Zip 大文件,而不会出现磁盘/CPU 抖动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆