优化php命令行脚本处理大型平面文件 [英] Optimizing php command line scripts to process large flat files

查看:50
本文介绍了优化php命令行脚本处理大型平面文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于投反对票的仙女.. 我知道 php 是错误的语言......但我在外部限制下工作.鉴于:

For the downvote fairies.. I know that php is the wrong language for this... but I am working under outside constraints. Given that:

我有一个大的平面文件,需要用 php 处理.我将平面文件转换为 mysql 中的规范化数据库.平面文件中有几百万行.

I have a large flat file that I need to process in php. I convert the flat file into a normalized database in mysql. There are several million lines in the flat file.

我最初尝试在导入平面文件时使用 ORM 系统.即使小心释放对象,该设计也存在大量 php 内存泄漏问题.即使我确保有足够的内存,脚本也需要大约 25 天才能在我的桌面上运行.

I originally tried to use an ORM system while importing the flat file. There was a massive php memory leak problem with that design even with careful freeing of objects. Even if I ensured that there was enough memory, the script would take about 25 days to run on my desktop.

我去掉了开销并重新编写了脚本以直接构建 mysql 命令.我从我的设计中删除了 AUTO INCREMENT,因为这要求我将最后输入的 id 作为 Mysql,以便在数据点之间建立关系.我只是使用全局计数器来代替数据库 ID,我从不进行任何查找,只进行插入.

I stripped out the overhead and rewrote the script to directly build mysql commands. I removed AUTO INCREMENT from my design since that required me to as Mysql what the last id entered was in order to make relations between data points. I just use a global counter for database ids instead and I never do any lookups, just inserts.

我使用 unix split 命令来创建许多小文件而不是一个大文件,因为一次又一次地使用文件指针会产生内存开销.

I use the unix split command to make lots of small files instead of one big one, because there is a memory overhead associated with using a file pointer again and again.

使用这些优化(希望它们能帮助其他人)我让导入脚本在大约 6 小时内运行.

Using these optimizations (hope they help someone else) I got the import script to run in about 6 hours.

我租了一个虚拟实例,它的 RAM 是台式机的 5 倍,处理器能力是台式机的 5 倍,发现它的速度完全相同.服务器运行该进程,但有 CPU 周期和 RAM 备用.也许限制因素是磁盘速度.但是我有很多内存.我应该尝试以某种方式将文件加载到内存中吗?欢迎对php命令行脚本处理大文件的进一步优化提出任何建议!

I rented a virtual instance with 5 times more RAM and about 5 times more processor power than my desktop and noticed that it went exactly the same speed. The server runs the process but has CPU cycles and RAM to spare. Perhaps the limiting factor is disk speed. But I have lots of RAM. Should I try loading the files into memory somehow? Any suggestions for further optimization of php command line scripts processing large files are welcome!

推荐答案

您不会喜欢它,但是...听起来您正在为手头的任务使用错误的语言.如果你想在速度上获得一些巨大的飞跃,那么下一步将是编译语言的移植.编译语言的运行速度比脚本语言快得多,因此您会发现处理时间有所减少.

You won't like it but...sounds like you are using the wrong language for the task in hand. If you want to get some huge leaps in speed then a port to a compiled language would be the next step to go. Compiled languages run much, much faster than a scripting language ever will so you'll see your processing time drop off.

此外,您可以使用内置命令将数据转储到数据库中.Postgres 有一个(转储?加载?类似的东西)它会在一个制表符分隔的文本文件中读取谁的列与表中的列相匹配.这将允许您只专注于获取正确格式的文本文件,然后使用一个命令将其发送到数据库中,让其处理优化,而不是您自己.

Additionally you might be able to dump the data into the DB using a build in command. Postgres had one (Dump? Load? something like that) which would read in a tab delimited text file who's columns matched up with the columns in the table. That would allow you to just focus on getting a text file in the right format and then spitting it into DB with one command and let it handle the optimisation of that rather than yourself.

您已经做了正确的事情,将 ORM 敲在了头上,不需要拆分文件,因为您的文本文件阅读器应该只在内部使用缓冲区,所以它应该"无关紧要,但我不是*nix 家伙在这方面可能是错误的.

You've done the right thing with knocking the ORM on the head, splitting the file should not be needed though as your text file reader should just use a buffer internally so it "should" not matter but I'm not a *nix guy so could be wrong on that front.

我们用一个 .net 应用程序做了类似的事情,每天早上通过 20Gb 的文件在每一行上做 RegExp,在内存中保存一个唯一记录的哈希,然后将新的插入到数据库中.然后我们使用 Ruby 脚本轻松地吐出 9000 多个 JS 文件(这是最慢的部分).我们曾经也用 Ruby 编写导入器,整个过程需要 3 个多小时,在 .net 中重写运行整个过程大约需要 30-40 分钟,其中 20 分钟是缓慢的 Ruby 脚本(不再值得优化)虽然它可以很好地完成工作).

We've done something similar with a .net app that chomps through 20Gb of files every morning doing RegExp on every line, keeps a in memory hash for unique records and then pokes new ones into a DB. From that we then spit out 9000+ JS files using a Ruby Script for ease (this is the slowest part). We used to have the importer written in Ruby too and the whole thing took 3+ hours, re-write in .net runs the whole process in about 30-40 mins and 20 of that is the slow Ruby script (not worth optimising that anymore though it does the job well enough).

这篇关于优化php命令行脚本处理大型平面文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆