增加CPU密集型PHP脚本演出 [英] increase performances of CPU-intensive PHP scripts

查看:196
本文介绍了增加CPU密集型PHP脚本演出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PHP脚本,这需要时间(也许天)来执行。这是很简单,但非常CPU密集型,大部分的执行时间都花在为(具有成型的剧本后,我可以告诉):

I have a PHP script which takes hours (maybe days) to execute. It is quite simple but very CPU-intensive, most of the execution time is spent into (I can tell after having profiled the script):


  1. $阵列=爆炸('',$ A [$ i]);
    其中, $ a [$ i] 是重新$ P $一个很长的字符串psents用逗号隔开30K元素的矢量

  1. $array = explode(',', $a[$i]); where $a[$i] is a very long string which represents a vector of 30k elements separated by comma

的foreach($数组$关键=> $值)循环;其中,每个回路的一些in_array()和比较和赋值操作被执行

foreach($array as $key => $value) loops; where for each loop some in_array() and comparison and assignment operations are performed

$ A 实际上是一个非常大的稀疏矩阵(30K * 30K),但我不能保持它在内存(8GB似乎没有足够的RAM),所以我一直只是一个疏重新presentation(基本上每行是一个字符串),并使用爆炸()任何时候,我需要在一排工作。

$a is actually a very large and sparse matrix (30k * 30k) but I can't keep it in memory (8GB seems to be not enough RAM) so I keep just a "sparse representation" (basically each row is a string) and use explode() any time I need to work on a row.

我知道,改写C(或其他语言)都将提高性能(多少钱?),但是,这样做,我想知道如果我可以做任何事情,以改善在PHP执行时间之前。

I know that rewriting everything in C (or other languages) would improve performances (how much?) but, before doing that I would like to know if I can do anything to improve the execution time in PHP.

答案后编辑。

我试了你的建议,这里是我的报告:

I tried several of your advises and here is my report:

1)str_getcsv是在最慢于爆炸的情况下

1) str_getcsv is in most of the case slower than explode

2)SPLFixedArray降低要求来存储矩阵记忆,但仍然,8GB是不够的,一个30K点¯x30K矩阵,所以我不认为它可以帮助很多;这里真正的问题是缺乏在PHP稀疏重新presentation矩阵我觉得

2) SPLFixedArray decrease the memory requested to store the matrix but still, 8GB was not enough for a 30k x 30k matrix so I don't think it can help much; the real problem here is the lack of a sparse representation for matrix in PHP I think

3)我不能存储的爆炸行动的所有结果,因为还在,这将意味着在保持内存中的整个矩阵(没有足够的RAM)

3) I can't store all the results of the explode operations because, still, that would mean keeping the whole matrix in memory (not enough RAM)

4)我试过,即使我确信这将是慢的数据库的方法:我已存储三元(I,J值)重新present每个矩阵元素;即使删除不太重要的值(我可以小于一个阈牺牲值,并获得以下precise的结果,但仍然是有用的),并存储刚刚18百万元组,与MySQL的myisam的方法比在存储器我的阵列的方法慢得多

4) I've tried the database approach even if I was sure it would have been slower: I've stored triples (i,j,value) to represent each matrix element; even deleting the less important values (I can sacrifice values less than a threshold and get a less precise result, but still useful) and storing just 18 millions tuples, the approach with mysql myisam is much slower than my array approach in memory.

5)我试着使用MEMORY引擎(RAM中的MySQL表)和存储除了具有零值的那些所有的矩阵元素数据库的方法;有42百万记录这一次...它的速度更快,不是一个数量级,但2-4倍的速度......我想我可以在5天内,而不是15-20完成任务......它仍然是太多了(我想在24小时内完成),如果您有任何其他建议,你都非常欢迎。

5) I've tried the database approach using the MEMORY engine (a mysql table in RAM) and storing all the matrix elements except the ones having value zero; having 42 millions records this time...it is faster, not an order of magnitude but 2-4 times faster...I think I can finish the job in 5 days instead of 15-20...it is still too much (I would like to finish in 24 hours), if you have any other suggestions you are very welcome

编辑2:我解释一下这个问题。

EDIT 2: I explain the problem

我给有关该问题的一些细节,我真的需要简化一切,否则这将是太多时间去解释,但我认为它是足够更好地了解有关情况。

I'll give some details about the problem, I really need to simplify everything otherwise it would be too long to explain but I think it is enough to understand better the situation.

我有节点之间的矩阵重新presenting距离;在一个整数的距离也可以是无限的。

I have a matrix representing distances among nodes; the distance in an integer and could also be infinite.

我有一个内存表重新presenting与三元每个距离:node_1,node_2,距离(只非无限距离重新presented)

I have a memory table representing each distance with triples: node_1, node_2, distance (just the non infinite distances are represented).

我有这种贪心算法的,我没有写,我应该优化一个可行的时间来执行它(比方说,不到一天的时间),在具有8GB内存的笔记本电脑。

I have this sort of greedy algorithm that I did not write and I should optimize to execute it in a feasible time (let's say less than one day) on a laptop having 8GB of RAM.

的算法中基本上得到在输入两个节点,并设计一个起始节点,并根据必须在每一步进行校验以下两个属性一个结束节点步步之间的路径:

The algo basically gets in input two nodes and designs a path between a starting node and an ending node step by step according to the following two properties that must be verified at each step:


  • 新的中间节点,必须在组更接近结束节点相对于当前节点
  • 节点之间选择
  • 这些节点之间,一个是到当前节点越近被选择

  • the new intermediate node must be chosen among the set of nodes that are closer to the ending node respect to the current node
  • among those nodes, the one which is the closer to the current node is chosen

请考虑
1)三角不等式不满意。
2)这不是一个最短路径问题

Please consider that 1) The triangle inequality is NOT satisfied. 2) It is NOT a shortest path problem

下面是一些伪code为我叫了几次的函数,直到我足够接近结束节点:

Here is some pseudo code for the function I call several times until I'm close enough to the ending node:

get_next_node($node_1, $node_2){

    $dist = select distance from distances_table where node_2 = $node_2 and node_1 = $node_1

    $candidates_ar = select node_1 from distances_table where node_2 = $node_2 and distance < $dist

    $distances_ar = select distance from distances_table where node_1 = $node_1 and node_2 in ($candidates_ar) // e.g. $distances_ar[12] contains distance between node 12 and $node_1

    $min = 1000;
    foreach ($candidates_ar as $value){
        if ($distances_ar[$value] < $min){
            $min = $distances_ar[$value]
            $next_node = $value
        }
    }

}

我省略了很多检查和额外的复杂性,但这是基本而这正是算法中最花费时间的。

I have omitted a lot of checks and additional complexity, but this is the basic and this is where the algo spends most of time.

我想它可以与A *的实现来解决,但我想避免它,如果它可以提高性能,这样我可以以小时(而不是几天)执行它。

I guess it can be solved with an implementation of A* but I would like to avoid it if it is possible to increase the performances so that I can execute it in hours (not days).

感谢。

推荐答案

好吧,你已经有了一个性能问题。现在,有趣的部分开始了。

Ok, you've got a performance problem. Now the fun part begins.

第一步,就是不用猜。不要启动C.改写不要切换PHP编译器。这对吸盘。相反,通过试图找到实际的瓶颈开始。

The first step, is don't guess. Don't start rewriting in C. Don't switch PHP compilers. That's for suckers. Instead, start by trying to find the actual bottlenecks.

获取了XDebug并生成应用程序的 cachegrind分析。这将显示您所在的大部分时间都花在。

Get XDEBUG and generate a cachegrind profiling of the application. This will show you where the majority of the time is spent.

您也可以使用 XHProf的

问题的关键是,不用猜,但轮廓。查找算法的缓慢部分,然后工作,以对其进行优化。

The point is, don't guess, but profile. Find the slow parts of the algorithms, and then work to optimize them.

这个问题可能不是code,但您使用的算法。我建议尝试正规化算法,这样你就可以尝试优化和调整部分为您的特定限制。

The problem is likely not the code, but the algorithm that you're using. I'd suggest trying to formalize the algorithm, so that you can then try to optimize and tweak parts for your specific constraints.

例如。现在,你解析大CSV字符串。为什么?为什么不坚持在一个数据库,并让数据库做繁重的吗?显然,它可能无法与您的特定用例,但每当我看到人们在PHP 30K元素的数组操作,通常是因为他们正在做的事情,他们不应该摆在首位。

For example. Right now, you're parsing large CSV strings. Why? Why not stick that in a database and let the database do the heavy lifting for you? Obviously it may not be possible with your specific use-case, but whenever I see people operating on arrays of 30k elements in PHP, typically that's because they are doing something they shouldn't be in the first place.

如果一切都失败了,尝试块的算法,这样就可以在部分运行它。这样,你可以尝试做一个地图,减少或类似的技术来调整运行时间。

And if all else fails, try to chunk the algorithm so that you can run it in parts. That way you can try to do a map-reduce or similar technique to tweak runtime.

总之,它的真的取决于正是你在做什么。但是,重新编码或切换运行时间将是我的最后的度假胜地,而不是第一步...

In short, it really depends on what exactly you're doing. But re-coding or switching runtimes would be my last resort, not the first step...

这篇关于增加CPU密集型PHP脚本演出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆