朱莉娅:vcat n数组的有效方法 [英] julia: efficient ways to vcat n arrays
问题描述
我有一个从json加载的类似于下面的数据结构
I have a data structure that I have loaded in from json that resembles the below
json_in =
[ Dict("customer" => "cust1", "transactions" => 1:10^6)
, Dict("customer" => "cust2", "transactions" => 1:10^6)
, Dict("customer" => "cust3", "transactions" => 1:10^6)]
我知道两种将交易折叠成一个数组的方法
I know of two methods to collapse the transactions into one array
@time methodA = reduce(vcat,[cust["transactions"] for cust in json_in])
@time methodB = vcat(json_in[1]["transactions"],json_in[2]["transactions"],json_in[3]["transactions"])
但是,在我的计算机上,方法A的时间约为0.22s,而方法B的时间约为0.02s.我打算执行数千次,因此将性能提高10倍非常重要.
However the timing of methodA is ~0.22s vs ~0.02s for methodB on my computer. I intend to perform this thousands of times so 10x quicker performance is a big deal.
我看到methodB并不是很健壮,因为它只能处理3个Dicts(客户),因此即使它具有高性能,也不能一概而论.
I see methodB is not very robust as it can only deal with 3 Dicts (customers) so even though it's performant it doesn't generalise.
最有效地连接作为Dict数组中元素的数组的最有效方法是什么?
What would be the most efficient way to concatenate arrays that are elements in an array of Dict efficiently?
推荐答案
正如@Gnimuc在他的评论中指出的那样,您不应在全局范围内进行基准测试,最好使用BenchmarkTools.jl进行基准测试-以下是正确的时间安排:
As @Gnimuc states in his comment, you should not benchmark in global scope, and benchmarks are best done using BenchmarkTools.jl - here are the timings done right:
julia> methodA(json_in) = reduce(vcat,[cust["transactions"] for cust in json_in])
method1 (generic function with 1 method)
julia> methodB(json_in) = vcat(json_in[1]["transactions"],json_in[2]["transactions"],json_in[3]["transactions"])
method2 (generic function with 1 method)
#Gnimuc's syntax from his comment
julia> methodC(json_in) = mapreduce(x->x["transactions"], vcat, json_in)
method3 (generic function with 1 method)
julia> using BenchmarkTools
julia> @benchmark methodA(json_in)
BenchmarkTools.Trial:
memory estimate: 38.15 MiB
allocs estimate: 15
--------------
minimum time: 10.584 ms (3.10% GC)
median time: 14.781 ms (32.02% GC)
mean time: 15.112 ms (32.19% GC)
maximum time: 69.341 ms (85.28% GC)
--------------
samples: 331
evals/sample: 1
julia> @benchmark methodB(json_in)
BenchmarkTools.Trial:
memory estimate: 22.89 MiB
allocs estimate: 2
--------------
minimum time: 5.921 ms (5.92% GC)
median time: 8.402 ms (32.48% GC)
mean time: 8.701 ms (33.46% GC)
maximum time: 69.268 ms (91.09% GC)
--------------
samples: 574
evals/sample: 1
julia> @benchmark methodC(json_in)
BenchmarkTools.Trial:
memory estimate: 38.15 MiB
allocs estimate: 12
--------------
minimum time: 10.599 ms (3.37% GC)
median time: 14.843 ms (32.12% GC)
mean time: 15.228 ms (32.24% GC)
maximum time: 71.954 ms (85.95% GC)
--------------
samples: 328
evals/sample: 1
方法B的速度仍然是以前的两倍.正是因为它在具有三个元素的数组上更加专业化.
Method B is still like twice as fast. That is exactly because it is more specialized, on an array with exactly three elements.
在这里可能会很好用的另一种解决方案是使用MappedArray,它会创建原始数组的惰性视图:
An alternative solution that might work well here is to use a MappedArray, which creates a lazy view into the original array:
using MappedArrays
method4(json_in) = mappedarray(x->x["transactions"], json_in)
当然这不会串联数组,但是您可以使用CatView包来串联视图:
Of course this doesn't concatenate the arrays, but you can concatenate views using the CatView package:
using CatViews
julia> method5(json_in) = reduce(CatView, mappedarray(x->x["transactions"], json_in))
method5 (generic function with 1 method)
julia> @benchmark method5(json_in)
BenchmarkTools.Trial:
memory estimate: 1.73 KiB
allocs estimate: 46
--------------
minimum time: 23.320 μs (0.00% GC)
median time: 23.916 μs (0.00% GC)
mean time: 25.466 μs (0.00% GC)
maximum time: 179.092 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
因为不分配它,所以它的速度比方法B快300倍(但由于使用的是非本地性,因此值得使用该结果的速度较慢-值得进行基准测试).
Because it doesn't allocate it is like 300x faster than method B (but it's possible it's slower to use the result because of nonlocality - worth benchmarking).
这篇关于朱莉娅:vcat n数组的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!