Python列联表 [英] Python Contingency Table

查看:273
本文介绍了Python列联表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我正在编写的项目的一部分,我生成了很多列联表.

I am generating many, many contingency tables as part of a project that I'm writing.

工作流程为:

  • 采用具有连续(浮动)行的大型数据数组,并通过装箱将其转换为离散的整数值(例如,结果行的值为0-9)
  • 将两行切成向量X& Y,并从中生成一个列联表,这样我便获得了二维频率分布
  • 例如,我有一个10 x 10的数组,计算发生的(xi,yi)的数量
  • 使用列联表进行一些信息论数学
  • Take a large data array with continuous (float) rows and convert those to discrete integer values by binning (so that the resulting row has values 0-9, for example)
  • Slice two rows into vectors X & Y and generate a contingency table from them, so that I have the 2-dimensional frequency distribution
  • For example, I'd have a 10 x 10 array, counting the number of (xi, yi) that occur
  • Use the contingency table to do some information theory math

最初,我这样写:

def make_table(x, y, num_bins):
    ctable = np.zeros((num_bins, num_bins), dtype=np.dtype(int))
    for xn, yn in zip(x, y):
        ctable[xn, yn] += 1
    return ctable

这很好,但是速度太慢,以至于耗尽了整个项目运行时间的90%.

This works fine, but is so slow that it's eating up like 90% of the runtime of the entire project.

我能够想到的最快的仅python优化是这样的:

The fastest python-only optimization I've been able to come up with is this:

def make_table(x, y, num_bins):
    ctable = np.zeros(num_bins ** 2, dtype=np.dtype(int))
    reindex = np.dot(np.stack((x, y)).transpose(), 
                     np.array([num_bins, 1]))
    idx, count = np.unique(reindex, return_counts=True)
    for i, c in zip(idx, count):
        ctable[i] = c
    return ctable.reshape((num_bins, num_bins))

这(以某种方式)要快得多,但是对于似乎不应该成为瓶颈的某些东西来说,它仍然是相当昂贵的.有没有什么有效的方法可以实现,而我只是应该放弃并在cython中执行此操作?

That's (somehow) a lot faster, but it's still pretty expensive for something that doesn't seem like it should be a bottleneck. Are there any efficient ways to do this that I'm just not seeing, or should I just give up and do this in cython?

此外,这是一个基准测试功能.

Also, here's a benchmarking function.

def timetable(func):
    size = 5000
    bins = 10
    repeat = 1000
    start = time.time()
    for i in range(repeat):
        x = np.random.randint(0, bins, size=size)
        y = np.random.randint(0, bins, size=size)
        func(x, y, bins)
    end = time.time()
    print("Func {na}: {ti} Ms".format(na=func.__name__, ti=(end - start)))

推荐答案

可以更快地将np.stack((x, y))的元素表示为整数的巧妙技巧:

The clever trick for representing the elements of np.stack((x, y)) as integers can be made faster:

In [92]: %timeit np.dot(np.stack((x, y)).transpose(), np.array([bins, 1]))
109 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [94]: %timeit bins*x + y
12.1 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

此外,只需考虑一下

np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))

此外,由于我们正在处理等距的非负整数,因此 np.bincount 将胜过np.unique;这样,以上内容可以归结为

What is more, since we are dealing with equally spaced non-negative integers, np.bincount will outperform np.unique; with that, the above boils down to

np.bincount(bins * x + y).reshape((bins, bins))

总而言之,这提供了与您当前正在执行的操作相当的性能:

All in all, this provides quite some performance over what you are currently doing:

In [78]: %timeit make_table(x, y, bins)  # Your first solution
3.86 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [79]: %timeit make_table2(x, y, bins)  # Your second solution
443 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [101]: %timeit np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))
307 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [118]: %timeit np.bincount(bins * x + y).reshape((10, 10))
30.3 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

您可能还想知道 np.histogramdd 可以同时处理四舍五入和合并,尽管它可能比四舍五入和使用np.bincount慢.

这篇关于Python列联表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆