在Python数据子集 [英] Subsetting data in python

查看:93
本文介绍了在Python数据子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用的子命令相当于R中的一些Python code:我写信。

I want to use the equivalent of the subset command in R for some python code I am writing.

下面是我的数据:

col1    col2    col3    col4    col5
100002  2006    1.1 0.01    6352
100002  2006    1.2 0.84    304518
100002  2006    2   1.52    148219
100002  2007    1.1 0.01    6292
10002   2006    1.1 0.01    5968
10002   2006    1.2 0.25    104318
10002   2007    1.1 0.01    6800
10002   2007    4   2.03    25446
10002   2008    1.1 0.01    6408

我要子集基于col1和col2上的内容的数据。 (在COL1唯一值是100002和10002,并在COL2是2006,2007和2008年)。

I want to subset the data based on contents of col1 and col2. (the unique values in col1 are 100002 and 10002, and in col2 are 2006,2007 and 2008).

这在R中可以使用子命令来完成,有什么在Python相似?

This can be done in R using the subset command, is there anything similar in python?

谢谢!

推荐答案

虽然基于迭代器的答案是完全没有问题,如果你使用numpy的阵列的工作(如你提到,你是)有更好更快的方式选择的东西:

While the iterator-based answers are perfectly fine, if you're working with numpy arrays (as you mention that you are) there are better and faster ways of selecting things:

import numpy as np
data = np.array([
        [100002, 2006, 1.1, 0.01, 6352],
        [100002, 2006, 1.2, 0.84, 304518],
        [100002, 2006, 2,   1.52, 148219],
        [100002, 2007, 1.1, 0.01, 6292],
        [10002,  2006, 1.1, 0.01, 5968],
        [10002,  2006, 1.2, 0.25, 104318],
        [10002,  2007, 1.1, 0.01, 6800],
        [10002,  2007, 4,   2.03, 25446],
        [10002,  2008, 1.1, 0.01, 6408]    ])

subset1 = data[data[:,0] == 100002]
subset2 = data[data[:,0] == 10002]

这产生了

SUBSET1:

array([[  1.00002e+05,   2.006e+03,   1.10e+00, 1.00e-02,   6.352e+03],
       [  1.00002e+05,   2.006e+03,   1.20e+00, 8.40e-01,   3.04518e+05],
       [  1.00002e+05,   2.006e+03,   2.00e+00, 1.52e+00,   1.48219e+05],
       [  1.00002e+05,   2.007e+03,   1.10e+00, 1.00e-02,   6.292e+03]])

SUBSET2:

subset2:

array([[  1.0002e+04,   2.006e+03,   1.10e+00, 1.00e-02,   5.968e+03],
       [  1.0002e+04,   2.006e+03,   1.20e+00, 2.50e-01,   1.04318e+05],
       [  1.0002e+04,   2.007e+03,   1.10e+00, 1.00e-02,   6.800e+03],
       [  1.0002e+04,   2.007e+03,   4.00e+00, 2.03e+00,   2.5446e+04],
       [  1.0002e+04,   2.008e+03,   1.10e+00, 1.00e-02,   6.408e+03]])

如果你不知道在第一列的唯一值事前,您可以使用<一个href=\"http://docs.scipy.org/doc/numpy-1.3.x/reference/generated/numpy.unique1d.html\"><$c$c>numpy.unique1d或内置函数找到他们。

If you didn't know the unique values in the first column beforehand, you can use either numpy.unique1d or the builtin function set to find them.

编辑:我只是意识到你想选择,你有两列的唯一组合数据......在这种情况下,你可能会做这样的事情:

I just realized that you wanted to select data where you have unique combinations of two columns... In that case, you might do something like this:

col1 = data[:,0]
col2 = data[:,1]

subsets = {}
for val1, val2 in itertools.product(np.unique(col1), np.unique(col2)):
    subset = data[(col1 == val1) & (col2 == val2)]
    if np.any(subset):
        subsets[(val1, val2)] = subset

(我存储的子集作为一个字典,用钥匙作为组合的元组......当然,还有其他的(更好,这取决于你在做什么)的方式来做到这一点!)

(I'm storing the subsets as a dict, with the key being a tuple of the combination... There are certainly other (and better, depending on what you're doing) ways to do this!)

这篇关于在Python数据子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆