子集一个Python DataFrame [英] subsetting a Python DataFrame

查看:84
本文介绍了子集一个Python DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从R过渡到Python.我刚开始使用熊猫.我有一个很好的子集的R代码:

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:

k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))

现在,我想在Python中做类似的事情.这是到目前为止我得到的:

Now, I want to do similar stuff in Python. this is what I have got so far:

import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")


#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
 data.set_index('Product')
 k = data.ix[[p.id, 'Time']]

# then, index this subset with Time and do more subsetting..

我开始感到自己在以错误的方式进行此操作.也许,有一个优雅的解决方案.有人可以帮忙吗?我需要从我拥有的时间戳中提取月份和年份,然后进行子集设置.也许有一条线可以完成所有这一切:

I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:

k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))

谢谢.

推荐答案

我假定TimeProductDataFrame中的列,dfDataFrame的实例,并且其他变量是标量值:

I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:

现在,您必须引用DataFrame实例:

For now, you'll have to reference the DataFrame instance:

k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]

由于&运算符相对于比较运算符的优先级,括号也是必需的. &运算符实际上是重载的按位运算符,其优先级与算术运算符相同,而算术运算符的优先级又高于比较运算符.

The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.

pandas 0.13中,一个新的实验性 方法将可用.它与select参数取模的子集极为相似:

In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:

使用query(),您可以这样做:

df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')

这是一个简单的例子:

In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})

In [10]: df
Out[10]:
  gender  price
0      m     89
1      f    123
2      f    100
3      m    104
4      m     98
5      m    103
6      f    100
7      f    109
8      f     95
9      m     87

In [11]: df.query('gender == "m" and price < 100')
Out[11]:
  gender  price
0      m     89
4      m     98
9      m     87

您感兴趣的最终查询甚至可以利用链式比较,如下所示:

The final query that you're interested will even be able to take advantage of chained comparisons, like this:

k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')

这篇关于子集一个Python DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆