在Pandas DataFrame中拆分列表 [英] Splitting a List inside a Pandas DataFrame

查看:2300
本文介绍了在Pandas DataFrame中拆分列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件,其中包含许多列。使用pandas,我读这个csv文件到一个数据框架,并有一个datetime索引和五六个其他列。



其中一列是时间戳列表with index)

  CreateDate时间戳
4/1/11 [时间戳('2012-02-29 00:00 :00'),Timestamp('2012-03-31 00:00:00'),Timestamp('2012-04-25 00:00:00'),Timestamp('2012-06-30 00:00:00 ')]
4/2/11 [Timestamp('2014-01-31 00:00:00')]
6/8/11 [时间戳('2012-08-31 00:00 :00'),Timestamp('2012-09-30 00:00:00'),Timestamp('2012-11-07 00:00:00'),Timestamp('2013-01-10 00:00:00 '),Timestamp('2013-07-25 00:00:00')]

'd喜欢做的是将时间戳列转换为列出的每个时间戳的单独行。例如,对于行1,它将转换为4行,行2将转换为1行。我知道我需要重置索引才能做到这一点,这是好的。



我试过的所有东西,最后出来进入左字段取值并在pandas之外创建一个列表等)



任何建议赞赏。

解决方案

如果你想留在纯大熊猫,你可以投入一个棘手的 groupby

<$ c>和 apply pre> 在[1]:import pandas as pd

In [2]:d = {'date':['4/1/11',' 4/2/11'],'ts':[[pd.Timestamp('2012-02-29 00:00:00'),pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'),pd.Timestamp('2012-06-30 00:00:00')],[pd.Timestamp('2014-01-31 00: 00:00')]]}

In [3]:df = pd.DataFrame(d)

在[4]:df.head b Out [4]:
date ts
0 4/1/11 [2012-02-29 00:00:00,2012-03-31 00:00:00,201 ...
1 4/2/11 [2014-01-31 00:00:00]

In [5]:df_new = df.groupby('date')。ts.apply x:pd.DataFrame(x.values [0]))。reset_index()。drop('level_1',axis = 1)

在[6]:df_new.columns = ,'ts']

In [7]:df_new.head()
Out [7]:
date ts
0 4/1/11 2012- 02-29
1 4/1/11 2012-03-31
2 4/1/11 2012-04-25
3 4/1/11 2012-06-30
4 4/2/11 2014-01-31

由于目标是取值(在这种情况下为日期),并为列表中要创建的多个行的所有值重复此操作,以便将其视为pandas索引。



我们希望日期成为新行的单个索引,因此我们使用 groupby ,它将所需的行值放入索引。然后在该操作中,我想要拆分此日期的列表,这是 apply 将为我们做的。



我通过 apply 一个pandas 系列它包含一个单一的列表,但我可以通过 .values [0] 它将 Series 的唯一一行推送到具有单个条目的数组。



要将列表转换为一组将被传递回索引日期的行,我可以使它成为一个 DataFrame 。这招致了额外的索引的惩罚,但我们最终丢弃。我们可以使它成为一个索引本身,但这将排除dupe值。



一旦这被传回,我有一个多索引,但我可以强制这种行格式我们希望通过 reset_index



这听起来很复杂,但我们只是利用pandas函数的自然行为来避免显式迭代或循环。



速度明智这往往是相当不错的,因为它依赖于 apply 任何并行技巧,工作<



可选,如果您希望它对多个日期具有鲁棒性,每个日期都有一个嵌套列表:

  df_new = df.groupby('date')。ts.apply(lambda x:pd.DataFrame([子列表中item的值]))

抛出一个函数。


I have a csv file that contains a number of columns. Using pandas, I read this csv file into a dataframe and have a datetime index and five or six other columns.

One of the columns is a list of timestamps (example below with index)

CreateDate     TimeStamps
4/1/11         [Timestamp('2012-02-29 00:00:00'), Timestamp('2012-03-31 00:00:00'), Timestamp('2012-04-25 00:00:00'), Timestamp('2012-06-30 00:00:00')]
4/2/11         [Timestamp('2014-01-31 00:00:00')]
6/8/11         [Timestamp('2012-08-31 00:00:00'), Timestamp('2012-09-30 00:00:00'), Timestamp('2012-11-07 00:00:00'), Timestamp('2013-01-10 00:00:00'), Timestamp('2013-07-25 00:00:00')]

What I'd like to do is convert the timestamp column into separate rows for each timestamp listed. For example, for row 1 it would convert to 4 rows and row 2 would convert to 1 row. I realize I'd need to reset the index to be able to do this, which is fine.

Everything I've tried just ends up getting out into left field (taking the values and create a list outside of pandas, etc)

Any suggestions appreciated.

解决方案

If you want to stay in pure pandas you can throw in a tricky groupby and apply which ends up boiling down to a one liner if you don't count the column rename.

In [1]: import pandas as pd

In [2]: d = {'date': ['4/1/11', '4/2/11'], 'ts': [[pd.Timestamp('2012-02-29 00:00:00'), pd.Timestamp('2012-03-31 00:00:00'), pd.Timestamp('2012-04-25 00:00:00'), pd.Timestamp('2012-06-30 00:00:00')], [pd.Timestamp('2014-01-31 00:00:00')]]}

In [3]: df = pd.DataFrame(d)

In [4]: df.head()
Out[4]: 
     date                                                 ts
0  4/1/11  [2012-02-29 00:00:00, 2012-03-31 00:00:00, 201...
1  4/2/11                              [2014-01-31 00:00:00]

In [5]: df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame(x.values[0])).reset_index().drop('level_1', axis = 1)

In [6]: df_new.columns = ['date','ts']

In [7]: df_new.head()
Out[7]: 
     date         ts
0  4/1/11 2012-02-29
1  4/1/11 2012-03-31
2  4/1/11 2012-04-25
3  4/1/11 2012-06-30
4  4/2/11 2014-01-31

Since the goal is to take the value of a column (in this case date) and repeat it for all values of the multiple rows you intend to create from the list it's useful to think of pandas indexing.

We want the date to become the single index for the new rows so we use groupby which puts the desired row value into an index. Then inside that operation I want to split only this list for this date which is what apply will do for us.

I'm passing apply a pandas Series which consists of a single list but I can access that list via a .values[0] which pushes the sole row of the Series to an array with a single entry.

To turn the list into a set of rows that will be passed back to the indexed date I can just make it a DataFrame. This incurs the penalty of picking up an extra index but we end up dropping that. We could make this an index itself but that would preclude dupe values.

Once this is passed back out I have a multi-index but I can force this into the row format we desire by reset_index. Then we simply drop the unwanted index.

It sounds involved but really we're just leverage the natural behaviors of pandas functions to avoid explicitly iterating or looping.

Speed wise this tends to be pretty good and since it relies on apply any parallelization tricks that work with apply work here.

Optionally if you want it to be robust to multiple dates each with a nested list:

df_new = df.groupby('date').ts.apply(lambda x: pd.DataFrame([item for sublist in x.values for item in sublist]))

at which point the one liner is getting dense and you should probably throw into a function.

这篇关于在Pandas DataFrame中拆分列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆