从Pandas Dataframe单元格中将嵌套数组值拆分为多行 [英] Split nested array values from Pandas Dataframe cell over multiple rows

查看:156
本文介绍了从Pandas Dataframe单元格中将嵌套数组值拆分为多行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下的 Pandas DataFrame 表格

每年每个ID(2008年-2015年)有一行.对于Max TempMin TempRain列,每个单元格包含对应于该年中一天的值的数组,即

There is one row per ID per year (2008 - 2015). For the columns Max Temp, Min Temp, and Rain each cell contains an array of values corresponding to a day in that year, i.e. for the frame above

  • frame3.iloc[0]['Max Temp'][0]是2011年1月1日的值
  • frame3.iloc[0]['Max Temp'][364]是2011年12月31日的值.
  • frame3.iloc[0]['Max Temp'][0] is the value for January 1st 2011
  • frame3.iloc[0]['Max Temp'][364] is the value for December 31st 2011.

我知道这结构不好,但这是我必须处理的数据.它以这种方式存储在MongoDB中(这些行之一等同于Mongo中的文档).

I'm aware this is badly structured, but this is the data I have to deal with. It is stored in MongoDB in this way (where one of these rows equates to a document in Mongo).

我想拆分这些嵌套的数组,这样我每天每个ID而不是每个ID一行,而不是每年每个ID一行.但是,在拆分数组时,我还要基于当前数组索引创建一个新列以捕获一年中的某天.然后,我将使用这一天以及Year列来创建 DatetimeIndex

I want to split these nested arrays, so that instead of one row per ID per year, I have one row per ID per day. While splitting the array, however, I would also like to create a new column to capture the day of the year, based on the current array index. I would then use this day, plus the Year column to create a DatetimeIndex

我在这里搜索了相关的答案,但只找到了

I searched here for relevant answers, but only found this one which doesn't really help me.

推荐答案

您可以为每个列运行.apply(pd.Series),然后stack并连接结果.

You can run .apply(pd.Series) for each of your columns, then stack and concatenate the results.

对于系列

s = pd.Series([[0, 1], [2, 3, 4]], index=[2011, 2012])

s
Out[103]: 
2011       [0, 1]
2012    [2, 3, 4]
dtype: object

它的工作原理如下

s.apply(pd.Series).stack()
Out[104]: 
2011  0    0.0
      1    1.0
2012  0    2.0
      1    3.0
      2    4.0
dtype: float64

该系列的元素长度不同(这很重要,因为2012年是a年).中间序列,即stack之前的序列,其NaN值后来被删除.

The elements of the series have different length (it matters because 2012 was a leap year). The intermediate series, i.e. before stack, had a NaN value that has been later dropped.

现在,让我们来做一个框架:

Now, let's take a frame:

a = list(range(14))
b = list(range(20, 34))

df = pd.DataFrame({'ID': [11111, 11111, 11112, 11112],
                   'Year': [2011, 2012, 2011, 2012],
                   'A': [a[:3], a[3:7], a[7:10], a[10:14]],
                   'B': [b[:3], b[3:7], b[7:10], b[10:14]]})

df
Out[108]: 
                  A                 B     ID  Year
0         [0, 1, 2]      [20, 21, 22]  11111  2011
1      [3, 4, 5, 6]  [23, 24, 25, 26]  11111  2012
2         [7, 8, 9]      [27, 28, 29]  11112  2011
3  [10, 11, 12, 13]  [30, 31, 32, 33]  11112  2012

然后我们可以运行:

# set an index (each column will inherit it)
df2 = df.set_index(['ID', 'Year'])
# the trick
unnested_lst = []
for col in df2.columns:
    unnested_lst.append(df2[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df2.columns)

并获得:

result
Out[115]: 
                 A     B
ID    Year              
11111 2011 0   0.0  20.0
           1   1.0  21.0
           2   2.0  22.0
      2012 0   3.0  23.0
           1   4.0  24.0
           2   5.0  25.0
           3   6.0  26.0
11112 2011 0   7.0  27.0
           1   8.0  28.0
           2   9.0  29.0
      2012 0  10.0  30.0
           1  11.0  31.0
           2  12.0  32.0
           3  13.0  33.0

其余的(日期时间索引)则不那么直接.例如:

The rest (datetime index) is more less straightforward. For example:

# DatetimeIndex
years = pd.to_datetime(result.index.get_level_values(1).astype(str))
# TimedeltaIndex
days = pd.to_timedelta(result.index.get_level_values(2), unit='D')
# If the above line doesn't work (a bug in pandas), try this:
# days = result.index.get_level_values(2).astype('timedelta64[D]')

# the sum is again a DatetimeIndex
dates = years + days
dates.name = 'Date'

new_index = pd.MultiIndex.from_arrays([result.index.get_level_values(0), dates])

result.index = new_index

result
Out[130]: 
                     A     B
ID    Date                  
11111 2011-01-01   0.0  20.0
      2011-01-02   1.0  21.0
      2011-01-03   2.0  22.0
      2012-01-01   3.0  23.0
      2012-01-02   4.0  24.0
      2012-01-03   5.0  25.0
      2012-01-04   6.0  26.0
11112 2011-01-01   7.0  27.0
      2011-01-02   8.0  28.0
      2011-01-03   9.0  29.0
      2012-01-01  10.0  30.0
      2012-01-02  11.0  31.0
      2012-01-03  12.0  32.0
      2012-01-04  13.0  33.0

这篇关于从Pandas Dataframe单元格中将嵌套数组值拆分为多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆