为什么我的Pandas联接了联接数据的移位行? [英] Why does my Pandas join shift rows of the joined data?

查看:62
本文介绍了为什么我的Pandas联接了联接数据的移位行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Pandas中,当我join时,合并的数据相对于原始DataFrame未对齐:

In Pandas, when I join, the joined data is misaligned with respect to the original DataFrame:

import os
import pandas as pd
import statsmodels.formula.api as sm
import numpy as np
import matplotlib.pyplot as plt

flu_train = pd.read_csv('FluTrain.csv')
# From: https://courses.edx.org/c4x/MITx/15.071x/asset/FluTrain.csv
cols = ['Ystart', 'Mstart', 'Dstart', 'Yend', 'Mend', 'Dend']
flu_train = flu_train.join(pd.DataFrame(flu_train.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))

flu_trend_1 = sm.ols('log(ILI) ~ Queries', flu_train).fit()

flu_test = pd.read_csv('FluTest.csv')
# From: https://courses.edx.org/c4x/MITx/15.071x/asset/FluTest.csv
flu_test = flu_test.join(pd.DataFrame(flu_test.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))

flu_test = flu_test.join(pd.DataFrame(exp(flu_trend_1.predict(flu_test)), columns=['ILIPred1'] ))
flu_train['ILIShift2'] = flu_train.ILI.shift(2)

flu_trend_2 = sm.ols('log(ILI) ~ Queries + log(ILIShift2)', flu_train).fit()
flu_test['ILIShift2'] = flu_test.ILI.shift(2)

# Note that this does not work in a simplified example
# See -- http://stackoverflow.com/q/22457880/
flu_test[:2].ILIShift2 = list(flu_train[-2:].ILI)

# This SHIFTS the joined column "up" two rows, loosing the first two values of ILIPred2 and making the last 2 'NaN'
flu_test = flu_test.join(pd.DataFrame(exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2']))

最后一条语句将连接的列上移"两行,从而使ILIPred2的前两个值丢失,而最后2个为"NaN".我希望加入的列与其他所有列对齐.

The final statement shifts the joined column "up" two rows, loosing the first two values of ILIPred2 and making the last 2 'NaN'. I expected the joined column to align with all of the others.

为什么会这样?我该如何预防?

Why is this happening and how do I prevent it?

推荐答案

此联接的数据框(pd.DataFrame(np.exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2']))的索引从0到49.

This joined dataframe (pd.DataFrame(np.exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2'])) has an index from 0 to 49.

您正在将其加入到flu_test,该索引的索引范围是0到51.

You're joining it to flu_test, which has an index of 0 to 51.

因此,那些索引不匹配(分别为50和51)的地方,就会得到我希望的NaN.

So where those indices don't match up (50, and 51), you get NaN as I would hope.

如果要强制连接的列位于主数据框的底部,则可以执行以下操作(注意使用ilocrow_shift变量):

If you want to force the joined column to sit at the bottom of your main dataframe, you can do the following (note the use of iloc and the row_shift variable):

import os
import pandas as pd
import statsmodels.formula.api as sm
import numpy as np
import matplotlib.pyplot as plt

row_shift = 2

flu_train = pd.read_csv('https://courses.edx.org/c4x/MITx/15.071x/asset/FluTrain.csv')
cols = ['Ystart', 'Mstart', 'Dstart', 'Yend', 'Mend', 'Dend']
flu_train = flu_train.join(pd.DataFrame(flu_train.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))

flu_trend_1 = sm.ols('np.log(ILI) ~ Queries', flu_train).fit()

flu_test = pd.read_csv('https://courses.edx.org/c4x/MITx/15.071x/asset/FluTest.csv')
flu_test = flu_test.join(pd.DataFrame(flu_test.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))

flu_test = flu_test.join(pd.DataFrame(np.exp(flu_trend_1.predict(flu_test)), columns=['ILIPred1'] ))
flu_train['ILIShift2'] = flu_train.ILI.shift(row_shift)

flu_trend_2 = sm.ols('np.log(ILI) ~ Queries + np.log(ILIShift2)', flu_train).fit()
flu_test['ILIShift2'] = flu_test.ILI.shift(row_shift)

# Note that this does not work in a simplified example
# See -- http://stackoverflow.com/q/22457880/
flu_test.iloc[:2].ILIShift2 = list(flu_train.iloc[-row_shift:].ILI)

joiner = pd.DataFrame(np.exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2'], index=flu_test.index[row_shift:])
flu_test.join(joiner)

哪个给我:

                       Week       ILI   Queries  Ystart  Mstart  Dstart  Yend  Mend  Dend  ILIPred1  ILIShift2  ILIPred2
0   2012-01-01 - 2012-01-07  1.766707  0.593625    2012       1       1  2012     1     7  3.520332        NaN       NaN
1   2012-01-08 - 2012-01-14  1.543401  0.499336    2012       1       8  2012     1    14  2.662689        NaN       NaN
2   2012-01-15 - 2012-01-21  1.647615  0.500664    2012       1      15  2012     1    21  2.673181   1.766707  2.140941
3   2012-01-22 - 2012-01-28  1.684297  0.479416    2012       1      22  2012     1    28  2.510160   1.543401  1.907817
4   2012-01-29 - 2012-02-04  1.863542  0.471448    2012       1      29  2012     2     4  2.451624   1.647615  1.971504
5   2012-02-05 - 2012-02-11  1.864079  0.503320    2012       2       5  2012     2    11  
...
46  2012-11-18 - 2012-11-24  2.304625  0.511288    2012      11      18  2012    11    24  2.758619   1.610915  2.042260
47  2012-11-25 - 2012-12-01  2.225997  0.609562    2012      11      25  2012    12     1  3.690445   1.733293  2.424141
48  2012-12-02 - 2012-12-08  2.978047  0.671979    2012      12       2  2012    12     8  4.439679   2.304625  3.160283
49  2012-12-09 - 2012-12-15  3.600230  0.705179    2012      12       9  2012    12    15  4.898351   2.225997  3.220680
50  2012-12-16 - 2012-12-22  4.547268  0.787517    2012      12      16  2012    12    22  6.250888   2.978047  4.322513
51  2012-12-23 - 2012-12-29  6.033614  0.805421    2012      12      23  2012    12    29  6.591252   3.600230  5.006438

这篇关于为什么我的Pandas联接了联接数据的移位行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆