使用 pandas 的从宽到长的数据集 [英] Wide to long dataset using pandas
问题描述
有很多标题相似的问题,但是我无法解决我的数据集所遇到的问题.
There are a lot of questions out there with similar titles but I'm unable to solve the issues that I'm having with my dataset.
数据集:
ID Country Type Region Gender IA01_Raw IA01_Class1 IA01_Class2 IA02_Raw IA02_Class1 IA02_Class2 QA_Include QA_Comments
SC1 France A Europe Male 4 8 1 J 4 1 yes N/A
SC2 France A Europe Female 2 7 2 Q 6 4 yes N/A
SC3 France B Europe Male 3 7 2 K 8 2 yes N/A
SC4 France A Europe Male 4 8 2 A 2 1 yes N/A
SC5 France B Europe Male 1 7 1 F 1 3 yes N/A
ID6 France A Europe Male 2 8 1 R 3 7 yes N/A
ID7 France B Europe Male 2 8 1 Q 4 6 yes N/A
UC8 France B Europe Male 4 8 2 P 4 2 yes N/A
必需的输出:
ID Country Type Region Gender IA Raw Class1 Class2 QA_Include QA_Comments
SC1 France A Europe Male 01 K 8 1 yes N/A
SC1 France A Europe Male 01 L 8 1 yes N/A
SC1 France A Europe Male 01 P 8 1 yes N/A
SC1 France A Europe Male 02 Q 8 1 yes N/A
SC1 France A Europe Male 02 R 8 1 yes N/A
SC1 France A Europe Male 02 T 8 1 yes N/A
SC1 France A Europe Male 03 G 8 1 yes N/A
SC1 France A Europe Male 03 R 8 1 yes N/A
SC1 France A Europe Male 03 G 8 1 yes N/A
SC1 France A Europe Male 04 K 8 1 yes N/A
SC1 France A Europe Male 04 A 8 1 yes N/A
SC1 France A Europe Male 04 P 8 1 yes N/A
SC1 France A Europe Male 05 R 8 1 yes N/A
....
在数据集"中,我有名为 IA [X] _NAME 的列,其中 X = 1..9 和 NAME = Raw,Class1 和 Class2 .
In the Dataset I've columns which are names as IA[X]_NAME where X = 1..9 and NAME = Raw, Class1 and Class2.
我要做的是只是转置这些列,以使它看起来像必需"输出中所示的表,即 IA 将显示 X 值,就像原始和类这样,它们将显示其透视值.
What I am trying to do is to just transpose these columns so that it looks like the table shown in Required output i.e. IA will show X value and just like this raw and classes will show their perspective values.
因此,为了实现它,我将列切片为:
So in order to achieve it I sliced the columns as:
idVars = list(excel_df_final.columns[0:40]) + list(excel_df_final.columns[472:527]) #These contain columns like ID, Country, Type etc
valueVars = excel_df_final.columns[41:472].tolist() #All the IA_ columns
我不知道此步骤是否必要,但这为我提供了完美的列切片,但是当我将其放入melt
时,它无法正常工作.我已经尝试了其他问题中几乎所有可用的方法.
I don't know if this step was necessary but this gave me the perfect slices of columns but when I put it in melt
it is not working properly. I have tried almost every method that is available in other questions.
pd.melt(excel_df_final, id_vars=idVars,value_vars=valueVars)
我也尝试过:
excel_df_final.set_index(idVars)[41:472].unstack()
但是没有用,这是长期的广泛实施,但也没有用:
but didn't work and here is Wide to long implementation which also didn't work:
pd.wide_to_long(excel_df_final, stubnames = ['IA', 'Raw', 'Class1', 'Class2'], i=idVars, j=valueVars)
我得到的错误很长一段时间是:
The error I got for wide to long is:
ValueError:操作数不能与形状一起广播(95,) (431,)
ValueError: operands could not be broadcast together with shapes (95,) (431,)
由于我的数据集实际有526列,因此我将它们分为两个列表,其中一个包含95个列名,这将是i
,其余431个是我需要在列表中显示的列行,如示例数据集中所示.
As my dataset has 526 columns in real, so that is why I've divided them into two lists one contains 95 column names which will be the i
and the rest 431 are the one that I need to show in the row as shown in the sample data set.
推荐答案
这将使您入门.本质是使用set_index
,将列转换为MultiIndex,然后转换为stack
.可能存在更好的解决方案,但是我会这样做,因为这是您输出的一个简单步骤.
This will get you started. The essence is using set_index
, column conversion to MultiIndex, then stack
. Better solutions exist, possibly, but I would do it this way because it is an easy step to your output.
# Set the index with columns that we don't want to "transpose"
df2 = df.set_index([
'ID', 'Country', 'Type', 'Region', 'Gender', 'QA_Include', 'QA_Comments'])
# Convert headers to MultiIndex -- this is so we can melt IA values
df2.columns = pd.MultiIndex.from_tuples(map(tuple, df2.columns.str.split('_')))
# Call stack to replicate data, then reset the index
out = df2.stack(level=0).reset_index().rename({'level_7': 'IA'}, axis=1)
out
ID Country Type Region Gender QA_Include QA_Comments IA Class1 Class2 Raw
0 SC1 France A Europe Male yes NaN IA01 8 1 4
1 SC1 France A Europe Male yes NaN IA02 4 1 J
2 SC2 France A Europe Female yes NaN IA01 7 2 2
3 SC2 France A Europe Female yes NaN IA02 6 4 Q
4 SC3 France B Europe Male yes NaN IA01 7 2 3
5 SC3 France B Europe Male yes NaN IA02 8 2 K
6 SC4 France A Europe Male yes NaN IA01 8 2 4
7 SC4 France A Europe Male yes NaN IA02 2 1 A
8 SC5 France B Europe Male yes NaN IA01 7 1 1
9 SC5 France B Europe Male yes NaN IA02 1 3 F
10 ID6 France A Europe Male yes NaN IA01 8 1 2
11 ID6 France A Europe Male yes NaN IA02 3 7 R
12 ID7 France B Europe Male yes NaN IA01 8 1 2
13 ID7 France B Europe Male yes NaN IA02 4 6 Q
14 UC8 France B Europe Male yes NaN IA01 8 2 4
15 UC8 France B Europe Male yes NaN IA02 4 2 P
这篇关于使用 pandas 的从宽到长的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!