将字段重新格式化为列,将其他字段(与这些字段配对以在以前的结构中变为列)重新格式化为新列中的字段 [英] Reformatting fields as columns, other fields (paired with the fields to become columns in the previous structure) to become fields in the new columns

查看:55
本文介绍了将字段重新格式化为列,将其他字段(与这些字段配对以在以前的结构中变为列)重新格式化为新列中的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我负责从慈善机构设计的移动应用程序中清除数据

Ive been tasked with cleaning data from a mobile application designed by a charity

在一个部分中,用户Q/A应用程序使用会话用一行表示.本节由重复的问题答案字段对组成,其中一个字段代表提出的问题,然后在其旁边的字段代表相应的答案.每个问题/字段和答案列对一起代表一个带有答案的唯一问题.

In one section, a users Q/A app use session is represented by a row. This section consists of repeated question answer field pairs, where a field represents the question asked and then the field next to it represents the corresponding answer. Together, each question/field and answer column pair represent a unique question with the answer.

开始数据

    answers.0.answer    answers.0.fieldName answers.1.answer    answers.1.fieldName
5   0                    avoidexercise             0.0            vomitflag
6   156                  height                    54.0         weight
7   1                    affectedkneeside           3.0       painlocationknee

有人要求我重新格式化该部分,以便每个问题组成一列,相应的答案在该列中为一个字段

I have been asked to reformat the section so each question forms a column, the corresponding answer a field in that column

理想的输出

_id                    avoidexercise    enjoyment   fatigue2weeks   height
        

5f27f29c362a380d3f9a9e46    1.0           yes            20.0       120.0
5f27f2ac362a380d3f9a9e4b    0.0           no             40.0       180.0
5f27f4d4362a380d3f9a9e52    1.0           yes            50.0       150.0

我的计划是从彼此Q/A对列创建许多数据透视表,然后进行合并(外部联接),然后进行内部联接以删除重复项

My plan is to create many pivot tables, from each other Q/A pairs of columns, then concaternate (outer join) then inner join to remove duplications

但是,原始数据框包含数字和对象数据类型的混合

However, the original dataframe contains a mixture of numeric and object datatypes

因此,似乎只有某些问题/答案列对正在转换为数据透视表.我尝试过使用各种聚合函数

Therefore, only some question/answer column pairs appear to be converting to pivot tables. I have tried using various aggregate functions

p1 = ur.pivot_table(index=['_id'],columns= ['answers.0.fieldName'],values=['answers.0.answer'],aggfunc=lambda x: ' '.join(x))
p2 = ur.pivot_table(index=['_id'],columns= ['answers.1.fieldName'],values=['answers.1.answer'],aggfunc=lambda x: ' '.join(x))
p3 = ur.pivot_table(index=['_id'],columns= ['answers.2.fieldName'],values=['answers.2.answer'],aggfunc=lambda x: ' '.join(x))
I have also tried another lambda function

p1 = ur.pivot_table(index=['_id'],columns= ['answers.0.fieldName'],values=['answers.0.answer'],aggfunc=lambda x: ' '.join(str(v) for v in x)
The furthest I have got so far is to run pivots with standard mean aggfunc

p1 = ur.pivot_table(index=['_id'],columns=['answers.0.fieldName'],values=['answers.0.answer'])
ps = [p1,p2,p3]
c = pd.concat(ps)

然后尝试删除合并的行和列

Then attempting to remove merge rows and columns

df = c.sum(axis=1, level=1, skipna=False)

g = df.groupby('_id').agg(np.sum)

这将返回形状正确的数据框

This returns a dataframe with the right shape

但是,它会丢失对象列中的值,而且我不确定所有数字列的准确性如何

However, it looses the values in the object columns, and I'm not sure how accurate all the numeric columns are

为克服此问题,我正在考虑将尽可能多的数据转换为数字

To overcome this problem, I was considering converting as much data as I can into numeric

c4 = c.apply(pd.to_numeric, errors='ignore').info()

然后将组合的数据透视表数据框拆分为数字和对象类型

Then splitting the combined pivot table dataframe into numeric and object type

nu = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
cndf = c4.select_dtypes(include=nu)
o = ['object', 'bool', 'datetime64', 'category']
codf = c4.select_dtypes(include=o)

并在数字数据帧上执行与上述相同的.sum和groupby操作

And running through the same .sum and groupby operations as above on the numeric dataframe

n1 = cndf.sum(axis=1, level=1, skipna=False)
n2 = n1.groupby('_id').agg(np.sum)

但是,这仍然留下了处理对象列的挑战

However, this still leaves the challenge of dealing with the object columns

推荐答案

下面是一种获得所需结果的方法:

Here's an approach to obtain your desired result:

首先定义您拥有的问题-答案对的数量

First define the number of question-answer pairs you have:

num_answers = 2 #Following your 'Starting data' in the question

然后使用以下几行代码来获取所需的数据框:

Then use the following couple of lines to obtain a dataframe as required:

import pandas as pd
df2 = pd.concat([pd.pivot_table(df1, index=['_id'], columns= ['answers.{}.fieldName'.format(i)], values=['answers.{}.answer'.format(i)]) for i in range(num_answers)], axis = 1).fillna('N/A')
df2.columns = [col[1] for col in df2.columns]

假定 df1 是您的数据帧,其中包含起始数据.

Here df1 is assumed to be your dataframe with the starting data.

您可能已经注意到,在特定ID没有记录该特定字段答案的单元格中出现"N/A".

As you might have noticed, 'N/A' is present in cells where the particular id has no recorded answer for that particular field.

假设三行的ID分别为[1,2,3],则起始数据"的输出 df2 如下所示:

Assuming an ID of [1,2,3] for the three rows respectively, the output df2 for your 'Starting data' would look like this:

      affectedkneeside  avoidexercise   height  painlocationknee    vomitflag   weight
_id                     
0          N/A                0           N/A         N/A              0         N/A
1          N/A               N/A          156         N/A              N/A       54
2           1                N/A          N/A          3               N/A       N/A

这篇关于将字段重新格式化为列,将其他字段(与这些字段配对以在以前的结构中变为列)重新格式化为新列中的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆