使用 pandas 处理多答案问卷(来自Google表单)结果 [英] Process multiple-answer questionnaire (from Google Forms) results with pandas

查看:83
本文介绍了使用 pandas 处理多答案问卷(来自Google表单)结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用于收集调查数据的Google表单(对于这个问题,我将使用示例表单),其中包含可以使用一组复选框选择的多个答案的问题.

I have a Google Form which I am using to collect survey data (for this question I'll be using an example form) which has questions which can have multiple answers, selected using a set of checkboxes.

当我从表单中获取数据并将其导入大熊猫时,我得到了:

When I get the data from the form and import it into pandas I get this:

             Timestamp    What sweets do you like?
0  23/11/2013 13:22:30  Chocolate, Toffee, Popcorn
1  23/11/2013 13:22:34                   Chocolate
2  23/11/2013 13:22:39      Toffee, Popcorn, Fruit
3  23/11/2013 13:22:45               Fudge, Toffee
4  23/11/2013 13:22:48                     Popcorn

我想对问题的结果进行统计(有多少人喜欢Chocolate,有多少人喜欢Toffee等).问题在于,所有答案都在同一列之内,因此按该列分组并要求计数是行不通的.

I'd like to do statistics on the results of the question (how many people liked Chocolate, what proportion of people liked Toffee etc). The problem is, that all of the answers are within one column, so grouping by that column and asking for counts doesn't work.

Pandas中是否有一种简单的方法可以将这种数据框转换为一个包含多个列的数据,分别称为Chocolate,Toffee,Popcorn,Fudge和Fruit,而每个列都是布尔值(1表示是,0表示否) ?我想不出一个明智的方法来做到这一点,而且我不确定它是否真的有帮助(做我想做的汇总可能会更困难).

Is there a simple way within Pandas to convert this sort of data frame into one where there are multiple columns called Chocolate, Toffee, Popcorn, Fudge and Fruit, and each of those is boolean (1 for yes, 0 for no)? I can't think of a sensible way to do this, and I'm not sure whether it would really help (doing the aggregations that I want to do might be harder in that way).

推荐答案

以固定宽度的表形式读入,删除第一列

Read in as a fixed width table, droping the first column

In [30]: df = pd.read_fwf(StringIO(data),widths=[3,20,27]).drop(['Unnamed: 0'],axis=1)

In [31]: df
Out[31]: 
             Timestamp What sweets do you like0
0  23/11/2013 13:22:34                Chocolate
1  23/11/2013 13:22:39   Toffee, Popcorn, Fruit
2  23/11/2013 13:22:45            Fudge, Toffee
3  23/11/2013 13:22:48                  Popcorn

将时间戳记设置为正确的datetime64 dtype(此练习不需要), 但几乎总是您想要的.

Make the timestamp into a proper datetime64 dtype (not necessary for this exercise), but almost always what you want.

In [32]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])

新列名

In [33]: df.columns = ['date','sweets']

In [34]: df
Out[34]: 
                 date                  sweets
0 2013-11-23 13:22:34               Chocolate
1 2013-11-23 13:22:39  Toffee, Popcorn, Fruit
2 2013-11-23 13:22:45           Fudge, Toffee
3 2013-11-23 13:22:48                 Popcorn

In [35]: df.dtypes
Out[35]: 
date      datetime64[ns]
sweets            object
dtype: object

将最甜的列从字符串拆分为列表

Split the sweet column from a string into a list

In [37]: df['sweets'].str.split(',\s*')
Out[37]: 
0                 [Chocolate]
1    [Toffee, Popcorn, Fruit]
2             [Fudge, Toffee]
3                   [Popcorn]
Name: sweets, dtype: object

关键步骤,这将在其中存在值的地方创建一个虚拟矩阵

The key step, this creates a dummy matrix for where the values exist

In [38]: df['sweets'].str.split(',\s*').apply(lambda x: Series(1,index=x))
Out[38]: 
   Chocolate  Fruit  Fudge  Popcorn  Toffee
0          1    NaN    NaN      NaN     NaN
1        NaN      1    NaN        1       1
2        NaN    NaN      1      NaN       1
3        NaN    NaN    NaN        1     NaN

最终结果,其中我们将nans填充为0,然后将其类型化为bool以得出True/False.然后恭喜 它恢复到原始帧

Final result where we fill the nans to 0, then astype to bool to make True/False. Then concatate it to the original frame

In [40]: pd.concat([df,df['sweets'].str.split(',\s*').apply(lambda x: Series(1,index=x)).fillna(0).astype(bool)],axis=1)
Out[40]: 
                 date                  sweets Chocolate  Fruit  Fudge Popcorn Toffee
0 2013-11-23 13:22:34               Chocolate      True  False  False   False  False
1 2013-11-23 13:22:39  Toffee, Popcorn, Fruit     False   True  False    True   True
2 2013-11-23 13:22:45           Fudge, Toffee     False  False   True   False   True
3 2013-11-23 13:22:48                 Popcorn     False  False  False    True  False

这篇关于使用 pandas 处理多答案问卷(来自Google表单)结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆