用于分隔列的列表的 Pandas 列 [英] Pandas column of lists to separate columns

查看：47 发布时间：2021/7/16 20:09:39 python numpy encoding scikit-learn

本文介绍了用于分隔列的列表的 Pandas 列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

传入的数据是 0+ 个类别的列表:

Incoming data is a list of 0+ categories:

#input data frame
df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})

  categories
0  [A, B, C]
1     [B, C]
2        [A]

我想将其转换为每个类别一列且每个单元格中为 0/1 的 DataFrame:

I would like to convert this to a DataFrame with one column per category and a 0/1 in each cell:

#desired output

   A  B  C
0  1  1  1
1  0  1  1
2  1  0  0

尝试

带有 LabelEncoder 的 OneHotEncoder 卡住了，因为它们不处理单元格中的列表.目前使用嵌套的 for 循环实现了预期的结果:

Attempt

OneHotEncoder with LabelEncoder get stuck because they don't handle lists in cells. The desired result is currently achieved with nested for loops:

#get unique categories ['A','B','C']
categories = np.unique(np.concatenate(x['categories']))

#make empty data frame
binary_df = pd.DataFrame(columns=[c for c in categories],
                         index=x.index)

print(binary_df)
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN


#fill data frame
for i in binary_df.index:
    for c in categories:
        binary_df.loc[i][c] = 1 if c in np.concatenate(x.loc[i]) else 0

我担心的是循环表明这是处理大型数据集(数十个类别、数万或更多行)的一种极其低效的方式.

My concern is the loops indicate this is an extremely inefficient way to handle a large data set (tens of categories, ten-of-thousands or more rows).

有没有办法用内置的 Numpy/Scikit 函数来实现结果?

解决方案:

pd.get_dummies(pd.DataFrame(df['categories'].tolist()).stack()).sum(level=0)
Out[98]: 
   A  B  C
0  1  1  1
1  0  1  1
2  1  0  0

工作原理:

pd.DataFrame(df['categories'].tolist())
Out[100]: 
   0     1     2
0  A     B     C
1  B     C  None
2  A  None  None

将一系列列表转换为数据框.

gets the series of lists turned into a dataframe.

pd.DataFrame(df['categories'].tolist()).stack()
Out[101]: 
0  0    A
   1    B
   2    C
1  0    B
   1    C
2  0    A
dtype: object

为 get_dummies 做准备，同时保留索引供以后使用.

prepares for get_dummies while preserving the indices for later.

pd.get_dummies(pd.DataFrame(df['categories'].tolist()).stack())
Out[102]: 
     A  B  C
0 0  1  0  0
  1  0  1  0
  2  0  0  1
1 0  0  1  0
  1  0  0  1
2 0  1  0  0

几乎就在那里，但在初始列表中包含了值索引的垃圾信息.

is almost there, but contains the garbage information of value index in the initial list.

所以上面的解决方案在 MultiIndex 的这个级别上求和.

So the solution above sums over this level of the MultiIndex.

%timeit 结果:

在原始数据帧上

df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})

有问题的解决方案:100 个循环，最好的 3 个:每个循环 3.24 毫秒

这个解决方案:100 个循环，最好的 3 个:每个循环 2.29 毫秒

300 行

df = pd.concat(100*[df]).reset_index(drop=True)

有问题的解决方案:1 个循环，最好的 3 个:每个循环 252 毫秒

这个解决方案:100 个循环，最好的 3 个:每个循环 2.45 毫秒

这篇关于用于分隔列的列表的 Pandas 列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于分隔列的列表的 Pandas 列 [英] Pandas column of lists to separate columns

问题描述

尝试

Attempt

推荐答案

解决方案:

工作原理:

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用于分隔列的列表的 Pandas 列 [英] Pandas column of lists to separate columns

问题描述

尝试

Attempt

推荐答案

解决方案:

工作原理:

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭