根据列值和其他列更新 pandas 细胞 [英] Update Pandas Cells based on Column Values and Other Columns

查看:71
本文介绍了根据列值和其他列更新 pandas 细胞的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望根据一列中的值更新许多列;这很容易使用循环,但是当有许多列和许多行时,这对于我的应用程序来说花费太长时间.获得每个字母所需计数的最优雅的方法是什么?

I am looking to update many columns based on the values in one column; this is easy with a loop but takes far too long for my application when there are many columns and many rows. What is the most elegant way to get the desired counts for each letter?

所需的输出:

   Things         count_A     count_B    count_C     count_D
['A','B','C']         1            1         1          0
['A','A','A']         3            0         0          0
['B','A']             1            1         0          0
['D','D']             0            0         0          2

推荐答案

最优雅的肯定是sklearn的CountVectorizer.

The most elegant is definitely the CountVectorizer from sklearn.

我将首先向您展示它的工作方式,然后我将一行完成所有工作,因此您可以看到它的优雅程度.

I'll show you how it works first, then I'll do everything in one line, so you can see how elegant it is.

让我们创建一些数据

raw = ['ABC', 'AAA', 'BA', 'DD']

things = [list(s) for s in raw]

然后读取一些程序包并初始化计数矢量化器

Then read in some packages and initialize count vectorizer

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)

接下来,我们生成一个计数矩阵

Next we generate a matrix of counts

matrix = cv.fit_transform(things)

names = ["count_"+n for n in cv.get_feature_names()]

并保存为数据框

df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)

生成这样的数据帧:

    count_A count_B count_C count_D
ABC 1   1   1   0
AAA 3   0   0   0
BA  1   1   0   0
DD  0   0   0   2

优雅的版本:

上面的所有内容

Elegant version:

Everything above in one line

df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

时间:

您提到您正在使用相当大的数据集,因此我使用%% timeit函数给出了时间估计.

Timing:

You mentioned that you're working with a rather large dataset, so I used the %%timeit function to give a time estimate.

@piRSquared的先前回复(否则看起来很好!)

Previous response by @piRSquared (which otherwise looks very good!)

pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)

100 loops, best of 3: 3.27 ms per loop

我的答案:

pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

1000 loops, best of 3: 1.08 ms per loop

根据我的测试, CountVectorizer 大约快了3倍.

According to my testing, CountVectorizer is about 3x faster.

这篇关于根据列值和其他列更新 pandas 细胞的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆