在字符串的pandas数据框中查找值计数 [英] Find value counts within a pandas dataframe of strings
问题描述
我想获取一列中字符串的频率计数.一方面,这类似于将数据框折叠为仅反映列中的字符串的一组行.我能够通过循环解决此问题,但知道有更好的解决方案.
I want to get the frequency count of strings within a column. One one hand, this is similar to collapsing a dataframe to a set of rows that only reflects the strings in the column. I was able to solve this with a loop, but know there is a better solution.
示例df:
2017-08-09 2017-08-10
id
0 pre pre
2 active_1-3 active_1
3 active_1 active_1
4 active_3-7 active_3-7
5 active_1 active_1
想出去:
2017-08-09 2017-08-10
pre 1 1
active_1 2 3
active_1-3 3 0
active_3-7 1 1
我搜索了很多论坛,但找不到合适的答案.
I searched a lot of forums but couldnt' find a good answer.
我假设透视表方法是正确的方法,但是无法获得正确的参数来折叠没有输出df明显索引的表.
I'm assuming a pivot_table approach is the right one, but couldn't get the right arguments to collapse a table that didn't have an obvious index for the output df.
我可以通过使用value_counts()遍历每列并将每个值计数系列附加到新的数据框中来使其工作,但是我知道有更好的解决方案.
I was able to get this to work by iterating over each column, using value_counts(), and appending each value count series into a new dataframe, but I know there is a better solution.
for i in range(len(date_cols)):
new_values = df[date_cols[i]].value_counts()
output_df = pd.concat([output_df , new_values], axis=1)
谢谢!
推荐答案
您可以使用value counts
和pd.Series
(感谢Jon的改进),即
You can use value counts
and pd.Series
(Thanks for improvement Jon)i.e
ndf = df.apply(pd.Series.value_counts).fillna(0)
2017-08-09 2017-08-10
active_1 2 3.0
active_1-3 1 0.0
active_3-7 1 1.0
pre 1 1.0
时间:
k = pd.concat([df]*1000)
# @cᴏʟᴅsᴘᴇᴇᴅ's method
%%timeit
pd.get_dummies(k.T).groupby(by=lambda x: x.split('_', 1)[1], axis=1).sum().T
1 loop, best of 3: 5.68 s per loop
%%timeit
# @cᴏʟᴅsᴘᴇᴇᴅ's method
k.stack().str.get_dummies().sum(level=1).T
10 loops, best of 3: 84.1 ms per loop
# My method
%%timeit
k.apply(pd.Series.value_counts).fillna(0)
100 loops, best of 3: 7.57 ms per loop
# FabienP's method
%%timeit
k.unstack().groupby(level=0).value_counts().unstack().T.fillna(0)
100 loops, best of 3: 7.35 ms per loop
#@Wen's method (fastest for now)
pd.concat([pd.Series(collections.Counter(k[x])) for x in df.columns],axis=1)
100 loops, best of 3: 4 ms per loop
这篇关于在字符串的pandas数据框中查找值计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!