pandas 计数列表列中包含的列表的出现 [英] Pandas counting occurrence of list contained in column of lists

查看:113
本文介绍了 pandas 计数列表列中包含的列表的出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个Pandas DataFrame,其中有一个带有列表的列:

I have this Pandas DataFrame that has a column with lists:

>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]})
>>> df
           m
0  [1, 2, 3]
1  [5, 3, 2]
2     [2, 5]
3  [3, 8, 1]
4        [9]
5  [2, 6, 3]

我想计算在DataFrame的列表中包含列表v = [2, 3]的次数.因此,在此示例中,正确答案为3.现在这只是一个例子,在我的实际数据中,df['m']可以包含超过900万行,而列表实际上是最多包含约20个元素的字符串列表.如果需要的话,还有更多详细信息:v的元素不包含重复项,m的列表也不包含任何内容,因此可以设置它们而不是列表.

I want to count the number of times a list v = [2, 3] is contained in the lists of the DataFrame. So in this example the correct answer would be 3. Now this is just an example, in my actual data the df['m'] can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of v contain no duplicates and neither do the lists of m, so they can be sets instead of lists.

我的程序的第一次迭代遍历每行并检查all(e in data['m'][i] for e in v),如果是True,则增加一个计数器.但是,正如许多SO问题和博客文章所述,在DataFrame的行上进行迭代很慢,并且可以更快地完成.

The first iteration of my program iterated over each row and checked all(e in data['m'][i] for e in v) and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster.

因此,在下一次迭代中,我在DataFrame中添加了一个列,其中包含列表v的副本:

So for my next iteration I added a column to the DataFrame that contains a copy of the list v:

>>> df['V'] = [[2, 3]] * len(df)
>>> df
        V          m
0  [2, 3]  [1, 2, 3]
1  [2, 3]  [5, 3, 2]
2  [2, 3]     [2, 5]
3  [2, 3]  [3, 8, 1]
4  [2, 3]        [9]
5  [2, 3]  [2, 6, 3]

和一个帮助程序函数,它像我之前所做的那样简单地返回包含布尔值:

and a helper function that simply returns the containment boolean like I did before:

def all_helper(l1, l2):
    return all(v in l1 for v in l2)

然后我可以将其与np.vectorize一起使用,以添加具有布尔值的列:

which I can then use with np.vectorize to add a column with the boolean value:

df['bool'] = np.vectorize(all_helper)(df['m'], df['V'])

最后,用一个简单的df['bool'].sum()

我也尝试使用.apply():

df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1)
count = df['bool'].sum()

但这比矢量化要慢.

现在这些方法都可以使用,矢量化比初始方法要快得多,但是感觉有点笨拙(使用辅助函数以这种方式创建具有相同值的列).所以我的问题是,性能是关键,是否有更好/更快的方法来计算一个列表包含在一个列表列中的次数?由于列表中没有重复项,也许检查是否len(union(df['m'], df['V'])) == len(df['m'])之类的东西,但是我不知道如何以及是否是最佳解决方案.

Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if len(union(df['m'], df['V'])) == len(df['m']) or something, but I don't know how and if that's the best solution.

由于有人问;这是一个使用字符串而不是整数的示例:

Since somebody asked; here's an example with strings instead of integers:

>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]})
>>> v = ["aa", "ac"]
>>> df
                    m
0  ["aa", "ab", "ac"]
1  ["aa", "ac", "ad"]
2        ["ba", "bb"]
3  ["ac", "ca", "cc"]
4              ["aa"]
5  ["ac", "da", "aa"]

>>> count_occurrence(df, v)
3

但是,如果您想要更广泛的DataFrame,可以使用以下命令生成它:

But if you want a more extensive DataFrame, you can generate it with this:

import string

n = 10000
df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]})
v = ["abc", 'cde']
print(count_occurrence(df, v))

Divakar的解决方案或Vaishali的解决方案都没有比使用np.vectorize的解决方案快.想知道是否有人可以击败它.

Neither Divakar's or Vaishali's solution was faster than the one that uses np.vectorize. Wonder if anyone can beat it.

Jon Clements随附的解决方案速度提高了约30%,而且更加清洁:df.m.apply(set(v).issubset).sum().我一直在寻找更快的实现方式,但这是朝着正确方向迈出的一步.

Jon Clements came with a solution that is roughly 30% faster and much cleaner: df.m.apply(set(v).issubset).sum(). I continue looking for faster implementations, but this is a step in the right direction.

推荐答案

您可以使用DataFrame.apply以及内置的set.issubset方法,然后使用.sum(),它们均在比Python更低的级别(通常为C级别)上运行等价物.

You can utilise DataFrame.apply along with the builtin set.issubset method and then .sum() which all operate at a lower level (normally C level) than Python equivalents do.

subset_wanted = {2, 3}
count = df.m.apply(subset_wanted.issubset).sum()

与编写自定义C级函数(这等同于自定义sum并带有检查是否有子集来确定0/)相比,我省下了更多的时间逐行1个.到那时,无论如何,您可能已经运行了成千上万次.

I can't see shaving more time off that than writing a custom C-level function which'd be the equivalent of a custom sum with a check there's a subset to determine 0/1 on a row by row basis. At which point, you could have run this thousands upon thousands of times anyway.

这篇关于 pandas 计数列表列中包含的列表的出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆