pandas 计数列表列中包含的列表的出现 [英] Pandas counting occurrence of list contained in column of lists

查看：113 发布时间：2020/5/24 0:45:23 python pandas vectorization

本文介绍了 pandas 计数列表列中包含的列表的出现的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个Pandas DataFrame，其中有一个带有列表的列:

I have this Pandas DataFrame that has a column with lists:

>>> df = pd.DataFrame({'m': [[1,2,3], [5,3,2], [2,5], [3,8,1], [9], [2,6,3]]})
>>> df
           m
0  [1, 2, 3]
1  [5, 3, 2]
2     [2, 5]
3  [3, 8, 1]
4        [9]
5  [2, 6, 3]

我想计算在DataFrame的列表中包含列表v = [2, 3]的次数.因此，在此示例中，正确答案为3.现在这只是一个例子，在我的实际数据中，df['m']可以包含超过900万行，而列表实际上是最多包含约20个元素的字符串列表.如果需要的话，还有更多详细信息:v的元素不包含重复项，m的列表也不包含任何内容，因此可以设置它们而不是列表.

I want to count the number of times a list v = [2, 3] is contained in the lists of the DataFrame. So in this example the correct answer would be 3. Now this is just an example, in my actual data the df['m'] can contain more than 9 million rows and the lists are actually lists of strings with up to about 20 elements. Some more details if it matters: The elements of v contain no duplicates and neither do the lists of m, so they can be sets instead of lists.

我的程序的第一次迭代遍历每行并检查all(e in data['m'][i] for e in v)，如果是True，则增加一个计数器.但是，正如许多SO问题和博客文章所述，在DataFrame的行上进行迭代很慢，并且可以更快地完成.

The first iteration of my program iterated over each row and checked all(e in data['m'][i] for e in v) and if that's True, I increment a counter. But as addressed in many SO questions and blog posts, iterating over the rows of a DataFrame is slow and can be done much faster.

因此，在下一次迭代中，我在DataFrame中添加了一个列，其中包含列表v的副本:

So for my next iteration I added a column to the DataFrame that contains a copy of the list v:

>>> df['V'] = [[2, 3]] * len(df)
>>> df
        V          m
0  [2, 3]  [1, 2, 3]
1  [2, 3]  [5, 3, 2]
2  [2, 3]     [2, 5]
3  [2, 3]  [3, 8, 1]
4  [2, 3]        [9]
5  [2, 3]  [2, 6, 3]

和一个帮助程序函数，它像我之前所做的那样简单地返回包含布尔值:

and a helper function that simply returns the containment boolean like I did before:

def all_helper(l1, l2):
    return all(v in l1 for v in l2)

然后我可以将其与np.vectorize一起使用，以添加具有布尔值的列:

which I can then use with np.vectorize to add a column with the boolean value:

df['bool'] = np.vectorize(all_helper)(df['m'], df['V'])

最后，用一个简单的df['bool'].sum()

我也尝试使用.apply():

df['bool'] = df.apply(lambda row: all(w in row['m'] for w in v), axis=1)
count = df['bool'].sum()

但这比矢量化要慢.

现在这些方法都可以使用，矢量化比初始方法要快得多，但是感觉有点笨拙(使用辅助函数以这种方式创建具有相同值的列).所以我的问题是，性能是关键，是否有更好/更快的方法来计算一个列表包含在一个列表列中的次数?由于列表中没有重复项，也许检查是否len(union(df['m'], df['V'])) == len(df['m'])之类的东西，但是我不知道如何以及是否是最佳解决方案.

Now these methods work, and the vectorisation is much faster than the initial approach, but it feels a bit clunky (creating a column with identical values, using a helper function in such a way). So my question, performance is key, is there a better/faster way to count the number of times a list is contained in a column of lists? Since the lists contain no duplicates, perhaps the check if len(union(df['m'], df['V'])) == len(df['m']) or something, but I don't know how and if that's the best solution.

由于有人问；这是一个使用字符串而不是整数的示例:

Since somebody asked; here's an example with strings instead of integers:

>>> df = pd.DataFrame({'m': [["aa","ab","ac"], ["aa","ac","ad"], ["ba","bb"], ["ac","ca","cc"], ["aa"], ["ac","da","aa"]]})
>>> v = ["aa", "ac"]
>>> df
                    m
0  ["aa", "ab", "ac"]
1  ["aa", "ac", "ad"]
2        ["ba", "bb"]
3  ["ac", "ca", "cc"]
4              ["aa"]
5  ["ac", "da", "aa"]

>>> count_occurrence(df, v)
3

但是，如果您想要更广泛的DataFrame，可以使用以下命令生成它:

But if you want a more extensive DataFrame, you can generate it with this:

import string

n = 10000
df = pd.DataFrame({'m': [list(set([''.join(np.random.choice(list(string.ascii_lowercase)[:5], np.random.randint(3, 4))) for _ in range(np.random.randint(1, 10))])) for _ in range(n)]})
v = ["abc", 'cde']
print(count_occurrence(df, v))

Divakar的解决方案或Vaishali的解决方案都没有比使用np.vectorize的解决方案快.想知道是否有人可以击败它.

Neither Divakar's or Vaishali's solution was faster than the one that uses np.vectorize. Wonder if anyone can beat it.

Jon Clements随附的解决方案速度提高了约30％，而且更加清洁:df.m.apply(set(v).issubset).sum().我一直在寻找更快的实现方式，但这是朝着正确方向迈出的一步.

Jon Clements came with a solution that is roughly 30% faster and much cleaner: df.m.apply(set(v).issubset).sum(). I continue looking for faster implementations, but this is a step in the right direction.

pandas 计数列表列中包含的列表的出现 [英] Pandas counting occurrence of list contained in column of lists

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 计数列表列中包含的列表的出现 [英] Pandas counting occurrence of list contained in column of lists

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭