比较两个数据框列以匹配字符串或者是子字符串然后在 pandas 中计数 [英] Compare two dataframe columns for matching strings or are substrings then count in pandas

查看:51
本文介绍了比较两个数据框列以匹配字符串或者是子字符串然后在 pandas 中计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框(A 和 B).我想比较 A 中的字符串并找到匹配项或包含在 B 中的另一个字符串中.然后计算 A 匹配或包含在 B 中的次数.

I have two dataframes (A and B). I want to compare strings in A and find a match or is contained in another string in B. Then count the amount of times A was matched or contained in B.

    Dataframe A
 
0   "4012, 4065, 4682"
1   "4712, 2339, 5652, 10007"
2   "4618, 8987"
3   "7447, 4615, 4012"
4   "6515"
5   "4065, 2339, 4012"


    Dataframe B

0   "6515, 4012, 4618, 8987"         <- matches (DF A, Index 2 & 4) (2: 4618, 8987), (4: 6515)
1   "4065, 5116, 2339, 8757, 4012"   <- matches (DF A, Index 5) (4065, 2339, 4012)
2   "1101"
3   "6515"                           <- matches (DF A, Index 4) (6515)
4   "4012, 4615, 7447"               <- matches (DF A, Index 3) (7447, 4615, 4012)
5   "7447, 6515, 4012, 4615"         <- matches (DF A, Index 3 & 4) (3: 7447, 4615, 4012 ), (4: 6515)


    Desired Output:

    Itemset             Count

2   4618, 8987            1
3   7447, 4165, 4012      2
4   6515                  3
5   4065, 2339, 4012      1

基本上,我想计算 A 与 B 的直接匹配(按顺序或不匹配),或者 A 是否部分包含在 B 中(按顺序或不匹配).我的目标是计算 A 被 B 验证了多少次.顺便说一下,这些都是字符串.

Basically, I want to count when there is a direct match of A in B (either in order or not) or if A is partially contained in B (in order or not). My goal is to count how many times A is being validated by B. These are all strings by the way.

推荐答案

我正在根据我们在评论中的讨论重写此答案.

您可以使用列表理解来提供相同的效果,而不是使用 apply;下面为每一行创建一个包含所需计算的列表

Rather than use apply, you can use a list comprehension to provide the same effect; the following creates a list with the desired calculation for each row

[sum(all(val in cell for val in row) for cell in dfB['values_list']) for row in dfA['values_list']]

虽然我最初发现这比 apply 函数更难解析(并且更难编写),但在速度上有巨大的优势.这是您的数据,最后两行将条目拆分为列表:

While I originally found this significantly harder to parse than an apply function (and much harder to write), there is a tremendous advantage in speed. Here is your data, with the final two lines to split entries into lists:

import pandas as pd

dfA = pd.DataFrame(["4012, 4065, 4682",
                    "4712, 2339, 5652, 10007",
                    "4618, 8987",
                    "7447, 4615, 4012",
                    "6515",
                    "4065, 2339, 4012",],
                    columns=['values'])

dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
                    "4065, 5116, 2339, 8757, 4012",
                    "1101",
                    "6515",
                    "4012, 4615, 7447",
                    "7447, 6515, 4012, 4615"],
                    columns=['values'])

dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')

这是使用粗糙列表组件的速度测试:

Here is a speed test using the gnarly list comp:

In[0]
%%timeit -n 1000
dfA['overlap_A'] = [sum(all(val in cell for val in row)
                    for cell in dfB['values_list']) 
                    for row in dfA['values_list']]

Out[0]
186 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这里使用 apply 函数是一样的,类似于 MrNobody33 的答案和我的原始(派生)答案中使用的函数.请注意,此函数已经使用了一些推导式,大概将事情移到 for 循环会使事情变慢:

And here is the same using an apply function, similar to that used in MrNobody33 's answer, and in my original (derivative) answer. Note that this function already uses some comprehensions, and presumably moving things to for loops would make things slower:

def check_overlap(row):
    return sum(all(val in cell for val in row['values_list']) for cell in dfB['values_list'])

In[1]:
%%timeit -n 1000
dfA['overlap_B'] = dfA.apply(check_overlap, axis=1)

Out[1]:
1.4 ms ± 61.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

如果不使用 apply,速度大约快 7 倍!请注意,结果输出是相同的:

That's about 7x faster when not using apply! Note that the resulting output is the same:

                    values                values_list  overlap_A  overlap_B
0         4012, 4065, 4682         [4012, 4065, 4682]          0          0
1  4712, 2339, 5652, 10007  [4712, 2339, 5652, 10007]          0          0
2               4618, 8987               [4618, 8987]          1          1
3         7447, 4615, 4012         [7447, 4615, 4012]          2          2
4                     6515                     [6515]          3          3
5         4065, 2339, 4012         [4065, 2339, 4012]          1          1

这篇关于比较两个数据框列以匹配字符串或者是子字符串然后在 pandas 中计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆