pandas 中序列的相似度匹配 [英] Sequences’ Similarity Matching in Pandas

查看:54
本文介绍了 pandas 中序列的相似度匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在SO中搜索答案,但没有找到任何帮助.

I tried searching the answer in SO but didnt find any help.

这就是我想要做的:
我有一个数据框(这是一个小例子):

Here is what I´m trying to do:
I have a dataframe (here is a small example of it):

 df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])

我的目标是找到至少与某个ratio相似的PartnerId.
另外,我只想比较具有相同CountryIdPartnerId.匹配的PartnerId应该附加到列表中,最后写入数据帧的新列中.

My goal is to find the PartnerIds which Name is similar at least up to a certain ratio.
Additionally I only want to compare PartnerIds that have the same CountryId. The matching PartnerIds should be appended to a list and finally written in a new column in the dataframe.

这是我的尝试:

itemDict = {item[0]: {'CountryId': item[1], 'Name': item[2]} for item in df.values}

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def calculate_similarity(x,itemDict):
    own_name = x['Name']
    country_id = x['CountryId']
    matching_ids = []
    for k, v in itemDict.items():

        if k != x['PartnerId']:
            if v['CountryId'] == country_id:

                ratio = similar(own_name,v['Name'])


                if ratio > 0.7:

                    matching_ids.append(k)
    return matching_ids

df['Similar_IDs'] = df.apply(lambda x: calculate_similarity(x,itemDict),axis=1)
print(df)

输出为:

    PartnerId  CountryId                          Name Similar_IDs
0           1          5            AADDEEEEIILMNORRTU          []
1           2          5                AACEEEEGMMNNTT          []
2           3          5  AAACCCCEFHIILMNNOPRRRSSTTUUY        [15]
3           4          5               DEEEGINOOPRRSTY        [10]
4           5          5          AACCDEEHHIIKMNNNNTTW          []
5           6          5              ACEEHHIKMMNSSTUV          []
6           7          5                ACELMNOOPPRRTU          []
7           8          5                           BIT          []
8           9          5                           APR          []
9          10          5              CDEEEGHILLLNOOST         [4]
10         11          5                        ACCMNO          []
11         12          5                           AIK          []
12         13          5              CCHHLLOORSSSTTUZ          []
13         14          5                       ANNOSXY          []
14         15          5  AABBCEEEEHIILMNNOPRRRSSTUUVY         [3]

我现在的问题是:
1.)有没有更有效的方法来计算它?我现在大约有20.000行,在不久的将来还会更多.
2.)是否可以摆脱" itemDict并直接从数据框中进行操作?
3.)是否可以使用另一种距离测量方法更好?

My questions now are:
1.) Is there a more efficient way to compute it? I have about 20.000 rows now and a lot more in the near future.
2.) Is it possible to get "rid" of the itemDict and do it directly from the dataframe?
3.) Is another distance measure maybe better to use?

非常感谢您的帮助!

推荐答案

您可以使用模块difflib.首先,您需要通过使用外部联接将表与其自身联接,从而使所有字符串都构成笛卡尔积:

You can use the module difflib. First, you need to make a cartesian product of all strings by joining the table to itself using outer join:

cols = ['Name', 'CountryId', 'PartnerId']
df = df[cols].merge(df[cols], on='CountryId', how='outer')    
df = df.query('PartnerId_x != PartnerId_y')

下一步,您可以从此 answer 应用此功能,并过滤掉所有匹配项:

In the next step you can apply the function from this answer and filter out all matches:

def match(x):
    return SequenceMatcher(None, x[0], x[1]).ratio()

match = df.apply(match, axis=1) > 0.7
df.loc[match, ['PartnerId_x', 'Name_x', 'PartnerId_y']]

输出:

     PartnerId_x                        Name_x  PartnerId_y
44             3  AAACCCCEFHIILMNNOPRRRSSTTUUY           15
54             4               DEEEGINOOPRRSTY           10
138           10              CDEEEGHILLLNOOST            4
212           15  AABBCEEEEHIILMNNOPRRRSSTUUVY            3

如果没有足够的内存,则可以尝试遍历数据帧的行:

If you don't have enough memory you can try to iterate over the rows of a data frame:

lst = []
for idx, row in df.iterrows():
    if SequenceMatcher(None, row['Name_x'], row['Name_y']).ratio() > 0.7:
        lst.append(row[['PartnerId_x', 'Name_x', 'PartnerId_y']])

pd.concat(lst, axis=1).T

这篇关于 pandas 中序列的相似度匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆