NLP分类标签有许多相似之处,仅替换一个 [英] NLP Classification labels have many similarirites,replace to only have one

查看:53
本文介绍了NLP分类标签有许多相似之处,仅替换一个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直试图在Python中使用Fuzzywuzzy库来查找标签中字符串之间的百分比相似度.我遇到的问题是,即使我尝试执行查找和替换操作,仍然有很多真正相似的字符串.

I been trying to use the fuzzywuzzy library in Python to find the percentage similarity between strings in the labels. The problem I am having is that there is still many strings that are really similar even when I try to do a find and replace.

我想知道这里是否有人使用过这种方法来清理标签.举个例子.我有看起来真的完全相同的这些标签:

I am wondering if there is a method that anyone here has used in order to clean up labels. To give an example. I have these labels that look really identical:

 'Cable replaced',
 'Cable replaced.',
 'Camera is up and recording',
 'Chat closed due to inactivity.',
 'Closing as duplicate',
 'Closing as duplicate.',
 'Closing duplicate ticket.',
 'Closing ticket.',

理想情况下,我希望能够找到并替换为通用字符串,因此我们只说一个作为重复关闭"的实例.任何想法或建议都将不胜感激.

Ideally I want to be able to find and replace by a common string so we only have say one instance of 'closing as duplicate'. Any thoughts or suggestions are greatly appreciated.

提供更详尽的示例.这是我正在尝试做的事情:

To provide a more thorough example. Here is what I am trying to do:

import fuzzywuzzy
from fuzzywuzzy import process
import chardet

res = h['resolution'].unique()
res.sort()
res

'All APs are up and stable hence resoling TT  Logs are updated in WL',
'Asset returned to IT hub closing ticket.',
'Auto Resolved - No reply from requester', 'Cable replaced',
'Cable replaced.', 'Camera is up and recording',
'Chat closed due to inactivity.', 'Closing as duplicate',
'Closing as duplicate.', 'Closing duplicate ticket.',
'Closing ticket.', 'Completed', 'Connection to IDF restored',

哦,看看,是否可以找到类似"cable替换"之类的字符串.

Oh look at that, lets see if we can find strings like 'cable replaced'.

# get the top 10 closest matches to "cable replaced"
matches = fuzzywuzzy.process.extract("cable replaced", res, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

[('cable replaced', 100),
 ('cable replaced.', 100),
 ('replaced cable', 100),
 ('replaced scanner cable', 78),
 ('replaced scanner cable.', 78),
 ('scanner cable replaced', 78),
 ('battery replaced', 73),
 ('replaced', 73),
 ('replaced battery', 73),
 ('replaced battery.', 73)]

嗯,我应该创建一个函数来替换相似度得分大于 90 的字符串.

Hmmm, perhaos I should create a function to replace strings that have a similarity score greater than say 90.

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

# use the function we just wrote to replace close matches to "cable replaced" with "cable replaced"
replace_matches_in_column(df=h, column='resolution', string_to_match="cable replaced")

# get all the unique values in the 'City' column
res = h['resolution'].unique()

# sort them alphabetically and then take a closer look
res.sort()
res

'auto resolved - no reply from requester', 'battery replaced',
       'cable replaced', 'camera is up and recording',
       'chat closed due to inactivity.', 'check ok',

太好了!现在,我只有一个替换电缆"实例.让我们验证一下

Great! Now I only have one instance of 'cable replaced'. Lets verify that

# get the top 10 closest matches to "cable replaced"
matches = fuzzywuzzy.process.extract("cable replaced", res, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

[('cable replaced', 100),
 ('replaced scanner cable', 78),
 ('replaced scanner cable.', 78),
 ('scanner cable replaced', 78),
 ('battery replaced', 73),
 ('replaced', 73),
 ('replaced battery', 73),
 ('replaced battery.', 73),
 ('replaced.', 73),
 ('hardware replaced', 71)]

是的!看起来不错.现在,该示例很好用,但是如您所见,它是相当手工的.理想情况下,我想针对我的解析度列中的所有字符串自动执行此操作.有什么想法吗?

Yep! Looking good. Now, this example works great but as you can see it is rather manual. I would ideally like to automate this for all the strings in my resolution column. Any ideas?

推荐答案

使用此链接中的函数,您可以找到一个映射,如下所示:

Using the function in this link, you can find a mapping as follows:

from fuzzywuzzy import fuzz


def replace_similars(input_list):
    # Replaces %90 and more similar strings
    for i in range(len(input_list)):
        for j in range(len(input_list)):
            if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
                input_list[j] = input_list[i]


def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

让我们看看如何使用:

# Let's assume items in labels are unique.
# If they are not unique, it will work anyway but will be slower.
labels = [
    "Cable replaced",
    "Cable replaced.",
    "Camera is up and recording",
    "Chat closed due to inactivity.",
    "Closing as duplicate",
    "Closing as duplicate.",
    "Closing duplicate ticket.",
    "Closing ticket.",
    "Completed",
    "Connection to IDF restored",
]

mapping = generate_mapping(labels)


# Print to see mapping
print("\n".join(["{:<50}: {}".format(k, v) for k, v in mapping.items()]))

输出:

Cable replaced                                    : Cable replaced
Cable replaced.                                   : Cable replaced
Camera is up and recording                        : Camera is up and recording
Chat closed due to inactivity.                    : Chat closed due to inactivity.
Closing as duplicate                              : Closing as duplicate
Closing as duplicate.                             : Closing as duplicate
Closing duplicate ticket.                         : Closing duplicate ticket.
Closing ticket.                                   : Closing ticket.
Completed                                         : Completed
Connection to IDF restored                        : Connection to IDF restored

因此,您可以找到 h ['resolution'].unique()的映射,然后使用此映射更新 h ['resolution'] 列.由于我没有您的数据框,因此无法尝试.基于,我想您可以使用以下内容:

So, you can find a mapping for h['resolution'].unique(), then update h['resolution'] column using this mapping. Since I don't have your dataframe, I can't try it. Based on this, I guess you can use the following:

for k, v in mapping.items():
    if k != v:
        h.loc[h['resolution'] == k, 'resolution'] = v

这篇关于NLP分类标签有许多相似之处,仅替换一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆