编辑两个 pandas 列之间的距离 [英] Edit distance between two pandas columns
问题描述
我有一个由两列字符串组成的pandas DataFrame.我想创建第三列,其中包含两列的编辑距离".
I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns.
from nltk.metrics import edit_distance
df['edit'] = edit_distance(df['column1'], df['column2'])
出于某种原因,这似乎进入了某种无限循环,从某种意义上说,它在相当长一段时间内都没有响应,然后我不得不手动终止它.
For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually.
欢迎提出任何建议.
推荐答案
nltk的edit_distance
函数用于比较字符串对.如果要计算相应的字符串对之间的编辑距离,请按apply
分别将其与每一行的字符串分开,如下所示:
The nltk's edit_distance
function is for comparing pairs of strings. If you want to compute the edit distance between corresponding pairs of strings, apply
it separately to each row's strings like this:
results = df.apply(lambda x: edit_distance(x["column1"], x["column2"]), axis=1)
或者像这样(可能会更有效率),以避免包括数据框的无关列:
Or like this (probably a little more efficient), to avoid including the irrelevant columns of the dataframe:
results = df.loc[:, ["column1", "column2"]].apply(lambda x: edit_distance(*x), axis=1)
要将结果添加到数据框中,您将像这样使用它:
To add the results to your dataframe, you'd use it like this:
df["distance"] = df.loc[:, ["column1","column2"]].apply(lambda x: edit_distance(*x), axis=1)
这篇关于编辑两个 pandas 列之间的距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!