比较csv的两列,并在另一个csv中输出字符串相似性比率 [英] Comparing two columns of a csv and outputting string similarity ratio in another csv
问题描述
我对python编程很新。我试图采取一个csv文件,它有两列字符串值,并希望比较两列之间的字符串的相似比。然后我想取值并在另一个文件中输出比率。
I am very new to python programming. I am trying to take a csv file that has two columns of string values and want to compare the similarity ratio of the string between both columns. Then I want to take the values and output the ratio in another file.
csv可能如下所示:
The csv may look like this:
Column 1|Column 2
tomato|tomatoe
potato|potatao
apple|appel
我想要为每一行显示输出文件,列1中的字符串与列2的相似程度。我使用difflib输出比率分数。
I want the output file to show for each row, how similar the string in Column 1 is to Column 2. I am using difflib to output the ratio score.
是我到目前为止的代码:
This is the code I have so far:
import csv
import difflib
f = open('test.csv')
csf_f = csv.reader(f)
row_a = []
row_b = []
for row in csf_f:
row_a.append(row[0])
row_b.append(row[1])
a = row_a
b = row_b
def similar(a, b):
return difflib.SequenceMatcher(a, b).ratio()
match_ratio = similar(a, b)
match_list = []
for row in match_ratio:
match_list.append(row)
with open("output.csv", "wb") as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(match_list)
f.close()
我得到错误:
Traceback (most recent call last):
File "comparison.py", line 24, in <module>
for row in match_ratio:
TypeError: 'float' object is not iterable
我觉得我不是正确地导入列列表,并对sequencematcher函数运行它。
I feel like I am not importing the column list correctly and running it against the sequencematcher function.
推荐答案
使用 pandas
/ a>:
Here is another way to get this done using pandas
:
考虑您的csv数据如下:
Column 1,Column 2
tomato,tomatoe
potato,potatao
apple,appel
CODE
import pandas as pd
import difflib as diff
#Read the CSV
df = pd.read_csv('datac.csv')
#Create a new column 'diff' and get the result of comparision to it
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)
#Save the dataframe to CSV and you could also save it in other formats like excel, html etc
df.to_csv('outdata.csv',index=False)
结果
Column 1,Column 2 ,diff
tomato,tomatoe ,0.923076923077
potato,potatao ,0.923076923077
apple,appel ,0.8
这篇关于比较csv的两列,并在另一个csv中输出字符串相似性比率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!