使用python和打印匹配比较两个csv文件中的第一列 [英] Comparing the first columns in two csv files using python and printing matches
问题描述
我有两个csv文件,每个文件都包含如下所示的ngram:
I have two csv files each which contain ngrams that look like this:
drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8
这是一个三个字的短语,后跟一个频率数字,后跟一个相对频率数字。
It's a three word phrase followed by a frequency number followed by a relative frequency number.
我想编写一个脚本,它找到两个csv文件中的ngram,划分它们的相对频率,并将它们打印到一个新的csv文件。我想让它找到一个匹配每当三个单词短语匹配另一个文件中的三个单词短语,然后将第一个csv文件中的短语的相对频率除以第二个csv文件中该相同短语的相对频率。然后我想打印短语和两个相对频率除以一个新的csv文件。
I want to write a script that finds the ngrams that are in both csv files, divides their relative frequencies, and prints them to a new csv file. I want it to find a match whenever the three word phrase matches a three word phrase in the other file and then divide the relative frequency of the phrase in the first csv file by the relative frequency of that same phrase in the second csv file. Then I want to print the phrase and the division of the two relative frequencies to a new csv file.
下面是我已经得到的。我的脚本是比较线,但只有当整个线(包括频率和相对频率)匹配完全匹配。我意识到,这是因为我发现两个整套的交集,但我不知道如何做不同的。请原谅我;我刚接触编码。任何帮助,你可以让我更近一点将是这样一个大的帮助。
Below is as far as I've gotten. My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently. Please forgive me; I'm new to coding. Any help you can give me to get a little closer would be such a big help.
import csv
import io
alist, blist = [], []
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))
matches = set(first_set).intersection(secnd_set)
c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)
print matches
print len(matches)
推荐答案
没有转储 res
在一个新文件(乏味)。想法是第一个元素是短语,另外两个频率。使用 dict
而不是 set
做匹配和映射在一起。
Without dump res
in a new file (tedious). The idea is that the first element is the phrase and the other two the frequencies. Using dict
instead of set
to do matching and mapping together.
import csv
import io
alist, blist = [], []
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}
res = {}
for k,v in f_dict.items():
if k in s_dict:
res[k] = float(v[1])/float(s_dict[k][1])
print(res)
这篇关于使用python和打印匹配比较两个csv文件中的第一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!