使用python和打印匹配比较两个csv文件中的第一列 [英] Comparing the first columns in two csv files using python and printing matches

查看:361
本文介绍了使用python和打印匹配比较两个csv文件中的第一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个csv文件,每个文件都包含如下所示的ngram:

I have two csv files each which contain ngrams that look like this:

drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8

这是一个三个字的短语,后跟一个频率数字,后跟一个相对频率数字。

It's a three word phrase followed by a frequency number followed by a relative frequency number.

我想编写一个脚本,它找到两个csv文件中的ngram,划分它们的相对频率,并将它们打印到一个新的csv文件。我想让它找到一个匹配每当三个单词短语匹配另一个文件中的三个单词短语,然后将第一个csv文件中的短语的相对频率除以第二个csv文件中该相同短语的相对频率。然后我想打印短语和两个相对频率除以一个新的csv文件。

I want to write a script that finds the ngrams that are in both csv files, divides their relative frequencies, and prints them to a new csv file. I want it to find a match whenever the three word phrase matches a three word phrase in the other file and then divide the relative frequency of the phrase in the first csv file by the relative frequency of that same phrase in the second csv file. Then I want to print the phrase and the division of the two relative frequencies to a new csv file.

下面是我已经得到的。我的脚本是比较线,但只有当整个线(包括频率和相对频率)匹配完全匹配。我意识到,这是因为我发现两个整套的交集,但我不知道如何做不同的。请原谅我;我刚接触编码。任何帮助,你可以让我更近一点将是这样一个大的帮助。

Below is as far as I've gotten. My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently. Please forgive me; I'm new to coding. Any help you can give me to get a little closer would be such a big help.

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))

matches = set(first_set).intersection(secnd_set)

c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)

print matches
print len(matches)


推荐答案

没有转储 res 在一个新文件(乏味)。想法是第一个元素是短语,另外两个频率。使用 dict 而不是 set 做匹配和映射在一起。

Without dump res in a new file (tedious). The idea is that the first element is the phrase and the other two the frequencies. Using dict instead of set to do matching and mapping together.

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}

res = {}
for k,v in f_dict.items():
    if k in s_dict:
        res[k] = float(v[1])/float(s_dict[k][1])

print(res)

这篇关于使用python和打印匹配比较两个csv文件中的第一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆