如何使用 pandas PYTHON按列中的值合并两个CSV文件 [英] How to merge two CSV files by value in column using pandas PYTHON

查看:404
本文介绍了如何使用 pandas PYTHON按列中的值合并两个CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个csv文件的价格和性能.

I have 2 csv files price and performance.

这是每个的数据布局

价格:

性能:

我使用以下命令将它们导入python:

I import them into python using:

import pandas as pd

price = pd.read_csv("cpu.csv")
performance = pd.read_csv("geekbench.csv")

这可以按预期工作,但是我不确定如何创建一个新的csv文件,并在Price [brand + model]和Performance [name]之间进行匹配

This works as intended, however I am unsure on how to create a new csv file with matches between Price[brand + model] and Performance[name]

我想参加

  • 价格中的核数,tdp和价格
  • 成绩中的得分,multicore_score和名称

使用上面的这些参数创建一个新的csv文件.我一直在寻找一种很好的匹配方法,该方法忽略了细微的差异,例如大写字母.我一直在研究模糊字符串匹配之类的算法,但不确定最佳选择是什么.

Create a new csv file using these parameters above. Problems I've been having a finding a good way to match which ignores minor differences such as capitalization I was looking into algorithms such as fuzzy string matching but was not sure what the best option is.

这是我当前抛出错误的尝试;

This is my current attempt which throws errors;

for i in range(len(price.index)):
    brand = (price.iloc[i, 0])
    model = (price.iloc[i, 1])
    print(model)
    print(performance)
    print(performance.query('name == brand+model'))

谢谢

推荐答案

我建议以下内容:

import nltk
import pandas as pd
tokenizer = nltk.RegexpTokenizer(r'\w+')
price = pd.DataFrame({"brand": ["AMD", "AMD", "AMD", "AMD"],
                      "model" : ["2650", "3800", "5150", "4200"],
                      "cores" : [2,4,4,4],
                      "tdp" : [25,25,25,25]})
performance = pd.DataFrame({"name": ["AMD Athlon 64 3200+",
                                     "AMD Athlon 64 X2 3800+",
                                     "AMD Athlon 64 X2 4000+",
                                     "AMD Athlon 64 X2 4200+"],
                            "score" : [6,5,6,18]})
# I break down the name in performance and suppress capital letters
performance["tokens"] = (performance["name"].str.lower()
                         .apply(tokenizer.tokenize))
# And the same for price
price["tokens"] = price.loc[:,"brand"].values + " " + \
                   price.loc[:,"model"].values
price["tokens"] = (price["tokens"].str.lower()
                         .apply(tokenizer.tokenize))
# cartesian product

price["key"] = 1
performance["key"] = 1
df = pd.merge(price,performance, on = "key")
# define my criteria for match
n_match = 2

df['intersection'] =\
    [len(list(set(a).intersection(set(b))))
     for a, b in zip(df.tokens_x,
                     df.tokens_y)]
df = df.loc[df["intersection"]>=n_match,:]

我重新定义了您的数据集,以便在本示例中我们可以进行一些匹配.结果就是我得到的:

I redefined your datasets so that in this example we would have some matches. Here is what I have as a result:

   brand model  cores  ...  score                     tokens_y  intersection
5    AMD  3800      4  ...      5  [amd, athlon, 64, x2, 3800]             2
15   AMD  4200      4  ...     18  [amd, athlon, 64, x2, 4200]             2
[2 rows x 10 columns]

您可以重新定义n_match的条件,我输入了两个,因为看来这正是数据集所需要的. 希望对您有帮助

You can redefine your criteria for n_match I put two because it seemed that it was what was required by the dataset. Hope it helps

这篇关于如何使用 pandas PYTHON按列中的值合并两个CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆