Python中最有效的方法是将CSV中的行单独复制一个字段? [英] What is the most efficient way with Python to merge rows in a CSV which have a single duplicate field?

查看:127
本文介绍了Python中最有效的方法是将CSV中的行单独复制一个字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现有些类似的问题,但是我认为可以工作的答案对我来说太复杂了,我才能形容我所需要的。我可以使用一些帮助,找出如何在Python中完成以下内容:



我有一个包含三列数据的CSV文件。在第一列中,我有重复的值(如在其他行中重复的),我需要将它合并到一行以及第二列和第三列的特定数据。结果应该是另一个CSV。



此外,对于具有重复列一个数据的每组行,在第二列和第三列中的数据有多种情况,需要组合。换句话说,对于列第一个值的任何第一个实例,如果第二列中的值不为空,则将其放在第二列中的最终行中,否则,如果第二列为空,则在第三列中获取值在第三栏的最后行。我需要实现的规则是:第一列值的第一个和最后一个实例需要组合任何第二和第三列数据,同时保持第二列和第三列中的列两个数据。另外,给定的源CSV行中没有三个完整的值。



为了更好的解释,这里是源CSV中列出的数据:
这些是源CSV中需要组合的行集的示例:



示例1:这里我有四行匹配列一个数据,对于我需要的所有示例结果是一行,其中包含一列值,后跟第一列值的第一个和最后一个实例的值。

  wp。 xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,
wp.xyz03.def02.01195.1,wp02.xyz03,
wp.xyz03.def02.01195.1,,wp01.def02
wp。 xyz03.def02.01195.1,,wp02.def02-c02_lc14_m00

所以这个组的理想结果是:

  wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00 

示例2:这里我有三行匹配列一个数据,我需要结果是一行包含ng列一个值,后跟在第一列值的第一个和最后一个实例中找到的值。

  wp.atl21.lmn01.01193.2, wp03.atl21-c06_lc14_m00,
wp.atl21.lmn01.01193.2,wp02.atl21,
wp.atl21.lmn01.01193.2,,wp03.lmn01
/ pre>

所以这个组的理想结果是:

  wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01 

Example3:Here我有三行匹配列一个数据,我需要结果是一行包含列一个值,后跟在第一列值的最后一个实例中找到的值。请注意,此示例看到第一行现在在第二列中不包含任何值,但相应的期望值位于第三列。

  tp.ghi03 .ghi05.02194.65,,tp05.ghi05:1 
tp.ghi03.ghi05.02194.65,tp05.ghi03:2,
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,

所以这个组的理想结果是:

  tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1 

把它们放在一起:



这样:

  wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,
wp.xyz03.def02.01195.1,wp02.xyz03,
wp.xyz03.def02.01195.1,,wp01.def02
wp.xyz03.def02.01195.1,,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,
wp.atl21.lmn01.01193.2,wp02。 atl21,
wp.atl21.lmn01.01193.2,,wp03.lmn01
tp.ghi03.ghi05.02194.65,,tp05.ghi05:1
tp.ghi03.ghi05.02194.65,tp05。 ghi03:2,
tp.ghi03.ghi05.02194.65,tp05.ghi03-c 06_lc11_m00,

需要转换为:

  wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00 
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03。 lmn01
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1

我已经尝试了许多事情来完成这项工作,但是如果不快速进入非常陌生的领域,我无法达到预期的效果。



这是我原来的尝试切断一些必要的价值,一旦我达到三个值,它写出来,从不捕捉到可能有另一个:

  reader = csv.reader(open('parse_lur_luraz_clean_temp.csv','r'),delimiter =',')
final = [' - ',' - ',' - ']
parselur = [' - ']
lur_a =
lur_z =
读者中的行:
如果行[0]!= parselur [0]:
final = [' - ',' - ',' - ']
如果row [1]!='':lur_a = row [1]
if row [2]!='':lur_z = row [2]
parselur [0] = row [0]
elif row [0] == parselur [0]:
if row [1] =='':
lur_a = row [1]
elif row [1]!= '':
lur_a = row [1]
if row [2] =='':
lur_z = row [2]
elif row [2]!='' :
lur_z = row [2]
parselur [0] = row [0]
如果parselur [0]!=''和parselur [0]不在final中:final [0] = parselur [0]
如果lur_a!='':
如果final [1] ==' - '或'_lc'不在final [1]中:final [1] = lur_a
lur_a =''
如果lur_z!='':
如果final [2] ==' - '或'_lc'不在final [2]中:final [2] = lur_z
lur_z =''
如果len(final)== 3和' - '不在final中:
fd = open('final_alu_nsn_temp.csv','a')
writer = csv .writer(fd)
w riter.writerow((final))
fd.close()
final = [' - ',' - ',' - ']
else:
parselur [ = row [0]


解决方案

任何了解 itertools.groupby

  import csv 
from itertools import groupby

假设Python 2
with open(source.csv,rb)as fp_in,open(final.csv,wb)as fp_out:
reader = csv.reader(fp_in )
writer = csv.writer(fp_out)
grouping = groupby(reader,lambda x:x [0])
用于键,组在组中:
rows = list组)
rows = [rows [0],rows [-1]]
columns = zip(*(r [1:] for r in rows))
use_values = [max c)for c in columns]
new_row = [key] + use_values
writer.writerow(new_row)

生成

  $ cat final.csv 
wp.​​xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01
tp.ghi03.ghi05。 02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1


I have found somewhat similar questions however the answers that I think could work are too complex for me to morph into what I need. I could use some help figuring out how to accomplish the following in Python:

I have a CSV file which contains three columns of data. In the first column I have duplicate values (as in duplicated in other rows) of which I need to combine to a single row along with specific data from columns two and three. The result should be another CSV.

In addition, for each set of rows that have duplicate column one data there are a number of situations for data in columns two and three which need combined. In other words, for any first instance of column one value, if value in column two is not empty, grab it and place in a "final" row in column two, else if column two is empty, grab value in column three and place in "final" row in column three. The rule I need to implement is: The first and last instance of column one values need to combine whatever column two and three data exists, while maintaining column two data in column two and three in three. Also, there are never three full values in a given row of source CSV.

To better explain, here are the data situated as listed in source CSV: These are examples of sets of rows in source CSV that need to be combined:

Example1: Here I have four rows with matching column one data, as for all examples I need the result to be a row containing column one value followed by values found in first and last instance of column one value.

wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,
wp.xyz03.def02.01195.1,wp02.xyz03,
wp.xyz03.def02.01195.1,,wp01.def02
wp.xyz03.def02.01195.1,,wp02.def02-c02_lc14_m00

So the desired result for this group would be:

wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00

Example2: Here I have three rows with matching column one data, again I need the result to be a row containing column one value followed by values found in first and last instance of column one value.

wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,
wp.atl21.lmn01.01193.2,wp02.atl21,
wp.atl21.lmn01.01193.2,,wp03.lmn01

So the desired result for this group would be:

wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01

Example3: Here I have three rows with matching column one data, again I need the result to be a row containing column one value followed by values found in first and last instance of column one value. Note this example sees the first row now contains no value in column two but rather desired value is in column three.

tp.ghi03.ghi05.02194.65,,tp05.ghi05:1
tp.ghi03.ghi05.02194.65,tp05.ghi03:2,
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,

So the desired result for this group would be:

tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1

Putting it all together:

This:

wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,
wp.xyz03.def02.01195.1,wp02.xyz03,
wp.xyz03.def02.01195.1,,wp01.def02
wp.xyz03.def02.01195.1,,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,
wp.atl21.lmn01.01193.2,wp02.atl21,
wp.atl21.lmn01.01193.2,,wp03.lmn01
tp.ghi03.ghi05.02194.65,,tp05.ghi05:1
tp.ghi03.ghi05.02194.65,tp05.ghi03:2,
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,

Needs to turn into this:

wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1

I've tried a number of things to accomplish this but I cannot achieve desired result without getting into very unfamiliar territory quickly.

This is my original attempt which resulted in cutting off some of the necessary values as once I reach three values it writes out, and never catches that there might be another:

reader = csv.reader(open('parse_lur_luraz_clean_temp.csv', 'r'), delimiter=',')
final = ['-','-','-']
parselur = ['-']
lur_a = ""
lur_z = ""
for row in reader:
    if row[0] != parselur[0]:
        final = ['-','-','-']
        if row[1] != '': lur_a = row[1]
        if row[2] != '': lur_z = row[2]
        parselur[0] = row[0]
    elif row[0] == parselur[0]:
        if row[1] == '':
            lur_a = row[1]
        elif row[1] != '':
            lur_a = row[1]
        if row[2] == '':
            lur_z = row[2]
        elif row[2] != '':
            lur_z = row[2]
        parselur[0] = row[0]
    if parselur[0] != '' and parselur[0] not in final: final[0] = parselur[0]
    if lur_a != '': 
        if final[1] == '-' or '_lc' not in final[1]: final[1] = lur_a
        lur_a = ''
    if lur_z != '': 
        if final[2] == '-' or '_lc' not in final[2]: final[2] = lur_z
        lur_z = ''
    if len(final) == 3 and '-' not in final:
        fd = open('final_alu_nsn_temp.csv','a')
        writer = csv.writer(fd)
        writer.writerow((final))
        fd.close()
        final = ['-','-','-']
    else:
        parselur[0] = row[0]

解决方案

Now's as good a time as any to learn about itertools.groupby:

import csv
from itertools import groupby

# assuming Python 2
with open("source.csv", "rb") as fp_in, open("final.csv", "wb") as fp_out:
    reader = csv.reader(fp_in)
    writer = csv.writer(fp_out)
    grouped = groupby(reader, lambda x: x[0])
    for key, group in grouped:
        rows = list(group)
        rows = [rows[0], rows[-1]]
        columns = zip(*(r[1:] for r in rows))
        use_values = [max(c) for c in columns]
        new_row = [key] + use_values
        writer.writerow(new_row)

produces

$ cat final.csv 
wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1

这篇关于Python中最有效的方法是将CSV中的行单独复制一个字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆