Python中最有效的方法是将CSV中的行单独复制一个字段？ [英] What is the most efficient way with Python to merge rows in a CSV which have a single duplicate field?

查看：127 发布时间：2017/7/21 1:44:08 python csv merge duplicates

本文介绍了Python中最有效的方法是将CSV中的行单独复制一个字段？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我发现有些类似的问题，但是我认为可以工作的答案对我来说太复杂了，我才能形容我所需要的。我可以使用一些帮助，找出如何在Python中完成以下内容：

我有一个包含三列数据的CSV文件。在第一列中，我有重复的值（如在其他行中重复的），我需要将它合并到一行以及第二列和第三列的特定数据。结果应该是另一个CSV。

此外，对于具有重复列一个数据的每组行，在第二列和第三列中的数据有多种情况，需要组合。换句话说，对于列第一个值的任何第一个实例，如果第二列中的值不为空，则将其放在第二列中的最终行中，否则，如果第二列为空，则在第三列中获取值在第三栏的最后行。我需要实现的规则是：第一列值的第一个和最后一个实例需要组合任何第二和第三列数据，同时保持第二列和第三列中的列两个数据。另外，给定的源CSV行中没有三个完整的值。

为了更好的解释，这里是源CSV中列出的数据：
这些是源CSV中需要组合的行集的示例：

示例1：这里我有四行匹配列一个数据，对于我需要的所有示例结果是一行，其中包含一列值，后跟第一列值的第一个和最后一个实例的值。

  wp。 xyz03.def02.01195.1，wp03.xyz03-c01_lc08_m00，
 wp.xyz03.def02.01195.1，wp02.xyz03，
 wp.xyz03.def02.01195.1，，wp01.def02 
 wp。 xyz03.def02.01195.1，，wp02.def02-c02_lc14_m00

所以这个组的理想结果是：

  wp.xyz03.def02.01195.1，wp03.xyz03-c01_lc08_m00，wp02.def02-c02_lc14_m00

示例2：这里我有三行匹配列一个数据，我需要结果是一行包含ng列一个值，后跟在第一列值的第一个和最后一个实例中找到的值。

  wp.atl21.lmn01.01193.2， wp03.atl21-c06_lc14_m00，
 wp.atl21.lmn01.01193.2，wp02.atl21，
 wp.atl21.lmn01.01193.2，，wp03.lmn01 
  / pre> 
 
 所以这个组的理想结果是：
  wp.atl21.lmn01.01193.2，wp03.atl21-c06_lc14_m00，wp03.lmn01 
  
 Example3：Here我有三行匹配列一个数据，我需要结果是一行包含列一个值，后跟在第一列值的最后一个实例中找到的值。请注意，此示例看到第一行现在在第二列中不包含任何值，但相应的期望值位于第三列。
  tp.ghi03 .ghi05.02194.65，，tp05.ghi05：1 
 tp.ghi03.ghi05.02194.65，tp05.ghi03：2，
 tp.ghi03.ghi05.02194.65，tp05.ghi03-c06_lc11_m00，
  
所以这个组的理想结果是：
  tp.ghi03.ghi05.02194.65，tp05.ghi03-c06_lc11_m00，tp05.ghi05：1 
  
把它们放在一起：
 
 
 这样：
  wp.xyz03.def02.01195.1，wp03.xyz03-c01_lc08_m00，
 wp.xyz03.def02.01195.1，wp02.xyz03，
 wp.xyz03.def02.01195.1，，wp01.def02 
 wp.xyz03.def02.01195.1，，wp02.def02-c02_lc14_m00 
 wp.atl21.lmn01.01193.2，wp03.atl21-c06_lc14_m00，
 wp.atl21.lmn01.01193.2，wp02。 atl21，
 wp.atl21.lmn01.01193.2，，wp03.lmn01 
 tp.ghi03.ghi05.02194.65，，tp05.ghi05：1 
 tp.ghi03.ghi05.02194.65，tp05。 ghi03：2，
 tp.ghi03.ghi05.02194.65，tp05.ghi03-c 06_lc11_m00，
  
需要转换为：
  wp.xyz03.def02.01195.1，wp03.xyz03-c01_lc08_m00，wp02.def02-c02_lc14_m00 
 wp.atl21.lmn01.01193.2，wp03.atl21-c06_lc14_m00，wp03。 lmn01 
 tp.ghi03.ghi05.02194.65，tp05.ghi03-c06_lc11_m00，tp05.ghi05：1 
  
我已经尝试了许多事情来完成这项工作，但是如果不快速进入非常陌生的领域，我无法达到预期的效果。
 
 
 这是我原来的尝试切断一些必要的价值，一旦我达到三个值，它写出来，从不捕捉到可能有另一个：
  reader = csv.reader（open（'parse_lur_luraz_clean_temp.csv'，'r'），delimiter ='，'）
 final = [' - '，' - '，' - '] 
 parselur = [' - '] 
 lur_a =
 lur_z =
读者中的行：
如果行[0]！= parselur [0]：
 final = [' - '，' - '，' - '] 
如果row [1]！=''：lur_a = row [1] 
 if row [2]！=''：lur_z = row [2] 
 parselur [0] = row [0] 
 elif row [0] == parselur [0]：
 if row [1] ==''：
 lur_a = row [1] 
 elif row [1]！= ''：
 lur_a = row [1] 
 if row [2] ==''：
 lur_z = row [2] 
 elif row [2]！='' ：
 lur_z = row [2] 
 parselur [0] = row [0] 
如果parselur [0]！=''和parselur [0]不在final中：final [0] = parselur [0] 
如果lur_a！=''：
如果final [1] ==' - '或'_lc'不在final [1]中：final [1] = lur_a 
 lur_a =''
如果lur_z！=''：
如果final [2] ==' - '或'_lc'不在final [2]中：final [2] = lur_z 
 lur_z =''
如果len（final）== 3和' - '不在final中：
 fd = open（'final_alu_nsn_temp.csv'，'a'）
 writer = csv .writer（fd）
w riter.writerow（（final））
 fd.close（）
 final = [' - '，' - '，' - '] 
 else：
 parselur [ = row [0] 
  
 
 
解决方案
任何了解  itertools.groupby  ：
  import csv 
 from itertools import groupby 
 
假设Python 2 
 with open（source.csv，rb）as fp_in，open（final.csv，wb）as fp_out：
 reader = csv.reader（fp_in ）
 writer = csv.writer（fp_out）
 grouping = groupby（reader，lambda x：x [0]）
用于键，组在组中：
 rows = list组）
 rows = [rows [0]，rows [-1]] 
 columns = zip（*（r [1：] for r in rows））
 use_values = [max c）for c in columns] 
 new_row = [key] + use_values 
 writer.writerow（new_row）
  
生成
  $ cat final.csv 
 wp.xyz03.def02.01195.1，wp03.xyz03-c01_lc08_m00，wp02.def02-c02_lc14_m00 
 wp.atl21.lmn01.01193.2，wp03.atl21-c06_lc14_m00，wp03.lmn01 
 tp.ghi03.ghi05。 02194.65，tp05.ghi03-c06_lc11_m00，tp05.ghi05：1 
  
 
I have found somewhat similar questions however the answers that I think could work are too complex for me to morph into what I need. I could use some help figuring out how to accomplish the following in Python:

I have a CSV file which contains three columns of data. In the first column I have duplicate values (as in duplicated in other rows) of which I need to combine to a single row along with specific data from columns two and three. The result should be another CSV.

In addition, for each set of rows that have duplicate column one data there are a number of situations for data in columns two and three which need combined. In other words, for any first instance of column one value, if value in column two is not empty, grab it and place in a "final" row in column two, else if column two is empty, grab value in column three and place in "final" row in column three. The rule I need to implement is: The first and last instance of column one values need to combine whatever column two and three data exists, while maintaining column two data in column two and three in three. Also, there are never three full values in a given row of source CSV.

To better explain, here are the data situated as listed in source CSV:
These are examples of sets of rows in source CSV that need to be combined:

Example1: Here I have four rows with matching column one data, as for all examples I need the result to be a row containing column one value followed by values found in first and last instance of column one value.
wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,
wp.xyz03.def02.01195.1,wp02.xyz03,
wp.xyz03.def02.01195.1,,wp01.def02
wp.xyz03.def02.01195.1,,wp02.def02-c02_lc14_m00
So the desired result for this group would be:
wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00
Example2: Here I have three rows with matching column one data, again I need the result to be a row containing column one value followed by values found in first and last instance of column one value.
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,
wp.atl21.lmn01.01193.2,wp02.atl21,
wp.atl21.lmn01.01193.2,,wp03.lmn01
So the desired result for this group would be:
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01
Example3: Here I have three rows with matching column one data, again I need the result to be a row containing column one value followed by values found in first and last instance of column one value. Note this example sees the first row now contains no value in column two but rather desired value is in column three.
tp.ghi03.ghi05.02194.65,,tp05.ghi05:1
tp.ghi03.ghi05.02194.65,tp05.ghi03:2,
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,
So the desired result for this group would be:
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1
Putting it all together:

This:
wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,
wp.xyz03.def02.01195.1,wp02.xyz03,
wp.xyz03.def02.01195.1,,wp01.def02
wp.xyz03.def02.01195.1,,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,
wp.atl21.lmn01.01193.2,wp02.atl21,
wp.atl21.lmn01.01193.2,,wp03.lmn01
tp.ghi03.ghi05.02194.65,,tp05.ghi05:1
tp.ghi03.ghi05.02194.65,tp05.ghi03:2,
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,
Needs to turn into this:
wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1
I've tried a number of things to accomplish this but I cannot achieve desired result without getting into very unfamiliar territory quickly.

This is my original attempt which resulted in cutting off some of the necessary values as once I reach three values it writes out, and never catches that there might be another:
reader = csv.reader(open('parse_lur_luraz_clean_temp.csv', 'r'), delimiter=',')
final = ['-','-','-']
parselur = ['-']
lur_a = ""
lur_z = ""
for row in reader:
    if row[0] != parselur[0]:
        final = ['-','-','-']
        if row[1] != '': lur_a = row[1]
        if row[2] != '': lur_z = row[2]
        parselur[0] = row[0]
    elif row[0] == parselur[0]:
        if row[1] == '':
            lur_a = row[1]
        elif row[1] != '':
            lur_a = row[1]
        if row[2] == '':
            lur_z = row[2]
        elif row[2] != '':
            lur_z = row[2]
        parselur[0] = row[0]
    if parselur[0] != '' and parselur[0] not in final: final[0] = parselur[0]
    if lur_a != '': 
        if final[1] == '-' or '_lc' not in final[1]: final[1] = lur_a
        lur_a = ''
    if lur_z != '': 
        if final[2] == '-' or '_lc' not in final[2]: final[2] = lur_z
        lur_z = ''
    if len(final) == 3 and '-' not in final:
        fd = open('final_alu_nsn_temp.csv','a')
        writer = csv.writer(fd)
        writer.writerow((final))
        fd.close()
        final = ['-','-','-']
    else:
        parselur[0] = row[0]

 解决方案 
Now's as good a time as any to learn about itertools.groupby:
import csv
from itertools import groupby

# assuming Python 2
with open("source.csv", "rb") as fp_in, open("final.csv", "wb") as fp_out:
    reader = csv.reader(fp_in)
    writer = csv.writer(fp_out)
    grouped = groupby(reader, lambda x: x[0])
    for key, group in grouped:
        rows = list(group)
        rows = [rows[0], rows[-1]]
        columns = zip(*(r[1:] for r in rows))
        use_values = [max(c) for c in columns]
        new_row = [key] + use_values
        writer.writerow(new_row)
produces
$ cat final.csv 
wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1


                        
这篇关于Python中最有效的方法是将CSV中的行单独复制一个字段？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Python中最有效的方法是将CSV中的行单独复制一个字段？ [英] What is the most efficient way with Python to merge rows in a CSV which have a single duplicate field?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python中最有效的方法是将CSV中的行单独复制一个字段？ [英] What is the most efficient way with Python to merge rows in a CSV which have a single duplicate field?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭