调整CSV数据:将单元格追加到上一行,合并包含某些字符串的单元格 [英] Adjusting CSV data: appending cells to previous row, merge cells containing certain string

查看:187
本文介绍了调整CSV数据:将单元格追加到上一行,合并包含某些字符串的单元格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件 data.csv ,看起来像这样(两列; A和B):

I have a file, data.csv, that looks like this (two columns; A and B):

A       B
01      a
        'b'
0101    a
        b
010101  a
        'b'
        'c'
        d
        'e'
        f
010102  a
        b
        'd'
        'e'
010201  a
        b
        'c'
        d

02      a
        b
0201    a
        b

020101  a
        b
        'd'
        'e'
020102  a
        'b'
        c
020201  a
        b
        c
        d
        'e'
020301  a
        'b'
        c
        d

我希望它看起来像这样(五列; A,B,C,D和E):

I want it to look like this (five columns; A, B, C, D and E):

A       B   C   D   E
01      a   b       
0101    a   b       
010101  a   b   c   d, e, f
010102  a   b       d, e
010201  a   b   c   d
02      a           
0201    a   b       
020101  a   b       d, e
020102  a   b   c   
020201  a   b   c   d, e
020301  a   b   c   d

这是我对 data.csv 的了解:

  • UTF-8编码
  • UNIX样式的行尾
  • 制表符定界符
  • 某些是空白的(空的 cells )
  • 某些以空的 cell (制表符)开头
  • 某些以两位,四位或六位数字开头
  • 某些单元格包含文本字符串,此处由单个字符表示
  • 某些文本字符串被'符号包围
  • 不能假定'a','b'和'c'值始终存在
  • 没有用于"a","b"或"c"的模式
  • "d","e"和"f"有一个模式-单词foo是其字符串的一部分
  • UTF-8 encoding
  • UNIX style end-of-line
  • Tabulator delimiter
  • Some rows are blank (empty cells)
  • Some rows begin with an empty cell (a tabulator)
  • Some rows begin with two, four or six digits
  • Some cells contain text strings, here represented by a single character
  • Some of the text strings are surrounded by ' signs
  • The 'a', 'b' and 'c' values can not be assumed to always be present
  • There is no pattern for 'a', 'b' or 'c'
  • There is a pattern for 'd', 'e' and 'f' - the word foo is part of their strings

data.csv 作为文本文件处理,我将脚本放在一起:

Treating data.csv as a text file I put together a script to:

  • 删除空行
  • 将以制表符(空单元格)开头的行追加到上一行
  • 删除'符号

代码:

#!/usr/bin/python3
f = open('data.csv')
c = f.read()
f.close()
c = c.replace('\n\n', '\n')
c = c.replace('\n\t', '\t')
c = c.replace("'", "")
f = open('output.csv', 'w')
f.write(c)
f.close()

...然后我被卡住了.也许可以使用csv模块采用更统一的方式来执行此操作以及进行其他调整.我该如何使用Python 3.3解决此问题(我假设任何3.x解决方案都将兼容)?

...and then I got stuck. Perhaps there's a more uniform way to do this along with the other adjustments, using the csv module. How can I solve this with Python 3.3 (I assume any 3.x solution will be compatible)?

更新

基于Martijn Pieter的回答,我想出了这一点,并且似乎可以正常工作,尽管我不确定'a','b'和'c'文本值是否正确始终放在适当的列中.另外,最后一行也被跳过/保留为空白.

Based on Martijn Pieter's answer I came up with this, and it seems to be working, although I'm not sure that the 'a', 'b' and 'c' text values are always put in the appropriate columns. Also, the last row is skipped/left blank.

#!/usr/bin/python3

import csv

with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    write_this_row = None
    for row in reader:
        # If there is a row with content...
        if row:
            # If the first cell has content...
            if row[0]:
                if write_this_row != None:
                    writer.writerow(write_this_row)
                write_this_row = row
            elif 'foo' in row[1]:
                if len(write_this_row) < 5:
                    write_this_row.extend([''] * (5 - len(row)))
                if write_this_row[4]:
                    write_this_row[4] += ';' + row[1]
                else:
                    write_this_row[4] = row[1]
            else:
                write_this_row.insert(3, row[1])

推荐答案

只需使用csv模块读取数据,按行对数据进行处理,然后再次将其写出即可.

Just use the csv module to read the data, massage this per row, and write it out again.

您可以使用None或空字符串''作为该列的值来创建空"列.反之亦然,读取空列(因此在连续的选项卡之间)将为您提供空字符串.

You can create 'empty' columns by using None or an empty string '' as the value for that column. Vice-versa, reading empty columns (so between consecutive tabs) gives you empty strings.

with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')

    for row in reader:
        if len(row) > 3:
            # detect if `c` is missing (insert your own test here)
            # sample test looks for 3 consecutive columns with values f, o and o
            if row[3:6] == ['f', 'o', 'o']
                # insert an empty `c`
                row.insert(3, '')

        if len(row) < 5:
            # make row at least 5 columns long
            row.extend([''] * (5 - len(row)))
        if len(row) > 5:
            # merge any excess columns into the 5th column
            row[4] = ','.join(row[4:])
            del row[5:]

        writer.writerow(row)

更新:

使用阅读器作为迭代器(而不是使用标志)(在其上调用next()以获得下一行而不是使用for循环):

Instead of using a flag, use the reader as an iterator (calling next() on it to get the next row instead of using a for loop):

with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')

    row = None

    try:
        next(reader)  # skip the `A   B` headers.

        line = next(reader)  # prime our loop
        while True:
            while not line[0]:
                # advance to the first line with a column 0 value
                line = next(reader)

            row = line  # start off with the first number and column
            line = next(reader)  # prime the subsequent lines loop

            while line and not line[0]:
                # process subsequent lines until we find one with a value in col 0 again
                cell = line[1]
                if cell == 'foo':    # detect column d
                    row.append('')   # and insert empty value
                row.append(cell)
                line = next(reader)

            # consolidate, write
            if len(row) < 5:
                # make row at least 5 columns long
                row.extend([''] * (5 - len(row)))
            if len(row) > 5:
                # merge any excess columns into the 5th column
                row[4] = ','.join(row[4:])
                del row[5:]

            writer.writerow(row)
            row = None
    except StopIteration:
        # reader is done, no more lines to come
        # process the last row if there was one
        if row is not None:
            # consolidate, write
            if len(row) < 5:
                # make row at least 5 columns long
                row.extend([''] * (5 - len(row)))
            if len(row) > 5:
                # merge any excess columns into the 5th column
                row[4] = ','.join(row[4:])
                del row[5:]

            writer.writerow(row)

这篇关于调整CSV数据:将单元格追加到上一行,合并包含某些字符串的单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆