调整CSV数据:将单元格追加到上一行,合并包含某些字符串的单元格 [英] Adjusting CSV data: appending cells to previous row, merge cells containing certain string
问题描述
我有一个文件 data.csv ,看起来像这样(两列; A和B):
I have a file, data.csv, that looks like this (two columns; A and B):
A B
01 a
'b'
0101 a
b
010101 a
'b'
'c'
d
'e'
f
010102 a
b
'd'
'e'
010201 a
b
'c'
d
02 a
b
0201 a
b
020101 a
b
'd'
'e'
020102 a
'b'
c
020201 a
b
c
d
'e'
020301 a
'b'
c
d
我希望它看起来像这样(五列; A,B,C,D和E):
I want it to look like this (five columns; A, B, C, D and E):
A B C D E
01 a b
0101 a b
010101 a b c d, e, f
010102 a b d, e
010201 a b c d
02 a
0201 a b
020101 a b d, e
020102 a b c
020201 a b c d, e
020301 a b c d
这是我对 data.csv 的了解:
- UTF-8编码
- UNIX样式的行尾
- 制表符定界符
- 某些行是空白的(空的 cells )
- 某些行以空的 cell (制表符)开头
- 某些行以两位,四位或六位数字开头
- 某些单元格包含文本字符串,此处由单个字符表示
- 某些文本字符串被'符号包围
- 不能假定'a','b'和'c'值始终存在
- 没有用于"a","b"或"c"的模式
- "d","e"和"f"有一个模式-单词
foo
是其字符串的一部分
- UTF-8 encoding
- UNIX style end-of-line
- Tabulator delimiter
- Some rows are blank (empty cells)
- Some rows begin with an empty cell (a tabulator)
- Some rows begin with two, four or six digits
- Some cells contain text strings, here represented by a single character
- Some of the text strings are surrounded by ' signs
- The 'a', 'b' and 'c' values can not be assumed to always be present
- There is no pattern for 'a', 'b' or 'c'
- There is a pattern for 'd', 'e' and 'f' - the word
foo
is part of their strings
将 data.csv 作为文本文件处理,我将脚本放在一起:
Treating data.csv as a text file I put together a script to:
- 删除空行
- 将以制表符(空单元格)开头的行追加到上一行
- 删除'符号
代码:
#!/usr/bin/python3
f = open('data.csv')
c = f.read()
f.close()
c = c.replace('\n\n', '\n')
c = c.replace('\n\t', '\t')
c = c.replace("'", "")
f = open('output.csv', 'w')
f.write(c)
f.close()
...然后我被卡住了.也许可以使用csv
模块采用更统一的方式来执行此操作以及进行其他调整.我该如何使用Python 3.3解决此问题(我假设任何3.x解决方案都将兼容)?
...and then I got stuck. Perhaps there's a more uniform way to do this along with the other adjustments, using the csv
module. How can I solve this with Python 3.3 (I assume any 3.x solution will be compatible)?
更新
基于Martijn Pieter的回答,我想出了这一点,并且似乎可以正常工作,尽管我不确定'a','b'和'c'文本值是否正确始终放在适当的列中.另外,最后一行也被跳过/保留为空白.
Based on Martijn Pieter's answer I came up with this, and it seems to be working, although I'm not sure that the 'a', 'b' and 'c' text values are always put in the appropriate columns. Also, the last row is skipped/left blank.
#!/usr/bin/python3
import csv
with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
write_this_row = None
for row in reader:
# If there is a row with content...
if row:
# If the first cell has content...
if row[0]:
if write_this_row != None:
writer.writerow(write_this_row)
write_this_row = row
elif 'foo' in row[1]:
if len(write_this_row) < 5:
write_this_row.extend([''] * (5 - len(row)))
if write_this_row[4]:
write_this_row[4] += ';' + row[1]
else:
write_this_row[4] = row[1]
else:
write_this_row.insert(3, row[1])
推荐答案
只需使用csv
模块读取数据,按行对数据进行处理,然后再次将其写出即可.
Just use the csv
module to read the data, massage this per row, and write it out again.
您可以使用None
或空字符串''
作为该列的值来创建空"列.反之亦然,读取空列(因此在连续的选项卡之间)将为您提供空字符串.
You can create 'empty' columns by using None
or an empty string ''
as the value for that column. Vice-versa, reading empty columns (so between consecutive tabs) gives you empty strings.
with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
for row in reader:
if len(row) > 3:
# detect if `c` is missing (insert your own test here)
# sample test looks for 3 consecutive columns with values f, o and o
if row[3:6] == ['f', 'o', 'o']
# insert an empty `c`
row.insert(3, '')
if len(row) < 5:
# make row at least 5 columns long
row.extend([''] * (5 - len(row)))
if len(row) > 5:
# merge any excess columns into the 5th column
row[4] = ','.join(row[4:])
del row[5:]
writer.writerow(row)
更新:
使用阅读器作为迭代器(而不是使用标志)(在其上调用next()
以获得下一行而不是使用for
循环):
Instead of using a flag, use the reader as an iterator (calling next()
on it to get the next row instead of using a for
loop):
with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
row = None
try:
next(reader) # skip the `A B` headers.
line = next(reader) # prime our loop
while True:
while not line[0]:
# advance to the first line with a column 0 value
line = next(reader)
row = line # start off with the first number and column
line = next(reader) # prime the subsequent lines loop
while line and not line[0]:
# process subsequent lines until we find one with a value in col 0 again
cell = line[1]
if cell == 'foo': # detect column d
row.append('') # and insert empty value
row.append(cell)
line = next(reader)
# consolidate, write
if len(row) < 5:
# make row at least 5 columns long
row.extend([''] * (5 - len(row)))
if len(row) > 5:
# merge any excess columns into the 5th column
row[4] = ','.join(row[4:])
del row[5:]
writer.writerow(row)
row = None
except StopIteration:
# reader is done, no more lines to come
# process the last row if there was one
if row is not None:
# consolidate, write
if len(row) < 5:
# make row at least 5 columns long
row.extend([''] * (5 - len(row)))
if len(row) > 5:
# merge any excess columns into the 5th column
row[4] = ','.join(row[4:])
del row[5:]
writer.writerow(row)
这篇关于调整CSV数据:将单元格追加到上一行,合并包含某些字符串的单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!