在多个标头行上的csv中拆分行 [英] Splitting Rows in csv on several header rows
问题描述
我对python还是很陌生,所以请保持温柔.
I am very new to python, so please be gentle.
我有一个.csv文件,它以这种格式报告给我,所以我无法做很多事情:
I have a .csv file, reported to me in this format, so I cannot do much about it:
ClientAccountID AccountAlias CurrencyPrimary FromDate
SomeID SomeAlias SomeCurr SomeDate
OtherID OtherAlias OtherCurr OtherDate
ClientAccountID AccountAlias CurrencyPrimary AssetClass
SomeID SomeAlias SomeCurr SomeClass
OtherID OtherAlias OtherCurr OtherDate
AnotherID AnotherAlias AnotherCurr AnotherDate
我正在python中使用csv包,所以我有:
I am using the csv package in python, so i have:
with open(theFile, 'rb') as csvfile:
theReader = csv.DictReader(csvfile, delimiter = ',')
据我所知,这会创建词典"theReader".如何将此词典分为几个词典,然后按原始csv文件中的标题行将其拆分?是否有一种简单,优雅,非循环的方法来创建词典列表(甚至是以帐户ID为键的词典字典)?这有道理吗?
Which, as I understand it, creates the dictionary 'theReader'. How do I subset this dictionary, into several dictionaries, splitting them by the header rows in the original csv file? Is there a simple, elegant, non-loop way to create a list of dictionaries (or even a dictionary of dictionaries, with account IDs as keys)? Does that make sense?
哦.请注意,标题行并不相同,但标题行将始终以"ClientAccountID"开头.
Oh. Please note the header rows are not equivalent, but the header rows will always begin with 'ClientAccountID'.
由于@ codie,我现在使用以下命令,基于'\ t'分隔符,将csv分为几个字典.
Thanks to @ codie, I am now using the following to split the csv into several dicts, based on using the '\t' delimiter.
with open(theFile, 'rb') as csvfile:
theReader = csv.DictReader(csvfile, delimiter = '\t')
但是,我现在将整个标题行作为键,并将每个其他行作为值.我该如何进一步拆分?
However, I now get the entire header row as a key, and each other row as a value. How do I further split this up?
感谢下面的@Benjamin Hodgson,我有以下内容:
Thanks to @Benjamin Hodgson below, I have the following:
from csv import DictReader
from io import BytesIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringios.append(stringio)
stringio = BytesIO()
stringio.write(line)
stringio.write("\n")
stringios.append(stringio)
data = [list(DictReader(x.getvalue(), delimiter=',')) for x in stringios]
如果我在stringios中打印第一项,我会得到期望的结果.看起来像一个csv.但是,如果我使用下面的命令打印数据中的第一项,则会得到一些奇怪的结果:
If I print the first item in stringios, I get what I would expect. It looks like a single csv. However, if I print the first item in data, using below, i get something odd:
for row in data[0]:
print row
它返回:
{'C':'U'}
{'C':'S'}
{'C':'D'}
...
所以看起来它正在分割每个字符,而不是使用逗号定界符.
So it appears it is splitting every character, instead of using the comma delimiter.
推荐答案
如果我正确理解了您的问题,则您有一个包含多个表的CSV文件.表由标题行定界,标题行始终以字符串"ClientAccountID"
开头.
If I've understood your question correctly, you have a single CSV file which contains multiple tables. Tables are delimited by header rows which always begin with the string "ClientAccountID"
.
因此,工作是将CSV文件读取到词典列表中.列表中的每个条目都对应于CSV文件中的表格之一.
So the job is to read the CSV file into a list of lists-of-dictionaries. Each entry in the list corresponds to one of the tables in your CSV file.
这是我的处理方式:
- 将具有多个表的单个CSV文件分解为每个具有一个表的多个文件. (这些文件可能在内存中.)通过查找以
"ClientAccountID"
开头的行来完成此操作. - 使用
DictReader
将这些文件中的每一个读入词典列表.
- Break up the single CSV file with multiple tables into multiple files each with a single table. (These files could be in-memory.) Do this by looking for lines which start with
"ClientAccountID"
. - Read each of these files into a list of dictionaries using a
DictReader
.
这里有一些代码可以将文件读入 StringIO
的列表中 s. (StringIO
是内存文件.它通过将字符串包装到类似文件的接口中来工作.)
Here's some code to read the file into a list of StringIO
s. (A StringIO
is an in-memory file. It works by wrapping a string up into a file-like interface).
from csv import DictReader
from io import StringIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringio.seek(0)
stringios.append(stringio)
stringio = StringIO()
stringio.write(line)
stringio.write("\n")
stringio.seek(0)
stringios.append(stringio)
如果遇到以'ClientAccountID'
开头的行,则将当前的StringIO
放入列表中并开始写入新的行.完成后,请记住也将最后一个添加到列表中.
在使用stringio.seek(0)
写入StringIO
之后,不要忘记(就像我在此答案的早期版本中所做的那样)倒带StringIO
.
If we encounter a line starting with 'ClientAccountID'
, we put the current StringIO
into the list and start writing to a new one. When you've finished, remember to add the last one to the list too.
Don't forget (as I did, in an earlier version of this answer) to rewind the StringIO
after you've written to it using stringio.seek(0)
.
现在,直接遍历StringIO
即可获得字典表.
Now it's straightforward to loop over the StringIO
s to get a table of dictionaries.
data = [list(DictReader(x, delimiter='\t')) for x in stringios]
对于列表stringios
中的每个文件状对象,创建一个DictReader
并将其读取到列表中.
For each file-like object in the list stringios
, create a DictReader
and read it into a list.
如果您的数据太大而无法容纳到内存中,则修改此方法并不难.使用生成器代替列表,并逐行进行处理.
It's not too hard to modify this approach if your data is too big to fit into memory. Use generators instead of lists and do the processing line-by-line.
这篇关于在多个标头行上的csv中拆分行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!