在多个标头行上的csv中拆分行 [英] Splitting Rows in csv on several header rows

查看:60
本文介绍了在多个标头行上的csv中拆分行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对python还是很陌生,所以请保持温柔.

I am very new to python, so please be gentle.

我有一个.csv文件,它以这种格式报告给我,所以我无法做很多事情:

I have a .csv file, reported to me in this format, so I cannot do much about it:

ClientAccountID   AccountAlias   CurrencyPrimary    FromDate
         SomeID      SomeAlias          SomeCurr    SomeDate
        OtherID     OtherAlias         OtherCurr   OtherDate
ClientAccountID   AccountAlias   CurrencyPrimary    AssetClass
         SomeID      SomeAlias          SomeCurr     SomeClass
        OtherID     OtherAlias         OtherCurr     OtherDate
      AnotherID   AnotherAlias       AnotherCurr   AnotherDate

我正在python中使用csv包,所以我有:

I am using the csv package in python, so i have:

with open(theFile, 'rb') as csvfile:
    theReader = csv.DictReader(csvfile, delimiter = ',')

据我所知,这会创建词典"theReader".如何将此词典分为几个词典,然后按原始csv文件中的标题行将其拆分?是否有一种简单,优雅,非循环的方法来创建词典列表(甚至是以帐户ID为键的词典字典)?这有道理吗?

Which, as I understand it, creates the dictionary 'theReader'. How do I subset this dictionary, into several dictionaries, splitting them by the header rows in the original csv file? Is there a simple, elegant, non-loop way to create a list of dictionaries (or even a dictionary of dictionaries, with account IDs as keys)? Does that make sense?

哦.请注意,标题行并不相同,但标题行将始终以"ClientAccountID"开头.

Oh. Please note the header rows are not equivalent, but the header rows will always begin with 'ClientAccountID'.

由于@ codie,我现在使用以下命令,基于'\ t'分隔符,将csv分为几个字典.

Thanks to @ codie, I am now using the following to split the csv into several dicts, based on using the '\t' delimiter.

with open(theFile, 'rb') as csvfile:
    theReader = csv.DictReader(csvfile, delimiter = '\t')

但是,我现在将整个标题行作为键,并将每个其他行作为值.我该如何进一步拆分?

However, I now get the entire header row as a key, and each other row as a value. How do I further split this up?

感谢下面的@Benjamin Hodgson,我有以下内容:

Thanks to @Benjamin Hodgson below, I have the following:

from csv import DictReader
from io import BytesIO

stringios = []

with open('file.csv', 'r') as f:
    stringio = None
    for line in f:
        if line.startswith('ClientAccountID'):
            if stringio is not None:
                stringios.append(stringio)
            stringio = BytesIO()
        stringio.write(line)
        stringio.write("\n")
    stringios.append(stringio)

data = [list(DictReader(x.getvalue(), delimiter=',')) for x in stringios]

如果我在stringios中打印第一项,我会得到期望的结果.看起来像一个csv.但是,如果我使用下面的命令打印数据中的第一项,则会得到一些奇怪的结果:

If I print the first item in stringios, I get what I would expect. It looks like a single csv. However, if I print the first item in data, using below, i get something odd:

for row in data[0]:
    print row

它返回:

{'C':'U'}
{'C':'S'}
{'C':'D'}
...

所以看起来它正在分割每个字符,而不是使用逗号定界符.

So it appears it is splitting every character, instead of using the comma delimiter.

推荐答案

如果我正确理解了您的问题,则您有一个包含多个表的CSV文件.表由标题行定界,标题行始终以字符串"ClientAccountID"开头.

If I've understood your question correctly, you have a single CSV file which contains multiple tables. Tables are delimited by header rows which always begin with the string "ClientAccountID".

因此,工作是将CSV文件读取到词典列表中.列表中的每个条目都对应于CSV文件中的表格之一.

So the job is to read the CSV file into a list of lists-of-dictionaries. Each entry in the list corresponds to one of the tables in your CSV file.

这是我的处理方式:

  1. 将具有多个表的单个CSV文件分解为每个具有一个表的多个文件. (这些文件可能在内存中.)通过查找以"ClientAccountID"开头的行来完成此操作.
  2. 使用DictReader将这些文件中的每一个读入词典列表.
  1. Break up the single CSV file with multiple tables into multiple files each with a single table. (These files could be in-memory.) Do this by looking for lines which start with "ClientAccountID".
  2. Read each of these files into a list of dictionaries using a DictReader.

这里有一些代码可以将文件读入 StringIO的列表中 s. (StringIO是内存文件.它通过将字符串包装到类似文件的接口中来工作.)

Here's some code to read the file into a list of StringIOs. (A StringIO is an in-memory file. It works by wrapping a string up into a file-like interface).

from csv import DictReader
from io import StringIO

stringios = []

with open('file.csv', 'r') as f:
    stringio = None
    for line in f:
        if line.startswith('ClientAccountID'):
            if stringio is not None:
                stringio.seek(0)
                stringios.append(stringio)
            stringio = StringIO()
        stringio.write(line)
        stringio.write("\n")
    stringio.seek(0)
    stringios.append(stringio)

如果遇到以'ClientAccountID'开头的行,则将当前的StringIO放入列表中并开始写入新的行.完成后,请记住也将最后一个添加到列表中. 在使用stringio.seek(0)写入StringIO之后,不要忘记(就像我在此答案的早期版本中所做的那样)倒带StringIO.

If we encounter a line starting with 'ClientAccountID', we put the current StringIO into the list and start writing to a new one. When you've finished, remember to add the last one to the list too. Don't forget (as I did, in an earlier version of this answer) to rewind the StringIO after you've written to it using stringio.seek(0).

现在,直接遍历StringIO即可获得字典表.

Now it's straightforward to loop over the StringIOs to get a table of dictionaries.

data = [list(DictReader(x, delimiter='\t')) for x in stringios]

对于列表stringios中的每个文件状对象,创建一个DictReader并将其读取到列表中.

For each file-like object in the list stringios, create a DictReader and read it into a list.

如果您的数据太大而无法容纳到内存中,则修改此方法并不难.使用生成器代替列表,并逐行进行处理.

It's not too hard to modify this approach if your data is too big to fit into memory. Use generators instead of lists and do the processing line-by-line.

这篇关于在多个标头行上的csv中拆分行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆