从发电机列表创建 pandas 数据框 [英] Create Pandas Dataframe from List of Generators

查看:68
本文介绍了从发电机列表创建 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要问以下问题。有没有一种方法可以从python Generator对象列表中构建DataFrame。我使用列表推导来创建包含数据帧数据的列表:

  data_list.append([record.Timestamp,record.Value ,record.Name,record.desc]记录中的记录)

我这样做是因为正常for循环中的list append花费的时间大约是20倍:

 用于记录中的记录:
data_list.append( record.Timestamp,record.Value,record.Name,record.desc)

我试图创建数据框,但不起作用:



此:

 数据框= pd.DataFrame(data_list,columns = ['timestamp','value','name','desc'])

抛出异常:


ValueError:传递了4列,传递的数据有142538列。


我也尝试使用以下itertools:

  dataframe = pd.DataFrame(data =([[list(elem)for itm.chai中的elem n.from_iterable(data_list)]),columns = ['timestamp','value','name','desc'])

结果为空的DataFrame:


Empty DataFrame\nColumns:[时间戳,值,名称, desc] \nIndex:[]


data_list看起来像这样:



<$ p<发电机对象St ... 51DB0><发电机对象St ... 56EB8><发电机对象St ... 51F10><发电机对象St. ..51F68>]

用于生成列表的代码如下:

 用于events_list中的事件:
用于事件中的记录:
data_list.append([record.Timestamp,record.Value,record。记录中的记录的名称,record.desc]

由于事件列表数据结构的缘故,这是必需的。



我是否可以通过生成器列表来创建数据框?如果有,那将节省时间吗?我的意思是,我用列表理解替换普通的for循环节省了很多时间,但是,如果创建数据框需要更多时间,则此操作将毫无意义。

解决方案

只需将您的 data_list 转换为生成器表达式。例如:

 从集合中导入namedtuple 

MyData = namedtuple( MyData,[ a ])
data =(da在(MyData(i)在范围(100)中为i的da中))
df = pd.DataFrame(data)

就可以了。因此,您应该做的是:

  data =((record.Timestamp,record.Value,record.Name,record。 desc)记录中的记录)
df = pd.DataFrame(data,columns = [ Timestamp, Value, Name, Desc])

您的方法不起作用的实际原因是因为您在 data_list 中只有一个条目我想是142538条记录的生成器。熊猫会尝试将您的 data_list 中的单个条目填充到单行中(因此所有142538个条目,每个条目包含四个元素)都会失败,因为它期望4



编辑:您当然可以使生成器表达式更复杂,这是沿着事件的附加循环的示例:

 从集合导入namedtuple 
MyData = namedtuple( MyData,[ a, b])
数据=(对于范围(j)的j((da,db),对于范围(i)的i(MyData(j,j + i))(100)的d))
pd.DataFrame(data,columns = [ a, b])

编辑:这也是一个使用数据结构的示例,例如:

  Record = namedtuple( Record,[ Timestamp, Value, Name, desc] )

event_list = [[Record(Timestamp = 1,Value = 1,Name = 1,desc = 1),
Record(Timestamp = 2,Value = 2,Name = 2, desc = 2)],
[Record(Timestamp = 3,Value = 3,N ame = 3,desc = 3)]]

data =((r.Timestamp,r.Value,r.Name,r.desc)对于event_list中的事件,对于r中的事件)
pd.DataFrame(data,columns = [ timestamp, value, name, desc])

输出:

 时间戳记值名称desc 
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3


I have to following question. Is there a way to build a DataFrame from a list of python Generator objects. I used list comprehension to create the list with data for the dataframe:

data_list.append([record.Timestamp,record.Value, record.Name, record.desc] for record in records)

I did it this way because normal list append in a for loop is taking like 20x times longer:

for record in records:
    data_list.append(record.Timestamp,record.Value, record.Name, record.desc)

I tried to create the dataframe but it doesn't work:

This:

dataframe = pd.DataFrame(data_list, columns=['timestamp', 'value', 'name', 'desc'])

Throws exception:

ValueError: 4 columns passed, passed data had 142538 columns.

I also tried to use itertools like this:

dataframe = pd.DataFrame(data=([list(elem) for elem in itt.chain.from_iterable(data_list)]), columns=['timestamp', 'value', 'name', 'desc'])

This results as a empty DataFrame:

Empty DataFrame\nColumns: [timestamp, value, name, desc]\nIndex: []

data_list looks like this:

[<generator object St...51DB0>, <generator object St...56EB8>,<generator object St...51F10>, <generator object St...51F68>]

Code for generating the list looks like this:

for events in events_list:
    for record in events:
        data_list.append([record.Timestamp,record.Value, record.Name, record.desc] for record in records)

This is required because of events list data structure.

Is there a way for me to create a dataframe out of list of Generators? If there is, is it going to be time efficient? What I mean is that I save a lot of time with replacing normal for loop with list comprehension, however if the creation of dataframe takes more time, this action will be pointless.

解决方案

Just turn your data_list into a generator expression as well. For example:

from collections import namedtuple

MyData = namedtuple("MyData", ["a"])
data = (d.a for d in (MyData(i) for i in range(100)))
df = pd.DataFrame(data)

will work just fine. So what you should do is have:

data = ((record.Timestamp,record.Value, record.Name, record.desc) for record in records)
df = pd.DataFrame(data, columns=["Timestamp", "Value", "Name", "Desc"])

The actual reason why your approach does not work is because you have a single entry in your data_list which is a generator over - I suppose - 142538 records. Pandas will try to cram that single entry in your data_list into a single row (so all the 142538 entries, each a list of four elements) and fails, since it expects rather 4 columns to be passed.

Edit: you can of course make the generator expression more complex, here's an example along the lines of your additional loop over events:

from collections import namedtuple
MyData = namedtuple("MyData", ["a", "b"])
data = ((d.a, d.b) for j in range(100) for d in (MyData(j, j+i) for i in range(100)))
pd.DataFrame(data, columns=["a", "b"])

edit: here's also an example using data structures like you are using:

Record = namedtuple("Record", ["Timestamp", "Value", "Name", "desc"])

event_list = [[Record(Timestamp=1, Value=1, Name=1, desc=1),
               Record(Timestamp=2, Value=2, Name=2, desc=2)],
              [Record(Timestamp=3, Value=3, Name=3, desc=3)]]

data = ((r.Timestamp, r.Value, r.Name, r.desc) for events in event_list for r in events)
pd.DataFrame(data, columns=["timestamp", "value", "name", "desc"])

Output:

    timestamp   value   name    desc
0   1   1   1   1
1   2   2   2   2
2   3   3   3   3

这篇关于从发电机列表创建 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆