我怎么添加一个“ROWNUMBER”字段为结构化numpy的阵列? [英] how do I add a 'RowNumber' field to a structured numpy array?

查看:77
本文介绍了我怎么添加一个“ROWNUMBER”字段为结构化numpy的阵列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用genfromtxt加载大量的CSV文件为结构化数组。我需要的数据(使用多个字段)进行排序,做了一些工作,然后将数据恢复到原来的顺序。我的计划是另一个字段添加到数据和应用第一排序之前把行号进入这个领域。然后,它可以用来恢复在最后的顺序。我认为有可能是添加记录编号的这一领域,但尝试和搜索理念小时后一种优雅的方式我也没有什么特别光滑。

 进口numpy的
进口numpy.lib.recfunctions为RFN
高清的main():
    csvDataFile =C:\\\\ File1.csv
    csvData = numpy.genfromtxt(csvDataFile,分隔符='',名字= TRUE,DTYPE ='F8')
    rowNums = numpy.zeros(LEN(csvData),DTYPE = [('的RowID','F8')])
    #populate和添加的RowID列
    因为我在范围(0,LEN(csvData)):
        rowNums ['的RowID'] [I] = I
    csvDataWithID = rfn.merge_arrays((csvData,rowNums),asrecarray = TRUE,压平= TRUE)


    recfunctions.merge_arrays
尤其是非常缓慢的,一个添加行号人似乎很老的学校。您的想法将受到欢迎。


解决方案

  rowNums = np.zeros(LEN(csvData),DTYPE = [('ROWID','F8')] )
rowNums ['ROWID'] = np.arange(LEN(csvData))

以上可节省约一半的每个文件的第二个与我使用的CSV文件。非常好为止。

然而,关键的事情是如何有效地获得排序顺序的记录。这是使用最优雅的解决;

  =排序顺序np.argsort(csvData,'COL_1','COL_2','col_3','col_4','col_5')

给出,其中列出了 CsvData 项目的顺序时COLS 1至5排序的数组。
这否定了需要作出,填充和合并一个的RowID 列,节省了我周围的每个csv文件15S(超过在我的整个数据集6小时)。

非常感谢你@hpaulj

I am using genfromtxt to load large csv files into structured arrays. I need to sort the data (using multiple fields), do some work and then restore the data to the original ordering. My plan is to add another field to the data and put the row number into this field before the first sort is applied. It can then be used to revert the order at the end. I thought there might be an elegant way of adding this field of record numbers but after hours of trying and searching for ideas I have nothing particularly slick.

import numpy
import numpy.lib.recfunctions as rfn
def main():
    csvDataFile = 'C:\\File1.csv'
    csvData = numpy.genfromtxt(csvDataFile, delimiter=',',names = True, dtype='f8')
    rowNums = numpy.zeros(len(csvData),dtype=[('RowID','f8')])
    #populate and add column for RowID
    for i in range (0, len(csvData)):
        rowNums['RowID'][i]=i
    csvDataWithID = rfn.merge_arrays((csvData, rowNums), asrecarray=True, flatten=True)

The recfunctions.merge_arrays in particular is very slow and adding the row numbers one by one seems so old school. Your ideas would be gratefully received.

解决方案

rowNums = np.zeros(len(csvData),dtype=[('RowID','f8')])
rowNums['RowID']=np.arange(len(csvData))

The above saves approx half a second per file with the csv files I am using. Very good so far.

However the key thing was how to efficiently obtain a record of the sort order. This is most elegantly solved using;

sortorder = np.argsort(csvData, 'col_1','col_2','col_3','col_4','col_5')

giving an array that lists the order of items in CsvData when sorted by cols 1 through 5. This negates the need to make, populate and merge a RowID column, saving me around 15s per csv file (over 6hrs across my entire dataset.)

Thank you very much @hpaulj

这篇关于我怎么添加一个“ROWNUMBER”字段为结构化numpy的阵列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆