当name = True for Python 3时,numpy genfromtxt似乎不起作用 [英] Numpy genfromtxt doesn't seem to work when names=True for Python 3

查看:135
本文介绍了当name = True for Python 3时,numpy genfromtxt似乎不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Google Colab环境.

I am using the Google Colab enviroment.

我正在使用的文件可以在这里找到.这是一个csv文件

The file I am using can be found here. It is a csv file

https://drive.google.com/open?id=1v7Mm6S8BVtou1iIfobY43LRF8MgGdjfU

https://drive.google.com/open?id=1v7Mm6S8BVtou1iIfobY43LRF8MgGdjfU

警告:它有几百万行.

此代码在一分钟内即可在Google Colab Python 3笔记本中运行.我几次尝试都没有问题.

This code runs within a minute in Google Colab Python 3 notebook. I tried this several times with no problem.

from numpy import genfromtxt
my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' ,  dtype=int)

print(my_data[0:50])

相反,下面的代码运行了几分钟,然后才与Google Colab的服务器断开连接.我尝试了多次.最终,colab给了我内存不足"的警告.

The code below, on the other hand, runs for several minutes before disconnecting from Google Colab's server. I tried multiple times. Eventually colab gives me a 'running out of memory' warning.

from numpy import genfromtxt
my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' ,  dtype=int,  names=True)

print(my_data[0:50])

在Python 3中似乎曾经有一个names = True的问题,但该问题已得到解决 https://github.com/numpy/numpy/issues/5411

It seems that there used to be an issue with names=True in Python 3 but that issue was fixed https://github.com/numpy/numpy/issues/5411

我检查我在Colab中使用的版本,并且它是最新的

I check which version I was using in Colab and it was up to date

import numpy as np

print(np.version.version)

>1.14.3

推荐答案

使用

my_data = genfromtxt('DlRefinedRatings.csv', delimiter=',' ,  dtype=int, max_rows=100)

我有一个(100,4)int数组.

I got a (100,4) int array.

使用names=True花费了很长时间,然后发出了很长的错误列表,除了行号(甚至使用max_rows)之外,其他所有错误均相同:

With names=True it took long, and then issued an long list of errors, all the same except for line number (even with the max_rows):

Line #4121986 (got 4 columns instead of 3)

标题行很容易弄错-最初的空白名称:

The header line is screwy - with an initial blank name:

In [753]: !head ../Downloads/refinedRatings.csv
,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
5,2,26,4
7,2,33,4
8,2,301,5
9,2,2686,5
10,2,3753,5

因此,基于名称,它认为有3列,但所有数据行都有4列.因此出现错误.我不知道为什么在这种情况下它会忽略max_rows.

So based on names it thinks there are 3 columns, but all data lines have 4. Hence the error. I don't know why it ignores the max_rows in this case.

但是有我自己的名字

In [755]: np.genfromtxt('../Downloads/refinedRatings.csv',delimiter=',',dtype=in
     ...: t, max_rows=10, names='foo,bar,dat,me')
Out[755]: 
array([(-1, -1,   -1, -1), ( 0,  1,  258,  5), ( 1,  2, 4081,  4),
       ( 2,  2,  260,  5), ( 3,  2, 9296,  5), ( 5,  2,   26,  4),
       ( 7,  2,   33,  4), ( 8,  2,  301,  5), ( 9,  2, 2686,  5),
       (10,  2, 3753,  5)],
      dtype=[('foo', '<i8'), ('bar', '<i8'), ('dat', '<i8'), ('me', '<i8')])

第一个记录(-1,-1,-1,-1)是最初的错误标头行,用-1代替字符串无法将其转换为int.因此,我们应该skip_header如下所述.

The first record (-1,-1,-1,-1) is the initial bad header line, with -1 inplace of strings it couldn't turn into ints. So we should skip_header as done below.

或者:

In [756]: np.genfromtxt('../Downloads/refinedRatings.csv',delimiter=',',dtype=in
     ...: t, max_rows=10, skip_header=1)
Out[756]: 
array([[   0,    1,  258,    5],
       [   1,    2, 4081,    4],
       [   2,    2,  260,    5],
       [   3,    2, 9296,    5],
       [   5,    2,   26,    4],
       [   7,    2,   33,    4],
       [   8,    2,  301,    5],
       [   9,    2, 2686,    5],
       [  10,    2, 3753,    5],
       [  11,    2, 8519,    5]])

总而言之,跳过标题,如果需要结构化数组,请使用自己的名称.

In sum, skip the header, and use your own names if you want a structured array.

这篇关于当name = True for Python 3时,numpy genfromtxt似乎不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆