将2D数组的字符串表示形式从CSV列读取到2D numpy数组中 [英] Read string representation of 2D array from CSV column into a 2D numpy array

查看:478
本文介绍了将2D数组的字符串表示形式从CSV列读取到2D numpy数组中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas数据框,其中的一列包含对应于灰度图像像素数据的2D numpy数组.这些2D numpy数组的形状为(480, 640)(490, 640).数据框具有包含其他信息的其他列.然后,我通过pandas的to_csv()函数从中生成一个csv文件.现在我的问题是:我的2D numpy数组都以字符串形式显示在CSV中,那么如何读回它们并将它们再次转换为2D numpy数组?

I have a pandas dataframe, for which one of the columns holds 2D numpy arrays corresponding to pixel data from grayscale images. These 2D numpy arrays have the shape (480, 640) or (490, 640). The dataframe has other columns containing other information. I then generate a csv file out of it through pandas' to_csv() function. Now my issue is: my 2D numpy arrays all appear as strings in my CSV, so how can I read them back and convert them into 2D numpy arrays again?

我知道在StackOverflow上也有类似的问题,但是我找不到真正专注于2D numpy数组的问题.它们似乎主要是关于一维numpy数组,并且提供的解决方案似乎不起作用.

I know there are similar questions on StackOverflow, but I couldn't find any that really focuses on 2D numpy arrays. They seem to be mostly about 1D numpy arrays, and the solutions provided don't seem to work.

非常感谢您的帮助.

更新:

根据要求,我在下面添加一些代码以阐明问题所在.

As requested, I am adding some code below to clarify what my problem is.

# Function to switch images to grayscale format
grayscale(img):
  cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Iterating through my dataframe (called data), reading all image files, making them grayscale and then adding them to my collection.
grayscale_images = []
for index, row in data.iterrows():
  img_path = row['Image path']
  cv_image = cv2.imread(img_path)
  gray = grayscale(cv_image)
  grayscale_images.append(gray)

# Make numpy array elements show without truncation
np.set_printoptions(threshold=sys.maxsize)

# Adding a new column to the dataframe containing each image's numpy array corresponding to pixels
data['Image data'] = grayscale_images

因此,当我在其他列上执行完此操作和其他操作后,便将数据框导出为CSV,如下所示:

So when I'm done doing that and other operations on other columns, I export my dataframe to CSV like this:

data.to_csv('new_dataset.csv', index=False)

在另一个Jupyter笔记本中,我尝试读取CSV文件,然后提取图像的numpy数组,以将它们输入到卷积神经网络作为输入,作为监督训练的一部分.

In a different Jupyter notebook, I try to read my CSV file and then extract my image's numpy arrays to feed them to a convolutional neural network as input, as part of supervised training.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
import re

data = pd.read_csv('new_dataset.csv')
# data.head() -- It looks fine here

# Config to make numpy arrays display in their entirety without truncation
np.set_printoptions(threshold=sys.maxsize)

# Checking if I can extract a 2D numpy array for conversion from a cell.
# That's where I notice it's a string, and I'm having trouble turning it back to a 2D numpy array
image_arr = data.iloc[0,0]

但是,我一直无法将字符串类型的表示形式从CSV文件转换回2D numpy数组,尤其是形状为(490, 640)的数组,就像我将数据帧导出到CSV之前一样.

But, I'm stuck converting back my string-type representation from my CSV file into a 2D numpy array, especially one with the shape (490, 640) as it was before I exported the dataframe to CSV.

推荐答案

使用数组字符串构造csv:

Construct a csv with array strings:

In [385]: arr = np.empty(1, object)                                             
In [386]: arr[0]=np.arange(12).reshape(3,4)                                     
In [387]: S = pd.Series(arr,name='x')                                           
In [388]: S                                                                     
Out[388]: 
0    [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
Name: x, dtype: object
In [389]: S.to_csv('series.csv')                                                
/usr/local/bin/ipython3:1: FutureWarning: The signature of `Series.to_csv` was aligned to that of `DataFrame.to_csv`, and argument 'header' will change its default value from False to True: please pass an explicit value to suppress this warning.
  #!/usr/bin/python3
In [390]: cat series.csv                                                        
0,"[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]"

加载:

In [391]: df = pd.read_csv('series.csv',header=None)                            
In [392]: df                                                                    
Out[392]: 
   0                                                1
0  0  [[ 0  1  2  3]\n [ 4  5  6  7]\n [ 8  9 10 11]]

In [394]: astr=df[1][0]                                                         
In [395]: astr                                                                  
Out[395]: '[[ 0  1  2  3]\n [ 4  5  6  7]\n [ 8  9 10 11]]'

解析数组的字符串表示形式:

parse the string representation of the array:

In [396]: astr.split('\n')                                                      
Out[396]: ['[[ 0  1  2  3]', ' [ 4  5  6  7]', ' [ 8  9 10 11]]']

In [398]: astr.replace('[','').replace(']','').split('\n')                      
Out[398]: [' 0  1  2  3', '  4  5  6  7', '  8  9 10 11']
In [399]: [i.split() for i in _]                                                
Out[399]: [['0', '1', '2', '3'], ['4', '5', '6', '7'], ['8', '9', '10', '11']]
In [400]: np.array(_, int)                                                      
Out[400]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

不能保证这是最漂亮的最清晰的解析,但是它可以让您对必须要做的工作有所了解.我正在重新发明轮子,但是搜索重复项花费的时间太长.

No guarantee that that's the prettiest cleanest parsing, but it gives an idea of the work you have to do. I'm reinventing the wheel, but searching for a duplicate was taking too long.

如果可能,请避免将这样的数据帧另存为csv. csv格式适用于干净的2d表,用分隔符分隔的简单一致列.

If possible try to avoid saving such a dataframe as csv. csv format is meant for a clean 2d table, simple consistent columns separated by a delimiter.

在大多数情况下,请避免这样的数据帧/系列.系列可以具有对象dtype.每个对象元素可能很复杂,例如列表,字典或数组.但是我不认为pandas具有处理这些情况的特殊功能.

And for the most part avoid dataframes/series like this. A Series can have object dtype. And each object element can be complex, such as a list, dictionary, or array. But I don't think pandas has special functions to handle those cases.

numpy也具有对象dtypes(与我的arr一样),但是列表即使没有更好也常常一样好.构造这样的数组可能很棘手.这样的数组上的数学运算会成败.在对象数组上进行迭代比在列表上进行迭代要慢.

numpy also has object dtypes (as my arr), but a list is often just as good, if not better. Constructing such an array can be tricky. Math on such an array is hit or miss. Iteration on an object array is slower than iteration on a list.

===

re也可能起作用.例如,用逗号替换空格:

re might work as well. For example replacing whitespace with comma:

In [408]: re.sub('\s+',',',astr)                                                
Out[408]: '[[,0,1,2,3],[,4,5,6,7],[,8,9,10,11]]'

仍然不太正确.有一些逗号将使eval窒息.

Still not quite right. There are leading commas that will choke eval.

这篇关于将2D数组的字符串表示形式从CSV列读取到2D numpy数组中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆