Python numpy按字符串列的值分割一个csv文件 [英] Python numpy split a csv file by the values of a string column

查看:274
本文介绍了Python numpy按字符串列的值分割一个csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有5000行数据在csv文件中看起来像以下内容,我想用最后一列6(即A,B)使用numpy数组进行分组,因为之后我将在每个组中绘制数据

 标题
日期,时间,值1,值2,值3,值4,值5
,,单元1 ,Unit2,Unit3 ,,
2012-04-02,00:00,85.5333333333333,4.87666666666667,8.96,323.27,A
2012-04-02,00:30,196.5,5.49,8.42,323.15 ,B
2012-04-02,01:00,68.2,4.47,7.83,325.30,A
2012-04-02,01:30,320.9,6.77333333333333,8.05,326.63,B

当我使用np.genfromtxt加载数据时,必须指定dtype = None,否则A项变为NaN
如何使用numpy.genfromtxt当第一列是字符串和其余列是数字?



我想使用itertools组通过返回所有基于最后一列的值,这里提到:我如何使用Python的itertools.groupby()?
首先,我需要对numpy数组进行排序。



我尝试使用高级索引,通过拼接第六列并对其进行排序
Python(Numpy)数组排序
Ie。 v [v [:,0] .argsort()]



然而,这里有一个链接提到numpy会将我的记录当作我的dtype的一维数组这被设置为无),我遇到了同样的索引错误试图排序:
Numpy Array Column Slicing产生IndexError:invalid index Exception



问题:



1)如何使用groupby根据第6列的字符串值分割numpy数组,以便分别绘制它们?



< 2)如果能够跳过第一行(标题)和第三行(单位)并且保留第二行(列标题)和数据,那么也能很好地跳过。任何人都知道如何轻松地使用这些选项?



这是我到目前为止的脚本:

  import numpy as np 
from matplotlib import pyplot as plt $ b $ from itertools import groupby
import csv

regression_data_dp1 = np .genfromtxt(file.csv,delimiter =',',skiprows = 3,dtype = None)

sortindex = regression_data_dp1 [:,6]

#Error is命中此步骤:
#sortindex = regression_data_dp1 [:,6]
#IndexError:无效索引

regression_data_dp1_sorted = regression_data_dp1 [regression_data_dp1(:,column_WRF_wind_direction).argsort()]

for groupby(regression_data_dp1,lambda x:x [0]):
打印键

打开(file_+ key.strip ()+.csv,'w')作为data_file:
wr = csv.writer(data_file,quoting = csv.QUOTE_ALL)
for(group):
wr。 authorow(item)


itertools.groupby 来代替对数组的行进行排序,而不是使用 > group = arr [arr ['f6'] == key] 选择具有相同键的行

  import numpy as np 
import csv

def load_csv(filename):
with open(filename )作为f:
next(f)
header = [item.strip()for next in(f).split(',')]
arr = np.genfromtxt( file.csv,delimiter =',',skiprows = 3,dtype = None)
arr.dtype.names = header
return arr

arr = load_csv(file .csv)
keys = np.unique(arr ['Value5'])

for keys in:
group = arr [arr ['Value5'] == (文件名,'w')作为data_file:
wr = csv.writer($) data_file,quoting = csv.QUOTE_ALL)
wr.writerows(group)

没有可怕的ct设施来告诉 np.genfromtxt 使用第二行作为标题。最简单的方法可能是打开文件,将第二行写入头文件列表,关闭文件,然后使用 genfromtxt 加载数组并使用 arr.dtype.names = header 给结构化数组提供所需的列名。


I have 5000 rows of data that looks like the following in a csv file, I would like to group by the last column 6 (ie. A, B) using numpy arrays, as I would be plotting data in each group afterwards.

Title
Date, Time, Value1, Value2, Value3, Value4, Value5
,, Unit1, Unit2, Unit3,,
2012-04-02,00:00, 85.5333333333333, 4.87666666666667,    8.96,  323.27,A
2012-04-02,00:30, 196.5, 5.49,    8.42,  323.15,B
2012-04-02,01:00, 68.2, 4.47,    7.83,  325.30,A
2012-04-02,01:30, 320.9, 6.77333333333333,    8.05,  326.63,B

I had to specify dtype=None when I load the data with np.genfromtxt, or else the A term becomes NaN How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?

I am trying to use itertools groupby to return all the values based on the last column, mentioned here: How do I use Python's itertools.groupby()? But first, I would need to sort the numpy array.

I attempted to use advance indexing, by splicing the sixth column and sorting it Python (Numpy) array sorting Ie. v[v[:,0].argsort()]

However, here is a link that mentions numpy will treat my record as a 1D array of my dtype (which that was set to none) and I ran into the same index error trying to sort this: Numpy Array Column Slicing Produces IndexError: invalid index Exception

Questions:

1) How can I split the numpy array up using groupby based on column 6’s string values in order to plot them separately?

2) It would also be nice to be able to skiprows such that I can skip the first (title) and third row (unit) and leave the the second row (column heading) and data. Anyone knows how to do that easily with the options available?

This is the script I have so far, :

import numpy as np
from matplotlib import pyplot as plt
from itertools import groupby
import csv

regression_data_dp1 = np.genfromtxt("file.csv", delimiter=',', skiprows=3, dtype=None)

sortindex = regression_data_dp1[:,6]

#Error is hit at this step:
#    sortindex = regression_data_dp1[:,6]
#IndexError: invalid index

regression_data_dp1_sorted = regression_data_dp1[ regression_data_dp1(:,column_WRF_wind_direction).argsort()]

for key, group in groupby(regression_data_dp1, lambda x: x[0]):
    print key

    with open("file_" + key.strip() + ".csv", 'w') as data_file:
        wr=csv.writer(data_file, quoting=csv.QUOTE_ALL)
        for item in (group):            
            wr.writerow(item)

解决方案

Instead of sorting the rows of the array, and using itertools.groupby you could use group = arr[arr['f6']==key] to select the rows with the same key:

import numpy as np
import csv

def load_csv(filename):
    with open(filename) as f:
        next(f)
        header = [item.strip() for item in next(f).split(',')]
    arr = np.genfromtxt("file.csv", delimiter=',', skiprows=3, dtype=None)
    arr.dtype.names = header
    return arr

arr = load_csv("file.csv")
keys = np.unique(arr['Value5'])

for key in keys:
    group = arr[arr['Value5']==key]
    filename = 'file_{}.csv' .format(key.strip())
    with open(filename, 'w') as data_file:
        wr = csv.writer(data_file, quoting=csv.QUOTE_ALL)
        wr.writerows(group)

There is no direct facility to tell np.genfromtxt to use the second line as a header. The simplest approach would probably be to open the file, slurp the second line into a list of headers, close the file, then use genfromtxt to load the array and use arr.dtype.names = header to give the structured array the desired column names.

这篇关于Python numpy按字符串列的值分割一个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆