Python numpy按字符串列的值分割一个csv文件 [英] Python numpy split a csv file by the values of a string column
问题描述
我有5000行数据在csv文件中看起来像以下内容,我想用最后一列6(即A,B)使用numpy数组进行分组,因为之后我将在每个组中绘制数据
标题
日期,时间,值1,值2,值3,值4,值5
,,单元1 ,Unit2,Unit3 ,,
2012-04-02,00:00,85.5333333333333,4.87666666666667,8.96,323.27,A
2012-04-02,00:30,196.5,5.49,8.42,323.15 ,B
2012-04-02,01:00,68.2,4.47,7.83,325.30,A
2012-04-02,01:30,320.9,6.77333333333333,8.05,326.63,B
当我使用np.genfromtxt加载数据时,必须指定dtype = None,否则A项变为NaN
如何使用numpy.genfromtxt当第一列是字符串和其余列是数字?
我想使用itertools组通过返回所有基于最后一列的值,这里提到:我如何使用Python的itertools.groupby()?
首先,我需要对numpy数组进行排序。
我尝试使用高级索引,通过拼接第六列并对其进行排序
Python(Numpy)数组排序
Ie。 v [v [:,0] .argsort()]
然而,这里有一个链接提到numpy会将我的记录当作我的dtype的一维数组这被设置为无),我遇到了同样的索引错误试图排序:
Numpy Array Column Slicing产生IndexError:invalid index Exception
问题:
1)如何使用groupby根据第6列的字符串值分割numpy数组,以便分别绘制它们?
< 2)如果能够跳过第一行(标题)和第三行(单位)并且保留第二行(列标题)和数据,那么也能很好地跳过。任何人都知道如何轻松地使用这些选项?
这是我到目前为止的脚本:
import numpy as np
from matplotlib import pyplot as plt $ b $ from itertools import groupby
import csv
regression_data_dp1 = np .genfromtxt(file.csv,delimiter =',',skiprows = 3,dtype = None)
sortindex = regression_data_dp1 [:,6]
#Error is命中此步骤:
#sortindex = regression_data_dp1 [:,6]
#IndexError:无效索引
regression_data_dp1_sorted = regression_data_dp1 [regression_data_dp1(:,column_WRF_wind_direction).argsort()]
for groupby(regression_data_dp1,lambda x:x [0]):
打印键
打开(file_+ key.strip ()+.csv,'w')作为data_file:
wr = csv.writer(data_file,quoting = csv.QUOTE_ALL)
for(group):
wr。 authorow(item)
> group = arr [arr ['f6'] == key]
选择具有相同键的行
: import numpy as np
import csv
def load_csv(filename):
with open(filename )作为f:
next(f)
header = [item.strip()for next in(f).split(',')]
arr = np.genfromtxt( file.csv,delimiter =',',skiprows = 3,dtype = None)
arr.dtype.names = header
return arr
arr = load_csv(file .csv)
keys = np.unique(arr ['Value5'])
for keys in:
group = arr [arr ['Value5'] == (文件名,'w')作为data_file:
wr = csv.writer($) data_file,quoting = csv.QUOTE_ALL)
wr.writerows(group)
没有可怕的ct设施来告诉 np.genfromtxt
使用第二行作为标题。最简单的方法可能是打开文件,将第二行写入头文件列表,关闭文件,然后使用 genfromtxt
加载数组并使用 arr.dtype.names = header
给结构化数组提供所需的列名。
I have 5000 rows of data that looks like the following in a csv file, I would like to group by the last column 6 (ie. A, B) using numpy arrays, as I would be plotting data in each group afterwards.
Title
Date, Time, Value1, Value2, Value3, Value4, Value5
,, Unit1, Unit2, Unit3,,
2012-04-02,00:00, 85.5333333333333, 4.87666666666667, 8.96, 323.27,A
2012-04-02,00:30, 196.5, 5.49, 8.42, 323.15,B
2012-04-02,01:00, 68.2, 4.47, 7.83, 325.30,A
2012-04-02,01:30, 320.9, 6.77333333333333, 8.05, 326.63,B
I had to specify dtype=None when I load the data with np.genfromtxt, or else the A term becomes NaN How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?
I am trying to use itertools groupby to return all the values based on the last column, mentioned here: How do I use Python's itertools.groupby()? But first, I would need to sort the numpy array.
I attempted to use advance indexing, by splicing the sixth column and sorting it Python (Numpy) array sorting Ie. v[v[:,0].argsort()]
However, here is a link that mentions numpy will treat my record as a 1D array of my dtype (which that was set to none) and I ran into the same index error trying to sort this: Numpy Array Column Slicing Produces IndexError: invalid index Exception
Questions:
1) How can I split the numpy array up using groupby based on column 6’s string values in order to plot them separately?
2) It would also be nice to be able to skiprows such that I can skip the first (title) and third row (unit) and leave the the second row (column heading) and data. Anyone knows how to do that easily with the options available?
This is the script I have so far, :
import numpy as np
from matplotlib import pyplot as plt
from itertools import groupby
import csv
regression_data_dp1 = np.genfromtxt("file.csv", delimiter=',', skiprows=3, dtype=None)
sortindex = regression_data_dp1[:,6]
#Error is hit at this step:
# sortindex = regression_data_dp1[:,6]
#IndexError: invalid index
regression_data_dp1_sorted = regression_data_dp1[ regression_data_dp1(:,column_WRF_wind_direction).argsort()]
for key, group in groupby(regression_data_dp1, lambda x: x[0]):
print key
with open("file_" + key.strip() + ".csv", 'w') as data_file:
wr=csv.writer(data_file, quoting=csv.QUOTE_ALL)
for item in (group):
wr.writerow(item)
Instead of sorting the rows of the array, and using itertools.groupby
you could use group = arr[arr['f6']==key]
to select the rows with the same key
:
import numpy as np
import csv
def load_csv(filename):
with open(filename) as f:
next(f)
header = [item.strip() for item in next(f).split(',')]
arr = np.genfromtxt("file.csv", delimiter=',', skiprows=3, dtype=None)
arr.dtype.names = header
return arr
arr = load_csv("file.csv")
keys = np.unique(arr['Value5'])
for key in keys:
group = arr[arr['Value5']==key]
filename = 'file_{}.csv' .format(key.strip())
with open(filename, 'w') as data_file:
wr = csv.writer(data_file, quoting=csv.QUOTE_ALL)
wr.writerows(group)
There is no direct facility to tell np.genfromtxt
to use the second line as a header. The simplest approach would probably be to open the file, slurp the second line into a list of headers, close the file, then use genfromtxt
to load the array and use arr.dtype.names = header
to give the structured array the desired column names.
这篇关于Python numpy按字符串列的值分割一个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!