数据框成numpy数组,其值以逗号分隔 [英] Dataframe into numpy array with values comma seperated

查看:366
本文介绍了数据框成numpy数组,其值以逗号分隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经将一个csv(\ t分隔)读入一个Dataframe中,现在需要采用numpy数组格式进行聚类,而无需更改类型

I've read a csv (which is \t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type

到目前为止,根据尝试的引用(如下),我未能按要求获取输出.我尝试获取的两列值位于int64/float64中,如下所示:

So far as per tried references (below) I've failed to get the output as required. The two column's values I'm trying to fetch are in int64 / float64, as below

         uid   iid       rat
0        196   242  3.000000
1        186   302  3.000000
2         22   377  1.000000

我暂时只对 iid rat 感兴趣,并将其传递给Kmeans.fit()方法,而对于EPSILON来说也是如此.我需要以下格式的

I'm intrested in only iid and rat for the moment, and to pass it to Kmeans.fit() method and that too not with EPSILON in it. I need it in following format

期望的格式

[[242, 3.000000],
[302, 3.000000],
[22, 1.000000]]

尝试失败

X = values[:, 1:2]
Y = values[:, 2:3]
someArray = np.array([X,Y])
print someArray

并且不会在执行时告别

[[[  2.42000000e+02]
  [  3.02000000e+02]
  [  3.77000000e+02]
  ..., 
  [  1.35200000e+03]
  [  1.62600000e+03]
  [  1.65900000e+03]]
 [[  3.00000000e+00]
  [  3.00000000e+00]
  [  1.00000000e+00]
  ..., 
  [  1.00000000e+00]
  [  1.00000000e+00]
  [  1.00000000e+00]]]

到目前为止没有帮助的参考

  1. 这一个
  2. 这两个
  3. 这三个
  4. 这四个
  1. This one
  2. This two
  3. This three
  4. This four

编辑1

尝试了np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True)并获得了

[[             nan   1.96000000e+02   1.86000000e+02 ...,   4.79000000e+02
    4.79000000e+02   4.79000000e+02]
 [             nan   2.42000000e+02   3.02000000e+02 ...,   1.36000000e+03
    1.39400000e+03   1.65200000e+03]
 [             nan   3.00000000e+00   3.00000000e+00 ...,   2.00000000e+00
    1.92803605e+00   1.00000000e+00]]

推荐答案

似乎您需要 read_csv 首先用于DataFrame,首先仅过滤第二和第三列,然后通过

It seems you need read_csv for DataFrame first with filter only second and third column first and then convert to numpy array by values: import pandas as pd from sklearn.cluster import KMeans from pandas.compat import StringIO

temp=u"""col,iid,rat
4,1,0
5,2,4
6,3,3
7,4,1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), usecols = [1,2])
print (df)
   iid  rat
0    1    0
1    2    4
2    3    3
3    4    1

X = df.values 
print (X)
[[1 0]
 [2 4]
 [3 3]
 [4 1]]

kmeans = KMeans(n_clusters=2)
a = kmeans.fit(X)
print (a)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

这篇关于数据框成numpy数组,其值以逗号分隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆