数据框成numpy数组,其值以逗号分隔 [英] Dataframe into numpy array with values comma seperated
问题描述
我已经将一个csv(\ t分隔)读入一个Dataframe中,现在需要采用numpy数组格式进行聚类,而无需更改类型
I've read a csv (which is \t seperated) into a Dataframe, which is now needed to be in a numpy array format for clustering without changing type
到目前为止,根据尝试的引用(如下),我未能按要求获取输出.我尝试获取的两列值位于int64/float64中,如下所示:
So far as per tried references (below) I've failed to get the output as required. The two column's values I'm trying to fetch are in int64 / float64, as below
uid iid rat
0 196 242 3.000000
1 186 302 3.000000
2 22 377 1.000000
我暂时只对 iid 和 rat 感兴趣,并将其传递给Kmeans.fit()方法,而对于EPSILON来说也是如此.我需要以下格式的
I'm intrested in only iid and rat for the moment, and to pass it to Kmeans.fit() method and that too not with EPSILON in it. I need it in following format
期望的格式
[[242, 3.000000],
[302, 3.000000],
[22, 1.000000]]
尝试失败
X = values[:, 1:2]
Y = values[:, 2:3]
someArray = np.array([X,Y])
print someArray
并且不会在执行时告别
[[[ 2.42000000e+02]
[ 3.02000000e+02]
[ 3.77000000e+02]
...,
[ 1.35200000e+03]
[ 1.62600000e+03]
[ 1.65900000e+03]]
[[ 3.00000000e+00]
[ 3.00000000e+00]
[ 1.00000000e+00]
...,
[ 1.00000000e+00]
[ 1.00000000e+00]
[ 1.00000000e+00]]]
到目前为止没有帮助的参考
- This one
- This two
- This three
- This four
编辑1
尝试了np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True)
并获得了
[[ nan 1.96000000e+02 1.86000000e+02 ..., 4.79000000e+02
4.79000000e+02 4.79000000e+02]
[ nan 2.42000000e+02 3.02000000e+02 ..., 1.36000000e+03
1.39400000e+03 1.65200000e+03]
[ nan 3.00000000e+00 3.00000000e+00 ..., 2.00000000e+00
1.92803605e+00 1.00000000e+00]]
推荐答案
似乎您需要 read_csv
首先用于DataFrame
,首先仅过滤第二和第三列,然后通过
It seems you need read_csv
for DataFrame
first with filter only second and third column first and then convert to numpy array by values
:
import pandas as pd
from sklearn.cluster import KMeans
from pandas.compat import StringIO
temp=u"""col,iid,rat
4,1,0
5,2,4
6,3,3
7,4,1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), usecols = [1,2])
print (df)
iid rat
0 1 0
1 2 4
2 3 3
3 4 1
X = df.values
print (X)
[[1 0]
[2 4]
[3 3]
[4 1]]
kmeans = KMeans(n_clusters=2)
a = kmeans.fit(X)
print (a)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
这篇关于数据框成numpy数组,其值以逗号分隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!