从Panda Dataframe转换为numpy数组期间出现奇怪的错误 [英] Strange error during conversion from Panda Dataframe to numpy array
问题描述
我有一个熊猫数据框,其中有两列:评论"(文本)和情感"(1/0)
I have a pandas dataframe with two columns: "review"(text) and "sentiment"(1/0)
X_train = df.loc[0:25000, 'review'].values
y_train = df.loc[0:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values
但是在转换为numpy数组之后,请使用values()
方法.我得到以下形状的numpy数组:
But after conversion to numpy array, using values()
method. I obtain numpy arrays of following shape:
print(df.shape) #(50000, 2)
print(X_train.shape) #(25001,)
print(y_train.shape) #(25001,)
print(X_test.shape) # (25000,)
print(y_test.shape) # (25000,)
因此,如您所见values()
方法,添加了另一行.这真的很奇怪,我无法检测到错误.
So as you can see values()
method, added one additional row. This is really strange and I cant detect error.
推荐答案
df.loc
基于标签,即包含上限.使用iloc
:
The df.loc
is label based, i.e. it includes the upper bound. Use iloc
:
df.iloc[:25000, 1].values # here 1 is the column of 'review' for example
如果您想要类似NumPy的切片.
if you want NumPy-like slicing.
对于iloc
,您需要同时以整数或整数形式提供行和列
片.
With iloc
you need to supply both rows and columns as integers or integer
slices.
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df
a b
0 1 4
1 2 5
2 3 6
这是基于标签的,即上限(包括上限):
This is label based, i.e. upper bound inclusive:
>>> df.loc[:1, 'a']
0 1
1 2
Name: a, dtype: int64
这就像在NumPy中切片一样,即上限互斥:
This works like slicing in NumPy, i.e. upper bound exclusive:
>>> df.iloc[:2, 0]
0 1
1 2
Name: a, dtype: int64
这篇关于从Panda Dataframe转换为numpy数组期间出现奇怪的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!