如何将 Pandas 数据框转换为 NumPy 数组? [英] How to convert pandas data frame to NumPy array?
问题描述
按照我从我之前的问题中得到的建议我'正在将 Pandas 数据框转换为数字 NumPy 数组.为此,我使用了 numpy.asarray
.
我的数据框:
DataFrame----------标签向量0 0 1:0.0033524514 2:-0.021896651 3:0.05087798 4:...1 0 1:0.02134219 2:-0.007388343 3:0.06835007 4:0....2 0 1:0.030515702 2:-0.0037591448 3:0.066626 4:0....3 0 1:0.0069114454 2:-0.0149497045 3:0.020777626 ...4 1 1:0.003118149 2:-0.015105667 3:0.040879637 4:...…………19779 0 1:0.0042634667 2:-0.0044222944 3:-0.012995412...19780 1 1:0.013818732 2:-0.010984628 3:0.060777966 4:...19781 0 1:0.00019213723 2:-0.010443398 3:0.01679976 4...19782 0 1:0.010373874 2:0.0043582567 3:-0.0078354385 ...19783 1 1:0.0016790542 2:-0.028346825 3:0.03908631 4:...[19784 行 x 2 列]数据帧数据类型:标签对象矢量对象数据类型:对象
要转换为 Numpy 数组,我正在使用此脚本:
<预><代码>将熊猫导入为 pd从 sklearn.model_selection 导入 train_test_split从 sklearn 导入 svm从 sklearn 导入指标从 sklearn.preprocessing 导入 OneHotEncoder将 numpy 导入为 np导入 matplotlib.pyplot 作为 pltr_filenameTSV = 'TSV/A19784.tsv'tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])df = pd.DataFrame(tsv_read)df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),列 = ['标签','向量'])print('数据帧\n-----------\n', df)print('\nDataFrame 数据类型:\n', df.dtypes)arr = np.asarray(df, dtype=np.float64)print('\nNumpy 数组\n-----------\n', arr)print('\nNumpy 数组数据类型:', arr.dtype)我在第 nr.22 行出现此错误 arr = np.asarray(df, dtype=np.float64)
<代码> ValueError异常:无法将字符串转换为浮动:1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 5:-0.013740167 6:-0.0014883851 7:0.02230502 8:0.0053563705 9:0.00465044 10:-0.003082654211:0.010156203 12:-0.021754289 13:-0.03744049 14:0.011198468 15:-0.021201309 16:-0.0006497681 17:0.009229079 18:0.04218278 19:0.020572046 20:0.0021593391 ...
我该如何解决这个问题?
问候并感谢您的时间
对 DataFrame
使用列表理解和嵌套字典理解:
df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])打印 (df)1 2 3 40 0.0033524514 -0.021896651 0.05087798 01 0.02134219 -0.007388343 0.06835007 02 0.030515702 -0.0037591448 0.066626 03 0.0069114454 -0.0149497045 0.020777626 04 0.003118149 -0.015105667 0.040879637 0.4
然后转换为浮点数和numpy数组:
print (df.astype(float).to_numpy())[[ 0.00335245 -0.02189665 0.05087798 0.][ 0.02134219 -0.00738834 0.06835007 0. ][ 0.0305157 -0.00375914 0.066626 0. ][ 0.00691145 -0.0149497 0.02077763 0. ][ 0.00311815 -0.01510567 0.04087964 0.4 ]]
Following the suggestions I got from my previous question here I'm converting a Pandas data frame to a numeric NumPy array. To do this Im used numpy.asarray
.
My data frame:
DataFrame
----------
label vector
0 0 1:0.0033524514 2:-0.021896651 3:0.05087798 4:...
1 0 1:0.02134219 2:-0.007388343 3:0.06835007 4:0....
2 0 1:0.030515702 2:-0.0037591448 3:0.066626 4:0....
3 0 1:0.0069114454 2:-0.0149497045 3:0.020777626 ...
4 1 1:0.003118149 2:-0.015105667 3:0.040879637 4:...
... ... ...
19779 0 1:0.0042634667 2:-0.0044222944 3:-0.012995412...
19780 1 1:0.013818732 2:-0.010984628 3:0.060777966 4:...
19781 0 1:0.00019213723 2:-0.010443398 3:0.01679976 4...
19782 0 1:0.010373874 2:0.0043582567 3:-0.0078354385 ...
19783 1 1:0.0016790542 2:-0.028346825 3:0.03908631 4:...
[19784 rows x 2 columns]
DataFrame datatypes :
label object
vector object
dtype: object
To convert into a Numpy Array I'm using this script:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import matplotlib.pyplot as plt
r_filenameTSV = 'TSV/A19784.tsv'
tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
columns = ['label','vector'])
print('DataFrame\n----------\n', df)
print('\nDataFrame datatypes :\n', df.dtypes)
arr = np.asarray(df, dtype=np.float64)
print('\nNumpy Array\n----------\n', arr)
print('\nNumpy Array Datatype :', arr.dtype)
I'm having this error from line nr.22 arr = np.asarray(df, dtype=np.float64)
ValueError: could not convert string to float: ' 1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 5:-0.013740167 6:-0.0014883851 7:0.02230502 8:0.0053563705 9:0.00465044 10:-0.0030826542 11:0.010156203 12:-0.021754289 13:-0.03744049 14:0.011198468 15:-0.021201309 16:-0.0006497681 17:0.009229079 18:0.04218278 19:0.020572046 20:0.0021593391 ...
How can I solve this issue?
Regards and thanks for your time
Use list comprehension with nested dictionary comprehension for DataFrame
:
df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print (df)
1 2 3 4
0 0.0033524514 -0.021896651 0.05087798 0
1 0.02134219 -0.007388343 0.06835007 0
2 0.030515702 -0.0037591448 0.066626 0
3 0.0069114454 -0.0149497045 0.020777626 0
4 0.003118149 -0.015105667 0.040879637 0.4
And then convert to floats and to numpy array:
print (df.astype(float).to_numpy())
[[ 0.00335245 -0.02189665 0.05087798 0. ]
[ 0.02134219 -0.00738834 0.06835007 0. ]
[ 0.0305157 -0.00375914 0.066626 0. ]
[ 0.00691145 -0.0149497 0.02077763 0. ]
[ 0.00311815 -0.01510567 0.04087964 0.4 ]]
这篇关于如何将 Pandas 数据框转换为 NumPy 数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!