如何使用python pandas从csv文件中的多行填充数组 [英] how to populate array with multiple rows from csv file using python pandas
问题描述
我正在使用熊猫导入CSV文件,
I am importing a CSV file using pandas,
CSV列标题-年,型号,修剪,结果
CSV Column header - Year, Model, Trim, Result
来自csv文件的值如下-
The values coming in from the csv file are as follows -
Year | Model | Trim | Result
2012 | Camry | SR5 | 1
2014 | Tacoma | SR5 | 1
2014 | Camry | XLE | 0
etc..
数据集中有2500多个行,其中包含200多个唯一模型.
There are 2500+ rows in the data set containing over 200 unique models.
然后将所有值都转换为数值以用于分析.
All Values are then converted to numerical values for analysis purposes.
这里的输入是csv文件的前3列,输出是第4个结果列
Here the inputs are the first 3 columns of the csv file and the output is the fourth result column
这是我的剧本:
import pandas as pd
inmport numpy as np
c1 = []
c2 = []
c3 = []
input = []
output = []
# read in the csv file containing 4 columns
df = pd.read_csv('success.csv')
df.convert_objects(convert_numeric=True)
df.fillna(0, inplace=True)
# convert string values to numerical values
def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x+=1
df[column] = list(map(convert_to_int, df[column]))
return df
df = handle_non_numerical_data(df)
# extract each column to insert into input array later
c1.append(df['Year'])
c2.append(df['Model'])
c3.append(df['Trim'])
#create input array containg the first 3 rows of the csv file
input = np.stack_column(c1,c2,c3)
output.append(df['Result'])
除了仅附加1个值之外,此方法还行得通,我是否会使用extend,因为它似乎会将其附加到数组的末尾?
This works fine except append only excepts 1 value, would I use extend as that seems it would attach it to the end of the array?
更新
基本上所有这些工作都很好,我的问题是创建输入数组,我希望该数组由3列组成-Year,Model,Trim.
Essentially all of this works great, my problem is creating the input array, I would like the array to consist of 3 columns - Year, Model, Trim.
input = ([['Year'], ['Model'], ['Trim']],[['Year'], ['Model'], ['Trim']]...)
我似乎只能在另一个值上添加一个值,而不是让它们按顺序排列.
I can only seem to add one value on top of the other rather than having them sequence..
我现在得到的-
input = ([['Year'], ['Year'], ['Year']].., [['Model'], ['Model'], ['Model']]..[['Trim'], ['Trim'], ['Trim']]...)
推荐答案
要详细说明我的评论,假设您有一些由非整数值组成的DataFrame:
To elaborate on my comment, suppose you have some DataFrame consisting of non-integer values:
>>> df = pd.DataFrame([[np.random.choice(list('abcdefghijklmnop')) for _ in range(3)] for _ in range(10)])
>>> df
0 1 2
0 j p j
1 d g b
2 n m f
3 o b j
4 h c a
5 p m n
6 c c l
7 o d e
8 b g h
9 h o k
还有一个输出:
>>> df['output'] = np.random.randint(0,2,10)
>>> df
0 1 2 output
0 j p j 0
1 d g b 0
2 n m f 1
3 o b j 1
4 h c a 1
5 p m n 0
6 c c l 1
7 o d e 0
8 b g h 1
9 h o k 0
要将所有字符串值转换为整数,请在np.unique
和return_inverse=True
之间使用,此反将是您需要的数组,请记住,您需要调整形状(因为
To convert all the string values to integers, use np.unique
with return_inverse=True
, this inverse will be the array you need, just keep in mind, you need to reshape (because np.unique
will have flattened it):
>>> unique, inverse = np.unique(df.iloc[:,:3].values, return_inverse=True)
>>> unique
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n',
'o', 'p'], dtype=object)
>>> inverse
array([ 8, 14, 8, 3, 6, 1, 12, 11, 5, 13, 1, 8, 7, 2, 0, 14, 11,
12, 2, 2, 10, 13, 3, 4, 1, 6, 7, 7, 13, 9])
>>> input = inverse.reshape(df.shape[0], df.shape[1] - 1)
>>> input
array([[ 8, 14, 8],
[ 3, 6, 1],
[12, 11, 5],
[13, 1, 8],
[ 7, 2, 0],
[14, 11, 12],
[ 2, 2, 10],
[13, 3, 4],
[ 1, 6, 7],
[ 7, 13, 9]])
您可以随时返回:
>>> unique[input]
array([['j', 'p', 'j'],
['d', 'g', 'b'],
['n', 'm', 'f'],
['o', 'b', 'j'],
['h', 'c', 'a'],
['p', 'm', 'n'],
['c', 'c', 'l'],
['o', 'd', 'e'],
['b', 'g', 'h'],
['h', 'o', 'k']], dtype=object)
要再次获得输出数组,只需使用df
的.values
加上相应的列-因为这些已经是numpy
数组!
To get an array for the output, again, you simply use the .values
of the df
taking the appropriate column -- since these are already numpy
arrays!
>>> output = df['output'].values
>>> output
array([0, 0, 1, 1, 1, 0, 1, 0, 1, 0])
您可能需要重塑形状,具体取决于要用于分析的库(sklearn,scipy等):
You might want to reshape it, depending on what libraries you are going to use for analysis (sklearn, scipy, etc):
>>> output.reshape(output.size, 1)
array([[0],
[0],
[1],
[1],
[1],
[0],
[1],
[0],
[1],
[0]])
这篇关于如何使用python pandas从csv文件中的多行填充数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!