标准化后如何在不同范围之间的 pandas 数据框中的列上保留索引 [英] how can keep index after normalization on columns in pandas dataframe between different range

查看:76
本文介绍了标准化后如何在不同范围之间的 pandas 数据框中的列上保留索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以下情况下,我通过使用for循环对Pandas数据框中的多列应用了归一化:

I applied normalization on multiple columns in Pandas dataframe by using for-loop under the condition of below:

A,B列之间的归一化: [-1,+1]

Normalization for A , B columns between : [-1 , +1]

C列的标准化: [-40,+150]

并将结果替换到替代数据框中,让我们调用norm_data并将其存储为csv文件.

and replace results in alternative dataframe let's call norm_data and store it as a csv file.

我的数据是txt文件数据集

my data is txt file dataset

# Import and call the needed libraries
import numpy as np
import pandas as pd

#Normalizing Formula

def normalize(value, min_value, max_value, min_norm, max_norm):
    new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value

#Split data in three different lists A, B and C

df1 = pd.read_csv('D:\me4.TXT', header=None)
id_set = df1[df1.index % 4 == 0].astype('int').values
A = df1[df1.index % 4 == 1].values
B = df1[df1.index % 4 == 2].values
C = df1[df1.index % 4 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]} # arrays
#df contains all the data
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0]) 
df2 = pd.DataFrame(data, index= id_set[0:])
print(df)

#--------------------------------
cycles = int(len(df)/480)
print(cycles)

#next iteration create all plots, change the numer of cycles
for i in df:
    min_val = df[i].min()
    max_val = df[i].max()
    if i=='C':
        #Applying normalization for C between [-40,+150]
        data['C'] = normalize(df[i].values, min_val, max_val, -40, 150)
    elif i=='A':
        #Applying normalization for A , B between [-1,+1]
        data['A'] = normalize(df[i].values, min_val, max_val, -1, 1)
    else:
        data['B'] = normalize(df[i].values, min_val, max_val, -1, 1)


norm_data = pd.DataFrame(data)
print(norm_data)
norm_data.to_csv('norm.csv')
df2.to_csv('my_file.csv')
print(df2)

问题是在@Lucas的帮助下进行规范化之后,我错过了 index 标记为id_set的问题.

Problem is after normalization by help of @Lucas I've missed my index was labeled id_set.

到目前为止,我在my_file.csv中得到了以下输出,包括此错误TypeError unsupported format string passed to numpy.ndarray.__format__:

So far I got below output in my_file.csv including this error TypeError unsupported format string passed to numpy.ndarray.__format__:

id_set         A         B           C
['0']      2.291171  -2.689658  -344.047912
['10']     2.176816  -4.381186  -335.936524
['20']     2.291171  -2.589725  -342.544885
['30']     2.176597  -6.360999     0.000000
['40']     2.577268  -1.993412  -344.326376
['50']     9.844076  -2.690917  -346.125859
['60']     2.061782  -2.889378  -346.378859
['70']     2.348300  -2.789547  -347.980986
['80']     6.973350  -1.893454  -337.884738
['90']     2.520040  -3.087004  -349.209006

那些['']不需要的 ! 标准化后,我想要的输出应如下所示:

which those [''] are unwanted! my desired output should be like below after normalization :

id_set     A         B           C
000   -0.716746  0.158663  112.403310
010   -0.726023  0.037448  113.289702
020   -0.716746  0.165824  112.567557
030   -0.726040 -0.104426  150.000000
040   -0.693538  0.208556  112.372881
050   -0.104061  0.158573  112.176238
060   -0.735354  0.144351  112.148590
070   -0.712112  0.151505  111.973514
080   -0.336932  0.215719  113.076807
090   -0.698181  0.130189  111.839319
010    0.068357 -0.019388  114.346421
011    0.022007  0.165824  112.381444

任何想法都将受到欢迎,因为这对我来说是重要的数据.

Any ideas would be welcome since it's important data for me.

推荐答案

如果我对您的理解正确,my_file.csv/df2应该看起来像是您问题的下限输出? 然后,我相信您在df2的构造中只遇到一个错字,您希望索引看起来与df1相同,所以:

if I understand you correctly, my_file.csv / df2 should look like the lower output from your question? Then I believe you just have a typo in your construction of df2, you want the index to look the same as df1, so:

df2 = pd.DataFrame(data, index = id_set[:,0])

代替

df2 = pd.DataFrame(data, index= id_set[0:])

(请注意方括号中的内容). 这将使您的输出文件my_file.csv看起来像这样:

(notice the contents of the square brackets). This will make your output file my_file.csv look like this:

,A,B,C
0,2.19117130798,-2.5897247305,-342.54488522400004
10,2.19117130798,-4.3811855641,-335.936524309
20,2.19117130798,-2.5897247305,-342.54488522400004
...

输出文件norm.csv如下所示:

,A,B,C
0,-1.0,0.16582420581574775,145.05394742081884
1,-1.0,0.037447604422215175,145.9298596578588
2,-1.0,0.16582420581574775,145.05394742081884
...

如果您希望输出文件norm.csv具有相同的索引(0,10,20而不是0,1,2 ...),则需要将norm_data定义为

If you want your output file norm.csv to have the same index (0,10,20 instead of 0,1,2...) you need to define norm_data as

norm_data = pd.DataFrame(data, index = id_set[:,0])

代替

norm_data = pd.DataFrame(data)

另外,我应该注意您的数据包含几个NaN/inf条目,这会弄乱您的规范化.

Also, I should note that your data contains a couple of NaN/inf entries, which mess up your normalization.

您可以使用

df = df.replace(np.inf, np.nan)
df = df.fillna(0)

(记入问题/答案),对df2使用相同的内容.您也可以使用相同的功能将NaN/inf条目替换为其他值.

(credit to this question/answer), using the same for df2. You can also replace the NaN/inf entries with other values using the same functions.

这篇关于标准化后如何在不同范围之间的 pandas 数据框中的列上保留索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆