错误值错误:由于连接数据帧,无法从重复轴重新索引 [英] Error ValueError: cannot reindex from a duplicate axis because of concatenating dataframes

查看:52
本文介绍了错误值错误:由于连接数据帧,无法从重复轴重新索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我的项目中实现了实验环境.

I implemented experimental environment in my project.

该组件基于 Scikit 学习.

This component is based on Scikit learn.

在这个组件中,我将给定的 CSV 读入 Pandas 数据帧.之后,我选择了最佳特征并将给定数据帧的维度从 100 减少到 5.之后,我将删除的 ID 列添加到这个简化的数据框中以备将来使用.该列被降维过程删除.

In this compnent I read the given CSV into pandas dataframe. After that I selected the best features and reduced the dimensions of the given dataframe from 100 to 5. After that I added to this reduced dataframe the removed ID column for future use. This coloumn was dropped by the dimension reduction process.

一切正常,直到我更改代码以读取所有 CSV 文件并返回一个联合数据帧:

Everything works fine until I changed my code to read all CSV files and return one union dataframe:

请看下一个代码:读取所有 CSV:

Please look on the next code: Reading all CSV:

dataframes = []

from os import listdir
from os.path import isfile, join
files_names = [f for f in listdir(full_path_directory_files) if   isfile(join(full_path_directory_files, f))]
for file_name in files_names:
    full_path_file = full_path_directory_files + file_name

    data_frame = pd.read_csv(full_path_file, index_col=None, compression="infer")
dataframes.append(dataframe)

之后我在数据帧之间进行了连接

After that I made concatenation between the dataframes

features_dataframe = pd.concat(dataframes, axis=0)

我也查过了.我创建了两个不同的数据框,形状为 (200, 100)并在连接后变成 (400, 100)

I also checked it. I created two different dataframes with shape = (200, 100) and after concatenating it turned to (400, 100)

之后数据帧被发送到以下方法:

After that the dataframe was sent into the following method:

 def _reduce_dimensions_by_num_of_features(self, features_dataframe, truth_dataframe, num_of_features):
    print("Create dataframe with the {0} best features".format(num_of_features))

## In those functions I got the ids and their class

    ids, id_series = self._create_ids_by_dataframe(features_dataframe)
    features_dataframe_truth_class = self._extract_truth_class_by_truth_dataframe(truth_dataframe, ids)


    k_best_classifier = SelectKBest(score_func=f_classif, k=num_of_features)
    k_best_features = k_best_classifier.fit_transform(features_dataframe, features_dataframe_truth_class)

    reduced_dataframe_column_names = self._get_k_best_feature_names(k_best_classifier, features_dataframe)


    reduced_dataframe = pd.DataFrame(k_best_features, columns=reduced_dataframe_column_names)

现在我检索了 ID 列:

Now I retrieved the ID column:

    reduced_dataframe["Id"] = id_series

它失败的软件消息:

ValueError: cannot reindex from a duplicate axis

这仅在数据帧合并之后发生.

This is occurred only after the concation of the dataframes.

如何将 ID 列添加到数据框中而不会出错??

How can I add the column of the IDs into the dataframe without getting error??

推荐答案

我发现了问题:

在数据帧连接后,索引发生变化,当我们添加行时:

After the concatenation of the dataframes, the index is changed and when we add the row :

reduced_dataframe["Id"] = id_series

出现错误.

解决办法是重置索引:

features_dataframe = pd.concat(dataframes, axis=0)
features_dataframe.reset_index(drop=True, inplace=True)

这篇关于错误值错误:由于连接数据帧,无法从重复轴重新索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆