用该列的平均值替换值-许多列 [英] Replace value with the average of it's column - many columns

查看:129
本文介绍了用该列的平均值替换值-许多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Excel表格,其中包含1000多个列和11000行-都包含数字数据.在数据中,缺少用"*"表示的值.

我想用其所在列的平均值替换所有'*'值.

手动执行此操作会花费很长时间,那么是否有公式可以实现此目的?

非常感谢您提供任何帮助.

解决方案

正如您提到的机器学习,我想我将向您介绍如何使用清除丢失的数据模块,其中介绍了替换方法,例如使用链式方程式,均值,中位数和其他几种方法的多元插补.很棒的是,您可以通过右键单击数据集并查看哪些列有偏斜来可视化数据集列.然后,您可以逐列选择要使用的替换方法.如果您有严重倾斜的列,则可以使用中位数来代替.这也为数据规范化提供了很好的机会(缩小并缩小).您还可以在数据集中使用Python和R.

我不知道是否有一种直接将"*"当作缺失值的方法,我试图找出答案,但是如果您在加载之前进行了一些处理,那么一切都很好.加载之前的步骤要求:

  1. 将工作表导出为CSV并保存.
  2. 使用 Ctrl + F 弹出查找和替换"对话框,并输入"~*"作为查找和替换"空白

然后登录AML,然后点击屏幕底部的 +新建

选择新建">"DATASET > FROM LOCAL FILE",然后选择您的文件

选择类型时,如果数据没有标题行,请确保选择不包含标题的CSV;如果确实如此,请选择带有标题的

您的数据集将开始上传,如屏幕底部的进度条所示,然后出现在SAVED DATASETS集合中.

再次单击 +新建按钮,然后选择EXPERIMENT > BLANK EXPERIMENT

将保存的数据集拖放到右侧的画布上:

在右侧的搜索实验项目框中,键入:Clean Missing Data 然后将出现的模块拖到画布上

通过单击顶部框底部的点并拖动到另一个框来加入2个框

选择底部的框,然后在右侧输入以下参数(您可以在此处选择适用于缺失值的方法,例如,用均值替换缺失,或者如果列数据偏斜,则可以选择中位数.

右键单击底部模块,然后选择Run selected

再次右键单击并选择Cleaned dataset > Save as Dataset

底部的进度条将在完成时通知您

再次输入搜索实验项目框:convert to csv并将其拖动到画布上,然后将第二个模块的左侧底部连接到新添加的第三个模块的顶部:

选择底部模块,然后右键单击> Run selected

等待进度条完成.

右键单击底部模块,然后单击Download.完成.

I have an excel sheet with over 1000 columns and 11000 rows - all with numeric data. Within the data, there are missing values represented with '*'.

I would like to replace all of the '*' values with the average of the column that it is in.

Doing this manually would take a long time, so is there a formula that would achieve this?

Thanks so much in advanced for any help.

解决方案

As you have mentioned machine learning I thought I would introduce you to how you could do this with Azure Machine Learning Studio (AML) using a free account.

By using AML you gain access to a number of methods for replacing missing values which are extremely quick. AML has a Clean Missing Data module which exposes methods of replacement such as Multivariate Imputation using Chained Equation, Mean, Median and several others. The great thing here is you can visualize the dataset columns by right clicking on the dataset and see which columns have skew. You can then select on a column by column basis which replacement method to use. If you have heavily skewed columns you might use median instead for instance. This also offers great opportunities for data normalization (scale and reduce). You also gain access to using Python and R with your dataset.

I don't know if there is a method for directly treating "*" as missing values, I am trying to find that out, but if you do a little processing in advance of load then all is fine. The step before loading requires:

  1. Export the sheet as a CSV and save it.
  2. Use Ctrl+ F to bring up the find and replace dialog and enter "~*" for Find and leave Replace blank

Then login into AML and click the + New at the bottom of the screen

Select New > DATASET > FROM LOCAL FILE and select your file

When selecting type ensure to select CSV with no header if you data has no header row or with header if it does:

Your dataset will start uploading as shown by progress bar at bottom of screen and then appear in the SAVED DATASETS collection.

Click the + New button again and select EXPERIMENT > BLANK EXPERIMENT

Drag and drop your saved dataset onto the canvas on the right:

In the Search experiment items box on the right, type: Clean Missing Data then drag the module that appears onto the canvas

Join the 2 boxes by clicking the dot at the bottom of the top box and dragging to the other box

Select the bottom box and then input the following parameters on the right (here is where you can choose which method to apply for missing values e.g. replace missing with mean, or perhaps median if your column data is skewed.

Right click the bottom module and select Run selected

Right click again and select Cleaned dataset > Save as Dataset

The progress bar at the bottom will inform you when complete

Type in the Search experiment items box again: convert to csv and drag that onto the canvas and connect the left hand side bottom of the second module to the top of the newly added third:

Select the bottom module and right click > Run selected

Wait for the progress bar to complete.

Right-click the bottom module and hit Download. Done.

这篇关于用该列的平均值替换值-许多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆