用其他列中的过滤值填充所选列中的缺失值 [英] Fill missing values in selected columns with filtered values in other column

查看:64
本文介绍了用其他列中的过滤值填充所选列中的缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在包含来自其他列的一些缺失值的数据框中有一个名为 null 的奇怪列.一列是名为 location 的经纬度坐标,另一列是一个整数,表示名为 level 的目标变量.在locationlevel 缺少值的某些但不是所有情况下,应该存在的值位于此null 列中.这是一个示例 df:

I have a weird column named null in a dataframe that contains some missing values from other columns. One column is lat-lon coordinates named location, the other is an integer representing a target variable named level. In some but not all of the cases where location or level have missing values, the values that should be there are in this null column. Here's an example df:

pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
     'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
     'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
     }
)

我需要能够根据值是整数还是字符串来过滤 null 列,然后在此基础上用适当的值填充适当列中的缺失值.我已经尝试使用 .apply() 和 lambda 函数以及 .match().contains() 在 for 循环内,到目前为止还没有运气.

I need to be able to filter the null column according to whether the value is an integer or a string, and then based on that fill the missing value in the appropriate column with the appropriate value. I've tried using .apply() with a lambda funcation as well as .match(), .contains() and in inside of a for loop with no luck so far.

推荐答案

最简单的方法,如果不是最简单的方法,就是简单地填充 df.locationdf 中的所有缺失值.leveldf.null 中的值,然后使用正则表达式创建一个布尔过滤器,以返回 df.locationdf 中不适当/错误分配的值.levelnp.nan.

The easiest, if not the simplest approach, is to simply fill all the missing values in df.location and df.level with the values in df.null, then create a boolean filter with regex to return innappropriate/misassigned values in df.location and df.level to np.nan.

df = pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
     'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
     'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
     }
)

for col in ['location', 'level']:
     df[col].fillna(
          value = stress.null,
          inplace = True
     )

现在我们将使用字符串表达式来纠正错误分配的值.

Now we'll use string expressions to correct the mis-assigned values.

# Converting columns to type str so string methods work
df = df.astype(str)

# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.level.str.contains(regex), 'level'] = np.nan
    
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex), 'location'] = np.nan
    
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.level.astype(float)

输出如下:

pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
      'location': {0: '43.70477575,-72.28844073', 1: nan, 2: '43.70637091,-72.28704334', 3: nan, 4: nan},
      'level': {0: nan, 1: '2', 2: nan, 3: '4', 4: '3'}
     }
)

这篇关于用其他列中的过滤值填充所选列中的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆