用其他列中的过滤值填充所选列中的缺失值 [英] Fill missing values in selected columns with filtered values in other column
问题描述
我在包含来自其他列的一些缺失值的数据框中有一个名为 null
的奇怪列.一列是名为 location
的经纬度坐标,另一列是一个整数,表示名为 level
的目标变量.在location
或level
缺少值的某些但不是所有情况下,应该存在的值位于此null
列中.这是一个示例 df:
I have a weird column named null
in a dataframe that contains some missing values from other columns. One column is lat-lon coordinates named location
, the other is an integer representing a target variable named level
. In some but not all of the cases where location
or level
have missing values, the values that should be there are in this null
column. Here's an example df:
pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
}
)
我需要能够根据值是整数还是字符串来过滤 null
列,然后在此基础上用适当的值填充适当列中的缺失值.我已经尝试使用 .apply()
和 lambda 函数以及 .match()
、.contains()
和 在 for 循环内,到目前为止还没有运气.
I need to be able to filter the null
column according to whether the value is an integer or a string, and then based on that fill the missing value in the appropriate column with the appropriate value. I've tried using .apply()
with a lambda funcation as well as .match()
, .contains()
and in
inside of a for loop with no luck so far.
推荐答案
最简单的方法,如果不是最简单的方法,就是简单地填充 df.location
和 df 中的所有缺失值.level
和 df.null
中的值,然后使用正则表达式创建一个布尔过滤器,以返回 df.location
和 df 中不适当/错误分配的值.level
到 np.nan
.
The easiest, if not the simplest approach, is to simply fill all the missing values in df.location
and df.level
with the values in df.null
, then create a boolean filter with regex to return innappropriate/misassigned values in df.location
and df.level
to np.nan
.
df = pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
}
)
for col in ['location', 'level']:
df[col].fillna(
value = stress.null,
inplace = True
)
现在我们将使用字符串表达式来纠正错误分配的值.
Now we'll use string expressions to correct the mis-assigned values.
# Converting columns to type str so string methods work
df = df.astype(str)
# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.level.str.contains(regex), 'level'] = np.nan
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex), 'location'] = np.nan
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.level.astype(float)
输出如下:
pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
'location': {0: '43.70477575,-72.28844073', 1: nan, 2: '43.70637091,-72.28704334', 3: nan, 4: nan},
'level': {0: nan, 1: '2', 2: nan, 3: '4', 4: '3'}
}
)
这篇关于用其他列中的过滤值填充所选列中的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!