当连续两行不重复使用NaN时,用单词替换NaN值 [英] Replacing NaN value with a word when NaN is not repeated in two consecutive rows
问题描述
对于以下数据框:
index Sent col_1 col_2 col_3
1 AB NaN DD CC
1 0 1 0
2 SA FA FB NaN
2 1 1 NaN
3 FF Sha NaN PA
3 1 0 1
当在两个连续行中不重复NAN时,我需要用"F"替换col_1,col_2,col_3中的NAN值.输出是这样的:
I need to replace NAN value in col_1, col_2, col_3 with "F" when NAN is not repeated in two Consecutive rows. The output is like this:
index Sent col_1 col_2 col_3
1 AB F DD CC
1 0 1 0
2 SA FA FB NaN
2 1 1 NaN
3 FF Sha F PA
3 1 0 1
This is my code:
for col in ['col_1', 'col_2', 'col_3']:
data = np.reshape(df[col].values, (-1, 2))
need_fill = np.logical_and(data[:, 0] == '', data[:, 1] != '')
data[np.where(need_fill),1] = 'F'
但是它将NAN值下的0替换为F.如何修复将NAN替换为F的代码.
But it replace the 0 under NAN value to F. How I can fix the code to replace NAN to F.
推荐答案
也许有更好的方法,但是一种方法是尝试使用shift
在上面看到row
在下面看到row
.但是,对于第一行和最后一行,都会产生问题.因此,如果添加多余的行并在以后删除它不是问题,则可以尝试以下操作:
May be there is something better, but one way would be to try using shift
to see a row
above and a row
below. However, for first and last row, it would create issue. So, if it is not a problem to add extra rows and remove it later, you can try following:
# Appending row to the top: https://stackoverflow.com/a/24284680/5916727
df.loc[-1] = [0 for n in range(len(df.columns))]
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
# Append row to below it
df.loc[df.shape[0]] = [0 for n in range(len(df.columns))]
print(df)
index Sent col_1 col_2 col_3
0 0 0 0 0 0
1 1 AB NaN DD CC
2 1 0 1 0
3 2 SA FA FB NaN
4 2 1 1 NaN
5 3 FF Sha NaN PA
6 3 1 0 1
7 0 0 0 0 0
现在,使用shift(-1)
和shift(1)
的masking
和shift
检查连续的行:
Now, check for consecutive rows using shift
with masking
by shift(-1)
and shift(1)
:
columns = ["col_1", "col_2","col_3"]
for column in columns:
df.loc[df[column].isnull() & df[column].shift(-1).notnull() & df[column].shift(1).notnull(), column] = "F"
df = df [1:-1] # remove extra rows
print(df)
输出:
index Sent col_1 col_2 col_3
1 1 AB F DD CC
2 1 0 1 0
3 2 SA FA FB NaN
4 2 1 1 NaN
5 3 FF Sha F PA
6 3 1 0 1
如果需要,您也可以删除似乎重复的index
列.
If you want you can remove extra index
column as well which seems to have duplicates.
我在测试csv
文件中关注过.
I had following in the test csv
file.
index,Sent,col_1,col_2,col_3
1,AB,,DD,CC
1, ,0,1,0
2,SA,FA,FB,NA
2, ,1,1,NA
3,FF,Sha,,PA
3, ,1,0,1
然后,您可以使用以下命令创建输入dataframe
:
Then, you can use following to create input dataframe
:
import pandas as pd
df = pd.read_csv("FILENAME.csv")
这篇关于当连续两行不重复使用NaN时,用单词替换NaN值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!