使用loc和iloc进行 pandas 分配时,会弹出密码警告 [英] Cryptic warning pops up when doing pandas assignment with loc and iloc
问题描述
我的代码中有一条声明:
df.loc[i] = [df.iloc[0][0], i, np.nan]
其中,i
是我在该语句所在的for
循环中使用的迭代变量,np
是我导入的numpy模块,而df
是一个类似于以下内容的DataFrame:
build_number name cycles
0 390 adpcm 21598
1 390 aes 5441
2 390 dfadd 463
3 390 dfdiv 1323
4 390 dfmul 167
5 390 dfsin 39589
6 390 gsm 6417
7 390 mips 4205
8 390 mpeg2 1993
9 390 sha 348417
如您所见,代码中的语句用于将新行插入到DataFrame df
中,并用NaN
值填充cycles
下的最后一列(在新插入的行内). /p>
但是,这样做时,我收到以下警告消息:
/usr/local/bin/ipython:28: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
看文档,我仍然不明白我在这里引起的问题或风险.我以为使用loc
和iloc
已经遵循了建议?
谢谢.
在此处编辑 在@EdChum的请求下,我添加了使用以下语句的函数:
def patch_missing_benchmarks(refined_dataframe):
'''
Patches up a given DataFrame, ensuring that all build_numbers have the complete
set of benchmark names, inserting NaN values at the column where the data is
supposed to be residing in.
Accepts:
--------
* refined_dataframe
DataFrame that was returned from the remove_early_retries() function and that
contains no duplicates of benchmarks within a given build number and also has been
sorted nicely to ensure that build numbers are in alphabetical order.
However, this function can also accept the DataFrame that has not been sorted, so
long as it has no repitition of benchmark names within a given build number.
Returns:
-------
* patched_benchmark_df
DataFrame with all Build numbers filled with the complete set of benchmark data,
with those previously missing benchmarks now having NaN values for their data.
'''
patched_df_list = []
benchmark_list = ['adpcm', 'aes', 'blowfish', 'dfadd', 'dfdiv', 'dfmul',
'dfsin', 'gsm', 'jpeg', 'mips', 'mpeg2', 'sha']
benchmark_series = pd.Series(data = benchmark_list)
for number in refined_dataframe['build_number'].drop_duplicates().values:
# df must be a DataFrame whose data has been sorted according to build_number
# followed by benchmark name
df = refined_dataframe.query('build_number == %d' % number)
# Now we compare the benchmark names present in our section of the DataFrame
# with the Series containing the complete collection of Benchmark names and
# get back a boolean DataFrame telling us precisely what benchmark names
# are missing
boolean_bench = benchmark_series.isin(df['name'])
list_names = []
for i in range(0, len(boolean_bench)):
if boolean_bench[i] == False:
name_to_insert = benchmark_series[i]
list_names.append(name_to_insert)
else:
continue
print 'These are the missing benchmarks for build number',number,':'
print list_names
for i in list_names:
# create a new row with index that is benchmark name itself to avoid overwriting
# any existing data, then insert the right values into that row, filling in the
# space name with the right benchmark name, and missing data with NaN
df.loc[i] = [df.iloc[0][0], i, np.nan]
patched_for_benchmarks_df = df.sort_index(by=['build_number',
'name']).reset_index(drop = True)
patched_df_list.append(patched_for_benchmarks_df)
# we make sure we call a dropna method at threshold 2 to drop those rows whose benchmark
# names as well as cycles names are NaN, leaving behind the newly inserted rows with
# benchmark names but that now have the data as NaN values
patched_benchmark_df = pd.concat(objs = patched_df_list, ignore_index =
True).sort_index(by= ['build_number',
'name']).dropna(thresh = 2).reset_index(drop = True)
return patched_benchmark_df
如果不知道如何执行此操作,如果只想设置周期"列,则以下操作将在不发出任何警告的情况下起作用:
In [344]:
for i in range(len(df)):
df.loc[i,'cycles'] = np.nan
df
Out[344]:
build_number name cycles
0 390 adpcm NaN
1 390 aes NaN
2 390 dfadd NaN
3 390 dfdiv NaN
4 390 dfmul NaN
5 390 dfsin NaN
6 390 gsm NaN
7 390 mips NaN
8 390 mpeg2 NaN
9 390 sha NaN
如果您只想设置整个列,则无需循环即可:df['cycles'] = np.NaN
There is a statement in my code that goes:
df.loc[i] = [df.iloc[0][0], i, np.nan]
where i
is an iteration variable that I used in the for
loop that this statement is residing in,np
is my imported numpy module, and df
is a DataFrame that looks something like:
build_number name cycles
0 390 adpcm 21598
1 390 aes 5441
2 390 dfadd 463
3 390 dfdiv 1323
4 390 dfmul 167
5 390 dfsin 39589
6 390 gsm 6417
7 390 mips 4205
8 390 mpeg2 1993
9 390 sha 348417
So as you can see, the statement in my code serves to insert new rows into my DataFrame df
and fill the very last column (within that newly inserted row) under cycles
with a NaN
value.
However, in so doing, I get the following warning message:
/usr/local/bin/ipython:28: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Looking at the Docs, I still don't understand what's the problem or risk that I'm incurring here. I thought that using loc
and iloc
follows the recommendation already?
Thank you.
EDIT HERE At the request of @EdChum, I have added in the function that uses the above statement below:
def patch_missing_benchmarks(refined_dataframe):
'''
Patches up a given DataFrame, ensuring that all build_numbers have the complete
set of benchmark names, inserting NaN values at the column where the data is
supposed to be residing in.
Accepts:
--------
* refined_dataframe
DataFrame that was returned from the remove_early_retries() function and that
contains no duplicates of benchmarks within a given build number and also has been
sorted nicely to ensure that build numbers are in alphabetical order.
However, this function can also accept the DataFrame that has not been sorted, so
long as it has no repitition of benchmark names within a given build number.
Returns:
-------
* patched_benchmark_df
DataFrame with all Build numbers filled with the complete set of benchmark data,
with those previously missing benchmarks now having NaN values for their data.
'''
patched_df_list = []
benchmark_list = ['adpcm', 'aes', 'blowfish', 'dfadd', 'dfdiv', 'dfmul',
'dfsin', 'gsm', 'jpeg', 'mips', 'mpeg2', 'sha']
benchmark_series = pd.Series(data = benchmark_list)
for number in refined_dataframe['build_number'].drop_duplicates().values:
# df must be a DataFrame whose data has been sorted according to build_number
# followed by benchmark name
df = refined_dataframe.query('build_number == %d' % number)
# Now we compare the benchmark names present in our section of the DataFrame
# with the Series containing the complete collection of Benchmark names and
# get back a boolean DataFrame telling us precisely what benchmark names
# are missing
boolean_bench = benchmark_series.isin(df['name'])
list_names = []
for i in range(0, len(boolean_bench)):
if boolean_bench[i] == False:
name_to_insert = benchmark_series[i]
list_names.append(name_to_insert)
else:
continue
print 'These are the missing benchmarks for build number',number,':'
print list_names
for i in list_names:
# create a new row with index that is benchmark name itself to avoid overwriting
# any existing data, then insert the right values into that row, filling in the
# space name with the right benchmark name, and missing data with NaN
df.loc[i] = [df.iloc[0][0], i, np.nan]
patched_for_benchmarks_df = df.sort_index(by=['build_number',
'name']).reset_index(drop = True)
patched_df_list.append(patched_for_benchmarks_df)
# we make sure we call a dropna method at threshold 2 to drop those rows whose benchmark
# names as well as cycles names are NaN, leaving behind the newly inserted rows with
# benchmark names but that now have the data as NaN values
patched_benchmark_df = pd.concat(objs = patched_df_list, ignore_index =
True).sort_index(by= ['build_number',
'name']).dropna(thresh = 2).reset_index(drop = True)
return patched_benchmark_df
Without seeing how you are doing this, if you just want to set the 'Cycles' column then the following would work without raising any warning:
In [344]:
for i in range(len(df)):
df.loc[i,'cycles'] = np.nan
df
Out[344]:
build_number name cycles
0 390 adpcm NaN
1 390 aes NaN
2 390 dfadd NaN
3 390 dfdiv NaN
4 390 dfmul NaN
5 390 dfsin NaN
6 390 gsm NaN
7 390 mips NaN
8 390 mpeg2 NaN
9 390 sha NaN
If you are just wanting to set the entire column then there's no need to loop just do this: df['cycles'] = np.NaN
这篇关于使用loc和iloc进行 pandas 分配时,会弹出密码警告的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!