Python变量范围方法 [英] Python variable scope approach
问题描述
我目前有这个python代码(我正在使用Apache Spark,但可以肯定的是,这个问题无关紧要.)
I currently have this python code (I'm using Apache Spark, but pretty sure that it doesn't matter for this question).
import numpy as np
import pandas as pd
from sklearn import feature_extraction
from sklearn import tree
from pyspark import SparkConf, SparkContext
## Module Constants
APP_NAME = "My Spark Application"
df = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
def train_tree():
# Do more stuff with the data, call other functions
pass
def main(sc):
cat_columns = ["Sex", "Pclass"]
# PROBLEM IS HERE
cat_dict = df[cat_columns].to_dict(orient='records')
vec = feature_extraction.DictVectorizer()
cat_vector = vec.fit_transform(cat_dict).toarray()
df_vector = pd.DataFrame(cat_vector)
vector_columns = vec.get_feature_names()
df_vector.columns = vector_columns
df_vector.index = df.index
# train data
df = df.drop(cat_columns, axis=1)
df = df.join(df_vector)
train_tree()
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
# Execute Main functionality
main(sc)
当我运行它时,出现错误:cat_dict = df [cat_columns] .to_dict(orient ='records')UnboundLocalError:赋值之前引用了本地变量'df'
When I run it, I get the error: cat_dict = df[cat_columns].to_dict(orient='records') UnboundLocalError: local variable 'df' referenced before assignment
我感到困惑,因为我在文件顶部的 main
函数作用域之外定义了变量df.为什么在函数中使用此变量会触发此错误?我还尝试过将 df
变量定义放在 if __name__ =="__main __":
语句中(在调用 main
函数之前)
I find this puzzling because I am defining the variable df outside of the main
function scope at the top of the file. Why would using this variable inside the function trigger this error? I have also tried putting the df
variable definition inside the if __name__ == "__main__":
statement (before the main
function is called)
现在,显然有很多方法可以解决此问题,但这更多是关于帮助我更好地理解Python.所以我想问:
Now, obviously there are lots of ways I could solve this, but this is more about helping me to understand Python better. So I want to ask:
a)为什么还会出现此错误?
a) Why this error even occurs?
b)鉴于以下情况,如何最好地解决它:-我不想将 df
定义放在 main
函数中,因为我想在其他函数中访问它.-我不想上课-我不想使用全局变量-我不想在函数参数中传递 df
b) How best to solve it given that:
- I don't want to put the df
definition inside the main
function because I want to access it in other functions.
- I don't want to use a class
- I don't want to use a global variable
- I don't want to pass df
around in function parameters
推荐答案
我认为有必要将评论总结成详细的答案,以供将来阅读此问题的读者使用.
I think it's worth summarizing the comments into an a detailed answer for future readers of this question.
之所以在这里引发UnboundLocalError的原因是由于Python函数作用域的工作方式.尽管我的 df
变量是在最大范围的 main
函数之外定义的,但是尝试在 main
函数中重新分配它会导致错误.这个绝妙的答案很好地说明了,释义:
The reason why the UnboundLocalError is getting thrown here is due to the way Python function scope works. Although my df
variable is defined outside of the main
function at the uppermost scope, attempting to re-assign it in the main
function creates the error. This excellent answer puts it nicely, to paraphrase:
现在,我们进入 df = df.drop(cat_columns,axis = 1)
,当Python扫描该行时,它说:啊,有一个名为 df
的变量,我将其放入本地范围字典中."然后,当它在赋值右侧为 df
寻找 df
的值时,它将找到名为<的 local 变量.code> df (尚无值),因此会引发错误.
Now we get to df = df.drop(cat_columns, axis=1)
When Python scans that line, it says "ahah, there's a variable named df
, I'll put it into my local scope dictionary." Then when it goes looking for a value for df
for the df
on the right hand side of the assignment, it finds its local variable named df
, which has no value yet, and so throws the error.
要修复我的代码,我进行了以下更改:
To fix my code I made the following change:
def main(sc):
cat_columns = ["Sex", "Pclass", "SibSp"]
cat_dict = df[cat_columns].to_dict(orient='records')
vec = feature_extraction.DictVectorizer()
cat_vector = vec.fit_transform(cat_dict).toarray()
df_vector = pd.DataFrame(cat_vector)
vector_columns = vec.get_feature_names()
df_vector.columns = vector_columns
df_vector.index = df.index
# train data
df_updated = df.drop(cat_columns, axis=1) # This used to be df = df.drop(cat_columns, axis=1)
df_updated = df_updated.join(df_vector)
train_tree(df_updated) # passing the df_updated to the function
这将删除UnboundLocalError.为了在其他函数中继续使用 df
变量,我将其作为参数传递(尽管名称不同).这可能会造成混淆,因此,如@Padraic Cunningham所建议的那样,您可以在 main
函数中传递变量:
This removes the UnboundLocalError. To keep using the df
variable in other functions, I pass it in as a parameter (albeit with a different name). This could get confusing, so as suggested by @Padraic Cunningham you could pass the variable in the main
function:
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
df = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
# df.Age = df.Age.astype(int)
# test.Age = test.Age.astype(int)
# Execute Main functionality
main(sc,df)
其他选项是使用类或使用全局变量.我觉得这两个选择太过分了(一个类)或太过优雅(一个全局).但是,这纯粹是我的个人喜好.
Other options would be to use a class, or to use a global variable. I felt that these two options were overkill (a class) or inelegant (global). However, this is purely my personal taste.
这篇关于Python变量范围方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!