Google Cloud ML Engine + Tensorflow在input_fn()中执行预处理/标记化 [英] Google Cloud ML Engine + Tensorflow perform preprocessing/tokenization in input_fn()
问题描述
我想在我的输入函数中执行基本的预处理和标记。我的数据包含在我无法修改的谷歌云存储存储位置(gs://)中的csv中。此外,我对我的ml-engine包中的输入文本进行任何修改,以便在服务时间内复制行为。
我的输入函数遵循以下基本结构:
filename_queue = tf.train.string_input_producer(文件名)
reader = tf.TextLineReader()
_,rows = reader.read_up_to(filename_queue,num_records = batch_size)
text,label = tf.decode_csv(rows,record_defaults = [[],[]])
#添加逻辑来过滤特殊字符
#添加逻辑以使所有单词小写
words = tf.string_split(text)#根据空格分割
是否有任何选项可以避免预先对整个数据集进行预处理?这个帖子表明可以使用tf.py_func()来进行这些转换,但他们建议缺点是,因为它没有保存在图表中,所以我不能恢复我保存的模型,所以我不相信这对于服务时间是有用的。如果我正在定义我自己的tf.py_func()来执行预处理,并且在培训师包中定义了我正在上传到云中,我是否会遇到任何问题?有没有其他的选择我不考虑?
最好的做法是编写一个函数, / eval input_fn并从您的服务input_fn。
例如:
def add_engineered(features):
text = features ['text']
features ['words'] = tf.string_split(text)
返回特征
$ c
然后,在你的input_fn中,通过调用add_engineered来包装你返回的特性: def input_fn():
features = ...
label = ...
return add_engineered(features),label
并在您的serving_input fn中,请确保以调用add_engineered的方式将返回的功能(不是feature_placeholders)
def serving_input_fn():
feature_placeholders = ...
特征= ...
返回tflearn.utils.inp ut_fn_utils.InputFnOps(
add_engineered(features),
None,
feature_placeholders
)
您的模型将使用单词。然而,你在预测时输入的JSON只需要包含'文本'即原始值。
下面是一个完整的工作示例:
I want to perform basic preprocessing and tokenization within my input function. My data is contained in csv's in a google cloud storage bucket location (gs://) that I cannot modify. Further, I to perform any modifications on input text within my ml-engine package so that the behavior can be replicated at serving time.
my input function follows the basic structure below:
filename_queue = tf.train.string_input_producer(filenames)
reader = tf.TextLineReader()
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
text, label = tf.decode_csv(rows, record_defaults = [[""],[""]])
# add logic to filter special characters
# add logic to make all words lowercase
words = tf.string_split(text) # splits based on white space
Are there any options that avoid performing this preprocessing on the entire data set in advance? This post suggests that tf.py_func() can be used to make these transformations, however they suggest that "The drawback is that as it is not saved in the graph, I cannot restore my saved model" so I am not convinced that this will be useful at serving time. If I am defining my own tf.py_func() to do preprocessing and it is defined in the trainer package that I am uploading to the cloud will I run into any issues? Are there any alternative options that I am not considering?
解决方案 Best practice is to write a function that you call from both the training/eval input_fn and from your serving input_fn.
For example:
def add_engineered(features):
text = features['text']
features['words'] = tf.string_split(text)
return features
Then, in your input_fn, wrap the features you return with a call to add_engineered:
def input_fn():
features = ...
label = ...
return add_engineered(features), label
and in your serving_input fn, make sure to similarly wrap the returned features (NOT the feature_placeholders) with a call to add_engineered:
def serving_input_fn():
feature_placeholders = ...
features = ...
return tflearn.utils.input_fn_utils.InputFnOps(
add_engineered(features),
None,
feature_placeholders
)
Your model would use 'words'. However, your JSON input at prediction time would only need to contain 'text' i.e. the raw values.
Here's a complete working example:
这篇关于Google Cloud ML Engine + Tensorflow在input_fn()中执行预处理/标记化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!