Google Cloud ML Engine + Tensorflow在input_fn()中执行预处理/标记化 [英] Google Cloud ML Engine + Tensorflow perform preprocessing/tokenization in input_fn()

查看:375
本文介绍了Google Cloud ML Engine + Tensorflow在input_fn()中执行预处理/标记化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的输入函数中执行基本的预处理和标记。我的数据包含在我无法修改的谷歌云存储存储位置(gs://)中的csv中。此外,我对我的ml-engine包中的输入文本进行任何修改,以便在服务时间内复制行为。

我的输入函数遵循以下基本结构:

  filename_queue = tf.train.string_input_producer(文件名)
reader = tf.TextLineReader()
_,rows = reader.read_up_to(filename_queue,num_records = batch_size)
text,label = tf.decode_csv(rows,record_defaults = [[],[]])

#添加逻辑来过滤特殊字符
#添加逻辑以使所有单词小写
words = tf.string_split(text)#根据空格分割

是否有任何选项可以避免预先对整个数据集进行预处理?这个帖子表明可以使用tf.py_func()来进行这些转换,但他们建议缺点是,因为它没有保存在图表中,所以我不能恢复我保存的模型,所以我不相信这对于服务时间是有用的。如果我正在定义我自己的tf.py_func()来执行预处理,并且在培训师包中定义了我正在上传到云中,我是否会遇到任何问题?有没有其他的选择我不考虑?

解决方案

最好的做法是编写一个函数, / eval input_fn并从您的服务input_fn。



例如:

  def add_engineered(features):
text = features ['text']
features ['words'] = tf.string_split(text)
返回特征

然后,在你的input_fn中,通过调用add_engineered来包装你返回的特性:

  def input_fn():
features = ...
label = ...
return add_engineered(features),label

并在您的serving_input fn中,请确保以调用add_engineered的方式将返回的功能(不是feature_placeholders)

  def serving_input_fn():
feature_placeholders = ...
特征= ...
返回tflearn.utils.inp ut_fn_utils.InputFnOps(
add_engineered(features),
None,
feature_placeholders

您的模型将使用单词。然而,你在预测时输入的JSON只需要包含'文本'即原始值。

下面是一个完整的工作示例:

https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/taxifare/trainer/model.py#L107


I want to perform basic preprocessing and tokenization within my input function. My data is contained in csv's in a google cloud storage bucket location (gs://) that I cannot modify. Further, I to perform any modifications on input text within my ml-engine package so that the behavior can be replicated at serving time.

my input function follows the basic structure below:

filename_queue = tf.train.string_input_producer(filenames)
reader = tf.TextLineReader()
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
text, label = tf.decode_csv(rows, record_defaults = [[""],[""]])

# add logic to filter special characters
# add logic to make all words lowercase
words = tf.string_split(text) # splits based on white space

Are there any options that avoid performing this preprocessing on the entire data set in advance? This post suggests that tf.py_func() can be used to make these transformations, however they suggest that "The drawback is that as it is not saved in the graph, I cannot restore my saved model" so I am not convinced that this will be useful at serving time. If I am defining my own tf.py_func() to do preprocessing and it is defined in the trainer package that I am uploading to the cloud will I run into any issues? Are there any alternative options that I am not considering?

解决方案

Best practice is to write a function that you call from both the training/eval input_fn and from your serving input_fn.

For example:

def add_engineered(features):
  text = features['text']
  features['words'] = tf.string_split(text)
  return features

Then, in your input_fn, wrap the features you return with a call to add_engineered:

def input_fn():
  features = ...
  label = ...
  return add_engineered(features), label

and in your serving_input fn, make sure to similarly wrap the returned features (NOT the feature_placeholders) with a call to add_engineered:

def serving_input_fn():
    feature_placeholders = ...
    features = ...
    return tflearn.utils.input_fn_utils.InputFnOps(
      add_engineered(features),
      None,
      feature_placeholders
    )

Your model would use 'words'. However, your JSON input at prediction time would only need to contain 'text' i.e. the raw values.

Here's a complete working example:

https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/taxifare/trainer/model.py#L107

这篇关于Google Cloud ML Engine + Tensorflow在input_fn()中执行预处理/标记化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆