AWS Sagemaker |如何训练文本数据|用于门票分类 [英] AWS Sagemaker | how to train text data | For ticket classification

查看:37
本文介绍了AWS Sagemaker |如何训练文本数据|用于门票分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Sagemaker 的新手,不确定如何对 AWS sagemaker 中的文本输入进行分类,

I am new to Sagemaker and not sure how to classify the text input in AWS sagemaker,

假设我有一个 Dataframe,它有两个字段,如Ticket"和Category",两者都是文本输入,现在我想将它拆分为测试和训练集并上传到 Sagemaker 训练模型中.

Suppose I have a Dataframe having two fields like 'Ticket' and 'Category', Both are text input, Now I want to split it test and training set and upload in Sagemaker training model.

X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])

现在因为我要执行TD-IDF特征提取然后将其转换为数值,所以执行此操作

Now as I want to perform TD-IDF feature extraction and then convert it to numeric value, so performing this operation

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf =  tfidf_vect.transform(X_train)
xvalid_tfidf =  tfidf_vect.transform(X_test)

当我想在 Sagemaker 中上传模型以便我可以执行下一个操作时,例如

When I want to upload the model in Sagemaker so I can perform next operation like

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)

我收到此错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-8055e6cdbf34> in <module>()
      1 buf = io.BytesIO()
----> 2 smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
      3 buf.seek(0)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
     98             raise ValueError("Label shape {} not compatible with array shape {}".format(
     99                              labels.shape, array.shape))
--> 100         resolved_label_type = _resolve_type(labels.dtype)
    101     resolved_type = _resolve_type(array.dtype)
    102 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
    205     elif dtype == np.dtype('float32'):
    206         return 'Float32'
--> 207     raise ValueError('Unsupported dtype {} on array'.format(dtype))

ValueError: Unsupported dtype object on array

除了这个例外,我不清楚这是否正确,因为 TfidfVectorizer 将系列转换为矩阵.

Other than this exception, I am not clear if this is right way as TfidfVectorizer convert the series to Matrix.

代码在我的本地机器上预测得很好,但不确定如何在 Sagemaker 上做同样的事情,那里提到的所有例子都太冗长了,而不适合仍然接触 SciKit Learn 的人

The code is predicting fine on my local machine but not sure how to do the same on Sagemaker, All the example mentioned there are too lengthy and not for the person who still reached to SciKit Learn

推荐答案

TfidfVectorizer 的输出是一个 scipy 稀疏矩阵,而不是一个简单的 numpy 数组.

The output of TfidfVectorizer is a scipy sparse matrix, not a simple numpy array.

所以要么使用不同的函数,如:

So either use a different function like:

write_spmatrix_to_sparse_tensor

"""将scipy稀疏矩阵写入稀疏张量"""

"""Writes a scipy sparse matrix to a sparse tensor"""

有关详细信息,请参阅此问题.

See this issue for more details.

OR 首先将 TfidfVectorizer 的输出转换为密集的 numpy 数组,然后使用上面的代码

OR first convert the output of TfidfVectorizer to a dense numpy array and then use your above code

xtrain_tfidf =  tfidf_vect.transform(X_train).toarray()   
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
...
...

这篇关于AWS Sagemaker |如何训练文本数据|用于门票分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆