AWS Sagemaker |如何训练文本数据|机票分类 [英] AWS Sagemaker | how to train text data | For ticket classification
问题描述
我是Sagemaker的新手,不确定如何在AWS sagemaker中对文本输入进行分类,
I am new to Sagemaker and not sure how to classify the text input in AWS sagemaker,
假设我有一个数据框,其中有两个字段,例如"Ticket"和"Category",两者都是文本输入,现在我想将其拆分为测试和训练集,并上传到Sagemaker训练模型中.
Suppose I have a Dataframe having two fields like 'Ticket' and 'Category', Both are text input, Now I want to split it test and training set and upload in Sagemaker training model.
X_train, X_test, y_train, y_test = model_selection.train_test_split(fewRecords['Ticket'],fewRecords['Category'])
现在我要执行TD-IDF特征提取,然后将其转换为数值,因此执行此操作
Now as I want to perform TD-IDF feature extraction and then convert it to numeric value, so performing this operation
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(fewRecords['Category'])
xtrain_tfidf = tfidf_vect.transform(X_train)
xvalid_tfidf = tfidf_vect.transform(X_test)
当我想在Sagemaker中上传模型时,可以执行
When I want to upload the model in Sagemaker so I can perform next operation like
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
buf.seek(0)
我收到此错误
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-8055e6cdbf34> in <module>()
1 buf = io.BytesIO()
----> 2 smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
3 buf.seek(0)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
98 raise ValueError("Label shape {} not compatible with array shape {}".format(
99 labels.shape, array.shape))
--> 100 resolved_label_type = _resolve_type(labels.dtype)
101 resolved_type = _resolve_type(array.dtype)
102
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
205 elif dtype == np.dtype('float32'):
206 return 'Float32'
--> 207 raise ValueError('Unsupported dtype {} on array'.format(dtype))
ValueError: Unsupported dtype object on array
除此异常外,我不清楚这是否正确,因为TfidfVectorizer将该系列转换为Matrix.
Other than this exception, I am not clear if this is right way as TfidfVectorizer convert the series to Matrix.
代码在我的本地计算机上运行正常,但是不确定如何在Sagemaker上执行相同的操作.上面提到的所有示例都太长了,对于仍然接触SciKit Learn的人来说不是
The code is predicting fine on my local machine but not sure how to do the same on Sagemaker, All the example mentioned there are too lengthy and not for the person who still reached to SciKit Learn
推荐答案
TfidfVectorizer
的输出是一个稀疏稀疏矩阵,而不是简单的numpy数组.
The output of TfidfVectorizer
is a scipy sparse matrix, not a simple numpy array.
因此,请使用其他功能,例如:
So either use a different function like:
"将稀疏矩阵写入稀疏张量""
"""Writes a scipy sparse matrix to a sparse tensor"""
有关更多详细信息,请参见此问题.
See this issue for more details.
OR 首先将TfidfVectorizer
的输出转换为密集的numpy数组,然后使用上面的代码
OR first convert the output of TfidfVectorizer
to a dense numpy array and then use your above code
xtrain_tfidf = tfidf_vect.transform(X_train).toarray()
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, xtrain_tfidf, y_train)
...
...
这篇关于AWS Sagemaker |如何训练文本数据|机票分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!