如何在字符级别对句子进行一次热编码? [英] How to one-hot-encode sentences at the character level?
本文介绍了如何在字符级别对句子进行一次热编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想将一个句子转换为一个单向向量数组. 这些向量将是字母的一键表示. 看起来像以下内容:
I would like to convert a sentence to an array of one-hot vector. These vector would be the one-hot representation of the alphabet. It would look like the following:
"hello" # h=7, e=4 l=11 o=14
将成为
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
不幸的是,来自sklearn的OneHotEncoder不会将其作为输入字符串.
Unfortunately OneHotEncoder from sklearn does not take as input string.
推荐答案
只需将您传递的字符串中的字母与给定的字母进行比较:
Just compare the letters in your passed string to a given alphabet:
def string_vectorizer(strng, alphabet=string.ascii_lowercase):
vector = [[0 if char != letter else 1 for char in alphabet]
for letter in strng]
return vector
请注意,使用自定义字母(例如,"defbcazk",各列将按原始列表中每个元素的顺序排列).
Note that, with a custom alphabet (e.g. "defbcazk", the columns will be ordered as each element appears in the original list).
string_vectorizer('hello')
的输出:
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
这篇关于如何在字符级别对句子进行一次热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文