sklearn 中字母的 N-gram [英] N-grams for letter in sklearn

查看:90
本文介绍了sklearn 中字母的 N-gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做n-grams方法但是一个字母一个字母

I want to do n-grams method but letter by letter

普通 N-gram:

sentence : He want to watch football match

result:
he, he want, want, want to , to , to watch , watch , watch football , football, football match, match

我想这样做,但要逐字逐句:

I want to do this but letter by letter:

word : Angela 

result:
a, an, n , ng , g , ge, e ,el, l , la ,a

这是我使用 Sklearn 的代码,但它仍然是逐字而不是逐字:

This is my code using Sklearn , but it is still word-by-word not letter-by-letter:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 100),token_pattern = r"(?u)\b\w+\b")

corpus = ['Angel','Angelica','John','Johnson']

X = vectorizer.fit_transform(corpus)
analyze = vectorizer.build_analyzer()
print(vectorizer.get_feature_names())
print(vectorizer.transform(['Angela']).toarray())

推荐答案

有一个 'analyzer' 参数可以满足您的需求.

There is an 'analyzer' param which does what you want.

根据文档:-

分析器:字符串、{‘word’、‘char’、‘char_wb’}或可调用

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable

特征应该由单词还是字符n-gram组成.选项‘char_wb’仅从单词边界内的文本创建字符 n-gram;单词边缘的 n-gram 用空格填充.

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

如果传递了一个可调用对象,它将用于提取特征序列从原始的、未处理的输入中提取出来.

If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

默认情况下,它设置为 word,您可以更改.

By default, it is set to word, which you can change.

就去做:

vectorizer = CountVectorizer(ngram_range=(1, 100),
                             token_pattern = r"(?u)\b\w+\b", 
                             analyzer='char')

这篇关于sklearn 中字母的 N-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆