如何在countVectorizer中将带小数点或逗号的数字作为一个单词处理 [英] How to treat number with decimals or with commas as one word in countVectorizer

查看:325
本文介绍了如何在countVectorizer中将带小数点或逗号的数字作为一个单词处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在清理文本,然后将其传递给CountVectorizer函数,以让我对每个单词在文本中出现的次数进行计数.问题在于它将10,000x视为两个单词(10和000x).同样,对于5.00,它将5和00视为两个不同的单词.

I am cleaning text and then passing it to the CountVectorizer function to give me a count of how many times each word appears in the text. The problem is that it is treating 10,000x as two words (10 and 000x). Similarly for 5.00 it is treating 5 and 00 as two different words.

我尝试了以下操作:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus=["userna lightning strike megawaysnew release there's many  
ways win lightning strike megaways. start epic adventure today, seek 
mystery symbols, re-spins wild multipliers, mega spins gamble lead wins 
10,000x bet!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer()


result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()

res_df45 = pd.DataFrame(result, columns = cols)

在数据帧中,"10"和"000x"的计数均为1,但我需要将它们视为一个字(10,000x).我该怎么办?

In the data frame, both "10" and "000x" are given a count of 1 but I need them to be treated as one word (10,000x). How can I do this?

推荐答案

令牌生成器用于token_pattern参数的默认正则表达式模式是:

The default regex pattern the tokenizer is using for the token_pattern parameter is:

token_pattern='(?u)\\b\\w\\w+\\b'

因此,一个单词由开头和结尾的 \ b 单词边界定义,其中 \ w \ w + 包含一个字母数字字符,后跟一个或多个字母数字字符边界.要解释正则表达式,必须使用 \\ 来转义反斜杠.

So a word is defined by a \b word boundary at the beginning and the end with \w\w+ one alphanumeric character followed by one or more alphanumeric characters between the boundaries. To interpret the regex, the backslashes have to be escaped by \\.

因此您可以将令牌模式更改为:

So you could change the token pattern to:

token_pattern='\\b(\\w+[\\.,]?\\w+)\\b'

说明: [\\.,]?是否允许出现..必须将第一个出现的字母数字字符 \ w 的正则表达式扩展到 \ w + ,以匹配标点之前多于一个数字的数字.

Explanation: [\\.,]?allows for the optional appearance of a . or ,. The regex for the first appearing alphanumeric character \w has to be extended to \w+ to match numbers with more than one digit before the punctuation.

对于您稍作调整的示例:

For your slightly adjusted example:

corpus=["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer(token_pattern='\\b(\\w+[\\.,]?\\w+)\\b')
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))

输出:

   10,000x  2.5  am  bet  in  lightning  many  na  re  release  spins  strike  there  userna  
0        1    1   1    1   1          1     1   1   1        1      1       1      1       1  

或者,您可以修改输入文本,例如通过用下划线 _ 替换小数点.,并删除数字之间的逗号.

Alternatively you could modify your input text, e.g. by replacing the decimal point .with underscore _ and removing commas standing between digits.

import re

corpus = ["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
for i in range(len(corpus)):
    corpus[i] = re.sub("(\d+)\.(\d+)", "\\1_\\2", corpus[i]) 
    corpus[i] = re.sub("(\d+),(\d+)", "\\1\\2", corpus[i])
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))

输出:

   10000x  2_5  am  bet  in  lightning  many  na  re  release  spins  strike  there  userna
0       1    1   1    1   1          1     1   1   1        1      1       1      1       1   

这篇关于如何在countVectorizer中将带小数点或逗号的数字作为一个单词处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆