标记停用词生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中 [英] Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words

查看:218
本文介绍了标记停用词生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python 制作聊天机器人.代码:

import nltk将 numpy 导入为 np随机导入导入字符串f=open('/home/hostbooks/ML/stewy/speech/chatbot.txt','r',errors = '忽略')原始 = f.read()raw=raw.lower()# 转换为小写sent_tokens = nltk.sent_tokenize(raw)# 转换为句子列表word_tokens = nltk.word_tokenize(raw)# 转换为单词列表lemmer = nltk.stem.WordNetLemmatizer()def LemTokens(代币):返回 [lemmer.lemmatize(token) for token in tokens]remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)def LemNormalize(文本):返回 LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey","hii")GREETING_RESPONSES = [嗨",嘿",*点头*",嗨",你好",我很高兴!你在跟我说话"]定义问候(句子):对于sentence.split() 中的单词:如果 GREETING_INPUTS 中的 word.lower():返回 random.choice(GREETING_RESPONSES)从 sklearn.feature_extraction.text 导入 TfidfVectorizer从 sklearn.metrics.pairwise 导入 cosine_similarity定义响应(用户响应):robo_response=''sent_tokens.append(user_response)TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')tfidf = TfidfVec.fit_transform(sent_tokens)vals = cosine_similarity(tfidf[-1], tfidf)idx=vals.argsort()[0][-2]flat = vals.flatten()平面排序()req_tfidf = 平面[-2]如果(req_tfidf==0):robo_response=robo_response+对不起!我不明白你的意思"返回 robo_response别的:robo_response = robo_response+sent_tokens[idx]返回 robo_response标志=真print("ROBO:我叫 Robo.我会回答你关于聊天机器人的问题.如果你想退出,请输入再见!")而(标志==真):用户响应 = 输入()user_response=user_response.lower()如果(用户响应!='再见'):if(user_response=='thanks' 或 user_response=='thank you' ):标志=假打印(机器人:不客气..")别的:如果(问候(用户响应)!=无):打印(机器人:"+问候(用户响应))别的:打印(机器人:",结束=")打印(响应(用户响应))sent_tokens.remove(user_response)别的:标志=假打印(机器人:再见!保重……")

它运行良好,但每次对话都会出现此错误:

/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: 你的 stop_words 可能与你的预处理不一致.标记停用词生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中.

这些是来自 CMD 的一些对话:

ROBO:聊天机器人是一种通过听觉或文本方式进行对话的软件.

印度是什么

/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: 你的 stop_words 可能与你的预处理不一致.标记停用词生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中.'stop_words.'% 排序(不一致))

ROBO:印度的野生动物传统上在印度文化中被视为宽容,在这些森林和其他地方的受保护栖息地中得到支持.

什么是聊天机器人

/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: 你的 stop_words 可能与你的预处理不一致.标记停用词生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中.'stop_words.'% 排序(不一致))

ROBO:聊天机器人是一种通过听觉或文本方式进行对话的软件.

解决方案

原因是您使用了自定义 tokenizer 并使用了默认 stop_words='english' 所以虽然提取特征会检查 stop_wordstokenizer

之间是否存在任何不一致

如果您深入研究 sklearn/feature_extraction/text.py 的代码,您会发现此代码段执行一致性检查:

<块引用>

def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):"""检查停用词是否一致退货-------is_consistent : 如果停用词与预处理器一致,则为真和标记器,如果不是,则为 False,如果检查为 None以前执行过,如果不能执行,则为错误"执行(例如,因为使用自定义预处理器/标记器)"""如果 id(self.stop_words) == getattr(self, '_stop_words_id', None):# 停用词之前已经过验证返回无# 注意:stop_words 是经过验证的,不像 self.stop_words尝试:不一致 = 设置()对于 stop_words 或 () 中的 w:令牌 = 列表(令牌化(预处理(w)))对于令牌中的令牌:如果令牌不在 stop_words 中:不一致.添加(令牌)self._stop_words_id = id(self.stop_words)如果不一致:warnings.warn('你的停用词可能与'不一致'你的预处理.标记停止 ''单词生成的标记 %r 不在 ''stop_words.'% 排序(不一致))

如您所见,如果发现不一致,它会发出警告.

希望有帮助.

I am making a chatbot using Python. Code:

import nltk
import numpy as np
import random
import string 
f=open('/home/hostbooks/ML/stewy/speech/chatbot.txt','r',errors = 'ignore')
raw=f.read()
raw=raw.lower()# converts to lowercase

sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words

lemmer = nltk.stem.WordNetLemmatizer()    

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey","hii")
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]


def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)    

    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]    

    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")

while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ROBO: "+greeting(user_response))
            else:
                print("ROBO: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

It is running well but with every conversation it's giving this error:

/home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. 

Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.

These are some conversations from CMD:

ROBO: a chatbot is a piece of software that conducts a conversation via auditory or textual methods.

what is india

    /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))

ROBO: india's wildlife, which has traditionally been viewed with tolerance in india's culture, is supported among these forests, and elsewhere, in protected habitats.

what is chatbot

    /home/hostbooks/django1/myproject/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words. 'stop_words.' % sorted(inconsistent))

ROBO: a chatbot is a piece of software that conducts a conversation via auditory or textual methods.

解决方案

The reason is that you have used custom tokenizer and used default stop_words='english' so while extracting features a check is made to see if there is any inconsistency between stop_words and tokenizer

If you dig deeper into the code of sklearn/feature_extraction/text.py you will find this snippet performing the consistency check:

def _check_stop_words_consistency(self, stop_words, preprocess, tokenize):
    """Check if stop words are consistent

    Returns
    -------
    is_consistent : True if stop words are consistent with the preprocessor
                    and tokenizer, False if they are not, None if the check
                    was previously performed, "error" if it could not be
                    performed (e.g. because of the use of a custom
                    preprocessor / tokenizer)
    """
    if id(self.stop_words) == getattr(self, '_stop_words_id', None):
        # Stop words are were previously validated
        return None

    # NB: stop_words is validated, unlike self.stop_words
    try:
        inconsistent = set()
        for w in stop_words or ():
            tokens = list(tokenize(preprocess(w)))
            for token in tokens:
                if token not in stop_words:
                    inconsistent.add(token)
        self._stop_words_id = id(self.stop_words)

        if inconsistent:
            warnings.warn('Your stop_words may be inconsistent with '
                          'your preprocessing. Tokenizing the stop '
                          'words generated tokens %r not in '
                          'stop_words.' % sorted(inconsistent))

As you can see it raises warning if an inconsistency is found.

Hope it helps.

这篇关于标记停用词生成的标记 ['ha', 'le', 'u', 'wa'] 不在 stop_words 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆