Pythons Scikit-Learn 库中分类数据的异常值预测 [英] Outlier prediction with categorical data in Pythons Scikit-Learn lib

查看:51
本文介绍了Pythons Scikit-Learn 库中分类数据的异常值预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用我自己的输出进行预测.我使用 Python Scikit-learn lib 和 Isolation Forest 作为算法.我不知道我做错了什么,但是当我想转换我的输入数据时,我总是遇到错误.我在这一行收到错误:

Im trying to make prediction with my own output. Im using Python Scikit-learn lib and Isolation Forest as algorithm. I do not know what am I doing wrong, but when I want to transform my input data I always get an error. I get an error in this line:

    input_par = encoder.transform(val)#ERROR

这是错误:使用 array.reshape(-1, 1)(如果您的数据具有单个特征)或 array.reshape(1, -1)(如果它包含单个样本)重塑您的数据.

我已经尝试过这个,但我总是得到一个错误:

And I have tried this, but I always get an error:

    input_par = encoder.transform([val])#ERROR

这是错误:alueError:仅pandas DataFrames 支持使用字符串指定列

我做错了什么,我该如何解决这个错误?另外,我应该使用 OneHotEncoderLabelEncoder 还是 CountVectorizer?

What am I doing wrong, how can I fix this error? Also, should I use OneHotEncoder, LabelEncoder or CountVectorizer?

这是我的代码:

import pandas as pd

from sklearn.ensemble import IsolationForest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']
num_data = [4, 1, 3, 2, 65, 3,3]

df = pd.DataFrame({'my text': textual_data,
                   'num data': num_data})
x = df

# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])], remainder='passthrough')
#encoder = ColumnTransformer(transformers=[('lab', LabelEncoder(), ['my text'])])

x = encoder.fit_transform(x)

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)

list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]

for val in list_of_val:

    input_par = encoder.transform(val)#ERROR

    outlier = model.predict(input_par)
    #print(outlier)

    if outlier[0] == -1:
        print('Values', val, 'are outliers')

    else:
        print('Values', val, 'are not outliers')

我也试过这个:

list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]

for val in list_of_val:

    input_par = encoder.transform(pd.DataFrame({'my text': val[0],
                                               'num data': val[1]}))

但我收到此错误:

ValueError: If using all scalar values, you must pass an index

推荐答案

我将尝试列出可能对您有用的观察结果:

I will try to make a list of observations that you will maybe find useful:

    例如,可以使用
  • LabelEncoder, 将非数字数据转换为数字标签.OneHotEncoder 通常采用数值或非数值数据和将其转换为单热编码.两者通常都用于预处理标签"(监督学习问题的类别).
  • 据我所知,您正在尝试预测异常值(异常检测).我不清楚话语和整数之间的联系是否只是硬编码的,或者您是否想以某种方式生成这种联系.如果这是您想要的,那么您无法使用前面提到的编码器实现此目的,因为您将它们拟合到某些数据(通常应该是标签)并尝试转换新的不相关数据(ValueError:y 包含以前未见过的标签).但是,这可以通过将 OneHotEncoder 的 handle_unknown 参数设置为忽略"来解决(来自文档:如果在转换期间存在未知的分类特征,是否引发错误或忽略").即使您可以使用这些编码器之一实现您想要的效果,您也应该记住,这不是它的主要目的.
  • 我假设您对负面"话语给予了很高的评价(即使错误"不对应于您的训练数据中的 65),而对积极"话语给予较小的评价.如果您假设您已经知道每个话语的每个整数,则可以在被认为是正面"示例的内容上训练模型,并仅在测试中给出负面"示例(异常值).您不会在正面"和负面"示例上训练 IsolationForest - 这只是可以使用决策树建模的基本二元分类.可以在此处查看 IsolationForest 的直观示例.以下是您的问题的代码:

  • LabelEncoder can be used, for example, to transform non-numerical data into numerical labels. OneHotEncoder usually takes numerical or non-numerical data and converts it into, well, one-hot encodings. Both are usually used for preprocessing the "labels" (classes of a supervised learning problem).
  • As I understand it, you are trying to predict outliers (anomaly detection). It is not clear to me if the connection between the utterances and the integers is only hardcoded or if you want to generate this kind of connection somehow. If this is what you want, then you cannot achieve this using previously mentioned encoders because you are fitting them on some data (that, in general, should be labels) and trying to transform new unrelated data (ValueError: y contains previously unseen labels). However, this can be fixed by setting the handle_unknown parameter of OneHotEncoder to 'ignore' (From Documentation: "Whether to raise an error or ignore if an unknown categorical feature is present during transform"). Even if you can achieve what you want with one of these Encoders, you should keep in mind that this is not the main purpose of it.
  • I assume you are giving a high value to "negative" utterances (even if "wrong" doesn't correspond to 65 in your train data) and a small value to "positive" ones. If you assume you already know every integer for every utterance you can train the model on what is considered "positive" examples and give "negative" examples (outliers) only in testing. You don't train an IsolationForest on "positive" and "negative" examples - this would just basic binary classification that can be modelled with a Decision Tree for example. An intuitive example of IsolationForest can be seen here. Below is the code for your problem:

import numpy as np
from sklearn.ensemble import IsolationForest

textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', ...]
integer_connection = [1, 1, 2, 3, 2, 2, 3, 1, 3, 4, 1, 2, 1, 2, 1, 2, 1, 1]
integer_connection = np.array([[n] for n in integer_connection])

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
isolation_forest.fit(integer_encoded)

list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]]

text_vals = [d[0] for d in list_of_val]
numeric_vals = np.array([[d[1]] for d in list_of_val])

print(integer_encoded, numeric_vals)

outliers = isolation_forest.predict(numeric_vals)
print(outliers)

  • 总的来说,我认为您的方法对于自然语言话语的异常值预测是不正确的.对于您在此特定示例中尝试执行的操作,我可以建议使用来自例如 spaCy 的词向量相似度,或者一个简单的词袋方法.

  • In general, I don't think your approach is right regarding outliers prediction for natural language utterances. For what you are trying to do in this specific example I can recommend using word vectors similarity from, for example, spaCy, or maybe a simple bag of words approach.

    如果您不关心这些要点中的任何一点,而只想要一个有效的代码,以下是您尝试执行的操作的我的版本:

    If you don't care of any of these points and you only want a working code, here is my version of what you are trying to do:

    import numpy as np
    
    from sklearn.ensemble import IsolationForest
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    
    
    textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']
    
    
    encodings = {}
    
    num_data = [4, 1, 3, 2, 65, 3, 3]
    
    
    onehot_encoder = OneHotEncoder(handle_unknown='ignore')
    onehots = onehot_encoder.fit_transform(np.array([[utt, no] for utt, no in zip(textual_data, num_data)]))
    
    for i, l in enumerate(onehots):
        original_label = (textual_data[i], num_data[i])
        encodings[original_label] = l
    
    print(encodings)
    
    isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
    model = isolation_forest.fit(onehots)
    
    list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]]
    
    
    test_encoded = onehot_encoder.transform(np.array(list_of_val))
    print(test_encoded)
    
    outliers = isolation_forest.predict(test_encoded)
    print(outliers)
    
    for i, outlier in enumerate(outliers):
        if outlier == -1:
            print('Values', list_of_val[i], 'are outliers')
    
        else:
            print('Values', list_of_val[i], 'are not outliers')
    

  • 这篇关于Pythons Scikit-Learn 库中分类数据的异常值预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆