在数据框中查找BernoulliNB概率 [英] Look up BernoulliNB Probability in Dataframe

查看:62
本文介绍了在数据框中查找BernoulliNB概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些训练数据(TRAIN)和一些测试数据(TEST).每个数据框的每一行都包含一个观察到的类(X)和一些二进制列(Y).BernoulliNB根据训练数据预测测试数据中给定Y的X的概率.我正在尝试在测试数据(Pr)中查找观察到的每一行类别的概率.

I have some training data (TRAIN) and some test data (TEST). Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).

我使用了Antoine Zambelli的建议来修复代码:

I used Antoine Zambelli's advice to fix the code:

from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()

# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
                      'Y1': [1,1,0,0],
                      'Y4': [1,0,0,0]})

# Test Data
TEST  = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
                      'Y1': [1,1,0,1,0,1,0,0,0],
                      'Y2': [1,0,1,0,1,0,1,0,1],
                      'Y3': [1,1,0,1,1,0,0,0,0],
                      'Y4': [1,1,0,1,1,0,0,0,0]})

# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
    TRAIN[i] = 0

# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST  .drop('X', axis=1)

# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)

# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)

# Rename the columns after the classes of X
df_R.columns = BNB.classes_

df_S = df_R .join(TEST)

# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
  def lu(i, j):
    return df.get(j, {}).get(i, np.nan)
  return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]

这似乎有效,给了我结果(df_S):

This seemed to work, giving me the result (df_S):

这正确地为前两行给出了"NaN",因为训练数据不包含有关类X = 5或X = 0的信息.

This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.

推荐答案

好,这里有几个问题.我在下面有一个完整的工作示例,但首先要解决这些问题.主要是断言这正确地为前两行给出了一个"NaN"".

Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".

这与分类算法的使用方式及其功能有关.训练数据包含您希望算法知道并能够采取行动的所有信息.测试数据将仅在考虑该信息的情况下进行处理.即使您(该人)知道测试标签为 5 并且未包含在训练数据中,算法也不知道.它只会查看要素数据,然后尝试从中预测标签.因此,它无法返回 nan (或 5 ,或训练集中未包含的任何内容)- nan 来自您的工作 df_R df_S .

This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.

这导致第二个问题,即行 df_Te_Y = TEST .iloc [:,1:] ,该行应为 df_Te_Y = TEST .iloc [:,2:],因此它不包含标签数据.标签数据仅出现在训练集中.预测标签只会从训练数据中出现的一组标签中提取.

This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.

注意:我已经将类标签更改为 Y ,将要素数据更改为 X ,因为这在文献中是标准的.

Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.

from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd

BNB = BernoulliNB()

# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})

# Test Data
test_df  = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
                      'X1': [1,1,0,1,0,1,0,0,0],
                      'X2': [1,0,1,0,1,0,1,0,1],
                      'X3': [1,1,0,1,1,0,0,0,0],
                      'X4': [1,1,0,1,1,0,0,0,0]})


X = train_df.drop('Y', axis=1)  # Known training data - all but 'Y' column.
Y = train_df['Y']  # Known training labels - just the 'Y' column.

X_te = test_df.drop('Y', axis=1)  # Test data.
Y_te = test_df['Y']  # Only used to measure accuracy of prediction - if desired.

Ar_R = BNB.fit(X, Y).predict_proba(X_te)  # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_  # Rename as per class labels.

# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)

predicted_labels = df_R.idxmax(axis=1).values  # For each row, take the column with the highest prob in that row.
print(predicted_labels)  # [1 1 3 1 3 2 3 3 3]

print(accuracy_score(Y_te, predicted_labels))  # Percent accuracy of prediction.

print(BNB.fit(X, Y).predict(X_te))  # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.

如果阅读代码后没有任何意义,我建议您查看一些有关聚类算法的教程或其他材料.

I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

这篇关于在数据框中查找BernoulliNB概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆