R Lime文本数据包 [英] R Lime package for text data

查看:92
本文介绍了R Lime文本数据包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在探索在文本数据集上使用R石灰解释黑匣子模型的预测,并遇到了一个示例

I was exploring the use of R lime on text datasets to explain black box model predictions and came across an example https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html

正在对餐厅评论数据集进行测试,但发现其中一些结果表明plot_features无法打印所有功能.我想知道是否有人可以为此提供任何建议/见解或为什么建议使用其他软件包.非常感谢您的帮助,因为在网上找不到关于R石灰的大量工作.谢谢!

Was testing on a restaurant review dataset but found some that the plot_features produced doesn't print all the features. I was wondering if anyone could provide any advice/insights for me on this as to why this happens or recommend a different package to use. Help here is greatly appreciated since not much work on R lime can be found online. Thanks!

数据集: https://drive.google.com /file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp = sharing

# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)

# Cleaning the texts
# install.packages('tm')
# install.packages('SnowballC')
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

# Creating the Bag of Words model
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked

# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

library(caret)
model <- train(Liked~., data=training_set, method="xgbTree")

######
#LIME#
######
library(lime)
explainer <- lime(training_set, model)
explanation <- explain(test_set[1:4,], explainer, n_labels = 1, n_features = 5)
plot_features(explanation)

我不想要的输出: https://www.dropbox.com /s/pf9dq0kba0d5flt/Udemy_NLP_Lime.jpeg?dl=0

我想要的内容(不同的数据集): https://www .dropbox.com/s/e1472i4yw1owmlc/DMT_A5_lime.jpeg?dl = 0

What I want (different dataset): https://www.dropbox.com/s/e1472i4yw1owmlc/DMT_A5_lime.jpeg?dl=0

推荐答案

我无法打开您为数据集和输出提供的链接.但是,我使用的是您提供的相同链接 https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html .我使用text2vec,如链接中所示,并使用xgboost包进行分类;它对我有用.要显示更多功能,您可能需要增加解释功能中n_features的值,请参见

I could not open the links you provided for the dataset and output. However, I am using the same link you provided https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html . I use text2vec, as it is in the link, and xgboost package for classification; and it works for me. To display more features, you may need to increase the value of n_features in explain function, see https://www.rdocumentation.org/packages/lime/versions/0.4.0/topics/explain .

library(lime)
library(xgboost)  # the classifier
library(text2vec) # used to build the BoW matrix

# load data
data(train_sentences, package = "lime")  # from lime 
data(test_sentences, package = "lime")   # from lime

# Tokenize data
get_matrix <- function(text) {
  it <- text2vec::itoken(text, progressbar = FALSE)

  # use the following lines if you want to prune vocabulary
  # vocab <- create_vocabulary(it, c(1L, 1L)) %>%   
  # prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
  #   vectorizer <- vocab_vectorizer(vocab )

  # there is no option to prune the vocabulary, but it is very fast for big data
  vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 1L))
  text2vec::create_dtm(it,vectorizer = vectorizer) # hash_vectorizer())
}

# BoW matrix generation
# features should be the same for both dtm_train and dtm_test 
dtm_train <- get_matrix(train_sentences$text)
dtm_test  <- get_matrix(test_sentences$text) 

# xgboost for classification
param <- list(max_depth = 7, 
          eta = 0.1, 
          objective = "binary:logistic", 
          eval_metric = "error", 
          nthread = 1)

xgb_model <-xgboost::xgb.train(
  param, 
  xgb.DMatrix(dtm_train, label = train_sentences$class.text == "OWNX"),
  nrounds = 100 
)

# prediction
predictions <- predict(xgb_model, dtm_test) > 0.5
test_labels <- test_sentences$class.text == "OWNX"

# Accuracy
print(mean(predictions == test_labels))

# what are the most important words for the predictions.
n_features <- 5 # number of features to display
sentence_to_explain <- head(test_sentences[test_labels,]$text, 6)
explainer <- lime::lime(sentence_to_explain, model = xgb_model, 
                    preprocess = get_matrix)
explanation <- lime::explain(sentence_to_explain, explainer, n_labels = 1, 
                         n_features = n_features)

#
explanation[, 2:9]

# plot
lime::plot_features(explanation)

在您的代码中,在应用于train_sentences数据集时,将在下一行中创建NA.请检查以下代码.

In your code, NAs are created in the following line, when applying on train_sentences dataset. Please check your code for the following.

dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

删除级别或将级别更改为标签对我来说都是有效的.

Removing levels or changing levels to labels works for me.

请检查您的数据结构,并确保由于这些NA导致数据不是零矩阵,或者它不是太稀疏.由于找不到前n个特征,这也可能导致问题.

Please check your data structure and make sure your data is not zero matrix due to those NAs, or it is not too sparse. It may also cause the problem as it cannot find top n features.

这篇关于R Lime文本数据包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆