mllib NaiveBayes中的类数是否有限制?调用model.save()时出错 [英] Is there a limit on the number of classes in mllib NaiveBayes? Error calling model.save()

查看:87
本文介绍了mllib NaiveBayes中的类数是否有限制?调用model.save()时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试训练一个模型来预测文本输入数据的类别.当类数超过一定数量时,我会在词袋上使用pyspark.ml.classification.NaiveBayes分类器来处理似乎是数值不稳定的问题.

在我的现实世界项目中,我有大约10亿条记录和大约50个类.我可以训练我的模型并做出预测,但是当我尝试使用model.save()保存它时出现错误.从操作上来说,这很烦人,因为每次都要从头开始重新训练模型.

在尝试调试时,我将数据缩减到大约1万行,并且尝试保存时遇到了同样的问题.但是,如果我减少类标签的数量,则保存效果很好.

这使我相信标签数量是有限制的.我无法重现我的确切问题,但是下面的代码是相关的.如果我将num_labels设置为大于31的任何值,则model.fit()会引发错误.

我的问题:

  1. NaiveBayesmllib实现中的类数是否受到限制?
  2. 如果我可以成功地使用模型进行预测,那么可能无法保存模型的原因有哪些?
  3. 如果确实存在限制,是否有可能将我的数据分成较小的类组,训练单独的模型,然后合并?

完整的示例

创建一些虚拟数据.

我将使用nltk.corpus.comparitive_sentencesnltk.corpus.sentence_polarity.请记住,这只是一个无意义数据的说明性示例-我不关心拟合模型的性能.

import pandas as pd
from pyspark.sql.types import StringType

# create some dummy data
from nltk.corpus import comparative_sentences, sentence_polarity
df = pd.DataFrame(
    {
        'sentence': [" ".join(s) for s in cs.sents() + sp.sents()]
    }
)

# assign a 'category' to each row
num_labels = 31  # seems to be the upper limit
df['category'] = (df.index%num_labels).astype(str)

# make it into a spark dataframe
spark_df = sqlCtx.createDataFrame(df)

数据准备管道

from pyspark.ml.feature import NGram, Tokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vector

indexer = StringIndexer(inputCol='category', outputCol='label')
tokenizer = Tokenizer(inputCol="sentence", outputCol="sentence_tokens")
remove_stop_words = StopWordsRemover(inputCol="sentence_tokens", outputCol="filtered")
unigrammer = NGram(n=1, inputCol="filtered", outputCol="tokens") 
hashingTF = HashingTF(inputCol="tokens", outputCol="hashed_tokens")
idf = IDF(inputCol="hashed_tokens", outputCol="tf_idf_tokens")

clean_up = VectorAssembler(inputCols=['tf_idf_tokens'], outputCol='features')

data_prep_pipe = Pipeline(
    stages=[indexer, tokenizer, remove_stop_words, unigrammer, hashingTF, idf, clean_up]
)
transformed = data_prep_pipe.fit(spark_df).transform(spark_df)
clean_data = transformed.select(['label','features'])

训练模型

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
(training,testing) = clean_data.randomSplit([0.7,0.3], seed=12345)
model = nb.fit(training)
test_results = model.transform(testing)

评估模型

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting label was: {}".format(acc))

在我的机器上,此打印:

Accuracy of model at predicting label was: 0.0305764788269

错误消息

如果将num_labels更改为32或更高,这是我致电model.fit()时遇到的错误:

Py4JJavaError:调用o1336.fit时发生错误. : org.apache.spark.SparkException:由于阶段失败,作业中止了: 阶段86.0中的任务0失败4次,最近一次失败:丢失的任务 在阶段86.0(TID 1984,someserver.somecompany.net,executor 22)中为0.3:org.apache.spark.SparkException:Kryo序列化失败:缓冲区 溢出.可用:7,必需:8序列化跟踪:值 (org.apache.spark.ml.linalg.DenseVector).为了避免这种情况,请增加 spark.kryoserializer.buffer.max值. ... ... 等等等等更多的Java东西永远持续下去

注释

  • 在此示例中,如果我为bigrams添加了功能,则在num_labels> 15时发生错误.我想知道这是否也是2的乘方小于2的乘方是否是巧合.
  • 在我的真实项目中,尝试调用model.theta时也出现错误. (我认为错误本身并不有意义-它们只是从java/scala方法传回的异常.)

解决方案

硬限制:

功能数量*类数量必须低于Integer.MAX_VALUE(2 31 -1).您远不及这些值.

软限制:

Theta矩阵(条件概率)的大小特征数量*类数量. Theta既本地存储在驱动程序中(作为模型的一部分),也进行序列化并发送给工作人员.这意味着所有计算机都至少需要足够的内存才能序列化或反序列化并存储结果.

由于对HashingTF.numFeatures(2 20 )使用了默认设置,因此每个附加类都会添加262144-数量不多,但很快就累加了.根据您发布的部分追溯,看来失败的组件是Kryo序列化器.相同的回溯也表明了解决方案,该解决方案正在增加spark.kryoserializer.buffer.max.

您还可以通过以下设置尝试使用标准Java序列化:

 spark.serializer org.apache.spark.serializer.JavaSerializer 

由于您将PySpark与pyspark.mlpyspark.sql结合使用,因此可以接受而不会造成明显的性能损失.

除配置外,我将重点介绍功能工程组件.将二进制CountVetorizer(请参阅下面有关HashingTF的注释)与ChiSqSelector结合使用可能提供一种增加可解释性并有效减少功能部件数量的方法.您可能还会考虑使用更复杂的方法(确定功能重要性并仅将Naive Bayes应用于数据子集,更高级的文本处理(如词条化/词干提取)或使用自动编码器的某些变体来获得更紧凑的矢量表示).

注释:

  • 请记住,跨国公司Naive Bayes仅考虑二进制功能. NaiveBayes将在内部处理此问题,但为清晰起见,我仍建议使用setBinary.
  • 可以说HashingTF在这里是没有用的.除了散列冲突之外,高度稀疏的特征和本质上毫无意义的特征使其成为NaiveBayes的预处理步骤的不佳选择.

I am trying to train a model to predict the category of text input data. I am running into what seems to be numerical instability using the pyspark.ml.classification.NaiveBayes classifier on a bag-of-words when the number of classes is above a certain amount.

In my real world project, I have on the order of ~1bn records and ~50 classes. I am able to train my model and make predictions but I get an error when I try to save it using model.save(). Operationally, this is annoying since I have to retrain my model each time from scratch.

In trying to debug, I scaled my data down to around ~10k rows and had the same issue trying to save. However saving works fine if I reduce the number of class labels.

This leads me to believe that there is a limit to the number of labels. I am not able to reproduce my exact issues, but the code below is related. If I set num_labels to anything greater than 31, model.fit() throws an error.

My questions:

  1. Is there a limit to the number of classes in the mllib implementation of NaiveBayes?
  2. What could be some reasons that I am not able to save my model if I can successfully use it to make predictions?
  3. If there is indeed a limit, would it be possible to split my data into groups of smaller classes, train separate models, and combine?

Full Working Example

Create some dummy data.

I'm going to use nltk.corpus.comparitive_sentences and nltk.corpus.sentence_polarity. Keep in mind that this is just an illustrative example with nonsense data - I'm not concerned with the performance of the fitted model.

import pandas as pd
from pyspark.sql.types import StringType

# create some dummy data
from nltk.corpus import comparative_sentences, sentence_polarity
df = pd.DataFrame(
    {
        'sentence': [" ".join(s) for s in cs.sents() + sp.sents()]
    }
)

# assign a 'category' to each row
num_labels = 31  # seems to be the upper limit
df['category'] = (df.index%num_labels).astype(str)

# make it into a spark dataframe
spark_df = sqlCtx.createDataFrame(df)

Data Preparation Pipeline

from pyspark.ml.feature import NGram, Tokenizer, StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vector

indexer = StringIndexer(inputCol='category', outputCol='label')
tokenizer = Tokenizer(inputCol="sentence", outputCol="sentence_tokens")
remove_stop_words = StopWordsRemover(inputCol="sentence_tokens", outputCol="filtered")
unigrammer = NGram(n=1, inputCol="filtered", outputCol="tokens") 
hashingTF = HashingTF(inputCol="tokens", outputCol="hashed_tokens")
idf = IDF(inputCol="hashed_tokens", outputCol="tf_idf_tokens")

clean_up = VectorAssembler(inputCols=['tf_idf_tokens'], outputCol='features')

data_prep_pipe = Pipeline(
    stages=[indexer, tokenizer, remove_stop_words, unigrammer, hashingTF, idf, clean_up]
)
transformed = data_prep_pipe.fit(spark_df).transform(spark_df)
clean_data = transformed.select(['label','features'])

Train the model

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
(training,testing) = clean_data.randomSplit([0.7,0.3], seed=12345)
model = nb.fit(training)
test_results = model.transform(testing)

Evaluate Model

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting label was: {}".format(acc))

On my machine, this prints:

Accuracy of model at predicting label was: 0.0305764788269

Error Message

If I change num_labels to 32 or higher, this is the error I get when I call model.fit():

Py4JJavaError: An error occurred while calling o1336.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 86.0 failed 4 times, most recent failure: Lost task 0.3 in stage 86.0 (TID 1984, someserver.somecompany.net, executor 22): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 7, required: 8 Serialization trace: values (org.apache.spark.ml.linalg.DenseVector). To avoid this, increase spark.kryoserializer.buffer.max value. ... ... blah blah blah more java stuff that goes on forever

Notes

  • In this example, if I add a feature for bigrams, the error happens if num_labels > 15. I wonder if it is coincidence that this is also 1 less than a power of 2.
  • In my real-world project, I also get an error when trying to call model.theta. (I don't think the errors themselves are meaningful - they are just the exceptions passed back from the java/scala methods.)

解决方案

Hard limitations:

Number of features * Number of classes has to be lower Integer.MAX_VALUE (231 - 1). You are nowhere near these value.

Soft limitations:

Theta matrix (conditional probabilities) is of size Number of features * Number of classes. Theta is stored both locally on the driver (as a part of the model) and serialized and send to the workers. This means that all machines require at least enough memory to serialize or deserialize and store the result.

Since you use default settings for HashingTF.numFeatures (220) each additional class adds 262144 - it is not that much, but quickly adds up. Based on the partial traceback you've posted, it looks like the failing component is Kryo serializer. The same traceback also suggests the solution, which is increasing spark.kryoserializer.buffer.max.

You can also try using standard Java serialization by setting:

 spark.serializer org.apache.spark.serializer.JavaSerializer 

Since you use PySpark with pyspark.ml and pyspark.sql it might be acceptable without significant performance loss.

Configuration aside I would focus on the feature engineering component. Using binary CountVetorizer (see note about HashingTF below) with ChiSqSelector might provide one way to both increase interpretability and effectively reduce number of features. You may also consider more sophisticated approaches (determine feature importances and applying Naive Bayes only on a subset of data, more advanced text processing like lemmatization / stemming, or using some variant of autoencoder to get more compact vector representation).

Notes:

  • Please keep in mind that multinational Naive Bayes considers only binary features. NaiveBayes will handle this internally, but I would still recommend using setBinary for clarity.
  • Arguably HashingTF is rather useless here. Hash collisions aside, highly sparse features and essentially meaningless features, make it poor choice as a preprocessing step for NaiveBayes.

这篇关于mllib NaiveBayes中的类数是否有限制?调用model.save()时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆