如何在 Weka 中使用带有新数据的创建模型 [英] How to use created model with new data in Weka

查看:38
本文介绍了如何在 Weka 中使用带有新数据的创建模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试一些关于 weka 的测试,希望有人可以帮助我,我可以让自己清楚.

I'm trying some tests on weka, hope someone can help me and i can made myself clear.

第 1 步:标记我的数据

@attribute text string
@attribute @@class@@ {derrota,empate,win}

@data
'O Grêmio perdeu para o Cruzeiro por 1 a 0',derrota
'O Grêmio venceu o Palmeiras em um grande jogo de futebol, nesta quarta-feira na Arena',vitoria

第 2 步:基于标记化数据构建模型

加载后,我应用了一个 StringToWordVector.应用此过滤器后,我保存了一个带有标记化单词的新 arff 文件.有点像..

After loading this i apply a StringToWordVector. After applying this filter i save a new arff file with the words tokenized. Something like..

@attribute @@class@@ {derrota,vitoria,win}
@attribute o numeric
@attribute grêmio numeric
@attribute perdeu numeric
@attribute venceu numeric
@ and so on .....

@data
{0 derrota, 1 1, 2 1, 3 1, 4 0, ...}
{0 vitoria, 1 1, 2 1, 3 0, 4 1, ...}

好的!现在基于这个 arff 我建立了我的分类器模型并保存它.

Ok! Now based on this arff i build my classifier model and save it.

第 3 步:使用模拟新数据"进行测试

如果我想用模拟新数据"测试我的模型,我实际上在做的是编辑最后一个 arff 并制作一条线

If i want to test my model with "simulated new data" what im doing actually is editing this last arff and making a line like

{0 ?, 1 1, 2 1, 3 1, 4 0, ...}

{0 ?, 1 1, 2 1, 3 1, 4 0, ...}

第 4 步(我的问题):如何使用真正的新数据进行测试

到目前为止一切顺利.我的问题是当我需要将此模型与真正的"新数据一起使用时.例如,如果我有一个带有O Grêmio caiudiante do Palmeiras"的字符串.我有 4 个新词在我的模型中不存在,2 个存在.

So far so good. My problem is when i need to use this model with 'really' new data. For example, if i have a string with "O Grêmio caiu diante do Palmeiras". I have 4 new words that doesn't exist in my model and 2 that exist.

我怎样才能用这个新数据创建一个 arff 文件,以适应我的模型?(好吧,我知道 4 个新词不会出现,但我该如何处理?)

How can i create a arff file with this new data that can be fitted in my model? (ok i know that the 4 new words will not be present, but how can i work with this?)

提供不同的测试数据后,出现以下消息

After supply a different test data the following message appears

推荐答案

如果您以编程方式使用 Weka,那么您可以很容易地做到这一点.

If you use Weka programmatically then you can do this fairly easy.

  • 创建您的培训文件(例如 training.arff)
  • 从训练文件创建实例.实例 trainingData = ..
  • 使用 StringToWordVector 将您的字符串属性转换为数字表示:
  • Create your training file (e.g training.arff)
  • Create Instances from training file. Instances trainingData = ..
  • Use StringToWordVector to transform your string attributes to number representation:

示例代码:

    StringToWordVector() filter = new StringToWordVector(); 
    filter.setWordsToKeep(1000000);
    if(useIdf){
        filter.setIDFTransform(true);
    }
    filter.setTFTransform(true);
    filter.setLowerCaseTokens(true);
    filter.setOutputWordCounts(true);
    filter.setMinTermFreq(minTermFreq);
    filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
    NGramTokenizer t = new NGramTokenizer();
    t.setNGramMaxSize(maxGrams);
    t.setNGramMinSize(minGrams);    
    filter.setTokenizer(t);  
    WordsFromFile stopwords = new WordsFromFile();
    stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
    filter.setStopwordsHandler(stopwords);
    if (useStemmer){
        Stemmer s = new /*Iterated*/LovinsStemmer();
        filter.setStemmer(s);
    }
    filter.setInputFormat(trainingData);

  • 将过滤器应用于trainingData:trainingData = Filter.useFilter(trainingData, filter);

    选择一个分类器来创建您的模型

    Select a classifier to create your model

    LibLinear 分类器的示例代码

            Classifier cls = null;
            LibLINEAR liblinear = new LibLINEAR();
            liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
            liblinear.setProbabilityEstimates(true);
            // liblinear.setBias(1); // default value
            cls = liblinear;
            cls.buildClassifier(trainingData);
    

    • 保存模型
    • 示例代码

          System.out.println("Saving the model...");
          ObjectOutputStream oos;
          oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
          oos.writeObject(cls);
          oos.flush();
          oos.close();
      

      • 创建一个测试文件(例如 testing.arff)

        • Create a testing file (e.g testing.arff)

          从训练文件创建实例:Instances testingData=...

          负载分类器

          示例代码

          Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
          

          • 使用与上述相同的 StringToWordVector 过滤器或为 testingData 创建一个新过滤器,但请记住在此命令中使用 trainingData:filter.setInputFormat(trainingData); 这将保留训练集的格式,不会添加不在训练集中的单词.

            将过滤器应用到 testingData:testingData = Filter.useFilter(testingData, filter);

            Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);

            分类!

            示例代码

             for (int j = 0; j < testingData.numInstances(); j++) {
                double res = myCls.classifyInstance(testingData.get(j));
             }
            

            <小时>

            1. 不确定这是否可以通过 GUI 完成.
            2. 保存和加载步骤是可选的.

            在对 Weka GUI 进行一些挖掘之后,我认为可以做到.在分类选项卡中,在供应测试集字段中设置您的测试集.之后,您的集合通常应该是不兼容的.要解决此问题,请在以下对话框中单击是"

            after some digging in the Weka GUI i think it is possible to do it. In the classify tab set your testing set at the Supply test set field. After that your sets should normally be incompatible. To fix this click yes in the following dialog

            你可以走了.

            这篇关于如何在 Weka 中使用带有新数据的创建模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆