如何“更新"现有的命名实体识别模型 - 而不是从头开始创建? [英] How to "update" an existing Named Entity Recognition model - rather than creating from scratch?

查看:21
本文介绍了如何“更新"现有的命名实体识别模型 - 而不是从头开始创建?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请参阅 OpenNLP - 命名实体识别的教程步骤:链接到教程我正在使用 此处 找到的en-ner-person.bin"模型在教程中,有关于训练和创建新模型的说明.有没有办法用额外的训练数据更新"现有的en-ner-person.bin"?

Please see the tutorial steps for OpenNLP - Named Entity Recognition : Link to tutorial I am using the "en-ner-person.bin" model found here In the tutorial, there are instructions on Training and creating a new model. Is there any way to "Update" the existing "en-ner-person.bin" with additional training data?

假设我有一个包含 500 个其他人名的列表,否则这些人名不会被识别为人 - 如何生成新模型?

Say I have a list of 500 additional person names that are otherwise not recognized as persons - how do I generate a new model?

推荐答案

抱歉,我花了一段时间才整理出一个像样的代码示例...下面的代码是在你的句子中读取的,使用默认的 en-ner-person 模型来做到最好.然后它将这些结果写入一个包含良好点击的文件和一个包含不良点击的文件.然后我将这些文件输入到底部的modelbuilder-addon"调用中.

Sorry it took me a while to put together a decent code example... What the code below does is read in your sentences, uses the default en-ner-person model to do it's best. Then it writes those results to a file of the good hits, and a file of the bad hits . Then I feed those files into the "modelbuilder-addon" call at the bottom.

要获得最佳结果,请按原样运行该类...然后进入已知实体文件和黑名单文件,并添加和删除名称.换句话说,将它根本没有找到但您知道的名称放入已知中,并从已知中删除坏名称.从黑名单文件中删除好名字,并将它们添加到已知文件中.然后再次运行模型构建器部分,而不是读取所有数据和所有内容的第一部分.在已知文件和黑名单文件中有重复是可以的.如果您有任何问题,请告诉我...有点复杂

To get the best results, run the class as is... then go into the known entities file and the blacklist file, and add and remove names. In other words, put names that it did not find at all, but you are aware of, into the knowns, and remove bad names from the knowns. Remove good names from the blacklist file, and add them to the knowns file. Then run the model builder part again without the first part that reads in all your data and everything. It's ok to have duplicates in the knowns and blacklist files. If you have questions let me know... it's a bit complicated

import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;
import opennlp.tools.entitylinker.EntityLinkerProperties;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.Span;

public class ModelBuilderAddonUse {
//fill this method in with however you are going to get your data into a list of sentences..for me I am hitting a MySQL database
  private static List<String> getSentencesFromSomewhere() throws Exception {
    List<String> sentences = new ArrayList<>();
    int counter = 0;
    DocProvider dp = new DocProvider();
    String modelPath = "c:\apache\entitylinker\";
    EntityLinkerProperties properties = new EntityLinkerProperties(new File(modelPath + "entitylinker.properties"));
    Map<Long, List<String>> docs = dp.getDocs(properties);
    for (Long key : docs.keySet()) {
      counter++;
      System.out.println("		DOC: " + key + "

");
      String docu = "";
      sentences.addAll(docs.get(key));
      counter++;
      if(counter > 1000){
        break;
      }
    }
    return sentences;
  }

  public static void main(String[] args) throws Exception {
    /**
     * establish a file to put sentences in
     */
    File sentences = new File("C:\temp\modelbuilder\sentences.text");

    /**
     * establish a file to put your NER hits in (the ones you want to keep based
     * on prob)
     */
    File knownEntities = new File("C:\temp\modelbuilder\knownentities.txt");

    /**
     * establish a BLACKLIST file to put your bad NER hits in (also can be based
     * on prob)
     */
    File blacklistedentities = new File("C:\temp\modelbuilder\blentities.txt");

    /**
     * establish a file to write your annotated sentences to
     */
    File annotatedSentences = new File("C:\temp\modelbuilder\annotatedSentences.txt");

    /**
     * establish a file to write your model to
     */
    File theModel = new File("C:\temp\modelbuilder\theModel");


//------------create a bunch of file writers to write your results and sentences to a file

    FileWriter sentenceWriter = new FileWriter(sentences, true);
    FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);
    FileWriter knownEntityWriter = new FileWriter(knownEntities, true);

//set some thresholds to decide where to write hits, you don't have to use these at all...
    double keeperThresh = .95;
    double blacklistThresh = .7;


    /**
     * Load your model as normal
     */
    TokenNameFinderModel personModel = new TokenNameFinderModel(new File("c:\temp\opennlpmodels\en-ner-person.zip"));
    NameFinderME personFinder = new NameFinderME(personModel);
    /**
     * do your normal NER on the sentences you have
     */
    for (String s : getSentencesFromSomewhere()) {
      sentenceWriter.write(s.trim() + "
");
      sentenceWriter.flush();

      String[] tokens = s.split(" ");//better to use a tokenizer really
      Span[] find = personFinder.find(tokens);
      double[] probs = personFinder.probs();
      String[] names = Span.spansToStrings(find, tokens);
      for (int i = 0; i < names.length; i++) {
        //YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL
        if (probs[i] > keeperThresh) {
          knownEntityWriter.write(names[i].trim() + "
");
        }
        if (probs[i] < blacklistThresh) {
          blacklistWriter.write(names[i].trim() + "
");
        }
      }
      personFinder.clearAdaptiveData();
      blacklistWriter.flush();
      knownEntityWriter.flush();
    }
    //flush and close all the writers
    knownEntityWriter.flush();
    knownEntityWriter.close();
    sentenceWriter.flush();
    sentenceWriter.close();
    blacklistWriter.flush();
    blacklistWriter.close();

    /**
     * THIS IS WHERE THE ADDON IS GOING TO USE THE FILES (AS IS) TO CREATE A NEW MODEL. YOU SHOULD NOT HAVE TO RUN THE FIRST PART AGAIN AFTER THIS RUNS, JUST NOW PLAY WITH THE
     * KNOWN ENTITIES AND BLACKLIST FILES AND RUN THE METHOD BELOW AGAIN UNTIL YOU GET SOME DECENT RESULTS (A DECENT MODEL OUT OF IT).
     */
    DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities,
            theModel, annotatedSentences, "person", 3);


  }
}

这是控制台的样子(为了简洁起见,我在这里删除了一些行)

this is what the console should look like ( I removed some lines for brevity here)

ITERATION: 0
    Perfoming Known Entity Annotation
        knowns: 625
        reading data....: 
        writing annotated sentences....: 
        building model.... 
    Building Model using 7343 annotations
        reading training data...
Indexing events using cutoff of 5

    Computing event counts...  done. 561755 events
    Indexing...  done.
Sorting and merging events... done. Reduced 561755 events to 127362.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 127362
        Number of Outcomes: 3
      Number of Predicates: 106490
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-617150.9462211537  0.015709695507828147
  2:  ... loglikelihood=-90520.86903515142  0.9771288195031642
  3:  ... loglikelihood=-56901.86905339755  0.9771288195031642
  4:  ... loglikelihood=-44231.80460317638  0.9773086131854634
  5:  ... loglikelihood=-37222.56576767385  0.9787985865724381
  6:  ... loglikelihood=-32900.5623814595   0.9801924326441243
  7:  ... loglikelihood=-29992.881445391187 0.9829747843810914
  8:  ... loglikelihood=-27893.341149419102 0.9836423351817073
  9:  ... loglikelihood=-26296.107313900917 0.9845092611547739
 10:  ... loglikelihood=-25033.501573153182 0.9850682236918229
 11:  ... loglikelihood=-24006.060636903556 0.9856182855515305
 12:  ... loglikelihood=-23150.856525607975 0.9859084476328649
 13:  ... loglikelihood=-22425.987337392176 0.9861897090368577
 14:  ... loglikelihood=-21802.386362016423 0.9864211266477378
 15:  ... loglikelihood=-21259.20580401235  0.9865208142339632
 16:  ... loglikelihood=-20781.0716762281   0.9867362106256287
 17:  ... loglikelihood=-20356.37732369309  0.986905323495118
 18:  ... loglikelihood=-19976.18228587008  0.9870673158227341
 19:  ... loglikelihood=-19633.47877575036  0.9872097266601988
 20:  ... loglikelihood=-19322.689448146353 0.9873165347882974
 21:  ... loglikelihood=-19039.31522510173  0.9874073216971812
 22:  ... loglikelihood=-18779.683112448918 0.9875176900962164
 23:  ... loglikelihood=-18540.76222439295  0.9876316187661881
 24:  ... loglikelihood=-18320.027315327916 0.9877081645913254
 25:  ... loglikelihood=-18115.35602743375  0.9877918309583359
 26:  ... loglikelihood=-17924.95047403401  0.9878612562416
 27:  ... loglikelihood=-17747.27665623459  0.9879378020667373
 28:  ... loglikelihood=-17581.01712643139  0.9879947664017231
 29:  ... loglikelihood=-17425.03361369085  0.9880784327687337
 30:  ... loglikelihood=-17278.3372262906   0.9881282765618463
 31:  ... loglikelihood=-17140.06447937828  0.9882012621160471
 32:  ... loglikelihood=-17009.45784626013  0.9882546661800963
 33:  ... loglikelihood=-16885.84985637711  0.9883187510569554
 34:  ... loglikelihood=-16768.64999916476  0.9883703749855364
 35:  ... loglikelihood=-16657.3338665414   0.9884166585077124
 36:  ... loglikelihood=-16551.434095577726 0.9884558214880153
 37:  ... loglikelihood=-16450.532769374073 0.9885074454165962
 38:  ... loglikelihood=-16354.255007222264 0.9885448282614306
 39:  ... loglikelihood=-16262.263530858221 0.9885733104289236
 40:  ... loglikelihood=-16174.254036589966 0.9886391754412511
 41:  ... loglikelihood=-16089.951236435176 0.9886765582860856
 42:  ... loglikelihood=-16009.105457548561 0.9887281822146665
 43:  ... loglikelihood=-15931.489709807445 0.988747763704818
 44:  ... loglikelihood=-15856.897147780543 0.9887798061432475
 45:  ... loglikelihood=-15785.138866385483 0.9888065081752722
 46:  ... loglikelihood=-15716.041980029182 0.9888349903427651
 47:  ... loglikelihood=-15649.447943527766 0.9888581321038531
 48:  ... loglikelihood=-15585.211079986258 0.9888901745422827
 49:  ... loglikelihood=-15523.19728647256  0.9889328977935221
 50:  ... loglikelihood=-15463.282892914636 0.9889595998255467
 51:  ... loglikelihood=-15405.353653492159 0.9889685005028883
 52:  ... loglikelihood=-15349.303852923775 0.9889809614511664
 53:  ... loglikelihood=-15295.035512678789 0.9889934223994445
 54:  ... loglikelihood=-15242.457684348112 0.989013003889596
 55:  ... loglikelihood=-15191.485819217298 0.9890236847024059
 56:  ... loglikelihood=-15142.041204645499 0.9890397059216206
 57:  ... loglikelihood=-15094.050459152337 0.9890539470053671
 58:  ... loglikelihood=-15047.445079207273 0.9890592874117721
 59:  ... loglikelihood=-15002.161031666768 0.9890753086309868
 60:  ... loglikelihood=-14958.13838658306  0.9890966702566065
 61:  ... loglikelihood=-14915.320985817205 0.9891180318822262
 62:  ... loglikelihood=-14873.656143433394 0.9891269325595677
 63:  ... loglikelihood=-14833.094374397517 0.9891500743206558
 64:  ... loglikelihood=-14793.589148498404 0.9891589749979973
 65:  ... loglikelihood=-14755.096666806796 0.9891785564881488
 66:  ... loglikelihood=-14717.5756582924   0.9891892373009586
 67:  ... loglikelihood=-14680.98719451864  0.9891892373009586
 68:  ... loglikelihood=-14645.294520562966 0.9891945777073635
 69:  ... loglikelihood=-14610.462900520715 0.9891999181137685
 70:  ... loglikelihood=-14576.45947616036  0.989214159197515
 71:  ... loglikelihood=-14543.25313742511  0.9892212797393881
 72:  ... loglikelihood=-14510.814403643026 0.9892230598748565
 73:  ... loglikelihood=-14479.115314429962 0.9892230598748565
 74:  ... loglikelihood=-14448.129329357815 0.9892426413650078
 75:  ... loglikelihood=-14417.831235594616 0.9892515420423494
 76:  ... loglikelihood=-14388.19706276905  0.9892622228551593
 77:  ... loglikelihood=-14359.204004414    0.9892711235325008
 78:  ... loglikelihood=-14330.8303454032   0.9892764639389058
 79:  ... loglikelihood=-14303.055394843146 0.9892764639389058
 80:  ... loglikelihood=-14275.859423957678 0.9892924851581205
 81:  ... loglikelihood=-14249.223608524193 0.9893013858354621
 82:  ... loglikelihood=-14223.129975482772 0.9893209673256135
 83:  ... loglikelihood=-14197.561353359844 0.9893263077320185
 84:  ... loglikelihood=-14172.50132620183  0.9893280878674867
 85:  ... loglikelihood=-14147.934190713178 0.9893263077320185
 86:  ... loglikelihood=-14123.84491635766  0.9893316481384233
 87:  ... loglikelihood=-14100.21910816809  0.9894313357246487
 88:  ... loglikelihood=-14077.042972066316 0.989433115860117
 89:  ... loglikelihood=-14054.303282478262 0.9894437966729268
 90:  ... loglikelihood=-14031.987352086799 0.9894580377566733
 91:  ... loglikelihood=-14010.083003539214 0.9894615980276099
 92:  ... loglikelihood=-13988.578542971209 0.9894776192468246
 93:  ... loglikelihood=-13967.46273521311  0.9894811795177613
 94:  ... loglikelihood=-13946.724780546094 0.9894829596532296
 95:  ... loglikelihood=-13926.354292898612 0.9894829596532296
 96:  ... loglikelihood=-13906.341279379953 0.9894900801951029
 97:  ... loglikelihood=-13886.676121050288 0.9894936404660395
 98:  ... loglikelihood=-13867.34955484593  0.9894954206015077
 99:  ... loglikelihood=-13848.35265657199  0.9894954206015077
100:  ... loglikelihood=-13829.676824889664 0.9894972007369761
    model generated
        model building complete.... 
        annotated sentences: 7343
    Performing NER with new model
        Printing NER Results. Add undesired results to the blacklist file and start over

//prints some names

    annotated sentences: 7369
        knowns: 651
ITERATION: 1
    Perfoming Known Entity Annotation
        knowns: 651
        reading data....: 
        writing annotated sentences....: 
        building model.... 
    Building Model using 20370 annotations
        reading training data...
Indexing events using cutoff of 5

    Computing event counts...  done. 1116781 events
    Indexing...  done.
Sorting and merging events... done. Reduced 1116781 events to 288251.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 288251
        Number of Outcomes: 3
      Number of Predicates: 206399
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-1226909.3303549637 0.03418485808766446
  2:  ... loglikelihood=-196688.7107544095  0.9622047653031346
  3:  ... loglikelihood=-138615.22912914792 0.9651462551744702
  4:  ... loglikelihood=-114777.09879832959 0.9697075791941303
  5:  ... loglikelihood=-101055.0229949508  0.9716443958126079
  6:  ... loglikelihood=-92253.8923255943   0.973049326591337
  7:  ... loglikelihood=-86146.35307405592  0.9750121107003074
  8:  ... loglikelihood=-81641.85792288609  0.975682788299586
  9:  ... loglikelihood=-78164.62963136223  0.9762594456746667
 10:  ... loglikelihood=-75386.40867917785  0.9767044747358703
 11:  ... loglikelihood=-73106.85371375803  0.9770590652957025
 12:  ... loglikelihood=-71196.60721959372  0.9774718588514668
 13:  ... loglikelihood=-69568.23683712543  0.9777279520335679
 14:  ... loglikelihood=-68160.39924327709  0.9779374828189233
 15:  ... loglikelihood=-66928.70260893498  0.9780914969004666
 16:  ... loglikelihood=-65840.17418566217  0.9782661058882628
 17:  ... loglikelihood=-64869.77222395241  0.9784040022170865
 18:  ... loglikelihood=-63998.109674075415 0.9785159310554173
 19:  ... loglikelihood=-63209.92394252923  0.9786475593692944
 20:  ... loglikelihood=-62493.02131098982  0.9787505339005589
 21:  ... loglikelihood=-61837.53211219312  0.9788597764467698
 22:  ... loglikelihood=-61235.37451190329  0.9789457377946079
 23:  ... loglikelihood=-60679.86146007204  0.9790003590677133
 24:  ... loglikelihood=-60165.407875448924 0.979062143786472
 25:  ... loglikelihood=-59687.30928567587  0.9791346736737104
 26:  ... loglikelihood=-59241.572255584455 0.979201830976709
 27:  ... loglikelihood=-58824.78291785096  0.9792698837104141
 28:  ... loglikelihood=-58434.00392167818  0.979333459290586
 29:  ... loglikelihood=-58066.69284046825  0.979381812548745
 30:  ... loglikelihood=-57720.63696783972  0.9794355383911438
 31:  ... loglikelihood=-57393.9007602091   0.9795089637090889
 32:  ... loglikelihood=-57084.78313293037  0.9795483626601814
 33:  ... loglikelihood=-56791.78250307578  0.9795743301506741
 34:  ... loglikelihood=-56513.567973701254 0.9796298468544863
 35:  ... loglikelihood=-56248.955425711436 0.9796808864047651
 36:  ... loglikelihood=-55996.887560355084 0.9797202853558576
 37:  ... loglikelihood=-55756.41714443519  0.9797543117227102
 38:  ... loglikelihood=-55526.69286884015  0.9797963969659226
 39:  ... loglikelihood=-55306.94735282102  0.9798152010107621
 40:  ... loglikelihood=-55096.48692031122  0.9798563908232679
 41:  ... loglikelihood=-54894.68284780714  0.9799029532200136
 42:  ... loglikelihood=-54700.963840494    0.9799378750175728
 43:  ... loglikelihood=-54514.80953871555  0.9799656333694788
 44:  ... loglikelihood=-54335.744892614406 0.9800005551670381
 45:  ... loglikelihood=-54163.33527156895  0.9800301043803574
 46:  ... loglikelihood=-53997.182198154995 0.9800551764401436
 47:  ... loglikelihood=-53836.91961491415  0.980082039361343
 48:  ... loglikelihood=-53682.210607423985 0.980112484005369
 49:  ... loglikelihood=-53532.74451955152  0.980140242357275
 50:  ... loglikelihood=-53388.23440690913  0.9801688961398878
 51:  ... loglikelihood=-53248.41478285541  0.9801921773382606
 52:  ... loglikelihood=-53113.03961847529  0.9802109813831001
 53:  ... loglikelihood=-52981.880563479055 0.9802351580121796
 54:  ... loglikelihood=-52854.7253600851   0.9802584392105524
 55:  ... loglikelihood=-52731.37642565477  0.9802727661018589
 56:  ... loglikelihood=-52611.64958353087  0.9803005244537649
 57:  ... loglikelihood=-52495.37292415569  0.9803148513450712
 58:  ... loglikelihood=-52382.38578113555  0.9803470868505105
 59:  ... loglikelihood=-52272.53780883427  0.9803748452024166
 60:  ... loglikelihood=-52165.68814994865  0.9803891720937229
 61:  ... loglikelihood=-52061.7046829472   0.9804043944157359
 62:  ... loglikelihood=-51960.46334051503  0.9804151395842157
 63:  ... loglikelihood=-51861.84749132724  0.9804393162132952
 64:  ... loglikelihood=-51765.74737831825  0.9804491659510683
 65:  ... loglikelihood=-51672.05960757943  0.9804634928423747
 66:  ... loglikelihood=-51580.686682513515 0.9804876694714542
 67:  ... loglikelihood=-51491.53657871175  0.9805046826548804
 68:  ... loglikelihood=-51404.52235540815  0.9805172186847735
 69:  ... loglikelihood=-51319.56179989248  0.9805315455760798
 70:  ... loglikelihood=-51236.577101627925 0.9805440816059728
 71:  ... loglikelihood=-51155.494553260556 0.9805584084972793
 72:  ... loglikelihood=-51076.24427590388  0.980569153665759
 73:  ... loglikelihood=-50998.75996642977  0.9805825851263587
 74:  ... loglikelihood=-50922.97866477339  0.9805951211562518
 75:  ... loglikelihood=-50848.84053937224  0.9806112389089714
 76:  ... loglikelihood=-50776.28868909037  0.9806264612309844
 77:  ... loglikelihood=-50705.2689602481   0.9806389972608774
 78:  ... loglikelihood=-50635.729777298875 0.9806470561372372
 79:  ... loglikelihood=-50567.62198610024  0.9806658601820769
 80:  ... loglikelihood=-50500.8987085974   0.9806685464741968
 81:  ... loglikelihood=-50435.51520800019  0.9806775007812633
 82:  ... loglikelihood=-50371.42876358994  0.9806837687962098
 83:  ... loglikelihood=-50308.59855431275  0.9806918276725697
 84:  ... loglikelihood=-50246.98555046764  0.9806989911182228
 85:  ... loglikelihood=-50186.55241287111  0.980703468271756
 86:  ... loglikelihood=-50127.26339882067  0.9807195860244757
 87:  ... loglikelihood=-50069.08427441567  0.9807312266236621
 88:  ... loglikelihood=-50011.9822326526   0.9807357037771953
 89:  ... loglikelihood=-49955.92581691934  0.9807446580842618
 90:  ... loglikelihood=-49900.88484943885  0.9807527169606216
 91:  ... loglikelihood=-49846.83036430355  0.9807634621291014
 92:  ... loglikelihood=-49793.734544757914 0.9807724164361679
 93:  ... loglikelihood=-49741.57066440427  0.9807786844511144
 94:  ... loglikelihood=-49690.31303207665  0.9807840570353543
 95:  ... loglikelihood=-49639.93694007888  0.9807948022038341
 96:  ... loglikelihood=-49590.418615580194 0.9808001747880739
 97:  ... loglikelihood=-49541.73517492774  0.9808073382337271
 98:  ... loglikelihood=-49493.86458067577  0.9808145016793803
 99:  ... loglikelihood=-49446.785601155134 0.9808234559864467
100:  ... loglikelihood=-49400.477772387036 0.9808359920163399
    model generated
        model building complete.... 
        annotated sentences: 20370
    Performing NER with new model


it will do this for each iteration  util you see
......
 97:  ... loglikelihood=-49140.50129715517  0.9808462362240823
 98:  ... loglikelihood=-49095.42289306763  0.9808641444693966
 99:  ... loglikelihood=-49051.095083380205 0.9808713077675223
100:  ... loglikelihood=-49007.49834809576  0.9808748894165852
    model generated

如果您看到带注释的句子停止更改,并且在您细化列表时,已知信息在后续运行中停止更改,您可以更改迭代次数.

you can change the num iterations if you see the annotated sentences stop changing, and the knowns stop changing on subsequent runs as you refine the lists.

HTH

这篇关于如何“更新"现有的命名实体识别模型 - 而不是从头开始创建?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆