如何使用OpenNLP创建自定义模型? [英] How to create Custom model using OpenNLP?

查看:229
本文介绍了如何使用OpenNLP创建自定义模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 OpenNLP Java API 从文档中提取实体,例如名称,技能。但它没有提取适当的名称。我正在使用 opennlp sourceforge链接上提供的模型

I am trying to extract entities like Names, Skills from document using OpenNLP Java API. but it is not extracting proper Names. I am using model available on opennlp sourceforge link

这是一段java代码 -

Here is a piece of java code-

public class tikaOpenIntro {

    public static void main(String[] args) throws IOException, SAXException,
            TikaException {

        tikaOpenIntro toi = new tikaOpenIntro();
        toi.filest("");
        String cnt = toi.contentEx();
        toi.sentenceD(cnt);
        toi.tokenization(cnt);

        String names = toi.namefind(toi.Tokens);
        toi.files(names);

    }

    public String Tokens[];

    public String contentEx() throws IOException, SAXException, TikaException {
        InputStream is = new BufferedInputStream(new FileInputStream(new File(
                "/home/rahul/Downloads/rahul.pdf")));
        // URL url=new URL("http://in.linkedin.com/in/rahulkulhari");
        // InputStream is=url.openStream();
        Parser ps = new AutoDetectParser(); // for detect parser related to

        BodyContentHandler bch = new BodyContentHandler();

        ps.parse(is, bch, new Metadata(), new ParseContext());

        return bch.toString();

    }

    public void files(String st) throws IOException {
        FileWriter fw = new FileWriter("/home/rahul/Documents/extrdata.txt",
                true);
        BufferedWriter bufferWritter = new BufferedWriter(fw);
        bufferWritter.write(st + "\n");
        bufferWritter.close();
    }

    public void filest(String st) throws IOException {
        FileWriter fw = new FileWriter("/home/rahul/Documents/extrdata.txt",
                false);

        BufferedWriter bufferWritter = new BufferedWriter(fw);
        bufferWritter.write(st);
        bufferWritter.close();
    }

    public String namefind(String cnt[]) {
        InputStream is;
        TokenNameFinderModel tnf;
        NameFinderME nf;
        String sd = "";
        try {
            is = new FileInputStream(
                    "/home/rahul/opennlp/model/en-ner-person.bin");
            tnf = new TokenNameFinderModel(is);
            nf = new NameFinderME(tnf);

            Span sp[] = nf.find(cnt);

            String a[] = Span.spansToStrings(sp, cnt);
            StringBuilder fd = new StringBuilder();
            int l = a.length;

            for (int j = 0; j < l; j++) {
                fd = fd.append(a[j] + "\n");

            }
            sd = fd.toString();

        } catch (FileNotFoundException e) {

            e.printStackTrace();
        } catch (InvalidFormatException e) {

            e.printStackTrace();
        } catch (IOException e) {

            e.printStackTrace();
        }
        return sd;
    }


    public void sentenceD(String content) {
        String cnt[] = null;
        InputStream om;
        SentenceModel sm;
        SentenceDetectorME sdm;
        try {
            om = new FileInputStream("/home/rahul/opennlp/model/en-sent.bin");
            sm = new SentenceModel(om);
            sdm = new SentenceDetectorME(sm);
            cnt = sdm.sentDetect(content);

        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    public void tokenization(String tokens) {

        InputStream is;
        TokenizerModel tm;

        try {
            is = new FileInputStream("/home/rahul/opennlp/model/en-token.bin");
            tm = new TokenizerModel(is);
            Tokenizer tz = new TokenizerME(tm);
            Tokens = tz.tokenize(tokens);
            // System.out.println(Tokens[1]);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

什么是我想做的是:


  • 我正在使用 Apache Tika 将PDF文档转换为纯文本文档。

  • 我正在通过纯文本文档进行句子边界检测。

  • 标记化

  • 名称实体提取

  • i am using Apache Tika to convert PDF document into plain text document.
  • I am passing plain text document for sentence boundary detection.
  • After this tokenization
  • after this Name entity extraction

但它正在提取名称和其他的话。 不提取专有名称。如何创建自定义模型以从游泳,编程等文档中提取技能?

But it is extracting names and other words. It is not extract proper names. and how to create Custom model to extract Skills from document like Swimming, Programming etc?

给我一些建议!

我们将非常感谢任何帮助!

推荐答案

听起来你对OpenNLP的预建名称模型的性能不满意。但是(a)模型永远不会是完美的,即使是最好的模型也会错过一些应该抓住的东西,并抓住一些应该错过的东西; (b)如果模型所训练的文件与你试图标记的文件,类型和文本风格相匹配,模型将表现最佳(因此,对混合案例文本进行训练的模型在所有文件上都不会很好 - 对文本进行限制,并且对新闻文章进行过培训的模型在推文方面效果不佳。您可以尝试其他公开可用的工具,例如Stanford NE工具包或LingPipe;他们可能会有更好的模特。但它们都不会是完美的。

It sounds like you're not happy with the performance of the pre-built name model for OpenNLP. But (a) models are never perfect, and even the best model will miss some things it should have caught and catch some things it should have missed; and (b) the model will perform best if the documents the model was trained on match the documents you're trying to tag, in genre and text style (so a model trained on mixed case text won't work very well on all-caps text, and a model trained on news articles won't work well on, say, tweets). You can try other publicly available tools, like the Stanford NE toolkit, or LingPipe; they may have better-performing models. But none of them are going to be perfect.

要创建自定义模型,您需要生成一些训练数据。对于OpenNLP,它看起来像

To create a custom model, you'll need to produce some training data. For OpenNLP, it would look something like

I have a Ph.D. in <START:skill> operations research <END>

对于特定的内容,您可能需要自己提供这些数据。你需要很多东西; OpenNLP文档推荐大约15,000个例句。有关详细信息,请参阅OpenNLP文档。

For something as specific as this, you'd probably need to come up with that data yourself. And you'll need a lot of it; the OpenNLP documentation recommends about 15,000 example sentences. Consult the OpenNLP docs for more details.

这篇关于如何使用OpenNLP创建自定义模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆