如何使用 OpenNLP 创建自定义模型? [英] How to create Custom model using OpenNLP?

查看:30
本文介绍了如何使用 OpenNLP 创建自定义模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 OpenNLP Java API 从文档中提取实体,例如姓名、技能.但它没有提取正确的名称.我正在使用 opennlp sourceforge 链接

I am trying to extract entities like Names, Skills from document using OpenNLP Java API. but it is not extracting proper Names. I am using model available on opennlp sourceforge link

这是一段java代码-

Here is a piece of java code-

public class tikaOpenIntro {

    public static void main(String[] args) throws IOException, SAXException,
            TikaException {

        tikaOpenIntro toi = new tikaOpenIntro();
        toi.filest("");
        String cnt = toi.contentEx();
        toi.sentenceD(cnt);
        toi.tokenization(cnt);

        String names = toi.namefind(toi.Tokens);
        toi.files(names);

    }

    public String Tokens[];

    public String contentEx() throws IOException, SAXException, TikaException {
        InputStream is = new BufferedInputStream(new FileInputStream(new File(
                "/home/rahul/Downloads/rahul.pdf")));
        // URL url=new URL("http://in.linkedin.com/in/rahulkulhari");
        // InputStream is=url.openStream();
        Parser ps = new AutoDetectParser(); // for detect parser related to

        BodyContentHandler bch = new BodyContentHandler();

        ps.parse(is, bch, new Metadata(), new ParseContext());

        return bch.toString();

    }

    public void files(String st) throws IOException {
        FileWriter fw = new FileWriter("/home/rahul/Documents/extrdata.txt",
                true);
        BufferedWriter bufferWritter = new BufferedWriter(fw);
        bufferWritter.write(st + "\n");
        bufferWritter.close();
    }

    public void filest(String st) throws IOException {
        FileWriter fw = new FileWriter("/home/rahul/Documents/extrdata.txt",
                false);

        BufferedWriter bufferWritter = new BufferedWriter(fw);
        bufferWritter.write(st);
        bufferWritter.close();
    }

    public String namefind(String cnt[]) {
        InputStream is;
        TokenNameFinderModel tnf;
        NameFinderME nf;
        String sd = "";
        try {
            is = new FileInputStream(
                    "/home/rahul/opennlp/model/en-ner-person.bin");
            tnf = new TokenNameFinderModel(is);
            nf = new NameFinderME(tnf);

            Span sp[] = nf.find(cnt);

            String a[] = Span.spansToStrings(sp, cnt);
            StringBuilder fd = new StringBuilder();
            int l = a.length;

            for (int j = 0; j < l; j++) {
                fd = fd.append(a[j] + "\n");

            }
            sd = fd.toString();

        } catch (FileNotFoundException e) {

            e.printStackTrace();
        } catch (InvalidFormatException e) {

            e.printStackTrace();
        } catch (IOException e) {

            e.printStackTrace();
        }
        return sd;
    }


    public void sentenceD(String content) {
        String cnt[] = null;
        InputStream om;
        SentenceModel sm;
        SentenceDetectorME sdm;
        try {
            om = new FileInputStream("/home/rahul/opennlp/model/en-sent.bin");
            sm = new SentenceModel(om);
            sdm = new SentenceDetectorME(sm);
            cnt = sdm.sentDetect(content);

        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    public void tokenization(String tokens) {

        InputStream is;
        TokenizerModel tm;

        try {
            is = new FileInputStream("/home/rahul/opennlp/model/en-token.bin");
            tm = new TokenizerModel(is);
            Tokenizer tz = new TokenizerME(tm);
            Tokens = tz.tokenize(tokens);
            // System.out.println(Tokens[1]);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

我想做的是:

  • 我正在使用 Apache Tika 将 PDF 文档转换为纯文本文档.
  • 我正在传递纯文本文档以进行句子边界检测.
  • 在此标记化之后
  • 在此名称实体提取之后
  • i am using Apache Tika to convert PDF document into plain text document.
  • I am passing plain text document for sentence boundary detection.
  • After this tokenization
  • after this Name entity extraction

但它正在提取名称和其他单词.不是提取专有名词.以及如何创建自定义模型来从游泳、编程等文档中提取技能?

But it is extracting names and other words. It is not extract proper names. and how to create Custom model to extract Skills from document like Swimming, Programming etc?

给我一些想法!

任何帮助将不胜感激!?

推荐答案

听起来您对 OpenNLP 的预构建名称模型的性能不满意.但是(a)模型从来都不是完美的,即使是最好的模型也会错过一些它应该抓住的东西,抓住一些它应该错过的东西;(b) 如果模型训练的文档在体裁和文本样式上与您尝试标记的文档匹配,则模型将表现最佳(因此,在混合大小写文本上训练的模型在所有情况下都不能很好地工作-大写文本,并且在新闻文章上训练的模型在推文上不能很好地工作).您可以尝试其他公开可用的工具,例如斯坦福 NE 工具包或 LingPipe;他们可能有更好的模型.但它们都不会是完美的.

It sounds like you're not happy with the performance of the pre-built name model for OpenNLP. But (a) models are never perfect, and even the best model will miss some things it should have caught and catch some things it should have missed; and (b) the model will perform best if the documents the model was trained on match the documents you're trying to tag, in genre and text style (so a model trained on mixed case text won't work very well on all-caps text, and a model trained on news articles won't work well on, say, tweets). You can try other publicly available tools, like the Stanford NE toolkit, or LingPipe; they may have better-performing models. But none of them are going to be perfect.

要创建自定义模型,您需要生成一些训练数据.对于 OpenNLP,它看起来像

To create a custom model, you'll need to produce some training data. For OpenNLP, it would look something like

I have a Ph.D. in <START:skill> operations research <END>

对于诸如此类的特定内容,您可能需要自己提供这些数据.你会需要很多;OpenNLP 文档推荐了大约 15,000 个例句.有关更多详细信息,请参阅 OpenNLP 文档.

For something as specific as this, you'd probably need to come up with that data yourself. And you'll need a lot of it; the OpenNLP documentation recommends about 15,000 example sentences. Consult the OpenNLP docs for more details.

这篇关于如何使用 OpenNLP 创建自定义模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆