如何将文字中的单词分类为名称,数字,金额,日期等? [英] How I classify a word of a text in things like names, number, money, date,etc?

查看:77
本文介绍了如何将文字中的单词分类为名称,数字,金额,日期等?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一周前,我对文本挖掘提出了一些问题,但我仍然有些困惑,但现在我知道我想做的事了.

I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do.

情况:我有很多具有HTML内容的下载页面.例如,其中一些可以是来自博客的文本.它们不是结构化的,并且来自不同的站点.

The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites.

我想做的事情:我将用空格将所有单词分开,我想按一些预定义的名称(例如姓名,电话号码,电话,电子邮件)对每个单词或一组单词进行分类,网址,日期,资金,温度等

What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money, temperature, etc.

我所知道的:我知道有关自然语言处理,命名实体协调器,POSTagging,NayveBayesian,HMM,培训以及许多要做分类的事情的概念/听说,但是,有一些具有不同分类器和方法的NLP库,但我不知道有什么用或做什么.

What I know: I know the concepts/heard about about Natural Language Processing, Named Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and a lot of things to do classification, etc., but there is some different NLP libraries with differents classifiers and ways to do this and I don't know what use or what do.

我需要什么:我需要一个分类器NLP的代码示例,无论如何,它可以分别对文本中的每个单词进行分类,而不是对整个文本进行分类.像这样:

WHAT I NEED: I need some code example from a classifier, NLP, whatever, that can classify each word from a text separetely, and not a entire text. Something like this:

//This is pseudo-code for what I want, and not a implementation

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
    classifiedWord = classifier.classify(word);
    System.out.println(classifiedWord.getType());
}

有人可以帮助我吗?我对各种API,分类器和算法感到困惑.

Somebody can help me? I'm confused with various APIs, classifiers and algorithms.

推荐答案

您应该尝试 Apache OpenNLP .它易于使用和自定义.

You should try Apache OpenNLP. It is easy to use and customize.

如果您使用葡萄牙语进行操作,请在项目文档中使用Amazonia Corpus获得有关如何进行操作的信息.支持的类型为:

If you are doing it for Portuguese there are information on how to do it on the project documentation using Amazonia Corpus. The types supported are:

人员,组织,小组,地点,事件,ArtProd,摘要,事物,时间和数字.

Person, Organization, Group, Place, Event, ArtProd, Abstract, Thing, Time and Numeric.

  1. 下载OpenNLP和亚马逊语料库 .提取两者并将文件amazonia.ad复制到apache-opennlp-1.5.1-incubating文件夹.

  1. Download the OpenNLP and the Amazonia Corpus. Extract both and copy the file amazonia.ad to the apache-opennlp-1.5.1-incubating folder.

执行TokenNameFinderConverter工具以将Amazonia语料库转换为OpenNLP格式:

Execute the TokenNameFinderConverter tool to convert the Amazonia corpus to the OpenNLP format:

bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad -lang pt > corpus.txt

  • 训练您的模型(将编码更改为corpus.txt文件的编码,这应该是您的系统默认编码.此命令可能需要几分钟的时间):

  • Train you model (Change the encoding to the encoding of the corpus.txt file, that should be your system default encoding. This command can take several minutes):

    bin/opennlp TokenNameFinderTrainer -lang pt -encoding UTF-8 -data corpus.txt -model pt-ner.bin -cutoff 20
    

  • 从命令行执行它(您只应执行一个句子,并且标记应分开):

  • Executing it from command line (You should execute only one sentence and the tokens should be separated):

    $ bin/opennlp TokenNameFinder pt-ner.bin 
    Loading Token Name Finder model ... done (1,112s)
    Meu nome é João da Silva , moro no Brasil . Trabalho na Petrobras e tenho 50 anos .
    Meu nome é <START:person> João da Silva <END> , moro no <START:place> Brasil <END> . <START:abstract> Trabalho <END> na <START:abstract> Petrobras <END> e tenho <START:numeric> 50 anos <END> .
    

  • 使用API​​执行该操作:

  • Executing it using the API:

    InputStream modelIn = new FileInputStream("pt-ner.bin");
    
    try {
      TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
    }
    catch (IOException e) {
      e.printStackTrace();
    }
    finally {
      if (modelIn != null) {
        try {
           modelIn.close();
        }
        catch (IOException e) {
        }
      }
    }
    
    // load the name finder
    NameFinderME nameFinder = new NameFinderME(model);
    
    // pass the token array to the name finder
    String[] toks = {"Meu","nome","é","João","da","Silva",",","moro","no","Brasil",".","Trabalho","na","Petrobras","e","tenho","50","anos","."};
    
    // the Span objects will show the start and end of each name, also the type
    Span[] nameSpans = nameFinder.find(toks);
    

  • 要评估模型,您可以使用10倍交叉验证:(仅在1.5.2-INCUBATOR中可用,今天要使用它,您需要使用SVN干线)(可能需要几个小时)

    bin/opennlp TokenNameFinderCrossValidator -lang pt -encoding UTF-8 -data corpus.txt -cutoff 20
    

  • 通过使用自定义特征生成"(请参阅​​文档),例如通过添加名称词典,来提高精度/召回率.

  • Improve the precision/recall by using the Custom Feature Generation (check documentation), for example by adding a name dictionary.

    这篇关于如何将文字中的单词分类为名称,数字,金额,日期等?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆