NLP对句子的内容进行分类/标记(Ruby绑定necesarry) [英] NLP to classify/label the content of a sentence (Ruby binding necesarry)

查看:226
本文介绍了NLP对句子的内容进行分类/标记(Ruby绑定necesarry)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在分析几百万封电子邮件.我的目标是能够将其分类.组可以是例如:

I am analysing a few million emails. My aim is to be able to classify then into groups. Groups could be e.g.:

  • 传递问题(传递速度慢,发送前处理缓慢,可用性信息不正确等)
  • 客户服务问题(电子邮件响应时间慢,响应不礼等)
  • 退货问题(退货请求处理缓慢,客户服务缺乏帮助等)
  • 定价投诉(发现隐藏费用等)
  • Delivery problems (slow delivery, slow handling before dispatch, incorrect availability information, etc.)
  • Customer service problems (slow email response time, impolite response, etc.)
  • Return issues (slow handling of return request, lack of helpfulness from the customer service, etc.)
  • Pricing complaint (hidden fee's discovered, etc.)

为了执行此分类,我需要一个NLP来识别单词组的组合,例如:

In order to perform this classification, I need a NLP that can recognize the combination of word groups like:

  • "[[他们|公司|公司|网站|商人]"
  • "[没有|没有|没有]"
  • "[回复|回复| answer |回复]"
  • "[[在第二天之前|足够快|全部]"
  • "[they|the company|the firm|the website|the merchant]"
  • "[did not|didn't|no]"
  • "[response|respond|answer|reply]"
  • "[before the next day|fast enough|at all]"
  • etc.

这些示例组中的一些组合应与以下句子匹配:

A few of these exemplified groups in combination should then match sentences like:

  • 他们没有回应"
  • 他们根本没有回应"
  • 根本没有回应"
  • 我没有收到网站的回复"

然后将该句子归类为客户服务问题.

哪个NLP可以处理这样的任务?根据我的阅读,这些是最相关的:

Which NLP would be able to handle such a task? From what I read these are the most relevant:

  • 斯坦福大学CoreNLP
  • OpenNLP

还要这些建议的NLP.

Check also these suggested NLP's.

推荐答案

使用OpenNLP doccat api,您可以创建训练数据,然后根据训练数据创建模型.相对于朴素贝叶斯分类器之类的东西,它的优势在于,它可以返回类别集上的概率分布.

Using the OpenNLP doccat api, you can create training data and then a model from the training data. The advantage of this over something like a naive bayes classifier is that it returns a probability distribution over your set of categories.

因此,如果您使用以下格式创建文件:

so if you create a file with this format:

customerserviceproblems They did not respond
customerserviceproblems They didn't respond 
customerserviceproblems They didn't respond at all
customerserviceproblems They did not respond at all
customerserviceproblems I received no response from the website
customerserviceproblems I did not receive response from the website

等....提供尽可能多的示例,并确保每行以\ n换行符结尾

etc.... provide as many samples as possible and make sure each line ends with a \n newline

使用此方法,您可以添加任何意味着客户服务问题"的内容,也可以添加其他任何类别,因此您不必过于确定哪些数据属于哪些类别

using this appoach you can add anything you want that means "customer service problems" and you can also add any other categories as well, so you don't have to be too deterministic about what data falls into what categories

这是Java构建模型的样子

here is what the java looks like to build the model

DoccatModel model = null;
    InputStream dataIn = new FileInputStream(yourFileOfSamplesLikeAbove);
    try {

      ObjectStream<String> lineStream =  
              new PlainTextByLineStream(dataIn, "UTF-8");

      ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
      model = DocumentCategorizerME.train("en", sampleStream);
      OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutFile));
      model.serialize(modelOut);
      System.out.println("Model complete!");
    } catch (IOException e) {
      // Failed to read or parse training data, training failed
      e.printStackTrace();
    }

有了模型后,就可以使用它了,像这样:

Once you have the model, you can then use it something like this:

DocumentCategorizerME documentCategorizerME;
  DoccatModel doccatModel; 

doccatModel = new DoccatModel(new File(pathToModelYouJustMade));
   documentCategorizerME = new DocumentCategorizerME(doccatModel);
 /**
 * returns a map of a category to a score
 * @param text
 * @return
 * @throws Exception 
 */
  private Map<String, Double> getScore(String text) throws Exception {
    Map<String, Double> scoreMap = new HashMap<>();
    double[] categorize = documentCategorizerME.categorize(text);
    int catSize = documentCategorizerME.getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = documentCategorizerME.getCategory(i);
      scoreMap.put(category, categorize[documentCategorizerME.getIndex(category)]);
    }
    return scoreMap;

  }

然后在返回的哈希图中,您拥有要建模的每个类别以及一个分数,您可以使用这些分数来确定输入文本属于哪个类别.

then in the returned hashmap you have each category that you modeled and a score, you can use the scores to decide which category the input text belongs to.

这篇关于NLP对句子的内容进行分类/标记(Ruby绑定necesarry)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆