如何确定文档的(自然)语言? [英] How to determine the (natural) language of a document?

查看:65
本文介绍了如何确定文档的(自然)语言?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一套使用两种语言的文件:英语和德语.关于这些文档没有可用的元信息,程序只能查看其内容.基于此,程序必须决定用哪种语言编写文档.

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in.

是否存在可以在几个小时内实现的针对该问题的标准"算法?或者,一个免费的.NET库或工具包可以做到这一点?我知道 LingPipe ,但它是

Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is

  1. Java
  2. 对于半商业"用途不是免费的

这个问题似乎很难解决.我签出了 Google AJAX语言API (通过搜索网站第一),但这太可笑了.对于我所指向的六个德语德语网页,只有一个猜测是正确的.其他猜测是瑞典语,英语,丹麦语和法语...

This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. For six web pages in German to which I pointed it only one guess was correct. The other guesses were Swedish, English, Danish and French...

我想到的一种简单方法是使用停用词列表.我的应用程序已经在德国文档中使用了这样的列表,以便通过Lucene.Net对其进行分析.如果我的应用程序从任何一种语言扫描文档中是否出现停用词,则赢得次数最多的停用词会获胜.可以肯定,这是一种非常幼稚的方法,但是可能足够好.不幸的是,尽管这是一个有趣的话题,但我没有时间成为自然语言处理方面的专家.

A simple approach I came up with is to use a list of stop words. My app already uses such a list for German documents in order to analyze them with Lucene.Net. If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. A very naive approach, to be sure, but it might be good enough. Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic.

推荐答案

使用停用词列表的问题是健壮性之一.停用词列表基本上是一组规则,每个单词一个规则.与统计方法相比,基于规则的方法对看不见的数据的鲁棒性较差.您将遇到的一些问题是包含每种语言的停用词数量相等的文档,没有停用词的文档,包含来自错误语言的停用词的文档等.基于规则的方法无法执行其规则所能做的任何事情t指定.

The problem with using a list of stop words is one of robustness. Stop word lists are basically a set of rules, one rule per word. Rule-based methods tend to be less robust to unseen data than statistical methods. Some problems you will encounter are documents that contain equal counts of stop words from each language, documents that have no stop words, documents that have stop words from the wrong language, etc. Rule-based methods can't do anything their rules don't specify.

一种不需要您自己实现朴素贝叶斯或任何其他复杂的数学或机器学习算法的方法,就是计算字符二字组和三字组(取决于您是否有大量或少量的数据开始- -bigrams将使用较少的训练数据).对少数几种已知源语言的文档(越多越好)运行计数,然后根据计数数量为每种语言构造一个有序列表.例如,英语将"th"作为最常见的二元组.手持有序列表,计算要分类的文档中的二元组并将其排序.然后遍历每个文档,并将其在排序后的未知文档列表中的位置与其在每个培训列表中的排名进行比较.给每个二元组每种语言的分数

One approach that doesn't require you to implement Naive Bayes or any other complicated math or machine learning algorithm yourself, is to count character bigrams and trigrams (depending on whether you have a lot or a little of data to start with -- bigrams will work with less training data). Run the counts on a handful of documents (the more the better) of known source language and then construct an ordered list for each language by the number of counts. For example, English would have "th" as the most common bigram. With your ordered lists in hand, count the bigrams in a document you wish to classify and put them in order. Then go through each one and compare its location in the sorted unknown document list to the its rank in each of the training lists. Give each bigram a score for each language as

1 / ABS(RankInUnknown - RankInLanguage + 1).

语言得分最高的是获胜者.它很简单,不需要很多编码,也不需要很多训练数据.更好的是,您可以继续添加数据,它会不断改进.另外,您不必手工创建停用词列表,并且不会因为文档中没有停用词而失败.

Whichever language ends up with the highest score is the winner. It's simple, doesn't require a lot of coding, and doesn't require a lot of training data. Even better, you can keep adding data to it as you go on and it will improve. Plus, you don't have to hand-create a list of stop words and it won't fail just because there are no stop words in a document.

它仍然会被包含相等对称双字母组计数的文档所混淆.如果您可以获得足够的训练数据,则使用三字母组合会减少这种可能性.但是,使用三字母组合词意味着您还需要未知文档更长的时间.真的很短的文档可能会要求您减少到单个字符(字母组合)的计数.

It will still be confused by documents that contain equal symmetrical bigram counts. If you can get enough training data, using trigrams will make this less likely. But using trigrams means you also need the unknown document to be longer. Really short documents may require you to drop down to single character (unigram) counts.

所有这些,您将遇到错误.没有银弹.组合方法并选择使您对每种方法有最大信心的语言可能是最明智的选择.

All this said, you're going to have errors. There's no silver bullet. Combining methods and choosing the language that maximizes your confidence in each method may be the smartest thing to do.

这篇关于如何确定文档的(自然)语言?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆