检测外来词 [英] Detecting foreign words

查看:68
本文介绍了检测外来词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个脚本来检测语言A中来自语言B的单词.这两种语言非常相似,并且可能具有相同单词的实例.

I am writing a script to detect words from a language B in a language A. The two languages are very similar and may have instances of the same words.

如果您对我目前所拥有的感兴趣,代码在这里: https://github.com/arashsa/language-detection.git

The code is here if you are interested in what I have so far: https://github.com/arashsa/language-detection.git

我将在这里解释我的方法: 我用语言B创建了一个双元组列表,使用语言A创建了一个双元组列表(语言B中的小语料库,语言A中的大语料库).然后,我删除所有常见的二元组.然后,我浏览了语言A中的文本,并使用双字母组检测了语言A中的文本并将它们存储在文件中.但是,此方法找到了两种语言都通用的许多单词,并且还发现了奇怪的二元组,例如彼此相邻的两个国家的名称以及其他异常情况.

I will explain my method here: I create a list of bigrams in language B, a list of bigrams in language A (small corpus in language B, large corpus in language A). Then I remove all bigrams that are common. Then I go through the text in language A, and using the bigrams I detect these in language A and store them in a file. However, this methods finds many words that are common for both languages, and it also finds strange bigrams like the name of two countries adjacent to each other, and other anomalies.

你们中有人有建议,阅读材料,我可能会使用的NLP方法吗?

Do any of you have suggestions, reading material, NLP methods that I might use?

推荐答案

如果您的方法返回的是两种语言中存在的单词,而您只想返回一种语言中存在的单词,则可能要创建一个<用语言A表示em>一克,然后用语言B表示一克,然后删除两者中的单词.然后,如果您愿意,可以继续进行二元分析.

If your method is returning words present in both languages, and you only want to return words that exist in one language, you might want to create a list of one-grams in language A and one-grams in language B, and then remove the words in both. Then, if you like, you can proceed with the bigram analysis.

也就是说,Python中有一些不错的工具可用于语言识别.我发现lang-id是最好的之一.它预先进行了针对90多种语言的语言分类器的培训,并且如果您愿意,可以很容易地针对其他语言进行培训.这是文档.还有猜测语言,但据我估计它的效果不佳.根据外语的本地化程度,您可以尝试以适当的粒度对文本进行分块,然后通过(例如)langid的分类器运行这些分块.

That said, there are some good tools in Python for language identification. I've found lang-id to be one of the best. It comes pre-trained with language classifiers for over 90 languages, and is fairly easy to train for additional languages if you like. Here are the docs. There is also guess-language, but it doesn't perform as well in my estimation. Depending on how localized the bits of foreign language are, you could try chunking your texts at an appropriate level of granularity and running those chunks through (e.g.) langid's classifier.

这篇关于检测外来词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆