是否有任何 NLP 工具可用于对英语以外的语言进行语义解析 [英] Is there any NLP tools for semantic parsing for languages other than English

查看:56
本文介绍了是否有任何 NLP 工具可用于对英语以外的语言进行语义解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析马拉雅拉姆语(印度语)文本语料库以开发问答系统.是否有任何 NLP 工具可以对英语以外的语言进行语义解析.

I want to parse Malayalam(Indian Language) text corpora for developing a question answering system.Is there any NLP tools for semantic parsing for languages other than English.

推荐答案

这听起来可能很吓人.

据我所知,没有可供您研究的免费软件问答系统,即使它有文档记录.

As far as I know, there is no free software question/answering system you can study, even if it's documented.

问答有两个部分:

  • 理解问题
  • 在一些预处理数据集中查找响应(比如 wikidata.org)

两个步骤都需要类似的算法.

Both steps require similar algorithms.

垂直问答管道

要实现垂直问答系统,您需要能够解析马拉雅拉姆语"和其他高级别的印度语言,这意味着至少要做:

To implement a vertical question/answering system you will need to be able to parse "Malayalam" and other Indian languages at a high level which means at least to do:

  • 将文本分成段落,然后分成句子和单词.你必须能够分辨一个句子在哪里结束.例如,根据语言,句子可能不会以相同的字符结尾.或缩写.像 ie. 不是句子的结尾.I.B.M. 不是三个句子等等. 句子如何开头,在英语中有一个大写字母,但并非所有大写字母都像专有名词一样开头一个句子,例如乔姆斯基还活着吗?"

  • split text into paragraphs, and then into sentences and into words. You must be able to tell where a sentence ends. For instance depending on the language sentences might not end with the same char. Or abbrev. like ie. are not the end of a sentence. I.B.M. is not three sentences, etc. How do sentence start, in english there is a upper case letter, but not all upper case letter start a sentence like proper nouns e.g. "Is Chomsky alive?"

词性标注:告诉名词,从适当的名词,动词等.

Part of Speech Tagging: tell nouns, from proper nouns, from verbs, etc.

创建一个命名实体识别器:识别人员、组织的名称、地点、时间、数量、货币价值、百分比等的表达.

create a named entity recognizer: identify names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

构建语义树依赖:例如命名实体她"的对象或他"指的是?谁是句子的主语、补语等

build semantic tree dependencies: e.g. to which named entity "she" or "he" refers to? who is the subject of the sentence, the complement, etc.

创建一个文本生成工具.鉴于您的程序理解了问题,找到了可能的答案,它必须用自然语言对其进行格式化.

create a text generation tool. Given that your program understand the question, found a possible answer it must format it in a natural language.

幸运的是,您可以从中汲取灵感的英语例子不乏其人.如果你想在 Python 上工作,你需要学习 Spacy(想要快速和最新的 NLP 库)和 NLTK 其中 随书附送.

Luckily there is no shortage of example for doing that for english which you take inspiration from. If you want to work on Python you will want to study Spacy (wanna be fast and current NLP library) and NLTK which comes with book.

算法可以在语言之间共享.

Algorithms can be shared between languages.

更窄的方法

如果不想做所有的步骤,只解决答题的子问题.您需要简化问题并消除变量/未知:

If you don't want to do all the steps and only resolve the sub problem of answering questions. You need to simplify the problem and eliminate variables/unknown:

您必须构建一个带有已拆分和标记的事实的数据库,以便您可以通过执行 SQL 查询来简单地回答问题.例如给定以下事实元组:

You must build a database with already split and tagged facts so that you can simply answer questions by doing a SQL query. For instance given the following fact tuple:

WHO:印度 WHAT:赢得 WHAT:板球锦标赛时间:2015 年

WHO:India WHAT:win WHAT:Cricket Championship WHEN:2015

在这里,我将标记简化为 WHOWHENWHAT.

Here, I simplify tagging as WHO, WHEN and WHAT.

这个问题很容易回答:

谁:?什么:赢得什么:板球锦标赛时间:2015

WHO:? WHAT:win WHAT:Cricket Championship WHEN:2015

即.

谁赢得了 2015 年的板球冠军?

who won the Cricket championship in 2015?

同样的问题必须是可预测的";并且易于解析其他示例:

Again questions must be "predictable" and easy to parse other examples:

谁做了什么

什么是什么

当什么人

如果你能识别/解析一个 WHO、一个 WHEN 并猜测什么是 WHAT用户.您也可以进一步简化并说 WHENs 只能是 4 位数字,即.年.并进一步约束这类问题,以简化解析部分.

This can work if you can recognize/parse a WHO, a WHEN and guess what is a WHAT in a sentence provided by the user. Also you can further simplify and say that WHENs can only be 4 digits ie. years. And futher constraint the kind of question, to simplify the parsing part.

这将引导您进入一个程序,该程序实际上可以以更自然、更正确的方式回答问题,而原始 ElasticSearch 或 PostgreSQL 等信息检索 (IR) 系统会这样做.

This will lead you to a program that can actually answer question in a way that is more natural and more correct that an Information Retrieval (IR) systems like raw ElasticSearch or PostgreSQL would do.

事实数据库

您可能需要使用语义网络查看免费的ConceptNet(和如果您需要帮助或想贡献印度维基词典)或 babelnet.还有wordnet.

You will probably need to work with semantical network look at the free ConceptNet (and send a message to the mailling list if you need help or want to contribute indian wiktionary) or babelnet. There is also wordnet.

课程

我很喜欢Jurafsky 课程,有专门的一章关于 QA.Jurafsky 写了一本关于 NLP 的完整书籍介绍.

I liked a lot Jurafsky course, there is a specific chapter about QA. Jurafsky wrote a full book introduction to NLP.

搜索提示

在全球万维网中以您查找信息的语言搜索有关 NLP 算法的信息.比如说,对于法语词形还原师,我会在法语研究门户网站上进行搜索,或者通过使用本机 ie 的搜索引擎进行搜索.法语措辞.美国的搜索引擎在其他语言方面并不比英语好,所以准备好分页(也使用搜索工具).

Search the World Wide Web for information about NLP algorithms in the language you look information for. Say for a french lemmatizer I do the search on french research portal or through a search engine using native ie. french wording. American search engines are not really as good in other languages than english so be prepared to paginate (also use search tools).

文化

在自然语言处理和人工智能方面建立良好的文化.看看摘要或信息检索(这很容易)你会学到可以在另一个问题中重用的方法.例如,如果您查看基于规则的机器翻译,您会了解到他们使用的行业简化的非歧义自然语言语法能够准确翻译文档.这些文档是用简单的英语编写的(例如 SUBJECT VERB NOUN),可以轻松创建计算机语法(如计算机语言语法),并且可以轻松地进行大部分逐字翻译.这是解决子问题以实现更高质量的实例.这就是我想出上述狭义方法的方式.

Build a good culture on natural language processing and artificial intelligence. Look summarization or information retrieval (it's easy) to you will learn method that can be re-used in another problem. For instance if you look at rule based machine translation you learn that in the industry they used simplified non ambiguous natural language grammars to be able to accurately translates documentations. Those documentations are written in a simple English (e.g. SUBJECT VERB NOUN) for which a computer grammar can easily be created (like computer language grammars) and can easily be translated mostly word-to-word. This an instance of solving a sub-problem to achieve higher quality. This is how I came up with the above narrow approach.

算法

最后但并非最不重要的一点是,大多数子问题解决方案都属于以下三个算法类别之一:

Last but not least, most sub-problem solutions falls into one the three following algorithm category:

  • 代数和图论试图理解数据并解释其结果.例如PageRank、SimRank、CoSimRank、逻辑编程.

  • Algrebraic and graph theory try to makes sens of the data and can explain its results. e.g. PageRank, SimRank, CoSimRank, Logic programming.

我将其与术语动力学进行比较的统计数据,其中基本上您解决了问题,但不知道为什么".这就是所谓的机器学习".与 NLP 相比,主要用于解决实际范围的问题.但是仍然存在机器学习算法来解决自然语言问题,例如 主题建模 但这不是唯一的例子.统计编程很受欢迎.

Statistical which I compare to termodynamics, where basicaly "you get the problem solved but don't know why". This is what is called "machine learning" and is mainly used in the industry to solve actually narrow problems compared to NLP. But still machine learning algorithm exists to solve natural language problems e.g topic modeling but it's not the only example. Statistical programming is popular.

混合:这意味着两种方法的混合.

Hybrid: which means a mix of both methods.

阅读关于乔姆斯基和统计学习的两种文化,了解更多关于二分法和研究的见解/工程背景.

Read On Chomsky and the Two Cultures of Statistical Learning for more insight about the dichotomy and the research/engineering background.

一般提示

只要您了解它们的局限性以及如何使用它们,您就不需要了解和了解每种算法和科学依据.

You don't need to know and understand every algorithm and the scientifical grounds as long as you understand their limitations and how to use them.

我已经想到的东西,即使我主要阅读英语,阅读我的母语"法语语言拓宽了我的理解.

Something that I've figured, even if I read mostly english, reading in my "native" language french widens my understanding.

保存您找到的论文和资源,事情来去匆匆.

Save the papers and ressources you find, things come and go.

这篇关于是否有任何 NLP 工具可用于对英语以外的语言进行语义解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆