搜索自然语言的句子结构 [英] Searching Natural Language Sentence Structure

查看:104
本文介绍了搜索自然语言的句子结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

存储和搜索自然语言句子结构树的数据库的最佳方法是什么?

What's the best way to store and search a database of natural language sentence structure trees?

使用 OpenNLP的英语Treebank解析器,我可以对任意句子进行相当可靠的句子结构解析.我想做的是创建一个工具,该工具可以从源代码中提取所有文档字符串,为文档字符串中的所有句子生成这些树,将这些树及其关联的函数名称存储在数据库中,然后允许用户使用自然语言查询来搜索数据库.

Using OpenNLP's English Treebank Parser, I can get fairly reliable sentence structure parsings for arbitrary sentences. What I'd like to do is create a tool that can extract all the doc strings from my source code, generate these trees for all sentences in the doc strings, store these trees and their associated function name in a database, and then allow a user to search the database using natural language queries.

因此,给定函数upload_files()的句子"This uploads files to a remote machine.",我将得到一棵树:

So, given the sentence "This uploads files to a remote machine." for the function upload_files(), I'd have the tree:

(TOP
  (S
    (NP (DT This))
    (VP
      (VBZ uploads)
      (NP (NNS files))
      (PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
    (. .)))

如果有人输入查询我如何上传文件?",等同于树:

If someone entered the query "How can I upload files?", equating to the tree:

(TOP
  (SBARQ
    (WHADVP (WRB How))
    (SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))
    (. ?)))

我将如何在SQL数据库中存储和查询这些树?

how would I store and query these trees in a SQL database?

我已经编写了一个简单的概念验证脚本,可以使用正则表达式和网络图解析的组合来执行此搜索,但是我不确定如何以可扩展的方式实现此功能.

I've written a simple proof-of-concept script that can perform this search using a mix of regular expressions and network graph parsing, but I'm not sure how I'd implement this in a scalable way.

是的,我知道我的示例使用简单的关键字搜索进行检索很简单.我要测试的想法是如何利用语法结构,因此可以淘汰具有相似关键字但句子结构不同的条目.例如,对于上面的查询,我不想检索与具有相似关键字的句子"Checks a remote machine to find a user that uploads files."关联的条目,但是显然描述的是完全不同的行为.

And yes, I realize my example would be trivial to retrieve using a simple keyword search. The idea I'm trying to test is how I might take advantage of grammatical structure, so I can weed-out entries with similar keywords, but a different sentence structure. For example, with the above query, I wouldn't want to retrieve the entry associated with the sentence "Checks a remote machine to find a user that uploads files." which has similar keywords, but is obviously describing a completely different behavior.

推荐答案

关系数据库无法自然地存储知识,您真正需要的是知识库本体 >(尽管它可以构建在关系数据库之上).它将数据保存在三元组<subject, predicate, object>中,因此您的短语将存储为<upload_file(), upload, file>.有很多工具和方法可以在此类知识库中进行搜索(例如,Prolog是一种旨在执行此操作的语言).因此,您要做的就是将句子从自然语言翻译成 KB三胞胎/本体图,将用户查询翻译成不完整的三胞胎(您的问题看起来像<?, upload, file> )或联合查询,然后搜索您的知识库. OpenNLP将帮助您进行翻译,其余取决于您决定使用的具体技术.

Relational databases cannot store knowledge in a natural way, what you actually need is a knowledge base or ontology (though it may be constructed on top of relational database). It holds data in triplets <subject, predicate, object>, so your phrase will be stored as <upload_file(), upload, file>. There's a lot of tools and methods to search inside such KBs (for example, Prolog is a language that was designed to do it). So, all you have to do is to translate sentences from natural language to KB triplets/ontology graph, translate user query to incomplete triplets (your question will look like <?, upload, file>) or conjunctive queries and then search on your KB. OpenNLP will help you with translating, and the rest depends on concrete technique and technologies you decide to use.

这篇关于搜索自然语言的句子结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆