比较文本文档含义的最佳方法? [英] Best way to compare meaning of text documents?

查看:108
本文介绍了比较文本文档含义的最佳方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到使用AI和机器学习方法比较两个文本文档的最佳方法.我使用了TF-IDF-余弦相似度和其他相似度度量,但这是在单词(或n-gram)级别比较文档的.

I'm trying to find the best way to compare two text documents using AI and machine learning methods. I've used the TF-IDF-Cosine Similarity and other similarity measures, but this compares the documents at a word (or n-gram) level.

我正在寻找一种方法,可以比较文档的含义.最好的方法是什么?

I'm looking for a method that allows me to compare the meaning of the documents. What is the best way to do that?

推荐答案

这非常困难.实际上没有含义"的计算定义.您应该深入研究文本挖掘,摘要和gensim,spacy或pattern之类的库.

This is very difficult. There is actually no computational definition of "meaning". You should dive into text mining, summarization and libraries like gensim, spacy or pattern.

我认为,那里有更多易于使用的库.更高的投资回报率(ROI),也就是说,如果您是新手,则可能希望研究围绕聊天机器人的工具,他们希望从自然语言结构化数据中提取.那就是最类似于含义"的东西.实现这一目标的一个示例免费软件工具是对自然语言的理解.

In my opinion, the more readily useable libraries available out there ie. higher return on investesment (ROI), that is if you are a newbie you might want to look at tools around chatbots they want to extract from natural language structured data. That is what is the most similar to "meaning". One example free software tool to achieve that is rasa natural language understanding.

此类工具的缺点是它们在某种程度上可以工作,但仅在对其进行了培训并准备工作的领域中有效.特别是它们并非旨在像您想要的那样比较文档.

The drawback of such tools is that they somewhat work but only in the domain where they were trained and prepared to work. And in particular they do not aim at comparing documents like you want.

我正在尝试找到使用AI比较两个文本文档的最佳方法

I'm trying to find the best way to compare two text documents using AI

您必须提出一个更精确的任务,然后从中找出最适合您的用例的技术.是否要将文档分类为预定义的类别.您是否计算两个文档之间的相似度?给定一个输入文档,您是否要在数据库中查找最相似的文档.您是否要提取文档中的重要主题或关键字?您要汇总文档吗?是摘要摘要还是关键短语提取?

You must come up with a more precise task and from there find out which technic apply best to your use case. Do you want to classify documents in predefined categories. Do you to compute some similarity between two documents? Given an input document, do you want to find most similar documents in a database. Do you want to extract important topics or keywords in the document? Do you want to summarize the document? Is it an abstractig summary or key phrase extraction?

尤其是,没有任何软件可以从任何文档中提取某种语义指纹.根据最终目标,实现目标的方式可能会完全不同.

In particular, there is no software that allows to extract somekind of semantic fingerprint from any document. Depending on the end goal, the way to achieve it might be completly different.

您必须缩小要实现的确切目标;从那里,您将可以问另一个问题(或对此问题进行改进)来精确地描述您的目标.

You must narrow the precise goal you are trying to achieve; From there, you will be able to ask another question (or improve this one) to describe precisly your goal.

文本理解是AI完整的.因此,只对计算机说告诉我有关这两个文档的信息"是行不通的.

Text understanding is AI-Complete. So, just saying to the computer "tell me something about this two documents" doesn't work.

就像其他人所说的那样,word2vec和其他单词嵌入是实现NLP中许多目标的工具,但这只是达到目的的一种手段.您必须定义您要设计的系统的输入和输出,以便能够开始执行该实现.

Like other have said, word2vec and other word embeddings are tools to achieve many goals in NLP but it only a mean for an end. You must define the input and output of the system you are trying to design to be able to start working on the implementation.

您可能想挖掘另外两个Stack Overflow社区:

There is two other Stack Overflow communities that you might want to dig:

  • Linguistics
  • Data Science

这篇关于比较文本文档含义的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆