“文件"是什么?在NLP环境中意味着什么? [英] What does "document" mean in a NLP context?

查看:85
本文介绍了“文件"是什么?在NLP环境中意味着什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在阅读有关

As I was reading about tf–idf on Wiki, I was confused by what it means by the word "document". Does it mean paragraph?

文档的逆频率是一个单词提供多少信息的量度,也就是说,该术语在所有文档中是常见还是稀有.它是包含单词的文档的对数比例反比例,可通过以下方法获得用文档总数除以包含该术语的文档数,然后取该商的对数."

"The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient."

推荐答案

Document视为bag of words.在vector space model中,每个单词都是一个非常高维的空间中的维,其中单词向量的大小就是文档中单词(项)的出现次数. Document-Term矩阵表示一个矩阵,其中行表示文档,列表示术语,矩阵中的每个单元格表示文档中单词出现的#次.希望一切都清楚.

Document in the tf-idf context can typically be thought of as a bag of words. In a vector space model each word is a dimension in a very high-dimensional space, where the magnitude of an word vector is the number of occurrences of the word (term) in the document. A Document-Term matrix represents a matrix where the rows represent documents and the columns represent the terms, with each cell in the matrix representing # occurrences of the word in the document. Hope it's clear.

这篇关于“文件"是什么?在NLP环境中意味着什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆