标记数据和未标记数据有什么区别? [英] What is the difference between labeled and unlabeled data?

查看:1592
本文介绍了标记数据和未标记数据有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在塞巴斯蒂安·瑟鲁姆(Sebastian Thrum)的此视频中,他说,监督式学习适用于标记"数据,而不受监督式学习适用与未标记"的数据.他是什么意思?谷歌搜索带标签的数据与未带标签的数据"返回了大量关于该主题的学术论文.我只想知道基本区别.

In this video from Sebastian Thrum he says that supervised learning works with "labeled" data and unsupervised learning works with "unlabeled" data. What does he mean by this? Googling "labeled vs unlabeled data" returns a bunch of scholarly papers on this topic. I just want to know the basic difference.

推荐答案

通常,未标记数据由自然或人为人工制品的样本组成,您可以从世界上相对容易地获得这些样本.未标记数据的一些示例可能包括照片,录音,视频,新闻,推文,X射线(如果您正在处理医疗应用程序)等.每条未标记数据都没有解释",它是只是包含数据,而没有其他内容.

Typically, unlabeled data consists of samples of natural or human-created artifacts that you can obtain relatively easily from the world. Some examples of unlabeled data might include photos, audio recordings, videos, news articles, tweets, x-rays (if you were working on a medical application), etc. There is no "explanation" for each piece of unlabeled data -- it just contains the data, and nothing else.

标记数据通常会获取一组未标记的数据,并使用某种有意义的标签",标签"或类"来扩大该未标记数据的每一条,这些信息在某种意义上是有意义的或可取的要知道.例如,上述类型的未标记数据的标签可能是这张照片包含一匹马还是一头牛,在此音频录音中说出了哪些字眼,在此视频中正在执行什么类型的操作,该新闻报道的主题是什么?是,这条推文的总体感觉是什么,该X射线中的点是否是肿瘤等.

Labeled data typically takes a set of unlabeled data and augments each piece of that unlabeled data with some sort of meaningful "tag," "label," or "class" that is somehow informative or desirable to know. For example, labels for the above types of unlabeled data might be whether this photo contains a horse or a cow, which words were uttered in this audio recording, what type of action is being performed in this video, what the topic of this news article is, what the overall sentiment of this tweet is, whether the dot in this x-ray is a tumor, etc.

数据标签通常是通过让人们对给定的未标记数据做出判断而获得的(例如,这张照片中是否包含马或牛?"),并且获取这些数据的成本要比未标记的原始数据高得多.

Labels for data are often obtained by asking humans to make judgments about a given piece of unlabeled data (e.g., "Does this photo contain a horse or a cow?") and are significantly more expensive to obtain than the raw unlabeled data.

获得标记的数据集后,可以将机器学习模型应用于数据,以便可以将新的未标记数据呈现给模型,并可以为该未标记数据猜测或预测可能的标签.

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.

机器学习有许​​多活跃的研究领域,旨在整合未标记和标记的数据以建立更好,更准确的世界模型.半监督学习尝试将未标记的数据和已标记的数据(或更普遍的是,只有一些数据点带有标签的未标记数据集)组合到集成模型中.深度神经网络和特征学习是研究的领域,它们试图仅构建未标记数据的模型,然后将来自标签的信息应用于模型的有趣部分.

There are many active areas of research in machine learning that are aimed at integrating unlabeled and labeled data to build better and more accurate models of the world. Semi-supervised learning attempts to combine unlabeled and labeled data (or, more generally, sets of unlabeled data where only some data points have labels) into integrated models. Deep neural networks and feature learning are areas of research that attempt to build models of the unlabeled data alone, and then apply information from the labels to the interesting parts of the models.

这篇关于标记数据和未标记数据有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆