如何区分结构化和非结构化数据? [英] How are Structured and Unstructured data distinguished?

查看:664
本文介绍了如何区分结构化和非结构化数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

结构化数据和非结构化数据之间有什么区别? 这种差异如何影响相应的数据挖掘方法?

What are the differences between structured data and unstructured data? How that difference affect the respective data mining approaches?

推荐答案

我熟悉的术语是结构化非结构化数据(与Q中的内容相同,除了作为后缀).

The terms i am familiar with are structured and unstructured data(same as what's in your Q except for the suffix).

我在机器学习中使用两种类型的数据,并且我不知道任何正式的定义;但是,我怀疑几乎所有需要在这两种类型的数据之间进行区分的工作的人都不会轻易区分它们.

I work with both types of data in machine learning and I am not aware of any formal definition; however, i suspect that nearly everyone whose work requires a distinction between these two types of data has no trouble distinguishing them.

结构化数据示例:发送电子邮件的日期/时间;是否具有附件或电子邮件发件人.非结构化数据:电子邮件的正文.

Examples of structured data: the date/time on which an email was sent; whether it has an attachment, or the email sender. Unstructured data: the body of the email.

是否存在稳定的规则或一组规则来区分这两种类型的数据?我想是这样.首先,如果您可以为数据元素构建 解析器 ,那么它就是结构化的.

Is there a stable rule or set of rules to distinguish these two types of data? I think so. First, if you can build a parser for the data element, then it's structured.

另一个经验法则是查看存储数据所需的数据库中该字段的 数据类型 .如果是文本类型-适用于MySQL,Tintext,Text,Mediumtext或Longtext.或不太可能是VARCHAR(255)-那么数据可能是未结构化的.

Another rule of thumb is to look at the data type for that field in your database required to store the data. If it is a text type--for MySQL, Tintext, Text, Mediumtext, or Longtext. Or less likely, VARCHAR(255)--then that data is probably unstrutured.

这种区别对于数据挖掘的主要意义可能是:从文档中提取并解析结构化数据后,就可以将其用作统计/机器学习模型中的变量.但是,非结构化数据需要进一步解析-即,在将其用于建模之前,首先必须将其分解为一组结构化数据元素-例如字数等.

The principal significance of this distinction for data mining is probably this: structured data, once extracted from the document and parsed, can be used as variables in a statistical/machine learning model. Unstructured data, however, requires further parsing--i.e., before you can use it in modeling you first have to decompose it into a set of structured data elements--e.g., number of words, etc.

例如,假设您要为制造在线MMORPG的公司中的服务器组构建知识管理(KM)系统.您可能首先从该组成员之间交换的大量电子邮件消息开始.

For instance, suppose you want to build a knowledge management (KM) system for a server group within a company that makes online MMORPGs. You might begin with the massive collection of email messages exchanged between the members of this group.

因此,您将为此源创建一个数据模型,例如,由发件人",收件人",发送日期/时间"等字段组成,收件人和发件人是否均为服务器组的雇员,是否邮件已复制到其他人,等等.数据库的行是各个电子邮件.

So you create a data model for this source--e.g., comprised of fields like 'sender', 'recipient', 'date/time sent', whether the recipient and sender were both employees of the server group, whether the message was was copied to others, etc. The rows of the databse are the individual emails.

然后,编写一个由一组解析器组成的脚本,以从每个电子邮件中提取每个字段.对于许多字段,这很简单,例如,对于"cc:"字段,您编写一个解析器以扫描电子邮件的该部分并检查其是否为空,如果是,则在数据库中查找该字段该行可能填充有"False"(表示没有人被复制),否则为"True".同样,数据/时间的格式可能类似于:2011年3月16日18:45:39.0319(UTC).提取和解析这些数据同样简单明了.实际上,您的脚本语言几乎可以肯定有一个模块可以执行此操作.

Then you write a script comprised of a set of parsers to extract each field from each email message. For many fields, this is simple, e.g., for the 'cc:' field, you write a parser to scan that portion of the email message and check whether it is empty--if it is, then that field in your database for that row might be filled with 'False' (to indicate that no persons are copied), otherwise, 'True'. Likewise, data/time, which is probably in some form like: 16 Mar 2011 18:45:39.0319 (UTC). Extracting and parsing this data is likewise straightforward; in fact, your scripting language almost certainly has a module to do it.

但是,当您到达电子邮件正文时,虽然从其余电子邮件中提取消息并不困难,但对其进行解析并不是一件容易的事.您的数据模型可能具有用于"NumberOfWords",关键字"等的字段,并且构建解析器以填充这些字段很简单.但是,最有用的信息更难-即电子邮件对接收者有帮助吗?主题是什么?权威吗?

But when you get to the body of the email, while it's not difficult to extract from the rest of the email message, parsing it is not straightforward. Your data model might have fields for "NumberOfWords", "Keywords", etc. and it's simple to build a parser to populate those fields. The most useful information is more difficult though--i.e., was the email message helpful to the recipient? What was the subject? Is it authoritative?

这篇关于如何区分结构化和非结构化数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆