如何自动对大型数据库中的数据进行分类 [英] How to automatically classify data in a large database

查看:108
本文介绍了如何自动对大型数据库中的数据进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,

我应该从Wikipeadia转储,freebase转储或dbpedia中获取数据.
然后,我应该写代码以给出该数据库中每个数据的输出.例如:人名或企业名称,地址等...我用哪种语言编写代码都没有关系,但是,我只熟悉C,C ++,Java和Python. Java是我的首选语言.

这些数据库具有所有类型的数据:标题,人员姓名,地址,社会保险,电话...

我有三个问题:

1)由于我已经使用了很多机器学习方法,因此我决定使用机器学习方法.
我已经开始研究WEKA,这是Java机器学习工具箱.但是,它只有GPL许可证.有没有我可以在商业产品中使用的工具箱.

2)机器学习方法面临的问题是我不知道要使用什么功能.我现在能想到的就是:基准面的长度,它具有的字符串字符数,它具有的整数字符数.
这些数据库拥有的所有数据类型几乎都没有.正则表达式似乎不是此类项目的解决方案.

3)我可以使用另一种方法吗?我的意思是,机器学习是唯一的方法吗?

谢谢您的帮助.

问候,

Herve

Hello,

I am supposed to take data from wikipeadia dump or freebase dump or dbpedia.
I am then supposed write code that gives as output what every datum in that database is. eg: name of a person or a bussines, address,... It does not matter in what language i write the code but, I’m only familiar with C, C++, Java and Python. Java is my preferred language.

Those databases have all types of data: title, person name, address, social security, phone...

I have three questions:

1) Since I have used machine learning a lot, I have decided to use a machine learning approach.
I have started looking into WEKA, a Java machine learning toolbox. It however has only a GPL license. Is there another tool box that i can use in commercial product.

2)The problem I am facing with a machine learning approach is that I don''t know what features to use. All I can think of right now is: the length of the datum, the number of string characters it has, the number of integer character it has.
This is very little with all the type of data those databases have. Regular expression seems to not be a solution for this type of project.

3)Is there another approach I can use? I mean, is machine learning the only approach?

Thank you for your help.

Regards,

Herve

推荐答案

这东西超出了我的范围,但是为了开始讨论,这是我的处理方法. . .

*建立基本单词的词典(您可以从gutenberg项目中提取列表以开始使用).将它们分类为动词,名词,形容词等.

*阅读句子的语法(例如关系图 [
*将数据从源头分解为句子,将其传递给例程,然后拉回结果.

这仍然是一个非常粗糙的方法,但是我想稍作调整,对于初学者来说就可以了.

另外,请在网上查看有关流浪自然语言项目的视频和文档-我认为他们尝试了类似但更高级的方法.
This stuff''s way beyond me, but to get a discussion started, here''s how I''d approach it. . .

* Build up a dictionary of basic words (you can pull a list from project gutenberg to get you started). Classify these as verbs, nouns, adjectives, etc.

* Read up on the syntax of sentences (e.g. diagram[^]).

* Use this knowledge along with your dictionary to create a classification routine which can take a sentence and guess at a classification (verb, noun, etc) based on a word''s position within the sentence.

* The nouns are the bits you''re interested in (i.e. names, addresses, etc). Have another routine which you pass the sentence to if it contains an unknown noun and a keyword (named, called, he, she, lives at, etc). This can then add it to your list of likely candidates if the location of the keywords compared to the new noun is deemed as suggesting that the noun is a name/address.

* Break the data from your source down into sentences, pass them to the routine, and pull back the results.

This will still be a very rough approach, but with a bit of tweaking I reckon it''ll be OK for starters.

Alternatively, check the web for videos and docs about the Wanderlust Natural Language project - I think they attempted something similar, but more advanced.


这篇关于如何自动对大型数据库中的数据进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆