为Apriori算法准备XML数据 [英] Preparing XML data for Apriori algorithm

查看:65
本文介绍了为Apriori算法准备XML数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将Apriori算法应用于XML文档。在这方面,为了准备输入,我应该将XML数据转换为事务/矩阵形式,以便算法可以接受(用C#和Java编写)。到目前为止,我已经尝试将XML转换为关系格式甚至转换为excel,但问题仍未得到解决。最好的方法是什么?有什么建议吗?



更新:数据集中的样本记录



I want to apply Apriori algorithm to XML documents. In this regard, to prepare input, I should convert XML data to transaction/matrix form to be acceptable by the algorithm (written both in C# and Java). So far, I’ve tried to convert XML to relational format and even into excel, but the problem remained unsolved. What's the best way to do that? Any suggestion?

Update: Sample record from data set

<article key="tr/gte/TR-0263-08-94-165">
<author>Frank Manola</author>
<title>An Evaluation of Object-Oriented DBMS Developments: 1994 Edition.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0263-08-94-165</volume>
<month>August</month>
<year>1994</year>
</article>

推荐答案

您没有陈述您的目标。但它涉及文本挖掘。从技术上讲,任务挖掘是将非结构化文本数据转换为结构数值数据的任务,以便机器学习算法可以应用于大型文档数据库。将文本转换为数字需要使用技术来处理单个工作/角色的文本,具体取决于挖掘任务的目标。

简而言之,准备文本数据进行分析的过程包括:
1. 标记化 [ ^ ]

2. Stemming [ ^ ]

3. 停用词语 [ ^ ]

4.索引 - 使用bag-of-words [ ^ ]方法。

此处无法详细说明,请访问条款频率和反向

文档频率
[ ^ ]

最后,整个文档语料库将变成TDM,然后可以应用通常的数据挖掘技术来满足挖掘目标。
You did not state your objective. But it involves text mining. Technically, Task Mining is the task of transforming unstructured text data into structure numerical data so that machine learning algorithms can be applied to large document databases. Converting text to numbers requires the use of techniques for handling text at the individual work/character depending on the objective of the mining task.
Briefly, the process to prepare textual data for analysis involves:
1. Tokenization[^]
2. Stemming[^]
3. Stop words[^]
4. Indexing - represent the documents in the form term-document matrix (TDM) using "bag-of-words"[^] approach.
It is not possible to explain in details here, so visit Term Frequency and Inverse
Document Frequency
[^]
Finally, the whole document corpus will be turned into a TDM where the usual data mining techniques can then be applied to meet the mining objective.


这篇关于为Apriori算法准备XML数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆