在庞大的数据集上学习决策树 [英] Learning decision trees on huge datasets

查看:122
本文介绍了在庞大的数据集上学习决策树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用MATLAB从庞大的数据集(即无法存储在内存中)构建二进制分类决策树.本质上,我正在做的是:

I'm trying to build a binary classification decision tree out of huge (i.e. which cannot be stored in memory) datasets using MATLAB. Essentially, what I'm doing is:

  1. 收集所有数据
  2. 试用数据上的 n 个决策功能
  3. 选择最佳决策函数以将数据中的类分开 li>
  4. 将原始数据集拆分为2
  5. 对拆分进行递归
  1. Collect all the data
  2. Try out n decision functions on the data
  3. Pick out the best decision function to separate the classes within the data
  4. Split the original dataset into 2
  5. Recurse on the splits

数据具有 k 个属性和一个分类,因此将其存储为具有巨大行数和 k + 1 的矩阵>列.决策函数为布尔型,并作用于将每一行分配给左侧或右侧子树的属性.

The data has k attributes and a classification, so it is stored as a matrix with a huge number of rows, and k+1 columns. The decision functions are boolean and act on the attributes assigning each row to the left or right subtree.

现在,我正在考虑将数据以块的形式存储在文件中,这些块可以保存在内存中,并为每行分配一个ID,因此通过顺序读取所有文件来决定要进行拆分,并且将来的拆分将由身份证号码.

Right now I'm considering storing the data on files in chunks which can be held in memory and assigning an ID to each row so the decision to split is made by reading all the files sequentially and the future splits are identified by the ID numbers.

有人知道如何以更好的方式做到这一点吗?

Does anyone know how to do this in a better fashion?

m的行数约为5e8,k的行数约为500

The number of rows m is around 5e8 and k is around 500

推荐答案

在每次拆分时,您都将数据集分成越来越小的子集.从单个数据文件开始.将其作为流打开,一次只处理一行,以找出要拆分的属性.拥有第一个决策功能后,将原始数据文件拆分为2个较小的数据文件,每个文件均包含拆分数据的一个分支.递归.数据文件应该越来越小,直到可以将它们加载到内存中为止.这样,您就不必标记行并在大型数据文件中继续跳转.

At each split, you are breaking the dataset into smaller and smaller subsets. Start with the single data file. Open it as a stream and just process one row at a time to figure out which attribute you want to split on. Once you have your first decision function, split the original data file into 2 smaller data files that each hold one branch of the split data. Recurse. The data files should become smaller and smaller until you can load them in memory. That way, you don't have to tag rows and keep jumping around in a huge data file.

这篇关于在庞大的数据集上学习决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆