查找名义属性的所有二进制拆分 [英] Find All Binary Splits of a Nominal Attribute

查看:76
本文介绍了查找名义属性的所有二进制拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图基于仅具有名义属性的数据集从头开始在Python中构建一个二进制决策树分类器.

I'm trying to build a binary decision tree classifier in Python from scratch based on a data set that has only nominal attributes.

我坚持的第一步是找到所有可能的方法来计算名义属性的二进制拆分.例如,对于具有可能值[a,b,c,d]的属性,我正在寻找一种将这些值分成两个数组的方式,以便获得:

One step I'm stuck on is finding all possible ways to compute a binary split of a nominal attribute. For example, for an attribute with possible values [a, b, c, d], I am looking for a way to split these in two arrays such that we obtain:

  left     right
  ----     -----
  a        bcd
  b        acd
  c        abd
  d        abc
  ab       cd
  ac       bd
  ad       bc

没有重复的分割(例如,我们不需要left中的"bc"和right中的"ad",因为这将产生与left中的"ad"和left中的"bc"相同的二进制分割c1>).每个分割中的顺序也不相关(例如,"ad"与分割中一侧的"da"相同).

without duplicate splits (e.g. we don't need "bc" in left and "ad" in right since this would yield the same binary split as "ad" in left and "bc" in right). Order within each split is also irrelevant (e.g. "ad" is the same as "da" in one side of a split).

确切的术语在逃避我,但我认为这是某种形式的组合/排列问题.我知道这不是我想要的功能.我可以找到与我最相似的最接近的问题是在此处链接.

The exact terminology is escaping me, but I think this is some form of combination/permutation problem. I know its not quite a powerset I'm after. The closest question I could find similar to mine is linked here.

到目前为止,我已经开始了一个迭代过程:

So far I've started an iterative procedure:

for i in range(1, array.length / 2):
  for j in range(1, i):
     # stuck here

仅遍历属性可能值一半长度的底限(array)的原因是,如果我们在left中最多存储array.length / 2个值,则右边具有1 - (array.length / 2)个值,覆盖了所有可能的分裂.

The reason for looping only through the floor of half the length of the attribute's possible values (array) is because if we store up to array.length / 2 values in left, right has 1 - (array.length / 2) values, covering all possible splits.

此外,我听说过itertools库..也许有一种方法可以实现我的目标?

Also, I've heard of the itertools library .. so perhaps there's a way to achieve what I'm after with that?

推荐答案

我将使用itertools.product编写一个函数,该函数将序列分为两半的所有可能的除法.我会遍历整个过程,并使用一组删除重复项.

I would use itertools.product to write a function that splits a sequence into all possible divisions of two halves. I'd iterate through that and remove duplicates using a set.

import itertools

def binary_splits(seq):
    for result_indices in itertools.product((0,1), repeat=len(seq)):
        result = ([], [])
        for seq_index, result_index in enumerate(result_indices):
            result[result_index].append(seq[seq_index])
        #skip results where one of the sides is empty
        if not result[0] or not result[1]: continue
        #convert from list to tuple so we can hash it later
        yield map(tuple, result)

def binary_splits_no_dupes(seq):
    seen = set()
    for item in binary_splits(seq):
        key = tuple(sorted(item))
        if key in seen: continue
        yield key
        seen.add(key)

for left, right in binary_splits_no_dupes("abcd"):
    print left, right

结果:

('a', 'b', 'c') ('d',)
('a', 'b', 'd') ('c',)
('a', 'b') ('c', 'd')
('a', 'c', 'd') ('b',)
('a', 'c') ('b', 'd')
('a', 'd') ('b', 'c')
('a',) ('b', 'c', 'd')

这篇关于查找名义属性的所有二进制拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆