查找名义属性的所有二进制拆分 [英] Find All Binary Splits of a Nominal Attribute
问题描述
我试图基于仅具有名义属性的数据集从头开始在Python中构建一个二进制决策树分类器.
I'm trying to build a binary decision tree classifier in Python from scratch based on a data set that has only nominal attributes.
我坚持的第一步是找到所有可能的方法来计算名义属性的二进制拆分.例如,对于具有可能值[a,b,c,d]的属性,我正在寻找一种将这些值分成两个数组的方式,以便获得:
One step I'm stuck on is finding all possible ways to compute a binary split of a nominal attribute. For example, for an attribute with possible values [a, b, c, d], I am looking for a way to split these in two arrays such that we obtain:
left right
---- -----
a bcd
b acd
c abd
d abc
ab cd
ac bd
ad bc
没有重复的分割(例如,我们不需要left
中的"bc"和right
中的"ad",因为这将产生与left
中的"ad"和left
中的"bc"相同的二进制分割c1>).每个分割中的顺序也不相关(例如,"ad"与分割中一侧的"da"相同).
without duplicate splits (e.g. we don't need "bc" in left
and "ad" in right
since this would yield the same binary split as "ad" in left
and "bc" in right
). Order within each split is also irrelevant (e.g. "ad" is the same as "da" in one side of a split).
确切的术语在逃避我,但我认为这是某种形式的组合/排列问题.我知道这不是我想要的功能.我可以找到与我最相似的最接近的问题是在此处链接.
The exact terminology is escaping me, but I think this is some form of combination/permutation problem. I know its not quite a powerset I'm after. The closest question I could find similar to mine is linked here.
到目前为止,我已经开始了一个迭代过程:
So far I've started an iterative procedure:
for i in range(1, array.length / 2):
for j in range(1, i):
# stuck here
仅遍历属性可能值一半长度的底限(array
)的原因是,如果我们在left
中最多存储array.length / 2
个值,则右边具有1 - (array.length / 2)
个值,覆盖了所有可能的分裂.
The reason for looping only through the floor of half the length of the attribute's possible values (array
) is because if we store up to array.length / 2
values in left
, right has 1 - (array.length / 2)
values, covering all possible splits.
此外,我听说过itertools
库..也许有一种方法可以实现我的目标?
Also, I've heard of the itertools
library .. so perhaps there's a way to achieve what I'm after with that?
推荐答案
我将使用itertools.product
编写一个函数,该函数将序列分为两半的所有可能的除法.我会遍历整个过程,并使用一组删除重复项.
I would use itertools.product
to write a function that splits a sequence into all possible divisions of two halves. I'd iterate through that and remove duplicates using a set.
import itertools
def binary_splits(seq):
for result_indices in itertools.product((0,1), repeat=len(seq)):
result = ([], [])
for seq_index, result_index in enumerate(result_indices):
result[result_index].append(seq[seq_index])
#skip results where one of the sides is empty
if not result[0] or not result[1]: continue
#convert from list to tuple so we can hash it later
yield map(tuple, result)
def binary_splits_no_dupes(seq):
seen = set()
for item in binary_splits(seq):
key = tuple(sorted(item))
if key in seen: continue
yield key
seen.add(key)
for left, right in binary_splits_no_dupes("abcd"):
print left, right
结果:
('a', 'b', 'c') ('d',)
('a', 'b', 'd') ('c',)
('a', 'b') ('c', 'd')
('a', 'c', 'd') ('b',)
('a', 'c') ('b', 'd')
('a', 'd') ('b', 'c')
('a',) ('b', 'c', 'd')
这篇关于查找名义属性的所有二进制拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!