快速信息增益计算 [英] Fast Information Gain computation

查看:83
本文介绍了快速信息增益计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为文本分类计算超过1万个文档中超过10万个功能的信息增益得分.下面的代码可以正常工作,但是整个数据集非常慢-在笔记本电脑上需要一个多小时的时间.数据集是20newsgroup,我正在使用scikit-learn, chi2 (该功能在scikit中提供)非常快速.

I need to compute Information Gain scores for >100k features in >10k documents for text classification. Code below works fine but for the full dataset is very slow - takes more than an hour on a laptop. Dataset is 20newsgroup and I am using scikit-learn, chi2 function which is provided in scikit works extremely fast.

有什么想法可以更快地为此类数据集计算信息增益吗?

Any idea how to compute Information Gain faster for such dataset?

def information_gain(x, y):

    def _entropy(values):
        counts = np.bincount(values)
        probs = counts[np.nonzero(counts)] / float(len(values))
        return - np.sum(probs * np.log(probs))

    def _information_gain(feature, y):
        feature_set_indices = np.nonzero(feature)[1]
        feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]
        entropy_x_set = _entropy(y[feature_set_indices])
        entropy_x_not_set = _entropy(y[feature_not_set_indices])

        return entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
                                 + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))

    feature_size = x.shape[0]
    feature_range = range(0, feature_size)
    entropy_before = _entropy(y)
    information_gain_scores = []

    for feature in x.T:
        information_gain_scores.append(_information_gain(feature, y))
    return information_gain_scores, []

我合并了内部函数,并按以下方式运行cProfiler(在仅限于〜15,000个要素和〜1k个文档的数据集中):

I merged the internal functions and ran cProfiler as below (on a dataset limited to ~15k features and ~1k documents):

cProfile.runctx(
    """for feature in x.T:
    feature_set_indices = np.nonzero(feature)[1]
    feature_not_set_indices = [i for i in feature_range if i not in feature_set_indices]

    values = y[feature_set_indices]
    counts = np.bincount(values)
    probs = counts[np.nonzero(counts)] / float(len(values))
    entropy_x_set = - np.sum(probs * np.log(probs))

    values = y[feature_not_set_indices]
    counts = np.bincount(values)
    probs = counts[np.nonzero(counts)] / float(len(values))
    entropy_x_not_set = - np.sum(probs * np.log(probs))

    result = entropy_before - (((len(feature_set_indices) / float(feature_size)) * entropy_x_set)
                             + ((len(feature_not_set_indices) / float(feature_size)) * entropy_x_not_set))
    information_gain_scores.append(result)""",
    globals(), locals())

tottime的结果前20名:

ncalls  tottime percall cumtime percall filename:lineno(function)
1       60.27   60.27   65.48   65.48   <string>:1(<module>)
16171   1.362   0   2.801   0   csr.py:313(_get_row_slice)
16171   0.523   0   0.892   0   coo.py:201(_check)
16173   0.394   0   0.89    0   compressed.py:101(check_format)
210235  0.297   0   0.297   0   {numpy.core.multiarray.array}
16173   0.287   0   0.331   0   compressed.py:631(prune)
16171   0.197   0   1.529   0   compressed.py:534(tocoo)
16173   0.165   0   1.263   0   compressed.py:20(__init__)
16171   0.139   0   1.669   0   base.py:415(nonzero)
16171   0.124   0   1.201   0   coo.py:111(__init__)
32342   0.123   0   0.123   0   {method 'max' of 'numpy.ndarray' objects}
48513   0.117   0   0.218   0   sputils.py:93(isintlike)
32342   0.114   0   0.114   0   {method 'sum' of 'numpy.ndarray' objects}
16171   0.106   0   3.081   0   csr.py:186(__getitem__)
32342   0.105   0   0.105   0   {numpy.lib._compiled_base.bincount}
32344   0.09    0   0.094   0   base.py:59(set_shape)
210227  0.088   0   0.088   0   {isinstance}
48513   0.081   0   1.777   0   fromnumeric.py:1129(nonzero)
32342   0.078   0   0.078   0   {method 'min' of 'numpy.ndarray' objects}
97032   0.066   0   0.153   0   numeric.py:167(asarray)

看起来大部分时间都用在_get_row_slice中.我不确定第一行,它看起来覆盖了我提供给cProfile.runctx的整个块,尽管我不知道为什么第一行totime=60.27和第二行tottime=1.362之间有这么大的差距.差额花在哪里了?是否可以在cProfile中对其进行检查?

Looks that most of the time is spent in _get_row_slice. I am not entirely sure about the first row, looks it covers the whole block I provided to cProfile.runctx, though I don't know why there is such a big gap between first line totime=60.27 and second one tottime=1.362. Where was the difference spent in? Is it possible to check it in cProfile?

基本上,看起来问题出在稀疏矩阵运算(切片,获取元素)上-解决方案可能是使用矩阵代数(例如

Basically, looks the problem is with sparse matrix operations (slicing, getting elements) -- the solution probably would be to calculate Information Gain using matrix algebra (like chi2 is implemented in scikit). But I have no idea how to express this calculation in terms of matrices operations... Anyone has an idea??

推荐答案

自一年过去以来,尚不知道它是否仍然有帮助.但是现在我恰好面临着文本分类的相同任务.我已经使用 nonzero()函数.然后,我只扫描nz,计算相应的y_value并计算熵.

Don't know whether it still helps since a year has passed. But now I happen to be faced with the same task for text classification. I've rewritten your code using the nonzero() function provided for sparse matrix. Then I just scan nz, count the corresponding y_value and calculate the entropy.

以下代码只需几秒钟即可运行news20数据集(使用libsvm稀疏矩阵格式加载).

The following code only needs seconds to run news20 dataset (loaded in using libsvm sparse matrix format).

def information_gain(X, y):

    def _calIg():
        entropy_x_set = 0
        entropy_x_not_set = 0
        for c in classCnt:
            probs = classCnt[c] / float(featureTot)
            entropy_x_set = entropy_x_set - probs * np.log(probs)
            probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
            entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
        for c in classTotCnt:
            if c not in classCnt:
                probs = classTotCnt[c] / float(tot - featureTot)
                entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
        return entropy_before - ((featureTot / float(tot)) * entropy_x_set
                             +  ((tot - featureTot) / float(tot)) * entropy_x_not_set)

    tot = X.shape[0]
    classTotCnt = {}
    entropy_before = 0
    for i in y:
        if i not in classTotCnt:
            classTotCnt[i] = 1
        else:
            classTotCnt[i] = classTotCnt[i] + 1
    for c in classTotCnt:
        probs = classTotCnt[c] / float(tot)
        entropy_before = entropy_before - probs * np.log(probs)

    nz = X.T.nonzero()
    pre = 0
    classCnt = {}
    featureTot = 0
    information_gain = []
    for i in range(0, len(nz[0])):
        if (i != 0 and nz[0][i] != pre):
            for notappear in range(pre+1, nz[0][i]):
                information_gain.append(0)
            ig = _calIg()
            information_gain.append(ig)
            pre = nz[0][i]
            classCnt = {}
            featureTot = 0
        featureTot = featureTot + 1
        yclass = y[nz[1][i]]
        if yclass not in classCnt:
            classCnt[yclass] = 1
        else:
            classCnt[yclass] = classCnt[yclass] + 1
    ig = _calIg()
    information_gain.append(ig)

    return np.asarray(information_gain)

这篇关于快速信息增益计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆