如何在非常不平衡的数据集上使用vowpal wabbit执行logistic回归 [英] How to perform logistic regression using vowpal wabbit on very imbalanced dataset
问题描述
我正在尝试使用vowpal wabbit进行逻辑回归.我不确定这是否是正确的语法
I am trying to use vowpal wabbit for logistic regression. I am not sure if this is the right syntax to do it
For training, I do
./vw -d ~/Desktop/new_data.txt --passes 20 --binary --cache_file cache.txt -f lr.vw --loss_function logistic --l1 0.05
For testing I do
./vw -d ~/libsvm-3.18_test/matlab/new_data_test.txt --binary -t -i lr.vw -p predictions.txt -r raw_score.txt
这是我的火车数据的一个片段
Here is a snippet from my train data
-1:1.00038 | 110:0.30103 262:0.90309 689:1.20412 1103:0.477121 1286:1.5563 2663:0.30103 2667:0.30103 2715:4.63112 3012:0.30103 3113:8.38411 3119:4.62325 3382:1.07918 3666:1.20412 3728:5.14959 4029:0.30103 4596:0.30103
1:2601.25 | 32:2.03342 135:3.77379 146:3.19535 284:2.5563 408:0.30103 542:3.80618 669:1.07918 689:2.25527 880:0.30103 915:1.98227 1169:5.35371 1270:0.90309 1425:0.30103 1621:0.30103 1682:0.30103 1736:3.98227 1770:0.60206 1861:4.34341 1900:3.43136 1905:7.54141 1991:5.33791 2437:0.954243 2532:2.68664 3370:2.90309 3497:0.30103 3546:0.30103 3733:0.30103 3963:0.90309 4152:3.23754 4205:1.68124 4228:0.90309 4257:1.07918 4456:0.954243 4483:0.30103 4766:0.30103
这是我的测试数据的一个片段
Here is a snippet from my test data
-1 | 110:0.90309 146:1.64345 543:0.30103 689:0.30103 1103:0.477121 1203:0.30103 1286:2.82737 1892:0.30103 2271:0.30103 2715:4.30449 3012:0.30103 3113:7.99039 3119:4.08814 3382:1.68124 3666:0.60206 3728:5.154 3960:0.778151 4309:0.30103 4596:0.30103 4648:0.477121
但是,如果我查看结果,则预测均为-1,原始分数均为0.我有大约200,000个示例,其中100个是+1,其余的是-1.为了处理这种不平衡的数据,我给了正例权重200,000/100,给了负例权重200,000/(200000-100).是因为即使我调整了发生的事情的权重,我的数据还是非常不平衡?
However, if I look at the results, the predictions are all -1 and the raw scores are all 0s. I have around 200,000 examples, out of which 100 are +1 and the rest are -1. To handle this unbalanced data, I gave the positive examples weight of 200,000/100 and the negative example weight of 200,000/(200000-100). Is it because my data is like really highly unbalanced even though I adjust the weights that this is happening?
我期望在原始分数文件中输出(P(y | x)).但是我得到全零.我只需要概率输出.有什么建议吗?
I was expecting the output of (P(y|x)) in the raw score file. But I get all zeros. I just need the probability outputs. Any suggestions what's going on guys?
推荐答案
由arielf总结详细的答案.
Summarizing the detailed answer by arielf.
-
重要的是要知道预期的最终成本(损失)函数是什么: Logistic损失,0/1损失(即准确性),F1得分,RO曲线下面积,还有其他什么?
It is important to know what is the intended final cost (loss) function: Logistic loss, 0/1 loss (ie. accuracy), F1 score, Area Under RO Curve, something else?
这是arielf答案的一部分的Bash代码. 请注意,我们应该首先从train.txt中删除奇怪的重要性加权尝试(我的意思是问题中的: 1.00038 "和:2601.25").
Here is a Bash code for part of arielf's answer. Note that we should first delete the strange attempts of importance weighting from train.txt (I mean the ":1.00038" and ":2601.25" in the question).
A. Prepare the training data
grep '^-1' train.txt | shuf > neg.txt
grep '^1' train.txt | shuf > p.txt
for i in `seq 2000`; do cat p.txt; done > pos.txt
paste -d '\n' neg.txt pos.txt > newtrain.txt
B. Train model.vw
# Note that passes=1 is the default.
# With one pass, holdout_off is the default.
`vw -d newtrain.txt --loss_function=logistic -f model.vw`
#average loss = 0.0953586
C. Compute test loss using vw
`vw -d test.txt -t -i model.vw --loss_function=logistic -r
raw_predictions.txt`
#average loss = 0.0649306
D. Compute AUROC using http://osmot.cs.cornell.edu/kddcup/software.html
cut -d ' ' -f 1 test.txt | sed -e 's/^-1/0/' > gold.txt
$VW_HOME/utl/logistic -0 raw_predictions.txt > probabilities.txt
perf -ROC -files gold.txt probabilities.txt
#ROC 0.83484
perf -ROC -plot roc -files gold.txt probabilities.txt | head -n -2 > graph
echo 'plot "graph"' | gnuplot -persist
这篇关于如何在非常不平衡的数据集上使用vowpal wabbit执行logistic回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!