在Logistic回归使用排名数据 [英] Using ranking data in Logistic Regression

查看:449
本文介绍了在Logistic回归使用排名数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将投入最​​大的赏金这个,因为我努力学习这些概念!我试图用一些排名数据在逻辑回归。我想用机器学习做一个简单的分类,以一个网页是否是好还是不行。这只是一个学习的锻炼,所以我不期望很大的成绩;只是希望学习过程和编码技术。

I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in a logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.

我已经把在.csv我的数据如下:

I have put my data in a .csv as follows :

URL WebsiteText AlexaRank GooglePageRank

在我测试的CSV我们有:

In my Test CSV we have :

URL WebsiteText AlexaRank GooglePageRank Label

标签是一个二元分类表示好的有1或坏与0。

Label is a binary classification indicating "good" with 1 or "bad" with 0.

目前,我有我的LR只使用本网站的文本运行;我上运行TF-IDF。

I currently have my LR running using only the website text; which I run a TF-IDF on.

我有两个问题,我需要帮助。我会投入在这个问题上的最大奖金,并颁发给最好的答案,因为这是我想了一些很好的帮助,让我和其他人,可以学习。

I have a two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others, may learn.

  • 如何归我的排名数据AlexaRank?我有一组 万网页,对此我有所有这些的Alexa排名; 但是他们没有被排名 1-10,000 。他们都排出来的 整个互联网的,所以当 http://www.google.com 可排 1 http://www.notasite.com 可排#83904803289480 。我如何 正常化这Scikit学习,以获得最佳的 结果从我的数据?
  • 我正在我的Logistic回归以这种方式;我几乎可以肯定,我已经这样做了不正确。我试图做的TF-IDF网站上的文字,然后添加其他两个相关列,并符合Logistic回归。我倒是AP preciate如果有人可以快速验证,我服用了三列我想在我的LR正确使用。我如何能提高自己也会在这里pciated AP $ P $任何和所有的反馈。

  • How can I normalize my ranking data for AlexaRank? I have a set of 10,000 webpages, for which I have the Alexa rank of all of them; however they aren't ranked 1-10,000. They are ranked out of the entire Internet, so while http://www.google.com may be ranked #1, http://www.notasite.com may be ranked #83904803289480. How do I normalize this in Scikit learn in order to get the best possible results from my data?
  • I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any and all feedback on how I can improve myself would also be appreciated here.

loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')

print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])#Reading WebsiteText column for TF-IDF.
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1] #reading label

tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode', analyzer='word',

token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True,    intercept_scaling=1.0, class_weight=None, random_state=None)

X_all = traindata + testdata
lentrain = len(traindata)

print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

#Add Two Integer Columns
AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.
AlexaAndGoogleTestData = list(np.array(p.read_table('test.tsv'))[2:,3])
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData + AlexaAndGoogleTrainData

#Add two columns to X.
X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.

print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
    print "submission file created.."`

非常感谢你的所有反馈意见 - 请后,如果您需要任何进一步的信息

Thank you very much for all feedback - please post if you need any further information!

推荐答案

我觉得 sklearn。preprocessing.StandardScaler 是你想尝试的第一件事。 StandardScaler转换您的所有功能集成到均值0-STD-1的功能。

I guess sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler transforms all of your features into Mean-0-Std-1 features.

  • 这无疑摆脱了你的第一个问题。 AlexaRank 将保证为s $ P $垫在0和限制。 (是的,即使大规模的 AlexaRank 值,例如 83904803289480 被转换为小浮点数)。当然,结果也不会与 1 整数和 10000 ,但他们将保持相同的顺序原来的行列。在这种情况下,保持等级界定和规范化将有助于解决您的第二个问题,就像如下。
  • 在为了理解为什么正常化的LR帮助,让我们重温LR的Logit模型制定。
    在你的情况,X1,X2,X3三种TF-IDF功能和X4,X5是Alexa的/谷歌排名相关的功能。现在,方程的线性形式表明系数重新present在Y的分对数的变化的一个单元的一个变量的变化。想想当你的X4保持固定在一个巨大的秩值会发生什么,说 83904803289480 。在这种情况下,Alexa排名变量的占主导地位的你适合LR和TF-IDF值的微小变化对LR配合几乎没有影响。现在,人们可能会认为该系数应能适应小/大值来说明这些功能之间的差异。不在此例---它是事关变量不仅幅度,而且他们的范围的。 Alexa的排名肯定有一个大的范围,一定要主宰你的LR适合在这种情况下。所以,我估计用StandardScaler调整其范围将提高适应标准化的所有变量。
  • This definitely gets rid of your first problem. AlexaRank will be guaranteed to be spread around 0 and bounded. (Yes, even massive AlexaRank values like 83904803289480 are transformed to small floating point numbers). Of course, the results will not be integers between 1 and 10000 but they will maintain same order as the original ranks. And in this case, keeping the rank bounded and normalized will help solve your second problem like follows.
  • In order to understand why normalization would help in LR, let's revisit the logit formulation of LR.
    In your case, X1, X2, X3 are three TF-IDF features and X4, X5 are Alexa/Google rank related features. Now, the linear form of equation suggest that the coefficients represent the change in logit of y with one unit change in a variable. Think what happens when your X4 is kept fixed at a massive rank value, say 83904803289480. In that case, the Alexa Rank variable dominates your LR fit and a small change in TF-IDF value has almost no effect on the LR fit. Now one might think that the coefficient should be able to adjust to small/large values to account for differences between these features. Not in this case --- It's not only the magnitude of variables that matter but also their range. Alexa Rank definitely has a large range and should definitely dominate your LR fit in this case. Therefore, I guess normalizing all variables using StandardScaler to adjust their range will improve the fit.

下面是你如何能缩放 X 矩阵。

Here is how you can scale the X matrix.

sc = proprocessing.StandardScaler().fit(X)
X = sc.transform(X)

不要忘了用相同的缩放器来转换 X_test

X_test = sc.transform(X_test)

现在,你可以使用拟合程序等。

Now you can use the fitting procedure etc.

rd.fit(X, y)
re.predict_proba(X_test)

检查了这一点更多关于sklearn preprocessing: HTTP:/ /scikit-learn.org/stable/modules/$p$pprocessing.html

编辑:解析和列合并部分可以轻松实现使用熊猫,也就是说,没有必要对矩阵转换成列表,然后将它们附加。此外,熊猫dataframes可以直接通过列名索引。

Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into list and then append them. Moreover, pandas dataframes can be directly indexed by their column names.

AlexaAndGoogleTrainData = p.read_table('train.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AlexaAndGoogleTestData = p.read_table('test.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData.append(AlexaAndGoogleTrainData)

请注意,我们是通过标题= 0 参数read_table保持从TSV文件原来的头名。还要注意我们如何使用索引整个组列。最后,你可以的堆栈的这个新的矩阵 X 使用 numpy.hstack

Note that we are passing header=0 argument to read_table to maintain original header names from tsv file. And also note how we can index using entire set of columns. Finally, you can stack this new matrix with X using numpy.hstack.

X = np.hstack((X, AllAlexaAndGoogleInfo))

hstack 水平结合了两种多维阵列状结构提供了它们的长度都一样的。

hstack horizontally combined two multi-dimensional array-like structures provided their lengths are same.

这篇关于在Logistic回归使用排名数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆