除去分布不均组项目 [英] Removing items from unevenly distributed set
问题描述
我有一个网站,用户提交问题(零,一台或多台每天),投他们,回答每天一个问题(详细信息<一个href="http://stackoverflow.com/questions/8600909/distribution-among-users-for-collaborative-voting-algorithm">here).用户可以通过提交,投票或回答它看到的问题只有一次要么。
I have a website where users submit questions (zero, one or multiple per day), vote on them and answer one question per day (more details here). A user can see the question only once either by submitting, voting or answering it.
我有问题,球员已经看到了一个游泳池。我需要从池中每月除去30个问题。我需要选择问题,在我最大限度地留在池中球员至少可用问题提供问题的数量等方法来消除。
I have a pool of questions that players have already seen. I need to remove 30 questions from the pool each month. I need to pick questions to remove in such way that I maximize the number of available questions left in the pool for player with least available questions.
例如具有5个问题(和需要删除3)池:
Example with pool of 5 questions (and need to remove 3):
- 玩家A已经看到的问题1,3,5
- 玩家B已经看到问题1和4
- 玩家C已经看到问题2和4
不过,我觉得要消除这些问题的最佳球员已经看到,但位置将会改变。按照上面的例子中,玩家A也只拿到了2个问题左打(2,4)。但是,如果我删除1,3,5,情况将是:
I though about removing the questions that top player has seen, but the position would change. Following the above example, player A has only got 2 questions left to play (2 and 4). However, if I remove 1, 3 and 5, the situation would be:
- 玩家A可以播放问题2和4
- 玩家B能够发挥的问题2
- 玩家C不能因为1,3,5被删除,他已经看过2,4玩什么。
比分为这个解决方案是零,也就是玩家与最少的可用问题具有零可用问题发挥。
The score for this solution is zero, i.e. the player with least amount of available questions has zero available questions to play.
在这种情况下,这将是较好的去除1,3和4,得到:
In this case it would be better to remove 1, 3 and 4, giving:
- 玩家A可以玩的问题2
- 玩家B能够发挥的问题2和5
- 玩家C能起到的问题5
比分为这个解决方案是一个,因为这两个球员最少的可用问题发挥有一个可用的问题。
The score for this solution is one, because the two players with least amount of available questions to play have one available question.
如果数据量很小,我就可以穷举的解决方案。不过,我有数百名球员和疑问,所以我找了一些算法来解决这个问题。
If the data size was small, I would be able to brute-force the solution. However, I have hundreds of players and questions, so I'm looking for some algorithm to solve this.
推荐答案
线性规划模型。
变种1
Sum(Uij * Qj) - Sum(Dij * Xj) + 0 = 0 (for each i)
0 + Sum(Dij * Xj) - Score >= 0 (for each i)
Sum(Qj) = (Number of questions - 30)
Maximize(Score)
的Uij
是 1
如果用户我
还没有看到问题Ĵ
,否则是 0
Uij
is 1
if user i
has not seen question j
, otherwise it is 0
DIJ
为单位矩阵(DIJ = 1 <$ C C $>如果 I = j的<元素/ code>,否则是
0
)
Dij
is element of identity matrix (Dij=1
if i=j
, otherwise it is 0
)
XJ
是辅助变量(每个用户)
Xj
is auxiliary variable (one for each user)
变式2。
Sum(Uij * Qj) >= Score (for each i)
Sum(Qj) = (Number of questions - 30)
No objective function, just check feasibility
在这种情况下,唱片问题是简单的,但分数
应该由二元和线性搜索来确定。设置电流范围为[0 ..数最少的看不见的问题,为用户]中,设置分数
的范围的中间,采用整数LP算法(与小的时间限制)。如果没有找到解决办法,设置范围为[开始.. 分数
],否则将其设置为[分数
..结束],并继续二进制搜索。
In this case, LP problem is simpler, but Score
should be determined by binary and linear search. Set current range to [0 .. the least number of unseen questions for a user], set Score
to the middle of the range, apply integer LP algorithm (with small time limit). If no solution found, set range to [begin .. Score
], otherwise set it to [Score
.. end] and continue binary search.
(可选)使用二进制搜索来确定上限的精确解的分数
。
(Optionally) use binary search to determine upper bound for exact solution's Score
.
这是最好的分数
,由二进制搜索发现出发,应用整数LP算法分数
,同比增长1 ,2,...
(以及限制计算时间视需要)。最后,你会得到任何确切的解决方案,或者一些好的近似。
Starting from the best Score
, found by binary search, apply integer LP algorithm with Score
, increased by 1, 2, ...
(and limiting computation time as necessary). At the end, you get either exact solution, or some good approximation.
下面是一个示例code的下GNU GLPK(为变体1):
Here is sample code in C for GNU GLPK (for variant 1):
#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>
int main(void)
{
int ind[3000];
double val[3000];
int row;
int col;
glp_prob *lp;
// Parameters
int users = 120;
int questions = 10000;
int questions2 = questions - 30;
int time = 30; // sec.
// Create GLPK problem
lp = glp_create_prob();
glp_set_prob_name(lp, "questions");
glp_set_obj_dir(lp, GLP_MAX);
// Configure rows
glp_add_rows(lp, users*2 + 1);
for (row = 1; row <= users; ++row)
{
glp_set_row_bnds(lp, row, GLP_FX, 0.0, 0.0);
glp_set_row_bnds(lp, row + users, GLP_LO, 0.0, 0.0);
}
glp_set_row_bnds(lp, users*2 + 1, GLP_FX, questions2, questions2);
// Configure columns
glp_add_cols(lp, questions + users + 1);
for (col = 1; col <= questions; ++col)
{
glp_set_obj_coef(lp, col, 0.0);
glp_set_col_kind(lp, col, GLP_BV);
}
for (col = 1; col <= users; ++col)
{
glp_set_obj_coef(lp, questions + col, 0.0);
glp_set_col_kind(lp, questions + col, GLP_IV);
glp_set_col_bnds(lp, questions + col, GLP_FR, 0.0, 0.0);
}
glp_set_obj_coef(lp, questions+users+1, 1.0);
glp_set_col_kind(lp, questions+users+1, GLP_IV);
glp_set_col_bnds(lp, questions+users+1, GLP_FR, 0.0, 0.0);
// Configure matrix (question columns)
for(col = 1; col <= questions; ++col)
{
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = ((row <= users) && (rand() % 2))? 1.0: 0.0;
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 1.0;
glp_set_mat_col(lp, col, users*2 + 1, ind, val);
}
// Configure matrix (user columns)
for(col = 1; col <= users; ++col)
{
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = (row == col)? -1.0: ((row == col + users)? 1.0: 0.0);
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 0.0;
glp_set_mat_col(lp, questions + col, users*2 + 1, ind, val);
}
// Configure matrix (score column)
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = (row > users)? -1.0: 0.0;
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 0.0;
glp_set_mat_col(lp, questions + users + 1, users*2 + 1, ind, val);
// Solve integer GLPK problem
glp_iocp param;
glp_init_iocp(¶m);
param.presolve = GLP_ON;
param.tm_lim = time * 1000;
glp_intopt(lp, ¶m);
printf("Score = %g\n", glp_mip_obj_val(lp));
glp_delete_prob(lp);
return 0;
}
期限不工作可靠地在我的测试。貌似在GLPK一些bug ......
Time limit is not working reliably in my tests. Looks like some bug in GLPK...
样品code的变体2(只LP算法,没有自动搜索分数
):
Sample code for variant 2 (only LP algorithm, no automatic search for Score
):
#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>
int main(void)
{
int ind[3000];
double val[3000];
int row;
int col;
glp_prob *lp;
// Parameters
int users = 120;
int questions = 10000;
int questions2 = questions - 30;
double score = 4869.0 + 7;
// Create GLPK problem
lp = glp_create_prob();
glp_set_prob_name(lp, "questions");
glp_set_obj_dir(lp, GLP_MAX);
// Configure rows
glp_add_rows(lp, users + 1);
for (row = 1; row <= users; ++row)
{
glp_set_row_bnds(lp, row, GLP_LO, score, score);
}
glp_set_row_bnds(lp, users + 1, GLP_FX, questions2, questions2);
// Configure columns
glp_add_cols(lp, questions);
for (col = 1; col <= questions; ++col)
{
glp_set_obj_coef(lp, col, 0.0);
glp_set_col_kind(lp, col, GLP_BV);
}
// Configure matrix (question columns)
for(col = 1; col <= questions; ++col)
{
for (row = 1; row <= users; ++row)
{
ind[row] = row;
val[row] = (rand() % 2)? 1.0: 0.0;
}
ind[users + 1] = users + 1;
val[users + 1] = 1.0;
glp_set_mat_col(lp, col, users + 1, ind, val);
}
// Solve integer GLPK problem
glp_iocp param;
glp_init_iocp(¶m);
param.presolve = GLP_ON;
glp_intopt(lp, ¶m);
glp_delete_prob(lp);
return 0;
}
看来,变种2可以找到pretty的良好近似相当快。 而近似比变种1更好。
It appears that variant 2 allows to find pretty good approximation quite fast. And approximation is better than for variant 1.
这篇关于除去分布不均组项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!