如何改善我的推荐结果?我正在隐式使用Spark ALS [英] How to improve my recommendation result? I am using spark ALS implicit

查看:122
本文介绍了如何改善我的推荐结果?我正在隐式使用Spark ALS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我有一些用户应用程序的使用历史记录.

First, I have some use history of user's app.

例如:
user1,app1、3(启动时间)
user2,app2、2(启动时间)
user3,app1、1(启动时间)

For example:
user1, app1, 3(launch times)
user2, app2, 2(launch times)
user3, app1, 1(launch times)

我基本上有两个要求:

  1. 为每个用户推荐一些应用.
  2. 为每个应用推荐相似的应用.

所以我在火花上使用MLLib的ALS(隐式)来实现它.首先,我只是使用原始数据来训练模型.结果是可怕的.我认为这可能是由发射时间范围引起的.发射时间从1到数千.所以我处理原始数据.我认为分数可以反映真实情况和更正规化.

So I use ALS(implicit) of MLLib on spark to implement it. At first, I just use the original data to train the model. The result is terrible. I think it may caused by the range of launch times. And the launch time range from 1 to thousands. So I process the original data. I think score can reflect the true situation and more regularization.

分数= lt/uMlt + lt/aMlt

得分是训练模型的处理结果.
lt 是原始数据中的启动时间.
uMlt 是原始数据中用户的平均启动时间. uMlt(用户的所有启动时间)/(该用户曾经启动的应用程序的数量)
aMlt 是应用在原始数据中的平均启动时间. aMlt(应用程序的所有启动时间)/(曾经启动过此应用程序的用户数)
这是处理后的数据的一些示例.

score is process result to train model.
lt is launch times in original data.
uMlt is user's mean launch times in original data. uMlt(all launch times of a user) / (number of app this user ever launched)
aMlt is app's mean launch times in original data. aMlt(all launch times of a app) / (number of user who ever launched this app)
Here is some example of the data after processing.

评分(95788,20992,0.14167073369026184)
评分(98696,20992,5.92363166809082)
评级(160020,11264,2.261538505554199)
评级(67904,11264,2.261538505554199)
评分(268430,11264,0.13846154510974884)
评分(201369,11264,1.7999999523162842)
评级(180857,11264,2.2720916271209717)
评级(217692,11264,1.3692307472229004)
评分(186274,28672,2.4250855445861816)
评级(120820,28672,0.4422124922275543)
评分(221146,28672,1.0074234008789062)

Rating(95788,20992,0.14167073369026184)
Rating(98696,20992,5.92363166809082)
Rating(160020,11264,2.261538505554199)
Rating(67904,11264,2.261538505554199)
Rating(268430,11264,0.13846154510974884)
Rating(201369,11264,1.7999999523162842)
Rating(180857,11264,2.2720916271209717)
Rating(217692,11264,1.3692307472229004)
Rating(186274,28672,2.4250855445861816)
Rating(120820,28672,0.4422124922275543)
Rating(221146,28672,1.0074234008789062)

完成此操作并聚合具有不同程序包名称的应用程序后,结果似乎更好.但仍然不够好.
我发现用户和产品的功能是如此之小,而且大多数都是负面的.

After I have done this, and aggregate the apps which have different package name, the result seems better. But still not good enough.
I find that the features of users and products is so small, and most of them is negative.

以下是产品功能的3行示例,每行10个尺寸:

Here is 3 line example of products features, 10 dimensions for each line:

(((CompactBuffer(com.youlin.xyzs.shoumeng,com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966966E-7,-7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7, -4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386763402588258E-7,-4.289261710255232E-7)))
((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia)),(-4.769295992446132E-5,-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4,-1.445686939405205​​2E-4,2.3657752899453044E-4,-4.508546771830879-E 5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron))),(-1.219763362314552E-5,-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275275764E-5,-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E -6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))

((CompactBuffer(com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966E-7,-7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7,-4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386763402588258E-7,-4.289261710255232E-7))
((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia)),(-4.769295992446132E-5,-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4,-1.4456869394052774E-4,2.3657752899453044E-4,-4.508546771830879E-5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron)),(-1.219763362314552E-5,-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275764E-5,-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E-6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))

这是3行用户功能示例,每行10个维度:

Here is 3 line example of users features, 10 dimensions for each line:

(96768,(-0.0010857731103897095,-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4,-9.048808133229613E-4,-4.1544197301846E-5,0.0014421759406104684,-9.77902309386991E-5,0.0010355513077229261,-0.0017878251383081079)) (97280,(-0.0022841691970825195,-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4,-0.0011165080359205604,0.0017137592658400400,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082,-5.738904583267868E-4))
(97792,(-0.0017802991205826402,-0.003464450128376484,0.002837196458131075,0.0015725698322057724,-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094594535E-4,0.0027060359716415405,-0.0017731878906488419))

(96768,(-0.0010857731103897095,-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4,-9.048808133229613E-4,-4.1544197301846E-5,0.0014421759406104684,-9.77902309386991E-5,0.0010355513077229261,-0.0017878251383081079))
(97280,(-0.0022841691970825195,-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4,-0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082,-5.738904583267868E-4))
(97792,(-0.0017802991205826402,-0.003464450128376484,0.002837196458131075,0.0015725698322057724,-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094535E-4,0.0027060359716415405,-0.0017731878906488419))

因此您可以想象当我得到特征向量的点积来计算用户项矩阵的值时有多小.

So you can imagine how small when I get dot product of the feature vectors to compute value of user-item matrix.

我的问题是:

  1. 还有其他方法可以改善推荐结果吗?
  2. 我的功能看起来正确吗,还是出了点问题?
  3. 我是否可以处理原始启动时间(转换为得分),对吗?
  1. Is there any other way to improve the recommendation result?
  2. Does my features seem right, or there's something going wrong?
  3. Is my way to process the original launch times(convert to score) right?

我在这里放了一些代码.这绝对是一个程序问题.但是也许不能用几行代码来解决.

I put some code here. And this is absolutely a program question. But maybe can't be solved by a few lines of code.

val model = ALS.trainImplicit(ratings, rank, iterations, lambda, alpha)
print("recommendForAllUser")
val userTopKRdd = recommendForAllUser(model, topN).join(userData.map(x => (x._2._1, x._1))).map {
  case (uid, (appArray, mac)) => {
    (mac, appArray.map {
      case (appId, rating) => {
        val packageName = appIdPriorityPackageNameDict.value.getOrElse(appId, Constants.PLACEHOLDER)
        (packageName, rating)
      }
    })
  }
}
HbaseWriter.writeRddToHbase(userTopKRdd, "user_top100_recommendation", (x: (String, Array[(String, Double)])) => {
  val mac = x._1
  val products = x._2.map {
    case (packageName, rating) => packageName + "=" + rating
  }.mkString(",")
  val putMap = Map("apps" -> products)
  (new ImmutableBytesWritable(), Utils.getHbasePutByMap(mac, putMap))
})

print("recommendSimilarApp")
println("productFeatures ******")
model.productFeatures.take(1000).map{
  case (appId, features) => {
    val packageNameList = appIdPackageNameListDict.value.get(appId)
    val packageNameListStr = if (packageNameList.isDefined) {
      packageNameList.mkString("(", ",", ")")
    } else {
      "Unknow List"
    }
    (packageNameListStr, features.mkString("(", ",", ")"))
  }
}.foreach(println)
println("productFeatures ******")
model.userFeatures.take(1000).map{
  case (userId, features) => {
    (userId, features.mkString("(", ",", ")"))
  }
}.foreach(println)
val similarAppRdd = recommendSimilarApp(model, topN).flatMap {
  case (appId, similarAppArray) => {
    val groupedAppList = appIdPackageNameListDict.value.get(appId)
    if (groupedAppList.isDefined) {
      val similarPackageList = similarAppArray.map {
        case (destAppId, rating) => (appIdPriorityPackageNameDict.value.getOrElse(destAppId, Constants.PLACEHOLDER), rating)
      }
      groupedAppList.get.map(packageName => {
        (packageName, similarPackageList)
      })
    } else {
      None
    }
  }
}
HbaseWriter.writeRddToHbase(similarAppRdd, "similar_app_top100_recommendation", (x: (String, Array[(String, Double)])) => {
  val packageName = x._1
  val products = x._2.map {
    case (packageName, rating) => packageName + "=" + rating
  }.mkString(",")
  val putMap = Map("apps" -> products)
  (new ImmutableBytesWritable(), Utils.getHbasePutByMap(packageName, putMap))
})  

更新:
阅读本文后,我发现了有关数据的新信息(隐式反馈数据集的协作过滤").与本文中描述的IPTV数据集相比,我的数据太稀疏了.

UPDATE :
I found something new about my data after reading the paper("Collaborative Filtering for Implicit Feedback Datasets"). My data is too sparse compare to the IPTV data set described in the paper.

纸张:300,000(用户)17,000(产品)32,000,000(数据)
矿山:300,000(用户)31,000(产品)700,000(数据)

Paper: 300,000(users) 17,000(products) 32,000,000(data)
Mine: 300,000(users) 31,000(products) 700,000(data)

因此,论文数据集中的用户项目矩阵已填充为0.00627 =(32,000,000/300,000/17,000).我的数据集比率是0.0000033.我认为这意味着我的用户项目矩阵的稀疏度是论文的2000倍.
这会导致不良结果吗?还有什么方法可以改善它?

So the user-item matrix in the paper's data set has been filled with 0.00627 = (32,000,000 / 300,000 / 17,000). My data set's ratio is 0.0000033. I think it means that my user-item matrix is 2000 times sparser than the paper's.
Should this lead to a bad result? And any way to improve it?

推荐答案

您应该尝试两件事:

  1. 标准化您的数据,以使每个用户向量的均值和单位方差为零.这是许多机器学习中的常见步骤.它有助于减少离群值的影响,离群值会导致您看到的值接近零.
  2. 删除仅拥有一个应用程序的所有用户.您将从这些用户那里学到的唯一一件事是应用程序分数的均值"值略好一些.他们不会帮助您学习任何有意义的关系,而这正是您真正想要的.

从模型中删除用户后,您将无法通过提供用户ID直接从模型中获取该用户的推荐.但是,无论如何,他们只有一个应用程序评级.因此,您可以改为在产品矩阵上进行KNN搜索,以找到与该用户的应用最相似的应用=推荐.

Having removed a user from the model, you will lose the ability to get a recommendation for that user directly from the model, by providing the user ID. However, they only have a single app rating anyway. So, you can instead run a KNN search over the product matrix to find apps most similar to that users apps = recommendations.

这篇关于如何改善我的推荐结果?我正在隐式使用Spark ALS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆