如何改善我的推荐结果?我正在使用 spark ALS 隐式 [英] How to improve my recommendation result? I am using spark ALS implicit

查看：24 发布时间：2021/11/14 21:03:40 apache-spark recommendation-engine apache-spark-mllib

本文介绍了如何改善我的推荐结果?我正在使用 spark ALS 隐式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先，我有一些用户应用的使用历史.

First, I have some use history of user's app.

例如:
user1, app1, 3(启动次数)
user2, app2, 2(启动次数)
user3, app1, 1(启动次数)

我基本上有两个要求:

为每个用户推荐一些应用.
为每个应用推荐类似的应用.

所以我在 spark 上使用了 MLLib 的 ALS(隐式)来实现它.一开始，我只是使用原始数据来训练模型.结果很可怕.我认为这可能是由发射时间的范围引起的.并且发射时间从1到数千不等.所以我处理原始数据.我觉得分数更能反映真实情况，更正则化.

So I use ALS(implicit) of MLLib on spark to implement it. At first, I just use the original data to train the model. The result is terrible. I think it may caused by the range of launch times. And the launch time range from 1 to thousands. So I process the original data. I think score can reflect the true situation and more regularization.

分数 = lt/uMlt + lt/aMlt

score 是训练模型的过程结果.
lt 是原始数据中的启动时间.
uMlt 是原始数据中用户的平均启动时间.uMlt(用户的所有启动时间)/(该用户曾经启动的应用程序数量)
aMlt 是应用在原始数据中的平均启动时间.aMlt(应用程序的所有启动时间)/(曾经启动此应用程序的用户数)
这是处理后的数据的一些示例.

score is process result to train model.
lt is launch times in original data.
uMlt is user's mean launch times in original data. uMlt(all launch times of a user) / (number of app this user ever launched)
aMlt is app's mean launch times in original data. aMlt(all launch times of a app) / (number of user who ever launched this app)
Here is some example of the data after processing.

评分(95788,20992,0.14167073369026184)
评分(98696,20992,5.92363166809082)
评级(160020,11264,2.261538505554199)
评分(67904,11264,2.261538505554199)
评分(268430,11264,0.13846154510974884)
评级(201369,11264,1.7999999523162842)
评分(180857,11264,2.2720916271209717)
评分(217692,11264,1.3692307472229004)
评分(186274,28672,2.4250855445861816)
评分(120820,28672,0.4422124922275543)
评分(221146,28672,1.0074234008789062)

Rating(95788,20992,0.14167073369026184)
Rating(98696,20992,5.92363166809082)
Rating(160020,11264,2.261538505554199)
Rating(67904,11264,2.261538505554199)
Rating(268430,11264,0.13846154510974884)
Rating(201369,11264,1.7999999523162842)
Rating(180857,11264,2.2720916271209717)
Rating(217692,11264,1.3692307472229004)
Rating(186274,28672,2.4250855445861816)
Rating(120820,28672,0.4422124922275543)
Rating(221146,28672,1.0074234008789062)

在我这样做之后，并聚合具有不同包名的应用程序，结果似乎更好.但还是不够好.
我发现用户和产品的特征如此之小，而且大部分都是负面的.

After I have done this, and aggregate the apps which have different package name, the result seems better. But still not good enough.
I find that the features of users and products is so small, and most of them is negative.

以下是产品特征的 3 行示例，每行 10 个维度:

Here is 3 line example of products features, 10 dimensions for each line:

((CompactBuffer(com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966E-7,-7.641608021913271E-7,6.04089620717E-72017E-4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386783256E-7,6.386783256E-7,6.386783256E-7,5.0000471675)2783-256E-7)((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia))，( - 4.769295992446132E-5，-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4，-1.4456869394052774E-4,2.3657752899453044E-4，-4.508546771830879E-5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron))，( - 1.219763362314552E-5，-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275764E-5，-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E-6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))

((CompactBuffer(com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966E-7,-7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7,-4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386763402588258E-7,-4.289261710255232E-7))
((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia)),(-4.769295992446132E-5,-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4,-1.4456869394052774E-4,2.3657752899453044E-4,-4.508546771830879E-5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron)),(-1.219763362314552E-5,-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275764E-5,-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E-6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))

这是用户特征的 3 行示例，每行 10 个维度:

Here is 3 line example of users features, 10 dimensions for each line:

(96768，( - 0.0010857731103897095，-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4，-9.048808133229613E-4，-4.1544197301846E-5,0.0014421759406104684，-9.77902309386991E-5,0.0010355513077229261，-0.0017878251383081079))，点击(97280，( - 0.0022841691970825195，-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4，-0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082，-5.738904583267868E-4))，点击(97792，( - 0.0017802991205826402，-0.003464450128376484,0.002837196458131075,0.0015725698322057724，-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094535E-4,0.0027060359716415405，-0.0017731878906488419))

(96768,(-0.0010857731103897095,-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4,-9.048808133229613E-4,-4.1544197301846E-5,0.0014421759406104684,-9.77902309386991E-5,0.0010355513077229261,-0.0017878251383081079))
(97280,(-0.0022841691970825195,-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4,-0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082,-5.738904583267868E-4))
(97792,(-0.0017802991205826402,-0.003464450128376484,0.002837196458131075,0.0015725698322057724,-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094535E-4,0.0027060359716415405,-0.0017731878906488419))

所以你可以想象当我得到特征向量的点积来计算用户-项目矩阵的值时有多小.

So you can imagine how small when I get dot product of the feature vectors to compute value of user-item matrix.

我的问题是:

还有其他方法可以改善推荐结果吗?
我的功能看起来是对的，还是出了什么问题?
我处理原始启动时间的方式(转换为分数)对吗?

我在这里放了一些代码.这绝对是一个程序问题.但也许几行代码也解决不了.

I put some code here. And this is absolutely a program question. But maybe can't be solved by a few lines of code.

val model = ALS.trainImplicit(ratings, rank, iterations, lambda, alpha)
print("recommendForAllUser")
val userTopKRdd = recommendForAllUser(model, topN).join(userData.map(x => (x._2._1, x._1))).map {
  case (uid, (appArray, mac)) => {
    (mac, appArray.map {
      case (appId, rating) => {
        val packageName = appIdPriorityPackageNameDict.value.getOrElse(appId, Constants.PLACEHOLDER)
        (packageName, rating)
      }
    })
  }
}
HbaseWriter.writeRddToHbase(userTopKRdd, "user_top100_recommendation", (x: (String, Array[(String, Double)])) => {
  val mac = x._1
  val products = x._2.map {
    case (packageName, rating) => packageName + "=" + rating
  }.mkString(",")
  val putMap = Map("apps" -> products)
  (new ImmutableBytesWritable(), Utils.getHbasePutByMap(mac, putMap))
})

print("recommendSimilarApp")
println("productFeatures ******")
model.productFeatures.take(1000).map{
  case (appId, features) => {
    val packageNameList = appIdPackageNameListDict.value.get(appId)
    val packageNameListStr = if (packageNameList.isDefined) {
      packageNameList.mkString("(", ",", ")")
    } else {
      "Unknow List"
    }
    (packageNameListStr, features.mkString("(", ",", ")"))
  }
}.foreach(println)
println("productFeatures ******")
model.userFeatures.take(1000).map{
  case (userId, features) => {
    (userId, features.mkString("(", ",", ")"))
  }
}.foreach(println)
val similarAppRdd = recommendSimilarApp(model, topN).flatMap {
  case (appId, similarAppArray) => {
    val groupedAppList = appIdPackageNameListDict.value.get(appId)
    if (groupedAppList.isDefined) {
      val similarPackageList = similarAppArray.map {
        case (destAppId, rating) => (appIdPriorityPackageNameDict.value.getOrElse(destAppId, Constants.PLACEHOLDER), rating)
      }
      groupedAppList.get.map(packageName => {
        (packageName, similarPackageList)
      })
    } else {
      None
    }
  }
}
HbaseWriter.writeRddToHbase(similarAppRdd, "similar_app_top100_recommendation", (x: (String, Array[(String, Double)])) => {
  val packageName = x._1
  val products = x._2.map {
    case (packageName, rating) => packageName + "=" + rating
  }.mkString(",")
  val putMap = Map("apps" -> products)
  (new ImmutableBytesWritable(), Utils.getHbasePutByMap(packageName, putMap))
})

更新:
阅读论文后，我发现了一些关于我的数据的新东西(隐式反馈数据集的协作过滤").与论文中描述的 IPTV 数据集相比，我的数据过于稀疏.

UPDATE :
I found something new about my data after reading the paper("Collaborative Filtering for Implicit Feedback Datasets"). My data is too sparse compare to the IPTV data set described in the paper.

论文:300,000(用户)17,000(产品)32,000,000(数据)
我的:300,000(用户)31,000(产品)700,000(数据)

Paper: 300,000(users) 17,000(products) 32,000,000(data)
Mine: 300,000(users) 31,000(products) 700,000(data)

所以论文数据集中的 user-item 矩阵已经填充了 0.00627 = (32,000,000/300,000/17,000).我的数据集的比率是 0.0000033.我认为这意味着我的用户-项目矩阵比论文的矩阵稀疏 2000 倍.
这会导致不好的结果吗?有什么方法可以改进吗?

So the user-item matrix in the paper's data set has been filled with 0.00627 = (32,000,000 / 300,000 / 17,000). My data set's ratio is 0.0000033. I think it means that my user-item matrix is 2000 times sparser than the paper's.
Should this lead to a bad result? And any way to improve it?

如何改善我的推荐结果?我正在使用 spark ALS 隐式 [英] How to improve my recommendation result? I am using spark ALS implicit

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何改善我的推荐结果?我正在使用 spark ALS 隐式 [英] How to improve my recommendation result? I am using spark ALS implicit

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭