如何改善我的推荐结果?我正在使用 spark ALS 隐式 [英] How to improve my recommendation result? I am using spark ALS implicit

查看:24
本文介绍了如何改善我的推荐结果?我正在使用 spark ALS 隐式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我有一些用户应用的使用历史.

First, I have some use history of user's app.

例如:
user1, app1, 3(启动次数)
user2, app2, 2(启动次数)
user3, app1, 1(启动次数)

我基本上有两个要求:

  1. 为每个用户推荐一些应用.
  2. 为每个应用推荐类似的应用.

所以我在 spark 上使用了 MLLib 的 ALS(隐式)来实现它.一开始,我只是使用原始数据来训练模型.结果很可怕.我认为这可能是由发射时间的范围引起的.并且发射时间从1到数千不等.所以我处理原始数据.我觉得分数更能反映真实情况,更正则化.

So I use ALS(implicit) of MLLib on spark to implement it. At first, I just use the original data to train the model. The result is terrible. I think it may caused by the range of launch times. And the launch time range from 1 to thousands. So I process the original data. I think score can reflect the true situation and more regularization.

分数 = lt/uMlt + lt/aMlt

score 是训练模型的过程结果.
lt 是原始数据中的启动时间.
uMlt 是原始数据中用户的平均启动时间.uMlt(用户的所有启动时间)/(该用户曾经启动的应用程序数量)
aMlt 是应用在原始数据中的平均启动时间.aMlt(应用程序的所有启动时间)/(曾经启动此应用程序的用户数)
这是处理后的数据的一些示例.

score is process result to train model.
lt is launch times in original data.
uMlt is user's mean launch times in original data. uMlt(all launch times of a user) / (number of app this user ever launched)
aMlt is app's mean launch times in original data. aMlt(all launch times of a app) / (number of user who ever launched this app)
Here is some example of the data after processing.

评分(95788,20992,0.14167073369026184)
评分(98696,20992,5.92363166809082)
评级(160020,11264,2.261538505554199)
评分(67904,11264,2.261538505554199)
评分(268430,11264,0.13846154510974884)
评级(201369,11264,1.7999999523162842)
评分(180857,11264,2.2720916271209717)
评分(217692,11264,1.3692307472229004)
评分(186274,28672,2.4250855445861816)
评分(120820,28672,0.4422124922275543)
评分(221146,28672,1.0074234008789062)

Rating(95788,20992,0.14167073369026184)
Rating(98696,20992,5.92363166809082)
Rating(160020,11264,2.261538505554199)
Rating(67904,11264,2.261538505554199)
Rating(268430,11264,0.13846154510974884)
Rating(201369,11264,1.7999999523162842)
Rating(180857,11264,2.2720916271209717)
Rating(217692,11264,1.3692307472229004)
Rating(186274,28672,2.4250855445861816)
Rating(120820,28672,0.4422124922275543)
Rating(221146,28672,1.0074234008789062)

在我这样做之后,并聚合具有不同包名的应用程序,结果似乎更好.但还是不够好.
我发现用户和产品的特征如此之小,而且大部分都是负面的.

After I have done this, and aggregate the apps which have different package name, the result seems better. But still not good enough.
I find that the features of users and products is so small, and most of them is negative.

以下是产品特征的 3 行示例,每行 10 个维度:

Here is 3 line example of products features, 10 dimensions for each line:

((CompactBuffer(com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966E-7,-7.641608021913271E-7,6.04089620717E-72017E-4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386783256E-7,6.386783256E-7,6.386783256E-7,5.0000471675)2783-256E-7)((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia)),( - 4.769295992446132E-5,-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4,-1.4456869394052774E-4,2.3657752899453044E-4,-4.508546771830879E-5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron)),( - 1.219763362314552E-5,-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275764E-5,-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E-6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))

((CompactBuffer(com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966E-7,-7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7,-4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386763402588258E-7,-4.289261710255232E-7))
((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia)),(-4.769295992446132E-5,-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4,-1.4456869394052774E-4,2.3657752899453044E-4,-4.508546771830879E-5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron)),(-1.219763362314552E-5,-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275764E-5,-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E-6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))

这是用户特征的 3 行示例,每行 10 个维度:

Here is 3 line example of users features, 10 dimensions for each line:

(96768,( - 0.0010857731103897095,-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4,-9.048808133229613E-4,-4.1544197301846E-5,0.0014421759406104684,-9.77902309386991E-5,0.0010355513077229261,-0.0017878251383081079)),点击(97280,( - 0.0022841691970825195,-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4,-0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082,-5.738904583267868E-4)),点击(97792,( - 0.0017802991205826402,-0.003464450128376484,0.002837196458131075,0.0015725698322057724,-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094535E-4,0.0027060359716415405,-0.0017731878906488419))

(96768,(-0.0010857731103897095,-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4,-9.048808133229613E-4,-4.1544197301846E-5,0.0014421759406104684,-9.77902309386991E-5,0.0010355513077229261,-0.0017878251383081079))
(97280,(-0.0022841691970825195,-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4,-0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082,-5.738904583267868E-4))
(97792,(-0.0017802991205826402,-0.003464450128376484,0.002837196458131075,0.0015725698322057724,-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094535E-4,0.0027060359716415405,-0.0017731878906488419))

所以你可以想象当我得到特征向量的点积来计算用户-项目矩阵的值时有多小.

So you can imagine how small when I get dot product of the feature vectors to compute value of user-item matrix.

我的问题是:

  1. 还有其他方法可以改善推荐结果吗?
  2. 我的功能看起来是对的,还是出了什么问题?
  3. 我处理原始启动时间的方式(转换为分数)对吗?

我在这里放了一些代码.这绝对是一个程序问题.但也许几行代码也解决不了.

I put some code here. And this is absolutely a program question. But maybe can't be solved by a few lines of code.

val model = ALS.trainImplicit(ratings, rank, iterations, lambda, alpha)
print("recommendForAllUser")
val userTopKRdd = recommendForAllUser(model, topN).join(userData.map(x => (x._2._1, x._1))).map {
  case (uid, (appArray, mac)) => {
    (mac, appArray.map {
      case (appId, rating) => {
        val packageName = appIdPriorityPackageNameDict.value.getOrElse(appId, Constants.PLACEHOLDER)
        (packageName, rating)
      }
    })
  }
}
HbaseWriter.writeRddToHbase(userTopKRdd, "user_top100_recommendation", (x: (String, Array[(String, Double)])) => {
  val mac = x._1
  val products = x._2.map {
    case (packageName, rating) => packageName + "=" + rating
  }.mkString(",")
  val putMap = Map("apps" -> products)
  (new ImmutableBytesWritable(), Utils.getHbasePutByMap(mac, putMap))
})

print("recommendSimilarApp")
println("productFeatures ******")
model.productFeatures.take(1000).map{
  case (appId, features) => {
    val packageNameList = appIdPackageNameListDict.value.get(appId)
    val packageNameListStr = if (packageNameList.isDefined) {
      packageNameList.mkString("(", ",", ")")
    } else {
      "Unknow List"
    }
    (packageNameListStr, features.mkString("(", ",", ")"))
  }
}.foreach(println)
println("productFeatures ******")
model.userFeatures.take(1000).map{
  case (userId, features) => {
    (userId, features.mkString("(", ",", ")"))
  }
}.foreach(println)
val similarAppRdd = recommendSimilarApp(model, topN).flatMap {
  case (appId, similarAppArray) => {
    val groupedAppList = appIdPackageNameListDict.value.get(appId)
    if (groupedAppList.isDefined) {
      val similarPackageList = similarAppArray.map {
        case (destAppId, rating) => (appIdPriorityPackageNameDict.value.getOrElse(destAppId, Constants.PLACEHOLDER), rating)
      }
      groupedAppList.get.map(packageName => {
        (packageName, similarPackageList)
      })
    } else {
      None
    }
  }
}
HbaseWriter.writeRddToHbase(similarAppRdd, "similar_app_top100_recommendation", (x: (String, Array[(String, Double)])) => {
  val packageName = x._1
  val products = x._2.map {
    case (packageName, rating) => packageName + "=" + rating
  }.mkString(",")
  val putMap = Map("apps" -> products)
  (new ImmutableBytesWritable(), Utils.getHbasePutByMap(packageName, putMap))
})  

更新:
阅读论文后,我发现了一些关于我的数据的新东西(隐式反馈数据集的协作过滤").与论文中描述的 IPTV 数据集相比,我的数据过于稀疏.

UPDATE :
I found something new about my data after reading the paper("Collaborative Filtering for Implicit Feedback Datasets"). My data is too sparse compare to the IPTV data set described in the paper.

论文:300,000(用户)17,000(产品)32,000,000(数据)
我的:300,000(用户)31,000(产品)700,000(数据)

Paper: 300,000(users) 17,000(products) 32,000,000(data)
Mine: 300,000(users) 31,000(products) 700,000(data)

所以论文数据集中的 user-item 矩阵已经填充了 0.00627 = (32,000,000/300,000/17,000).我的数据集的比率是 0.0000033.我认为这意味着我的用户-项目矩阵比论文的矩阵稀疏 2000 倍.
这会导致不好的结果吗?有什么方法可以改进吗?

So the user-item matrix in the paper's data set has been filled with 0.00627 = (32,000,000 / 300,000 / 17,000). My data set's ratio is 0.0000033. I think it means that my user-item matrix is 2000 times sparser than the paper's.
Should this lead to a bad result? And any way to improve it?

推荐答案

你应该尝试两件事:

  1. 标准化您的数据,使其每个用户向量的均值和单位方差为零.这是许多机器学习中的常见步骤.它有助于减少异常值的影响,异常值会导致您看到的值接近于零.
  2. 删除所有只有一个应用的用户.您将从这些用户那里学到的唯一一件事是应用程序分数的平均"值稍好一些.不过,它们不会帮助您学习任何有意义的关系,而这正是您真正想要的.

从模型中删除用户后,您将无法通过提供用户 ID 直接从模型中获取对该用户的推荐.但是,无论如何,他们只有一个应用程序评级.因此,您可以改为在产品矩阵上运行 KNN 搜索,以查找与用户应用 = 推荐最相似的应用.

Having removed a user from the model, you will lose the ability to get a recommendation for that user directly from the model, by providing the user ID. However, they only have a single app rating anyway. So, you can instead run a KNN search over the product matrix to find apps most similar to that users apps = recommendations.

这篇关于如何改善我的推荐结果?我正在使用 spark ALS 隐式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆