pyspark添加带有数据框行号的新列字段 [英] pyspark add new column field with the data frame row number

查看：413 发布时间：2020/9/4 18:46:50 python apache-spark pyspark apache-spark-mllib apache-spark-ml

本文介绍了pyspark添加带有数据框行号的新列字段的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

嘿，我正在尝试使用Spark构建推荐系统

Hy, I'm trying build a recommendation system with Spark

我有一个数据框，其中包含用户的电子邮件和电影分级.

I have a data frame with users email and movie rating.

df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]), columns=['user','movie','rating'])

sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)

           user movie rating
  aa@gmail.com     2      3
  aa@gmail.com     5      5
  bb@gmail.com     8      2
  cc@gmail.com     9      3

我的第一个疑问是，pySpark MLlib不接受我正确的电子邮件吗?因此，我需要通过主键来更改电子邮件.

My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need to change the email by a Primary key.

我的方法是创建一个临时表，选择不同的用户，现在我想添加一个带有行号的新列(该数字将成为每个用户的主键.

My approach was create a temporary table, select distinct user and now I want add a new column with a row number (and this number will be the primary key for each user.

sparkdf.registerTempTable("sparkdf")

DistinctUsers = sqlContext.sql("Select distinct user FROM sparkdf")

我有什么

+------------+
|        user|
+------------+
|bb@gmail.com|
|aa@gmail.com|
|cc@gmail.com|
+------------+

我想要的

+------------+
|        user| PK
+------------+
|bb@gmail.com| 1
|aa@gmail.com| 2
|cc@gmail.com| 3
+------------+

接下来，我将进行连接并获取要在MLlib中使用的最终数据框

Next I will do a join and obtain my final data frame to use in MLlib

user movie rating
  1     2      3
  1     5      5
  2     8      2
  3     9      3

问候并感谢您的宝贵时间.

Regards and thanks for your time.

推荐答案

带有Apache Spark的主键实际上回答了您的问题但是在这种特殊情况下，使用StringIndexer可能是更好的选择:

Primary keys with Apache Spark practically answers your question but in this particular case using StringIndexer could be a better choice:

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="user", outputCol="user_id")
indexed = indexer.fit(sparkdf ).transform(sparkdf)

这篇关于pyspark添加带有数据框行号的新列字段的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark添加带有数据框行号的新列字段 [英] pyspark add new column field with the data frame row number

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pyspark添加带有数据框行号的新列字段 [英] pyspark add new column field with the data frame row number

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭