pyspark添加带有数据框行号的新列字段 [英] pyspark add new column field with the data frame row number
问题描述
嘿,我正在尝试使用Spark构建推荐系统
Hy, I'm trying build a recommendation system with Spark
我有一个数据框,其中包含用户的电子邮件和电影分级.
I have a data frame with users email and movie rating.
df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]), columns=['user','movie','rating'])
sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
user movie rating
aa@gmail.com 2 3
aa@gmail.com 5 5
bb@gmail.com 8 2
cc@gmail.com 9 3
我的第一个疑问是,pySpark MLlib不接受我正确的电子邮件吗?因此,我需要通过主键来更改电子邮件.
My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need to change the email by a Primary key.
我的方法是创建一个临时表,选择不同的用户,现在我想添加一个带有行号的新列(该数字将成为每个用户的主键.
My approach was create a temporary table, select distinct user and now I want add a new column with a row number (and this number will be the primary key for each user.
sparkdf.registerTempTable("sparkdf")
DistinctUsers = sqlContext.sql("Select distinct user FROM sparkdf")
我有什么
+------------+
| user|
+------------+
|bb@gmail.com|
|aa@gmail.com|
|cc@gmail.com|
+------------+
我想要的
+------------+
| user| PK
+------------+
|bb@gmail.com| 1
|aa@gmail.com| 2
|cc@gmail.com| 3
+------------+
接下来,我将进行连接并获取要在MLlib中使用的最终数据框
Next I will do a join and obtain my final data frame to use in MLlib
user movie rating
1 2 3
1 5 5
2 8 2
3 9 3
问候 并感谢您的宝贵时间.
Regards and thanks for your time.
推荐答案
带有Apache Spark的主键实际上回答了您的问题但是在这种特殊情况下,使用StringIndexer
可能是更好的选择:
Primary keys with Apache Spark practically answers your question but in this particular case using StringIndexer
could be a better choice:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="user", outputCol="user_id")
indexed = indexer.fit(sparkdf ).transform(sparkdf)
这篇关于pyspark添加带有数据框行号的新列字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!