在PySpark Dataframe中拆分String列的内容 [英] Split Contents of String column in PySpark Dataframe
本文介绍了在PySpark Dataframe中拆分String列的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个pyspark数据框,其中有一列包含字符串.我想将此列拆分为单词
I have a pyspark data frame whih has a column containing strings. I want to split this column into words
代码:
>>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
>>> sentenceData.show(truncate=False)
+---+---------------------------+
|key|desc |
+---+---------------------------+
|1 |Virat is good batsman |
|2 |sachin was good |
|3 |but modi sucks big big time|
|4 |I love the formulas |
+---+---------------------------+
Expected Output
---------------
>>> sentenceData.show(truncate=False)
+---+-------------------------------------+
|key|desc |
+---+-------------------------------------+
|1 |[Virat,is,good,batsman] |
|2 |[sachin,was,good] |
|3 |.... |
|4 |... |
+---+-------------------------------------+
我该如何实现?
推荐答案
使用split
函数:
from pyspark.sql.functions import split
df.withColumn("desc", split("desc", "\s+"))
这篇关于在PySpark Dataframe中拆分String列的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文