字数少于5的行数 [英] number of lines with number of words less than 5
本文介绍了字数少于5的行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
使用pyspark,我想找到字数<的行数.5
Using pyspark, I would like to find number of lines that has number of words < 5
我写了这段代码,但我不知道它到底出了什么问题
I wrote this code but I couldn't figure out what is wrong with it
from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.master("spark://master:7077").appName('test').config(conf=SparkConf()).getOrCreate()
df = spark.read.text('text.txt')
rdd = df.rdd
print(df.count())
rdd1=rdd.filter(lambda line: len((line.split(" "))<5)).collect()
print(rdd1.count())
This is the a small part of the Error
-----------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-48-27233afa0b82> in <module>()
9 rdd = df.rdd
10 print(df.count())
---> 11 rdd1=rdd.filter(lambda line: len((line.split(" "))<5)).collect()
12 print(rdd1.count())
13
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 144.0 failed 1 times, most recent failure: Lost task 0.0 in stage 144.0 (TID 144, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/ff/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1497, in __getattr__
idx = self.__fields__.index(item)
ValueError: 'split' is not in list
推荐答案
我解决了.问题是我试图拆分列表.这是新行
I solved it. The problem was that I was trying to split a list. This is the new line
rdd=rdd.filter(lambda line: len(line[0].split(" "))<5).collect()
这篇关于字数少于5的行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文