从pyspark的列中删除非英语单词 [英] Remove non-english words from column in pyspark

查看：95 发布时间：2021/4/28 20:45:43 python apache-spark pyspark data-cleaning non-english

本文介绍了从pyspark的列中删除非英语单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理pyspark数据框，如下所示:

I am working on a pyspark dataframe as shown below:

+-------+--------------------------------------------------+
|     id|                                             words|
+-------+--------------------------------------------------+
|1475569|[pt, m, reporting, delivery, scam, thank, 0a, 0...|
|1475568|[, , delivered, trblake, yahoo, com, received, ...|
|1475566|[,  marco, v, washin, gton, thursday, de, cembe...|
|1475565|[, marco, v, washin, gton, wednesday, de, cembe...|
|1475563|[joyce, 20, begin, forwarded, message, 20, memo...|
+-------+--------------------------------------------------+

df的

dtypes

dtypes of the df:

id: 'bigint'
words: 'array<string>'

我想从单词"列中删除非英语单词(包括数值或带有数字的单词，例如Bun20)，我已经删除了停用词，但如何从英语"单词中删除其他非英语单词专栏?

I want to remove non-english words (including numeric values or words with numbers, eg. Bun20) from the 'words' column, I have already removed the stop words but How can I remove other non-english words from the column?

请帮助.

推荐答案

您可以使用UDF检查数组中的每个单词是否在nltk语料库中:

You can check if each word in the array is in the nltk corpus using a UDF:

import pyspark.sql.functions as F
import nltk
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

nltk.download('words')
nltk.download('wordnet')

@F.udf('array<string>')
def remove_words(words):
    return [word for word in words if wnl.lemmatize(word) in nltk.corpus.words.words()]

df2 = df.withColumn('words', remove_words('words'))

这篇关于从pyspark的列中删除非英语单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从pyspark的列中删除非英语单词 [英] Remove non-english words from column in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从pyspark的列中删除非英语单词 [英] Remove non-english words from column in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭