如何在Spark RDD中比较不区分大小写的字符串? [英] How to Compare Strings without case sensitive in Spark RDD?

查看:192
本文介绍了如何在Spark RDD中比较不区分大小写的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据集

drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000

想从数据集上方找出A和B的数目以及标头.我正在使用以下代码来找出A的数量和B的数量.

Wanted to find out number of A's and B's from above DataSet along with the header. I am using the following code to find out num of A's and number of B's.

from pyspark import SparkContext
from pyspark.sql import SparkSession

logFile = 'Sample.txt'
spark = SparkSession.builder.appName('GD App').getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print('{0} {1}'.format(numAs,numBs))

返回的输出为 1 1 .我想在不区分大小写的情况下进行比较.我已经尝试了以下操作,但是由于'Column'对象不可调用

It returned the output as 1 1. I wanted to compare without the case sensitivity. I have tried the following, but it is returning the error as 'Column' object is not callable

numAs = logData.filter((logData.value).tolower().contains('a')).count()
numBs = logData.filter((logData.value).tolower().contains('b')).count()

请帮帮我.

推荐答案

要转换为小写,应使用 lower()函数(请参见

To convert to lower case, you should use the lower() function (see here) from pyspark.sql.functions.So you could try:

import pyspark.sql.functions as F

logData = spark.createDataFrame(
    [
     (0,'aB'),
     (1,'AaA'),
     (2,'bA'),
     (3,'bB')
    ],
    ('id', "value")
)
numAs = logData.filter(F.lower((logData.value)).contains('a')).count()


您提到我正在使用以下代码找出A的数量和B的数量."请注意,如果要计算字符的实际出现次数而不是包含该字符的行数,可以执行以下操作:


You mention 'I am using the following code to find out num of A's and number of B's.' Note that if you want to count the actual occurrences of a character instead of the amount of rows that contain the character, you could do something like:

def count_char_in_col(col: str, char: str):
    return F.length(F.regexp_replace(F.lower(F.col(col)), "[^" + char + "]", ""))

logData.select(count_char_in_col('value','a')).groupBy().sum().collect()[0][0]

在上面的示例中将返回 5 .

which in the above example will return 5.

希望这会有所帮助!

这篇关于如何在Spark RDD中比较不区分大小写的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆