如何在Spark RDD中比较不区分大小写的字符串? [英] How to Compare Strings without case sensitive in Spark RDD?
问题描述
我有以下数据集
drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000
想从数据集上方找出A和B的数目以及标头.我正在使用以下代码来找出A的数量和B的数量.
Wanted to find out number of A's and B's from above DataSet along with the header. I am using the following code to find out num of A's and number of B's.
from pyspark import SparkContext
from pyspark.sql import SparkSession
logFile = 'Sample.txt'
spark = SparkSession.builder.appName('GD App').getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print('{0} {1}'.format(numAs,numBs))
返回的输出为 1 1
.我想在不区分大小写的情况下进行比较.我已经尝试了以下操作,但是由于'Column'对象不可调用
It returned the output as 1 1
. I wanted to compare without the case sensitivity. I have tried the following, but it is returning the error as 'Column' object is not callable
numAs = logData.filter((logData.value).tolower().contains('a')).count()
numBs = logData.filter((logData.value).tolower().contains('b')).count()
请帮帮我.
推荐答案
To convert to lower case, you should use the lower()
function (see here) from pyspark.sql.functions
.So you could try:
import pyspark.sql.functions as F
logData = spark.createDataFrame(
[
(0,'aB'),
(1,'AaA'),
(2,'bA'),
(3,'bB')
],
('id', "value")
)
numAs = logData.filter(F.lower((logData.value)).contains('a')).count()
您提到我正在使用以下代码找出A的数量和B的数量."请注意,如果要计算字符的实际出现次数而不是包含该字符的行数,可以执行以下操作:
You mention 'I am using the following code to find out num of A's and number of B's.' Note that if you want to count the actual occurrences of a character instead of the amount of rows that contain the character, you could do something like:
def count_char_in_col(col: str, char: str):
return F.length(F.regexp_replace(F.lower(F.col(col)), "[^" + char + "]", ""))
logData.select(count_char_in_col('value','a')).groupBy().sum().collect()[0][0]
在上面的示例中将返回 5
.
which in the above example will return 5
.
希望这会有所帮助!
这篇关于如何在Spark RDD中比较不区分大小写的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!