使用pyspark计算所有可能的词对 [英] Counting all possible word pairs using pyspark
本文介绍了使用pyspark计算所有可能的词对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个文本文档.我需要找到整个文档中重复单词对的可能计数.例如,我有以下 word 文档.该文档有两行,每行以;"分隔.文档:
I have a text document. I need to find the possible counts of repeating word pairs in the overall document. For example, I have the below word document. The document has two lines, each line separated by ';'. Document:
My name is Sam My name is Sam My name is Sam;
My name is Sam;
我正在研究对词计数.预期结果是:
I am working on pairwords count.The expected out is:
[(('my', 'my'), 3), (('name', 'is'), 7), (('is', 'name'), 3), (('sam', 'sam'), 3), (('my', 'name'), 7), (('name', 'sam'), 7), (('is', 'my'), 3), (('sam', 'is'), 3), (('my', 'sam'), 7), (('name', 'name'), 3), (('is', 'is'), 3), (('sam', 'my'), 3), (('my', 'is'), 7), (('name', 'my'), 3), (('is', 'sam'), 7), (('sam', 'name'), 3)]
如果我使用:
wordPairCount = rddData.map(lambda line: line.split()).flatMap(lambda x: [((x[i], x[i + 1]), 1) for i in range(0, len(x) - 1)]).reduceByKey(lambda a,b:a + b)
我得到了连续单词的成对单词及其重复出现次数.
I get pair-words of consecutive words and their count of re-occurences.
如何将每个单词与该行中的每个其他单词配对,然后在所有行中搜索相同的对?
How can I pair each word with every other word in the line and then search for the same pair in all lines?
有人可以看看吗?谢谢
推荐答案
您的输入字符串:
# spark is SparkSession object
s1 = 'The Adventure of the Blue Carbuncle The Adventure of the Blue Carbuncle The Adventure of the Blue Carbuncle; The Adventure of the Blue Carbuncle;'
# Split the string on ; and I parallelize it to make an rdd
rddData = spark.sparkContext.parallelize(rdd_Data.split(";"))
rddData.collect()
# ['The Adventure of the Blue Carbuncle The Adventure of the Blue Carbuncle The Adventure of the Blue Carbuncle', ' The Adventure of the Blue Carbuncle', '']
import itertools
final = (
rddData.filter(lambda x: x != "")
.map(lambda x: x.split(" "))
.flatMap(lambda x: itertools.combinations(x, 2))
.filter(lambda x: x[0] != "")
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y).collect()
)
# [(('The', 'of'), 7), (('The', 'Blue'), 7), (('The', 'Carbuncle'), 7), (('Adventure', 'the'), 7), (('Adventure', 'Adventure'), 3), (('of', 'The'), 3), (('the', 'Adventure'), 3), (('the', 'the'), 3), (('Blue', 'The'), 3), (('Carbuncle', 'The'), 3), (('Adventure', 'The'), 3), (('of', 'the'), 7), (('of', 'Adventure'), 3), (('the', 'The'), 3), (('Blue', 'Adventure'), 3), (('Blue', 'the'), 3), (('Carbuncle', 'Adventure'), 3), (('Carbuncle', 'the'), 3), (('The', 'The'), 3), (('of', 'Blue'), 7), (('of', 'Carbuncle'), 7), (('of', 'of'), 3), (('Blue', 'Carbuncle'), 7), (('Blue', 'of'), 3), (('Blue', 'Blue'), 3), (('Carbuncle', 'of'), 3), (('Carbuncle', 'Blue'), 3), (('Carbuncle', 'Carbuncle'), 3), (('The', 'Adventure'), 7), (('The', 'the'), 7), (('Adventure', 'of'), 7), (('Adventure', 'Blue'), 7), (('Adventure', 'Carbuncle'), 7), (('the', 'Blue'), 7), (('the', 'Carbuncle'), 7), (('the', 'of'), 3)]
- 从第一个拆分中删除所有空格
- Split x 是一个空格分隔的字符串,按空格
- 使用
itertools.combinations
创建 2 个元素的组合(flatMap
用于将每个单词与行中的每个其他单词配对) - 像统计字数一样映射和减少
- Remove any blank spaces from the first split
- Split x which is a space divided string, by space
- Create combinations of 2 elements each using
itertools.combinations
(flatMap
for pair each word with every other word in the line) - Map and reduce like you would do for a word-count
这篇关于使用pyspark计算所有可能的词对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文