如何使用 Python 从 RDD 动态获取值? [英] How to get values from RDD dynamically with Python?

查看：45 发布时间：2021/6/24 20:39:59 python apache-spark pyspark

本文介绍了如何使用 Python 从 RDD 动态获取值?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

以下是我们校园系统中一本书的样本记录.每个图书记录都是一个文本文件.我已经加载了记录:

Below is sample record for a book in our system on campus. Each book record is a text file. I have loaded up records with:

books = sc.wholeTextFiles (file:///data/dir/*/*/*/")

这会给我一个 RDD.RDD 中的一条记录如下所示:

This would give me a RDD. One record in the RDD looks like this:

[[‘Call No: 56CB',
  'Title:  Global Warming',
  'Type: Serial,
  'Database:  AWS898,',
 ‘Microfilm:  Y,',
  'Access:  Public ,',
]]

我正在尝试提取 RDD 的 4 到 N 元组位置中的值.0 到 4 元组总是存在的.但是 RDD 可能缺少第 5 个及以后的元组，如下所示:

I am trying to extract the values in the 4 to N tuple positions of the RDD. 0 through 4 tuples are always there. But the RDD may be missing the 5th and beyond tuples, like this:

[[‘Call No: 56CB',
  'Title:  Science 101',
  'Type: Serial,’
  'Database:  AWS898,',
   ‘Microfilm:  Y,',
]]

因此，代码必须灵活处理 RDD 的可变长度.我有以下代码可以获取 4 和 5 元组，但是当 RDD 具有 4 到 15 元组时，这不灵活:

So, the code has to be flexible to handle the variable length of the RDD. I have the following code that gets me the 4 and 5 tuples, but this is not flexible when the RDD has 4 through 15 tuples:

Summary1 = books.map(lambda x: (x[4]))
Summary2 = books.map(lambda x: (x[5]))

我可以通过以下方式获得 RDD 的长度:

I can get the length of the RDD with:

LenRDD = books.map(lambda x: len(x)).collect()

你能帮我写一个 python 代码，让我动态地获得第 4 个到 LenRDD 元组吗?

Can you help me write the python code that gets me dynamically the 4th to LenRDD tuples?

以下是其中一个文件的示例:

Here is an example of one of the files:

Call No: 56CB
Title:  Global Warming
Type: Serial
Database:  AWS894
Microfilm:  Y
Access:  Public
Location: Oxford
Size:  987 MB
Key:  677867IPOIO

推荐答案

根据我从您的问题中了解到的，您正在尝试过滤掉每个文本文件的前 4 行并保留其余行rdds 中的每个文件.如果我的理解是正确的，那么你应该在做的时候阅读文件

According to what I understand from your question, you are trying to filter out the first 4 lines of each text files and retain the rest of the lines of each file in rdds. If my understanding is correct then you should read the files as you are doing

books = sc.wholeTextFiles("file:///data/dir/*/*/*/")

然后你写一个函数从数组中删除前四条记录

Then you write a function to delete the first four records from an array

def delete(x):
    if(len(x)>4):
        for index in range(0,4):
            del x[0]
    return x

然后使用上面的函数从每个文本文件中删除前四行并将其余行作为rdd

and then use the above function to delete the first four lines from each text files and get the rest of the lines as rdd

summary1 = books.map(lambda x: delete(x[1].split("\n"))).map(lambda x: "\n".join(x))

你应该得到你想要的

这篇关于如何使用 Python 从 RDD 动态获取值?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 Python 从 RDD 动态获取值? [英] How to get values from RDD dynamically with Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用 Python 从 RDD 动态获取值? [英] How to get values from RDD dynamically with Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭