R - 将大数据帧拆分为几个较小的数据帧,对每个数据帧执行模糊连接并输出到单个数据帧 [英] R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe

查看:41
本文介绍了R - 将大数据帧拆分为几个较小的数据帧,对每个数据帧执行模糊连接并输出到单个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个数据框,我需要使用 Fuzzyjoin 函数将它们连接起来.我尝试在整个数据帧上执行该功能,但没有足够的内存来执行此操作.其中一个数据帧 [UPRN] 作为源数据保存地址的唯一标识符,另一个 [地址] 保存需要与唯一标识符匹配的地址.

I have 2 dataframes, which I need to join using the fuzzyjoin function. I've tried performing the function on the whole dataframes but do not have enough memory to do so. One of the dataframes [UPRN] acts as source data holding a unique identifier for addresses, the other [Address] holds addresses that needs to be matched to the unique identifier.

我知道有一些与以下内容相关的问题,但我发现似乎没有一个可以回答我的问题.

I'm aware there's a fair few questions relating to the below but none I've found seem to be answering my query.

我希望将大约 45000 行的 [Address] 拆分为 5000 行的可管理块(读取较小的数据帧),仅按行位置.然后我想使用这些小数据框模糊连接到 [UPRN] 数据框.例如[Address1]读取前5000行,应用fuzzyjoin并输出[Join1],然后[Address2]读取第5001到10000行,应用fuzzyjoin并输出[Join2]等等.

I'm looking to split [Address] which is roughly 45000 rows, into manageable chunks (read smaller dataframes) of say 5000 rows, just by row position. I want to then use these small dataframes to then fuzzyjoin to the [UPRN] dataframe. For example [Address1] reads the first 5000 rows, applies the fuzzyjoin and outputs [Join1], then [Address2] reads rows 5001 to 10000, applies the fuzzyjoin and outputs [Join2] and so on.

我在下面拆分后的一个小例子;

A small example of what I'm after with the splitting below;

> Address
Street                   Town            PostCode
742 Evergreen Terrace    Springfield     SP12 HS1
84 Evergreen Terrace     Springfield     SP14 DH9
....3 to 4999 skipped
23 Evergreen Terrace     Springfield     SP19 IA18
3230 Evergreen Terrace   Springfield     SP2 K43


**Function to split [Address]**
> Address1
Street                   Town            PostCode
742 Evergreen Terrace    Springfield     SP12 HS1
84 Evergreen Terrace     Springfield     SP14 DH9
...3 to 5000 skipped

> Address2
Street                   Town            PostCode
23 Evergreen Terrace     Springfield     SP19 IA18
3230 Evergreen Terrace   Springfield     SP2 K43
...5003 to 10000 skipped

然后我想依次将 Address1 连接到 UPRN,然后将 Address2 连接到 UPRN,输出到单个文件(然后我可以附加)或输出到同一文件.我已经有了 join 函数,只需要一种方法来调用每个单独的数据帧.我将如何去做这样的事情?我应该寻找哪些功能?

I then want to sequentially join the Address1 to UPRN, and then Address2 to UPRN, outputting to either individual files (which I can then append) or outputting to the same file. The join function I have already, just need a way to call each separate dataframe. How would I go about doing such a thing? Which functions should I be looking for?

推荐答案

如果您拆分(例如使用 base::splitdplyr::group_split)您的地址数据frame 转换为数据框列表,然后可以在列表上调用purrr::map.

If you split (e.g. with base::split or dplyr::group_split) your Address data frame into a list of data frames, then you can call purrr::map on the list.

purrr::map(list_of_dfs, ~fuzzy_join(x=., y=UPRN, by = "Street"))

您的结果将是一个数据框列表,每个数据框都与 UPRN 模糊连接.然后,您可以调用 bind_rows(或者您可以执行 map_dfr)以再次获取同一数据框中的所有结果.

Your result will be a list of data frames each fuzzyjoined with UPRN. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.

这篇关于R - 将大数据帧拆分为几个较小的数据帧,对每个数据帧执行模糊连接并输出到单个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆