通过将值不等式匹配到由2列定义的范围来连接的R数据帧 [英] R data frames joined by matching value inequality to a range defined by 2 columns

查看:131
本文介绍了通过将值不等式匹配到由2列定义的范围来连接的R数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在R中,我知道有两种或几列之间的等价条件加入/合并数据框架有很多不同的方法。



然而,我需要基于将值匹配到由2列定义的值范围来连接两个数据帧,在一种情况下使用大于或等于,而在另一种情况下使用小于或等于。如果我使用SQL,则查询可以是:

  SELECT * FROM Table1,
LEFT JOIN Table2
ON Table1.Value> = Table2.LowLimit AND Table1.Value< = Table2.HighLimit

我知道 sqldf 包,但如果可能,我想避免使用。



我的数据使用的是一个具有ip地址的数据帧,如下所示:

  ipaddresses<  -  data.frame(IPAddress = c 1.1.1.1,2.2.2.2,3.3.3.3,4.4.4.4))

另一个数据帧是MaxMind geolite2数据库,包含一个ip地址范围开始,ip-address范围结束,以及一个地理位置ID:

  ip_range_start < -  c(1.1.1.0,3.3.3.0)
ip_range_end< - c(1.1.1.255,3.3.3.100)
geolocationid< - c(12345,67890)
ipranges< - data.frame(ip_range_start,ip_range_end,geolocationid)

所以,我需要什么实现是 ipranges $ geolocationid ipaddresses 的连接,在每种情况下,

  ipaddresses $ IPAddress> = ipranges $ ip_range_start 
AND
ipaddresses $ IPAddress< = ipranges $ ip_range_end

使用上面的示例数据,这意味着我需要正确找到1.1.1.1在1.1.1.0-1.1.1.255的范围内,3.3.3.3在3.3.3.0-3.3.3.100的范围内。

解决方案

最后,我找到了解决方案对于一般问题,除了上述解决方案之外,还可以使用MaxMind数据库对IP地址进行地理位置分配。



这是加入两个相等或不等长的数据帧的通用解决方案,其中值必须与不等于(或小于)一列或更多列的值进行比较。



解决方案是使用基础R的 sapply



W在问题中定义的两个数据框架中,我们有: c code code $ ip / b

  ipaddresses $ geolocationid<  -  sapply(ipaddresses $ IPAddress,
function(x)
ipranges $ geolocationid [ipranges $ ip_range_start& = x& ipranges $ ip_range_end> = x])

什么 sapply 是否需要从向量 ipaddresses $ IPAddress 中的每个元素一次,并将其应用到作为参数提供的函数表达式 sapply 。将函数应用于每个元素的结果元素附加到向量,该向量是 sapply 的输出结果。这就是我们在 ipaddresses $ geolocationid 中插入一个新的列。



在这种情况下,如果IP地址首先转换为整数,则 sapply 操作可能会更快。这里有几行将扩展ipaddresses数据框,其中包含每个ip地址的整数版本的列:

 #计算每个IP地址
八位字节的整数版本$ data.frame(read.table(text = as.character(ipaddresses $ IPAddress),sep =。))
octet $ IPint $ 256 * 3 *八位字节[,1] + 256 ^ 2 *八比特组[,2] + 256 *八位位组[,3] +八位位组[,4]
ipaddresses $ IPint< - octet $ IPint
#清理八位位组从内存
八位字节< - $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $显然必须对您的 ipranges 数据框中的IP地址进行相同的转换。


In R, I know there are many different ways of joining/merging data frames based on an equals-condition between two or several columns.

However, I need to join two data frames based on matching a value to a value-range, defined by 2 columns, using greater-than-or-equal-to in one case and less-than-or-equal-to in the other. If I was using SQL, the query could be:

SELECT * FROM Table1,
LEFT JOIN Table2
ON Table1.Value >= Table2.LowLimit AND Table1.Value <= Table2.HighLimit

I know about the sqldf package, but I would like to avoid using that if possible.

The data I am working with is one data frame with ip-addresses, like so:

ipaddresses <- data.frame(IPAddress=c("1.1.1.1","2.2.2.2","3.3.3.3","4.4.4.4"))

The other data frame is the MaxMind geolite2 database, containing an ip-address range start, and ip-address range end, and a geographic location ID:

ip_range_start <- c("1.1.1.0","3.3.3.0")
ip_range_end <- c("1.1.1.255","3.3.3.100")
geolocationid <- c("12345","67890")
ipranges <- data.frame(ip_range_start,ip_range_end,geolocationid)

So, what I need to achieve is a join of ipranges$geolocationid onto ipaddresses, in each case where

ipaddresses$IPAddress >= ipranges$ip_range_start 
AND 
ipaddresses$IPAddress <= ipranges$ip_range_end

With the example data above, that means I need to correctly find that 1.1.1.1 is in the range of 1.1.1.0-1.1.1.255, and 3.3.3.3 is in the range of 3.3.3.0-3.3.3.100.

解决方案

Finally, I have found the solution for the general problem, in addition to the above solution to the specific problem of geolocating IP-addresses using the MaxMind database.

This is the general solution for joining two data frames of equal or unequal length, where a value must be compared with an inequality condition (less-than or greater-than) to one or more columns.

The solution is using sapply, which is base R.

With the two data frames defined in the question, iprangesand ipaddresses, we have:

ipaddresses$geolocationid <- sapply(ipaddresses$IPAddress, 
    function(x) 
    ipranges$geolocationid[ipranges$ip_range_start <= x & ipranges$ip_range_end >= x])

What sapply does is it takes each element, one at a time, from the vector ipaddresses$IPAddressand applies it to the function expression provided as an argument to sapply. The result element of applying the function to each element is appended to a vector, which is the output result of sapply. And that is what we insert as a new column into ipaddresses$geolocationid.

In this case, if the IP-addresses are converted to integers first, the sapply operation probably gets faster. Here are a few lines that will extend the ipaddresses data frame with a column containing the integer version of each ip-address:

#calculating the integer version of each IP-address
octet <- data.frame(read.table(text=as.character(ipaddresses$IPAddress), sep="."))
octet$IPint <- 256^3*octet[,1] + 256^2*octet[,2] + 256*octet[,3] + octet[,4]
ipaddresses$IPint <- octet$IPint
# cleaning "octet" from memory
octet <- NULL

You would obviously have to do the same kind of conversion to the IP-addresses in your ipranges dataframe.

这篇关于通过将值不等式匹配到由2列定义的范围来连接的R数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆