通过将值不等式匹配到由2列定义的范围来连接的R数据帧 [英] R data frames joined by matching value inequality to a range defined by 2 columns

查看：131 发布时间：2017/3/26 2:17:34 r join dataframe range ip-address

本文介绍了通过将值不等式匹配到由2列定义的范围来连接的R数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在R中，我知道有两种或几列之间的等价条件加入/合并数据框架有很多不同的方法。

然而，我需要基于将值匹配到由2列定义的值范围来连接两个数据帧，在一种情况下使用大于或等于，而在另一种情况下使用小于或等于。如果我使用SQL，则查询可以是：

  SELECT * FROM Table1，
 LEFT JOIN Table2 
 ON Table1.Value> = Table2.LowLimit AND Table1.Value< = Table2.HighLimit

我知道 sqldf 包，但如果可能，我想避免使用。

我的数据使用的是一个具有ip地址的数据帧，如下所示：

  ipaddresses<  -  data.frame（IPAddress = c 1.1.1.1，2.2.2.2，3.3.3.3，4.4.4.4））

另一个数据帧是MaxMind geolite2数据库，包含一个ip地址范围开始，ip-address范围结束，以及一个地理位置ID：

  ip_range_start < -  c（1.1.1.0，3.3.3.0）
 ip_range_end<  -  c（1.1.1.255，3.3.3.100）
 geolocationid<  -  c（12345，67890）
 ipranges<  -  data.frame（ip_range_start，ip_range_end，geolocationid）

所以，我需要什么实现是 ipranges $ geolocationid 到 ipaddresses 的连接，在每种情况下，

  ipaddresses $ IPAddress> = ipranges $ ip_range_start 
 AND 
 ipaddresses $ IPAddress< = ipranges $ ip_range_end

使用上面的示例数据，这意味着我需要正确找到1.1.1.1在1.1.1.0-1.1.1.255的范围内，3.3.3.3在3.3.3.0-3.3.3.100的范围内。

解决方案

最后，我找到了解决方案对于一般问题，除了上述解决方案之外，还可以使用MaxMind数据库对IP地址进行地理位置分配。

这是加入两个相等或不等长的数据帧的通用解决方案，其中值必须与不等于（或小于）一列或更多列的值进行比较。

解决方案是使用基础R的 sapply 。

W在问题中定义的两个数据框架中，我们有： c code code $ ip / b

  ipaddresses $ geolocationid<  -  sapply（ipaddresses $ IPAddress，
 function（x）
 ipranges $ geolocationid [ipranges $ ip_range_start& = x& ipranges $ ip_range_end> = x]）

什么 sapply 是否需要从向量 ipaddresses $ IPAddress 中的每个元素一次，并将其应用到作为参数提供的函数表达式 sapply 。将函数应用于每个元素的结果元素附加到向量，该向量是 sapply 的输出结果。这就是我们在 ipaddresses $ geolocationid 中插入一个新的列。

在这种情况下，如果IP地址首先转换为整数，则 sapply 操作可能会更快。这里有几行将扩展ipaddresses数据框，其中包含每个ip地址的整数版本的列：

 ＃计算每个IP地址
八位字节的整数版本$ data.frame（read.table（text = as.character（ipaddresses $ IPAddress），sep =。））
 octet $ IPint $ 256 * 3 *八位字节[，1] + 256 ^ 2 *八比特组[，2] + 256 *八位位组[，3] +八位位组[，4] 
 ipaddresses $ IPint<  -  octet $ IPint 
＃清理八位位组从内存
八位字节< -  $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $显然必须对您的 ipranges 数据框中的IP地址进行相同的转换。
 
In R, I know there are many different ways of joining/merging data frames based on an equals-condition between two or several columns.

However, I need to join two data frames based on matching a value to a value-range, defined by 2 columns, using greater-than-or-equal-to in one case and less-than-or-equal-to in the other. If I was using SQL, the query could be:
SELECT * FROM Table1,
LEFT JOIN Table2
ON Table1.Value >= Table2.LowLimit AND Table1.Value <= Table2.HighLimit
I know about the sqldf package, but I would like to avoid using that if possible.

The data I am working with is one data frame with ip-addresses, like so:
ipaddresses <- data.frame(IPAddress=c("1.1.1.1","2.2.2.2","3.3.3.3","4.4.4.4"))
The other data frame is the MaxMind geolite2 database, containing an ip-address range start, and ip-address range end, and a geographic location ID:
ip_range_start <- c("1.1.1.0","3.3.3.0")
ip_range_end <- c("1.1.1.255","3.3.3.100")
geolocationid <- c("12345","67890")
ipranges <- data.frame(ip_range_start,ip_range_end,geolocationid)
So, what I need to achieve is a join of ipranges$geolocationid onto ipaddresses, in each case where 
ipaddresses$IPAddress >= ipranges$ip_range_start 
AND 
ipaddresses$IPAddress <= ipranges$ip_range_end
With the example data above, that means I need to correctly find that 1.1.1.1 is in the range of 1.1.1.0-1.1.1.255, and 3.3.3.3 is in the range of 3.3.3.0-3.3.3.100.
 解决方案 
Finally, I have found the solution for the general problem, in addition to the above solution to the specific problem of geolocating IP-addresses using the MaxMind database.

This is the general solution for joining two data frames of equal or unequal length, where a value must be compared with an inequality condition (less-than or greater-than) to one or more columns.

The solution is using sapply, which is base R.

With the two data frames defined in the question, iprangesand ipaddresses, we have:
ipaddresses$geolocationid <- sapply(ipaddresses$IPAddress, 
    function(x) 
    ipranges$geolocationid[ipranges$ip_range_start <= x & ipranges$ip_range_end >= x])
What sapply does is it takes each element, one at a time, from the vector ipaddresses$IPAddressand applies it to the function expression provided as an argument to sapply. The result element of applying the function to each element is appended to a vector, which is the output result of sapply. And that is what we insert as a new column into ipaddresses$geolocationid. 

In this case, if the IP-addresses are converted to integers first, the sapply operation probably gets faster. Here are a few lines that will extend the ipaddresses data frame with a column containing the integer version of each ip-address:
#calculating the integer version of each IP-address
octet <- data.frame(read.table(text=as.character(ipaddresses$IPAddress), sep="."))
octet$IPint <- 256^3*octet[,1] + 256^2*octet[,2] + 256*octet[,3] + octet[,4]
ipaddresses$IPint <- octet$IPint
# cleaning "octet" from memory
octet <- NULL
You would obviously have to do the same kind of conversion to the IP-addresses in your ipranges dataframe.

                        这篇关于通过将值不等式匹配到由2列定义的范围来连接的R数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

通过将值不等式匹配到由2列定义的范围来连接的R数据帧 [英] R data frames joined by matching value inequality to a range defined by 2 columns

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过将值不等式匹配到由2列定义的范围来连接的R数据帧 [英] R data frames joined by matching value inequality to a range defined by 2 columns

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭