通过将值不等式匹配到由2列定义的范围来连接的R数据帧 [英] R data frames joined by matching value inequality to a range defined by 2 columns
问题描述
在R中,我知道有两种或几列之间的等价条件加入/合并数据框架有很多不同的方法。
然而,我需要基于将值匹配到由2列定义的值范围来连接两个数据帧,在一种情况下使用大于或等于,而在另一种情况下使用小于或等于。如果我使用SQL,则查询可以是:
SELECT * FROM Table1,
LEFT JOIN Table2
ON Table1.Value> = Table2.LowLimit AND Table1.Value< = Table2.HighLimit
我知道 sqldf
包,但如果可能,我想避免使用。
我的数据使用的是一个具有ip地址的数据帧,如下所示:
ipaddresses< - data.frame(IPAddress = c 1.1.1.1,2.2.2.2,3.3.3.3,4.4.4.4))
另一个数据帧是MaxMind geolite2数据库,包含一个ip地址范围开始,ip-address范围结束,以及一个地理位置ID:
ip_range_start < - c(1.1.1.0,3.3.3.0)
ip_range_end< - c(1.1.1.255,3.3.3.100)
geolocationid< - c(12345,67890)
ipranges< - data.frame(ip_range_start,ip_range_end,geolocationid)
所以,我需要什么实现是 ipranges $ geolocationid
到 ipaddresses
的连接,在每种情况下,
ipaddresses $ IPAddress> = ipranges $ ip_range_start
AND
ipaddresses $ IPAddress< = ipranges $ ip_range_end
使用上面的示例数据,这意味着我需要正确找到1.1.1.1在1.1.1.0-1.1.1.255的范围内,3.3.3.3在3.3.3.0-3.3.3.100的范围内。
最后,我找到了解决方案对于一般问题,除了上述解决方案之外,还可以使用MaxMind数据库对IP地址进行地理位置分配。
这是加入两个相等或不等长的数据帧的通用解决方案,其中值必须与不等于(或小于)一列或更多列的值进行比较。
解决方案是使用基础R的 sapply
。
W在问题中定义的两个数据框架中,我们有: c code code $ ip / b
ipaddresses $ geolocationid< - sapply(ipaddresses $ IPAddress,
function(x)
ipranges $ geolocationid [ipranges $ ip_range_start& = x& ipranges $ ip_range_end> = x])
什么 sapply
是否需要从向量 ipaddresses $ IPAddress
中的每个元素一次,并将其应用到作为参数提供的函数表达式 sapply
。将函数应用于每个元素的结果元素附加到向量,该向量是 sapply
的输出结果。这就是我们在 ipaddresses $ geolocationid
中插入一个新的列。
在这种情况下,如果IP地址首先转换为整数,则 sapply
操作可能会更快。这里有几行将扩展ipaddresses数据框,其中包含每个ip地址的整数版本的列:
#计算每个IP地址
八位字节的整数版本$ data.frame(read.table(text = as.character(ipaddresses $ IPAddress),sep =。))
octet $ IPint $ 256 * 3 *八位字节[,1] + 256 ^ 2 *八比特组[,2] + 256 *八位位组[,3] +八位位组[,4]
ipaddresses $ IPint< - octet $ IPint
#清理八位位组从内存
八位字节< - $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $显然必须对您的 ipranges
数据框中的IP地址进行相同的转换。
In R, I know there are many different ways of joining/merging data frames based on an equals-condition between two or several columns.
However, I need to join two data frames based on matching a value to a value-range, defined by 2 columns, using greater-than-or-equal-to in one case and less-than-or-equal-to in the other. If I was using SQL, the query could be:
SELECT * FROM Table1,
LEFT JOIN Table2
ON Table1.Value >= Table2.LowLimit AND Table1.Value <= Table2.HighLimit
I know about the sqldf
package, but I would like to avoid using that if possible.
The data I am working with is one data frame with ip-addresses, like so:
ipaddresses <- data.frame(IPAddress=c("1.1.1.1","2.2.2.2","3.3.3.3","4.4.4.4"))
The other data frame is the MaxMind geolite2 database, containing an ip-address range start, and ip-address range end, and a geographic location ID:
ip_range_start <- c("1.1.1.0","3.3.3.0")
ip_range_end <- c("1.1.1.255","3.3.3.100")
geolocationid <- c("12345","67890")
ipranges <- data.frame(ip_range_start,ip_range_end,geolocationid)
So, what I need to achieve is a join of ipranges$geolocationid
onto ipaddresses
, in each case where
ipaddresses$IPAddress >= ipranges$ip_range_start
AND
ipaddresses$IPAddress <= ipranges$ip_range_end
With the example data above, that means I need to correctly find that 1.1.1.1 is in the range of 1.1.1.0-1.1.1.255, and 3.3.3.3 is in the range of 3.3.3.0-3.3.3.100.
解决方案 Finally, I have found the solution for the general problem, in addition to the above solution to the specific problem of geolocating IP-addresses using the MaxMind database.
This is the general solution for joining two data frames of equal or unequal length, where a value must be compared with an inequality condition (less-than or greater-than) to one or more columns.
The solution is using sapply
, which is base R.
With the two data frames defined in the question, ipranges
and ipaddresses
, we have:
ipaddresses$geolocationid <- sapply(ipaddresses$IPAddress,
function(x)
ipranges$geolocationid[ipranges$ip_range_start <= x & ipranges$ip_range_end >= x])
What sapply
does is it takes each element, one at a time, from the vector ipaddresses$IPAddress
and applies it to the function expression provided as an argument to sapply
. The result element of applying the function to each element is appended to a vector, which is the output result of sapply
. And that is what we insert as a new column into ipaddresses$geolocationid
.
In this case, if the IP-addresses are converted to integers first, the sapply
operation probably gets faster. Here are a few lines that will extend the ipaddresses data frame with a column containing the integer version of each ip-address:
#calculating the integer version of each IP-address
octet <- data.frame(read.table(text=as.character(ipaddresses$IPAddress), sep="."))
octet$IPint <- 256^3*octet[,1] + 256^2*octet[,2] + 256*octet[,3] + octet[,4]
ipaddresses$IPint <- octet$IPint
# cleaning "octet" from memory
octet <- NULL
You would obviously have to do the same kind of conversion to the IP-addresses in your ipranges
dataframe.
这篇关于通过将值不等式匹配到由2列定义的范围来连接的R数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!