在给定的出生日期和任意日期的情况下,有效且准确地计算出R的年龄(以年,月或周为单位) [英] Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date
问题描述
给定出生日期和任意日期,我面临着计算年龄(以年,月或周为单位)的常见任务.问题是,我经常必须对许多条记录(> 3亿条记录)执行此操作,因此性能是此处的关键问题.
I am facing the common task of calculating the age (in years, months, or weeks) given the date of birth and an arbitrary date. The thing is that quite often I have to do this over many many records (>300 millions), so performance is a key issue here.
在SO和Google中快速搜索之后,我发现了3种替代方法:
After a quick search in SO and Google I found 3 alternatives:
- A common arithmetic procedure (/365.25) (link)
- Using functions
new_interval()
andduration()
from packagelubridate
(link) - Function
age_calc()
from packageeeptools
(link, link, link)
所以,这是我的玩具代码:
So, here's my toy code:
# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01",
"1962-12-30", "1962-12-31", "1963-01-01",
"2000-06-16", "2000-06-17", "2000-06-18",
"2007-03-18", "2007-03-19", "2007-03-20",
"1968-02-29", "1968-02-29", "1968-02-29"))
# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31",
"2015-12-31", "2015-12-31", "2015-12-31",
"2050-06-17", "2050-06-17", "2050-06-17",
"2008-03-19", "2008-03-19", "2008-03-19",
"2015-02-28", "2015-03-01", "2015-03-02"))
# Using a common arithmetic procedure ("Time differences in days"/365.25)
(givendate-birthdate)/365.25
# Use the package lubridate
require(lubridate)
new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years")
# Use the package eeptools
library(eeptools)
age_calc(dob = birthdate, enddate = givendate, units = "years")
让我们稍后再讨论准确性,并首先关注性能.这是代码:
Let's talk later about accuracy and focus first on performance. Here's the code:
# Now let's compare the performance of the alternatives using microbenchmark
library(microbenchmark)
mbm <- microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
times = 1000
)
# And examine the results
mbm
autoplot(mbm)
结果在这里:
底线:lubridate
和eeptools
函数的性能比算术方法差得多(/365.25至少快10倍).不幸的是,算术方法不够准确,我无法承受该方法会犯的一些错误.
Bottom line: performance of lubridate
and eeptools
functions is much worse than the arithmetic method (/365.25 is at least 10 times faster). Unfortunately, the arithmetic method is not accurate enough and I cannot afford the few mistakes that this method will make.
由于现代公历的方式 被构造,没有简单的算术 根据一个人的年龄来确定一个人的年龄的方法 常用用法-常用用法表示某人的 年龄应始终是一个整数,该整数将在 一个生日".(链接)
"because of the way the modern Gregorian calendar is constructed, there is no straightforward arithmetic method that produces a person’s age, stated according to common usage — common usage meaning that a person’s age should always be an integer that increases exactly on a birthday". (link)
当我在一些文章中读到的时候,lubridate
和eeptools
不会犯这样的错误(尽管我没有看代码/了解更多有关那些函数的信息,以了解它们使用哪种方法),这就是为什么我想要使用它们,但它们的性能对我的实际应用程序无效.
As I read on some posts, lubridate
and eeptools
make no such mistakes (though, I haven't looked at the code/read more about those functions to know which method they use) and that's why I wanted to use them, but their performance does not work for my real application.
对有效,准确地计算年龄的方法有何想法?
Any ideas on an efficient and accurate method to calculate the age?
糟糕,看来lubridate
也会出错.显然,基于这个玩具示例,它比算术方法犯了更多的错误(请参见第3、6、9、12行). (我做错什么了吗?)
Ops, it seems lubridate
also makes mistakes. And apparently based on this toy example, it makes more mistakes than the arithmetic method (see lines 3, 6, 9, 12). (am I doing something wrong?)
toy_df <- data.frame(
birthdate = birthdate,
givendate = givendate,
arithmetic = as.numeric((givendate - birthdate) / 365.25),
lubridate = new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years")
)
toy_df[, 3:5] <- floor(toy_df[, 3:5])
toy_df
birthdate givendate arithmetic lubridate eeptools
1 1978-12-30 2015-12-31 37 37 37
2 1978-12-31 2015-12-31 36 37 37
3 1979-01-01 2015-12-31 36 37 36
4 1962-12-30 2015-12-31 53 53 53
5 1962-12-31 2015-12-31 52 53 53
6 1963-01-01 2015-12-31 52 53 52
7 2000-06-16 2050-06-17 50 50 50
8 2000-06-17 2050-06-17 49 50 50
9 2000-06-18 2050-06-17 49 50 49
10 2007-03-18 2008-03-19 1 1 1
11 2007-03-19 2008-03-19 1 1 1
12 2007-03-20 2008-03-19 0 1 0
13 1968-02-29 2015-02-28 46 47 46
14 1968-02-29 2015-03-01 47 47 47
15 1968-02-29 2015-03-02 47 47 47
推荐答案
好,所以我在另一个 @Jim发表的话说:以下函数采用Date对象的向量并计算年龄,正确地计算了leap年.似乎比其他任何一个答案都更简单."
It was posted by @Jim saying "The following function takes a vectors of Date objects and calculates the ages, correctly accounting for leap years. Seems to be a simpler solution than any of the other answers".
它确实更简单,并且可以实现我一直在寻找的窍门.平均而言,它实际上比算术方法要快(大约快75%).
It is indeed simpler and it does the trick I was looking for. On average, it is actually faster than the arithmetic method (about 75% faster).
mbm <- microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
age = age(from = birthdate, to = givendate),
times = 1000
)
mbm
autoplot(mbm)
至少在我的示例中,它没有犯任何错误(并且在任何示例中都不应犯错;这是使用ifelse
s的非常简单的函数).
And at least in my examples it does not make any mistake (and it should not in any example; it's a pretty straightforward function using ifelse
s).
toy_df <- data.frame(
birthdate = birthdate,
givendate = givendate,
arithmetic = as.numeric((givendate - birthdate) / 365.25),
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
age = age(from = birthdate, to = givendate)
)
toy_df[, 3:6] <- floor(toy_df[, 3:6])
toy_df
birthdate givendate arithmetic lubridate eeptools age
1 1978-12-30 2015-12-31 37 37 37 37
2 1978-12-31 2015-12-31 36 37 37 37
3 1979-01-01 2015-12-31 36 37 36 36
4 1962-12-30 2015-12-31 53 53 53 53
5 1962-12-31 2015-12-31 52 53 53 53
6 1963-01-01 2015-12-31 52 53 52 52
7 2000-06-16 2050-06-17 50 50 50 50
8 2000-06-17 2050-06-17 49 50 50 50
9 2000-06-18 2050-06-17 49 50 49 49
10 2007-03-18 2008-03-19 1 1 1 1
11 2007-03-19 2008-03-19 1 1 1 1
12 2007-03-20 2008-03-19 0 1 0 0
13 1968-02-29 2015-02-28 46 47 46 46
14 1968-02-29 2015-03-01 47 47 47 47
15 1968-02-29 2015-03-02 47 47 47 47
我不认为这是一个完整的解决方案,因为我也想将年龄设在几个月和几周之内,并且此功能特定于几年.无论如何,我将其发布在这里,因为它解决了多年以来的问题.我不会接受,因为:
I do not consider it as a complete solution because I also wanted to have age in months and weeks, and this function is specific for years. I post it here anyway because it solves the problem for the age in years. I will not accept it because:
- 我将等待@Jim将其发布为答案.
- 我将拭目以待,看看其他人是否提出了完整的解决方案(有效,准确并且可以根据需要以年,月或周为单位的年龄).
这篇关于在给定的出生日期和任意日期的情况下,有效且准确地计算出R的年龄(以年,月或周为单位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!