Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date
I am facing the common task of calculating the age (in years, months, or weeks) given the date of birth and an arbitrary date. The thing is that quite often I have to do this over many many records (>300 millions), so performance is a key issue here.
After a quick search in SO and Google I found 3 alternatives:
- A common arithmetic procedure (/365.25) (link)
- Using functions
new_interval()
andduration()
from packagelubridate
(link) - Function
age_calc()
from packageeeptools
(link, link, link)
So, here's my toy code:
# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01",
"1962-12-30", "1962-12-31", "1963-01-01",
"2000-06-16", "2000-06-17", "2000-06-18",
"2007-03-18", "2007-03-19", "2007-03-20",
"1968-02-29", "1968-02-29", "1968-02-29"))
# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31",
"2015-12-31", "2015-12-31", "2015-12-31",
"2050-06-17", "2050-06-17", "2050-06-17",
"2008-03-19", "2008-03-19", "2008-03-19",
"2015-02-28", "2015-03-01", "2015-03-02"))
# Using a common arithmetic procedure ("Time differences in days"/365.25)
(givendate-birthdate)/365.25
# Use the package lubridate
require(lubridate)
new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years")
# Use the package eeptools
library(eeptools)
age_calc(dob = birthdate, enddate = givendate, units = "years")
Let's talk later about accuracy and focus first on performance. Here's the code:
# Now let's compare the performance of the alternatives using microbenchmark
library(microbenchmark)
mbm <- microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
times = 1000
)
# And examine the results
mbm
autoplot(mbm)
Here the results:
Bottom line: performance of lubridate
and eeptools
functions is much worse than the arithmetic method (/365.25 is at least 10 times faster). Unfortunately, the arithmetic method is not accurate enough and I cannot afford the few mistakes that this method will make.
"because of the way the modern Gregorian calendar is constructed, there is no straightforward arithmetic method that produces a person’s age, stated according to common usage — common usage meaning that a person’s age should always be an integer that increases exactly on a birthday". (link)
As I read on some posts, lubridate
and eeptools
make no such mistakes (though, I haven't looked at the code/read more about those functions to know which method they use) and that's why I wanted to use them, but their performance does not work for my real application.
Any ideas on an efficient and accurate method to calculate the age?
EDIT
Ops, it seems lubridate
also makes mistakes. And apparently based on this toy example, it makes more mistakes than the arithmetic method (see lines 3, 6, 9, 12). (am I doing something wrong?)
toy_df <- data.frame(
birthdate = birthdate,
givendate = givendate,
arithmetic = as.numeric((givendate - birthdate) / 365.25),
lubridate = new_interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years")
)
toy_df[, 3:5] <- floor(toy_df[, 3:5])
toy_df
birthdate givendate arithmetic lubridate eeptools
1 1978-12-30 2015-12-31 37 37 37
2 1978-12-31 2015-12-31 36 37 37
3 1979-01-01 2015-12-31 36 37 36
4 1962-12-30 2015-12-31 53 53 53
5 1962-12-31 2015-12-31 52 53 53
6 1963-01-01 2015-12-31 52 53 52
7 2000-06-16 2050-06-17 50 50 50
8 2000-06-17 2050-06-17 49 50 50
9 2000-06-18 2050-06-17 49 50 49
10 2007-03-18 2008-03-19 1 1 1
11 2007-03-19 2008-03-19 1 1 1
12 2007-03-20 2008-03-19 0 1 0
13 1968-02-29 2015-02-28 46 47 46
14 1968-02-29 2015-03-01 47 47 47
15 1968-02-29 2015-03-02 47 47 47
Solution 1:
The reason lubridate appears to be making mistakes above is that you are calculating duration (the exact amount of time that occurs between two instants, where 1 year = 31536000s), rather than periods (the change in clock time that occurs between two instants).
To get the change in clock time (in years, months, days, etc) you need to use
as.period(interval(start = birthdate, end = givendate))
which gives the following output
"37y 0m 1d 0H 0M 0S"
"37y 0m 0d 0H 0M 0S"
"36y 11m 30d 0H 0M 0S"
...
"46y 11m 30d 1H 0M 0S"
"47y 0m 0d 1H 0M 0S"
"47y 0m 1d 1H 0M 0S"
To just extract years, you can use the following
as.period(interval(start = birthdate, end = givendate))$year
[1] 37 37 36 53 53 52 50 50 49 1 1 0 46 47 47
Note sadly appears even slower than the methods above!
> mbm
Unit: microseconds
expr min lq mean median uq max neval cld
arithmetic 116.595 138.149 181.7547 184.335 196.8565 5556.306 1000 a
lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935 1000 b
Solution 2:
Ok, so I found this function in another post:
age <- function(from, to) {
from_lt = as.POSIXlt(from)
to_lt = as.POSIXlt(to)
age = to_lt$year - from_lt$year
ifelse(to_lt$mon < from_lt$mon |
(to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
age - 1, age)
}
It was posted by @Jim saying "The following function takes a vectors of Date objects and calculates the ages, correctly accounting for leap years. Seems to be a simpler solution than any of the other answers".
It is indeed simpler and it does the trick I was looking for. On average, it is actually faster than the arithmetic method (about 75% faster).
mbm <- microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
age = age(from = birthdate, to = givendate),
times = 1000
)
mbm
autoplot(mbm)
And at least in my examples it does not make any mistake (and it should not in any example; it's a pretty straightforward function using ifelse
s).
toy_df <- data.frame(
birthdate = birthdate,
givendate = givendate,
arithmetic = as.numeric((givendate - birthdate) / 365.25),
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
age = age(from = birthdate, to = givendate)
)
toy_df[, 3:6] <- floor(toy_df[, 3:6])
toy_df
birthdate givendate arithmetic lubridate eeptools age
1 1978-12-30 2015-12-31 37 37 37 37
2 1978-12-31 2015-12-31 36 37 37 37
3 1979-01-01 2015-12-31 36 37 36 36
4 1962-12-30 2015-12-31 53 53 53 53
5 1962-12-31 2015-12-31 52 53 53 53
6 1963-01-01 2015-12-31 52 53 52 52
7 2000-06-16 2050-06-17 50 50 50 50
8 2000-06-17 2050-06-17 49 50 50 50
9 2000-06-18 2050-06-17 49 50 49 49
10 2007-03-18 2008-03-19 1 1 1 1
11 2007-03-19 2008-03-19 1 1 1 1
12 2007-03-20 2008-03-19 0 1 0 0
13 1968-02-29 2015-02-28 46 47 46 46
14 1968-02-29 2015-03-01 47 47 47 47
15 1968-02-29 2015-03-02 47 47 47 47
I do not consider it as a complete solution because I also wanted to have age in months and weeks, and this function is specific for years. I post it here anyway because it solves the problem for the age in years. I will not accept it because:
- I would wait for @Jim to post it as an answer.
- I will wait to see if someone else come up with a complete solution (efficient, accurate and producing age in years, months or weeks as desired).
Solution 3:
I was going to leave this in the comments, but I think it's worthy of a separate answer. As @Molx points out, your "arithmetic" method is not as simple as it seems -- take a look at the code for -.Date
, most importantly:
return(difftime(e1, e2, units = "days"))
Thus, the "arithmetic" method on objects of class Date
is really a wrapper for the difftime
function. What about difftime
? This too has a bunch of overhead if what you're after is raw speed.
The key is that Date
objects are stored as an integer number of days since/until Jan. 1, 1970 (though they're not actually stored as integer
, hence the birth of the IDate
class in data.table
), so we can just subtract these and be done with it, but to avoid the -.Date
method being called, we have to unclass
our inputs:
(unclass(birthdate) - unclass(givendate)) / 365.25
As far as bang for your buck goes, this approach is another several orders of magnitude faster than even @Jim's age
method.
Here's some more scaled-up test data:
set.seed(20349)
NN <- 1e6
birthdate <- as.Date(sprintf('%d-%02d-%02d',
sample(1901:2030, NN, TRUE),
sample(12, NN, TRUE),
sample(28, NN, TRUE)))
#average 30 years, most data between 20 and 40 years
givendate <- birthdate + as.integer(rnorm(NN, mean = 10950, sd = 1000))
(excluding eeptools
because it is almost impossibly slower--a glance at the code for age_calc
suggests the code goes as far as to create a sequence of dates for each pair of dates (O(n^2)
-ish), not to mention a peppering of ifelse
s)
microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
age = age(from = birthdate, to = givendate),
fastar = (unclass(givendate) - unclass(birthdate)) / 365.25,
overlaps = get_age(birthdate, givendate),
times = 50)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# arithmetic 28.153465 30.384639 62.96118 31.492764 34.052991 180.9556 50 b
# lubridate 94.327968 97.233009 157.30420 102.751351 240.717065 265.0283 50 c
# age 338.347756 479.598513 483.84529 483.580981 488.090832 770.1149 50 d
# fastar 7.740098 7.831528 11.02521 7.913146 8.090902 153.3645 50 a
# overlaps 316.408920 458.734073 459.58974 463.806255 470.320072 769.0929 50 d
Thus we also highlight the folly of benchmarking on small-scale data.
The big cost of @Jim's method is that as.POSIXlt
is increasingly expensive as your vectors grow.
The issue of inaccuracy remains, but unless this accuracy is paramount, it seems the unclass
method is unparalleled.
Solution 4:
I have been hammering away at this and finally have something which is a) perfectly accurate* (in contrast to all of the other options presented thus far) and b) reasonably fast (see my benchmarks in the other answer). It relies on a bunch of arithmetic I did by hand and the wonderful foverlaps
function from the data.table
package.
The essence of the approach is to work from the integer representation of Date
s, as well as to recognize that all birth dates fall in one of four 1461 (= 365 * 4 + 1)-day cycles, depending on when the next year is when it will take 366 days for your birthday to come.
Here's the function:
library(data.table)
get_age <- function(birthdays, ref_dates){
x <- data.table(bday <- unclass(birthdays),
#rem: how many days has it been since the lapse of the
# most recent quadrennium since your birth?
rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
#cycle_type: which of the four years following your birthday
# was the one that had 366 days?
x[ , cycle_type :=
foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
#these intervals were calculated by hand;
# e.g., 59 is Feb. 28, 1970. I made the judgment
# call to say that those born on Feb. 29 don't
# have their "birthday" until the following March 1st.
data.table(start = c(0L, 59L, 424L, 790L, 1155L),
end = c(58L, 423L, 789L, 1154L, 1460L),
val = c(3L, 2L, 1L, 4L, 3L),
key = "start,end"))$val]
I4 <- diag(4L)[ , -4L] #for conciseness below
#The `by` approach might seem a little abstruse for those
# not familiar with `data.table`; see the edit history
# for a more palatable version (which is also slightly slower)
x[ , extra :=
foverlaps(data.table(start = rem, end = rem),
data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
I4[.BY[[1L]],])),
end = c(st[-1L] - 1L, 1461L),
int_yrs = 0:3, key = "start,end")
)[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
#grand finale -- 4 years for every quadrennium, plus the fraction:
4L * ((ref - bday) %/% 1461L) + x$extra
}
Comparing on your main example:
toy_df <- data.frame(
birthdate = birthdate,
givendate = givendate,
arithmetic = as.numeric((givendate - birthdate) / 365.25),
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
mine = get_age(birthdate, givendate)
)
toy_df
# birthdate givendate arithmetic lubridate eeptools mine
# 1 1978-12-30 2015-12-31 37.0020534 37.027397 37.0027397 37.0027322 #eeptools wrong: will be 366 days until 12/31/16, so fraction is 1/366
# 2 1978-12-31 2015-12-31 36.9993155 37.024658 37.0000000 37.0000000
# 3 1979-01-01 2015-12-31 36.9965777 37.021918 36.9972603 36.9972603
# 4 1962-12-30 2015-12-31 53.0020534 53.038356 53.0027397 53.0027322 #same problem
# 5 1962-12-31 2015-12-31 52.9993155 53.035616 53.0000000 53.0000000
# 6 1963-01-01 2015-12-31 52.9965777 53.032877 52.9972603 52.9972603
# 7 2000-06-16 2050-06-17 50.0013689 50.035616 50.0000000 50.0027397 #eeptools wrong: not exactly the birthday
# 8 2000-06-17 2050-06-17 49.9986311 50.032877 50.9972603 50.0000000 #eeptools wrong: _is_ exactly the birthday
# 9 2000-06-18 2050-06-17 49.9958932 50.030137 49.9945205 49.9972603 #eeptools wrong: fraction should be 364/365
# 10 2007-03-18 2008-03-19 1.0047912 1.005479 1.0027322 1.0027397 #eeptools wrong: 2/29 already passed, only 365 days until 3/19/2009
# 11 2007-03-19 2008-03-19 1.0020534 1.002740 1.0000000 1.0000000
# 12 2007-03-20 2008-03-19 0.9993155 1.000000 0.9966839 0.9972678 #eeptools wrong: we passed 2/29, so should be 365/366
# 13 1968-02-29 2015-02-28 46.9979466 47.030137 46.9977019 46.9972603 #my judgment: birthday occurs on 3/1 for 2/29 babies, so 364/365 the way there
# 14 1968-02-29 2015-03-01 47.0006845 47.032877 47.0000000 47.0000000
# 15 1968-02-29 2015-03-02 47.0034223 47.035616 47.0027397 47.0027322
This style of approach could be extended to handle months/weeks pretty easily. Months will be a bit long-winded (have to specify 4 years' worth of month lengths), so I didn't bother; weeks is easy (weeks are unaffected by leap year considerations, so we can just divide by 7).
I also made a lot of progress on doing this with base
functionalities, but a) it was quite ugly (needs a non-linear transformation of 0-1460 to avoid doing nested ifelse
statements, etc.) and b) in the end a for loop (in the form of apply
over the whole list of dates) was unavoidable, so I decided that would slow things down too much. (the transformation is x1 = (unclass(birthdays) - 59) %% 1461; x2 = x1 * (729 - x1) / 402232 + x1
, for posterity)
I've added this function to my package.
*(for dates ranges when non-leap centuries are not a concern; I believe the extension to handle such dates shouldn't be too burdensome, however)