How can I fix corrupted dates in R?
You need none of the packages you've loaded, nor do you need to use Reduce
, as functions we're using here are naturally "vectorized".
Here's a sample of your data. (A good question includes data in an easily copied format such as this.)
x <- c("19/9/1997", "22/9/1997", "23/9/1997", "24/9/1997", "25/9/1997",
"26/9/1997", "29/9/1997", "30/9/1997",
"35440", "35471", "35499", "35591", "35621",
"35652", "35683", "35713")
dates <- as.Date(x, format="%d/%m/%Y")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA
Not surprisingly, the second-half of the dates are not recognized given format="%d/%m/%Y"
. You mentioned the use of "%m/%d/%Y"
in your question, so we can (1) do a literal second-pass for this format (un-utilized with this example, but still relevant for your work?):
dates[is.na(dates)] <- as.Date(x[is.na(dates)], format="%m/%d/%Y")
where [is.na(dates)]
only works on the un-converted elements.
(2) If we have more than one other format, you can always use a vector of them and loop over them. (For this, I'll start over, since this loop would replace/augment the first steps above.)
formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
nas <- is.na(dates)
dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" NA NA
# [11] NA NA NA NA NA
# [16] NA
This still leaves us with NA
s for the integer-looking ones. For these you need to specify the origin=
to be able to figure it out (as well as converting to an integer). R typically works with an origin of "1970-01-01"
, which you can confirm with
as.integer(Sys.Date())
# [1] 17787
Sys.Date() - 17787
# [1] "1970-01-01"
but it appears that your dates have an origin of "1900-01-01"
, I think that's Excel's default storage of dates (but it doesn't matter here):
x[9] # the first integer-looking element
# [1] "35440"
dates[1] - as.integer(x[9])
# [1] "1900-09-08"
(I'm assuming that your dates are from the same relative period of time.)
From here:
nas <- is.na(dates)
dates[nas] <- as.Date(as.integer(x[nas]), origin="1900-01-01")
dates
# [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
# [6] "1997-09-26" "1997-09-29" "1997-09-30" "1997-01-12" "1997-02-12"
# [11] "1997-03-12" "1997-06-12" "1997-07-12" "1997-08-12" "1997-09-12"
# [16] "1997-10-12"
(Working on the indices of only NA
elements is relatively efficient in that it only works on and replaces the not-yet-matched entries. If there is nothing left when it gets to another call to as.Date
, it does still call it but with an argument of length 0, with which the function works rather efficiently. I don't think adding a conditional of if (any(nas)) ...
would help, but if there are further methods you need that might be more "expensive", you can consider it.)