Filling in missing (blanks) in a data table, per category - backwards and forwards

Solution 1:

A more concise example would have been easier to answer. For example you've included quite a few columns that appear to be redundant. Does it really need to be by first name and last name, or can we use the patient number?

Since you already have NAs in the data, that you wish to fill, it's not roll in data.table really. A rolling join is more for when your data has no NA but you have another time series (for example) that joins to positions inbetween the data. (One efficiency advantage there is the very fact you don't create NA first which you then have to fill in a 2nd step.) Or, in other words, in your question you just have one dataset; you aren't joining two.

So you do need na.locf as @Joshua suggested. I'm not aware of a function that fills NA forward and then the first value backwards, though.

In data.table, to use na.locf by group it's just :

require(data.table)
require(zoo)
DT[,doctor:=na.locf(doctor),by=patient]

which has the efficiency advantages of fast aggregation and update by reference. You would have to write a new small function on top of na.locf to roll the first non NA backwards.

Ensure the data is sorted by patient then date, first. Then the above will cope with changes in doctor over time, since by maintains the order of rows within each group.

Hope that gives you some hints.

Solution 2:

@MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.

In a nutshell, use zoo's na.locf. The problem is not amenable to rolling joins.

setDT(bill)
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE),
     by=list(patient.last.name, patient.first.name, medical.record.nr)]
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE),
     by=list(patient.last.name, patient.first.name, medical.record.nr)]

Then do something similar for referring.doctor.first

A few pointers:

  1. The by statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.

  2. One must use the na.rm=FALSE argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.

  3. Use fromLast=TRUE to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.

  4. You can pass multiple := per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]