Can Haskell's garbage collector prematurely collect "inner" objects?

You talk about question 2 in terms of a "risk". This is a curious choice of phrasing: how would such GC behavior affect you negatively? In practice, GHC will probably not collect your sub-object. In principle, a sufficiently smart GC might do so if it has enough insight into how your program behaves, e.g. through inlining. The point of garbage collection is that it only occurs for objects you won't use again. If you're not using them again, why is it a "risk" that they might be collected? Rather most people would see it as a benefit! This obsolete object can be collected early, freeing up memory for objects which are presently more important. So again, I suggest thinking more about why you consider this a risk: if you consider it dangerous, probably either your assumptions or your program are broken.


The simple answer to (2) is "no". GHC's garbage collector is designed so that if a multi-field value is marked as "used", all of its fields (its "inner objects") are also marked as "used". As @amalloy has pointed out, this is not necessarily a good thing. It would probably be better if an unused object could be garbage collected even if it was a field of some used object. However, designing a garbage collector to do this is difficult and mostly pointless (as I'll explain in a minute), which also helps answer (3): yes, it's pretty much guaranteed that the GHC garbage collector will never be redesigned to collect "inner objects" from in-use "outer objects".

The reason designing a garbage collector to collect unused fields of used objects is pointless is that this is something that can better be accomplished during compilation, by applying appropriate optimizations. Consider the following program:

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS

data Car = Car { carName :: String, carFunctionality :: ByteString }

main :: IO ()
main = do
  complexFunctionality <- BS.readFile "/etc/passwd"
  let myCar = Car {
        carName = "myCar",
        carFunctionality = complexFunctionality }
  print $ carName myCar
  print "done with myCar"

If you compile this with optimizations and dump the optimized code, using:

ghc -O2 Example1.hs -fforce-recomp -ddump-simpl -dsuppress-all -dsuppress-uniques

the resulting optimized code (which is admittedly hard to read) is equivalent to:

main = do
    BS.readFile "/etc/passwd"
    print "myCar"
    print "done with myCar"

(In fact, if you compile this version instead, the resulting optimized code is exactly the same as the optimized code for the original main.)

GHC does not optimize away the readFile itself, because it's an I/O operation, and GHC does not optimize away I/O operations, even if they "obviously" have no effect. But, the result of the readFile is ignored, which means that -- effectively -- the carProperties ByteString can be garbage collected immediately after the readFile statement. For that matter, the Car object is optimized away entirely, so it is never created much less garbage collected.

So, the bottom line is that compile-time optimizations can usually realize most of the benefits of "inner object garbage collection" by eliminating the dependency of an "outer object" on an "inner object" in the first place, which means that the garbage collector itself doesn't need any special functionality to identify and collect "inner objects", as they'll naturally be collected as unused objects in the optimized code.


There are some cases where an optimization in GHC's GC algorithm can cause it to collect unused fields in a data type. This section of the heap object documentation describes a type of heap object, the selector thunk, which is created from certain code specifically to enable this optimization.

Selector thunks are generated for any function which of the form \x -> case x of Pat n1 n2 n3 ... -> nk. IE, a function that matches a single constructor and returns exactly one field from it. When the garbage collector encounters a selector thunk applied to a value, it will look ahead. If the value is already evaluated and matches the expected constructor, it will rewrite the thunk to be a direct reference to that field, removing a reference to the containing data value. If the containing value is no longer reachable from the GC roots, it value is forgotten and will be collected by the garbage collector along with anything it contained which is no longer reachable.

This process can result in situations where unused record fields get collected while you still might think of the current state as having the whole record closed over by an unevaluated expression. But this is both a rare situation and you need a somewhat sophisticated mental model for how laziness works to even recognize when laziness says it might be keeping more in memory than necessary.