What's so bad about Lazy I/O?

Solution 1:

Lazy IO has the problem that releasing whatever resource you have acquired is somewhat unpredictable, as it depends on how your program consumes the data -- its "demand pattern". Once your program drops the last reference to the resource, the GC will eventually run and release that resource.

Lazy streams are a very convenient style to program in. This is why shell pipes are so fun and popular.

However, if resources are constrained (as in high-performance scenarios, or production environments that expect to scale to the limits of the machine) relying on the GC to clean up can be an insufficient guarantee.

Sometimes you have to release resources eagerly, in order to improve scalability.

So what are the alternatives to lazy IO that don't mean giving up on incremental processing (which in turn would consume too many resources)? Well, we have foldl based processing, aka iteratees or enumerators, introduced by Oleg Kiselyov in the late 2000s, and since popularized by a number of networking-based projects.

Instead of processing data as lazy streams, or in one huge batch, we instead abstract over chunk-based strict processing, with guaranteed finalization of the resource once the last chunk is read. That's the essence of iteratee-based programming, and one that offers very nice resource constraints.

The downside of iteratee-based IO is that it has a somewhat awkward programming model (roughly analogous to event-based programming, versus nice thread-based control). It is definitely an advanced technique, in any programming language. And for the vast majority of programming problems, lazy IO is entirely satisfactory. However, if you will be opening many files, or talking on many sockets, or otherwise using many simultaneous resources, an iteratee (or enumerator) approach might make sense.

Solution 2:

Dons has provided a very good answer, but he's left out what is (for me) one of the most compelling features of iteratees: they make it easier to reason about space management because old data must be explicitly retained. Consider:

average :: [Float] -> Float
average xs = sum xs / length xs

This is a well-known space leak, because the entire list xs must be retained in memory to calculate both sum and length. It's possible to make an efficient consumer by creating a fold:

average2 :: [Float] -> Float
average2 xs = uncurry (/) <$> foldl (\(sumT, n) x -> (sumT+x, n+1)) (0,0) xs
-- N.B. this will build up thunks as written, use a strict pair and foldl'

But it's somewhat inconvenient to have to do this for every stream processor. There are some generalizations (Conal Elliott - Beautiful Fold Zipping), but they don't seem to have caught on. However, iteratees can get you a similar level of expression.

aveIter = uncurry (/) <$> I.zip I.sum I.length

This isn't as efficient as a fold because the list is still iterated over multiple times, however it's collected in chunks so old data can be efficiently garbage collected. In order to break that property, it's necessary to explicitly retain the entire input, such as with stream2list:

badAveIter = (\xs -> sum xs / length xs) <$> I.stream2list

The state of iteratees as a programming model is a work in progress, however it's much better than even a year ago. We're learning what combinators are useful (e.g. zip, breakE, enumWith) and which are less so, with the result that built-in iteratees and combinators provide continually more expressivity.

That said, Dons is correct that they're an advanced technique; I certainly wouldn't use them for every I/O problem.

Solution 3:

I use lazy I/O in production code all the time. It's only a problem in certain circumstances, like Don mentioned. But for just reading a few files it works fine.

Solution 4:

Update: Recently on haskell-cafe Oleg Kiseljov showed that unsafeInterleaveST (which is used for implementing lazy IO within the ST monad) is very unsafe - it breaks equational reasoning. He shows that it allows to construct bad_ctx :: ((Bool,Bool) -> Bool) -> Bool such that

> bad_ctx (\(x,y) -> x == y)
True
> bad_ctx (\(x,y) -> y == x)
False

even though == is commutative.


Another problem with lazy IO: The actual IO operation can be deferred until it's too late, for example after the file is closed. Quoting from Haskell Wiki - Problems with lazy IO:

For example, a common beginner mistake is to close a file before one has finished reading it:

wrong = do
    fileData <- withFile "test.txt" ReadMode hGetContents
    putStr fileData

The problem is withFile closes the handle before fileData is forced. The correct way is to pass all the code to withFile:

right = withFile "test.txt" ReadMode $ \handle -> do
    fileData <- hGetContents handle
    putStr fileData

Here, the data is consumed before withFile finishes.

This is often unexpected and an easy-to-make error.


See also: Three examples of problems with Lazy I/O.