MATLAB's Garbage Collector?
What is your mental model of it? How is it implemented? Which strengths and weaknesses does it have? MATLAB GC vs. Python GC?
I sometimes see strange performance bottlenecks when using MATLAB nested functions in otherwise innocuously looking code, I am sure it is because of GC. Garbage Collector is an important part of VM and Mathworks does not make it public.
My question is about MATLAB's own heap and GC! Not about handling of Java/COM objects / preventing "out of memory" errors / allocation of stack variables.
EDIT: the first response is actually the meta-answer "Why should I care?". I do care because GC manifests itself when implementing linked list or MVC pattern.
This is the list of facts I collected. Instead of GC the term memory (de)allocation seems to be more appropriate in this context.
My principal information source is the blog of Loren (especially its comments) and this article from MATLAB Digest.
Because of its orientation for numeric computing with possible large data sets, MATLAB does really good job on optimizing stack objects performance like using in-place operations on data and call-by-reference on function arguments. Also because of its orientation its memory model is fundamentally different from such OO languages as Java.
MATLAB had officially no user-defined heap memory until version 7 (in version 6 there was undocumented reference
functionality in schema.m
files). MATLAB 7 has heap both in form of nested functions (closures) and handle objects, their implementation share the same underpinnings. As a side note OO could be emulated with closures in MATLAB (interesting for pre-2008a).
Surprisingly it is possible to examine entire workspace of the enclosing function captured by function handle (closure), see function functions(fhandle) in MATLAB Help. It means that enclosing workspace is being frozen in memory. This is why cellfun/arrayfun
are sometimes very slow when used inside nested functions.
There are also interesting posts by Loren and Brad Phelan on object cleanup.
The most interesting fact about heap deallocation in MATLAB is, in my opinion, that MATLAB tries to do it each time the stack is being deallocated, i.e. on leaving every function. This has advantages but is also a huge CPU penalty if heap deallocation is slow. And it is actually very slow in MATLAB in some scenarios!
The performance problems of MATLAB memory deallocation that can hit code are pretty bad. I always notice that I unintentionally introduce a cyclic references in my code when it suddenly runs x20 slower and sometimes needs some seconds between leaving function and returning to its caller (time spent on cleanup). It is a known problem, see Dave Foti and this older forum post which code is used to make this picture visualizing performance (tests are made on different machines, so absolute timing comparison of different MATLAB versions is meaningless):
Linear increase of pool size for reference-objects means polynomial (or exponential) decrease of MATLAB performance! For value-objects the performance is, as expected, linear.
Considering these facts I can only speculate that MATLAB uses not yet very efficient form of reference counting for heap deallocation.
EDIT: I always encountered performance problem with many small nested functions but recently I noticed that at least with 2006a the cleanup of a single nested scope with some megabytes of data is also terrible, it takes 1.5 seconds just to set nested scope variable to empty!
EDIT 2: finally I got the answer - by Dave Foti himself. He acknowledges the flaws but says that MATLAB is going to retain its present deterministic cleanup approach.
Legend: Shorter execution time is better
MATLAB makes the workspace very clear in the Workspace browser or with the "whos" command. This shows you all the objects created by your commands and how much memory they take up.
feature('memstats')
will show you the largest contiguous block of memory available to MATLAB, which means that is the largest matrix you can create. Using the "clear" command will synchronously remove those objects from memory and free up the space to be used again.
The JVM handles the garbage collection only of Java items. So if you open a file in the editor and close it, Java takes care of removing the window and text, etc from memory. If you create a Java object in the MATLAB workspace, it first has to be cleared and then it can be cleaned up by the jvm.
There's lots of information about managing program memory in our technote: http://www.mathworks.com/support/tech-notes/1100/1106.html
And I recently wrote about handling Java memory on the MATLAB Desktop blog: http://blogs.mathworks.com/desktop/2009/08/17/calling-java-from-matlab-memory-issues/
If you're academically interested what happens to memory allocated when a function exits or when you resize a variable...I'm pretty sure that's a trade secret and it changes every release. You should never notice it, and if you run into performance problems that you suspect are related to object managmenet, please file a help ticket with technical support: http://www.mathworks.com/support
It seems like you're trying to construct some sort of Python vs MATLAB argument. I'm not that interested in that argument.
A meta-answer to your meta-question.
It's actually fairly important that you don't care. When I say that, I don't mean to limit it to MATLAB memory management. This extends to Python, Java, .NET and any other language that does dynamic memory allocation and is still under active development.
The more you know about the current mechanism of memory management, the more likely you'll code defensively against that specific implementation, the more likely it becomes that you won't benefit from future performance improvements. A number of good examples of this can be found in Java's gc capably written up by Brian Goetz over at developerworks.com:
http://www.ibm.com/developerworks/library/j-jtp01274.html
You can say it's important to know. I counter that it's all about the requirements. The more appropriate question is, do the languages I am considering for my project meet my needs in terms of performance, development effort, maintainability, portability, expertise of my developers, etc, etc?
I've never seen a project with a requirement for using a generational gc over mark sweep over ref counting. I don't expect to see one soon.