Saturday, December 30, 2006

Memory Issues using Java ThreadPoolExecutors

Java ThreadPoolExecutors are a very conventient way to make your applications as concurrent. We recently began a drive to refactor most of our code so that we make use of these ThreadPools. We had a particular process where we read XML files from filesystem and did some transformations on the XML and write the transformed XML into the DB. When we started testing the initial code, we ran into serious OutOfMemoryErrors for even a few hundred XML files.

This was a serious drawback. I looked at our code and found we had set our pool size as 15 and had a blocking input of 500 which meant only 515 xml files are meant to be in memory at any given point in time. This was puzzling since this ideally should not max out memory in a 1.5 GB heap.

Roughly our process was like

XML File --> Callable --> Thread Pool (insert into Db) --> Return boolean success

The pool took a Callable that held a reference to xml and wrote it into DB.

On further analysing the code the only thing that was suspicious was an innocuous looking ArrayList. This List held all the future objects so that we could iterate thru this list and wait for all the inputs to be processed before terminating the process. Why would a List of Future objects cause issues?

To identify the root cause I looked into the JDK ThreadPoolExecutor implementation and found the following

1) When we create a Callable task and submit it to the ThreadPoolExecutor using any of the submit methods, a FutureTask object containing the callable is created. This is returned as return value of the submit method.

2) This FutureTask Object is a concrete class implementing both Future and Runnable. This object is the one that is submitted into the ThreadPoolExecutor. The ThreadPoolExecutor never takes a Callable task directly.

3) The ThreadPoolExecutor when its ready to execute a new task picks up the task (a FutureObject or a Runnable) and calls the run method on it.

4) The FutureTask object has stored the callable object as its instance variable. The run method calls the call method and the result returned is set into an instance variable on the FutureTask. At this point both the Callable object and the returned value from the Callable object are both instance variables in the Future object.

5) The Callable and its results are never set to null in the Future.

So since all the Callable objects were in memory, and each callable maintained a reference to DOM object we maxed out on memory ! So we came up with a set of rules when using Callable to make life simpler.

Rules when using Callables and Futures :

A) Never maintain a huge state in Callable. The state variables will not be explicitly GC'ed as long as a reference to the Future is held.
B) If you need to have a lot of state in Callable, ensure that you clean them up at the end of the call method.
C) Never hang on to the Future indiscriminately. This will prevent the Callable and its Return value from being GC'ed.

I dont understand why FutureTask needs to hold on to Callable forever. Why can't the executing thread on completion set a variable called result in Future and nullify the reference to callable ? I dont have an answer yet but this sounds logical to me. Can someone please educate me?

Subscribe to comments for this post

Clicky Web Analytics