Java Fork/Join Framework
Split work recursively and parallelize it with the Fork/Join framework — ForkJoinPool, RecursiveTask, work-stealing.
Java Fork/Join Framework
A regular thread pool is great for "many independent tasks." It's not great for "one big task that can be split recursively into smaller versions of itself." For that second shape — divide-and-conquer work — Java has a specialised executor: the ForkJoinPool. It's the pool behind parallelStream, CompletableFuture.supplyAsync (when no executor is given), and any code you write with RecursiveTask/RecursiveAction.
The trick that makes ForkJoinPool special is work-stealing: every worker has its own deque, and when its own deque is empty it steals a task from the bottom of another worker's deque. The result is automatic load-balancing — fast workers help slow workers without any coordination.
When to reach for it
Fork/join is the right tool for:
- Recursive divide-and-conquer. Quicksort, mergesort, tree traversal, recursive numeric algorithms, matrix multiply by halving.
- CPU-bound work that's parallelisable into roughly equal pieces.
- Work whose granularity adapts: split if the chunk is big, do directly if it's small.
It's the wrong tool for:
- I/O-bound work. A blocked worker doesn't steal — and the pool's default size is your CPU count. Block one worker and you've lost a core.
- Independent unrelated tasks. A regular
ThreadPoolExecutoris simpler and just as fast for that shape. - Tasks that depend on a fixed external schedule. Use
ScheduledExecutorService.
The Java 7 streams team's mental model: "if you'd write a parallelStream to do this, fork/join is the right shape directly."
The three classes
ForkJoinPool pool; // the executor
RecursiveTask<V>; // an abstract task returning V
RecursiveAction; // an abstract task returning nothingYou extend RecursiveTask or RecursiveAction, override compute(), decide inside compute() whether to split or to do the work directly, and call fork()/join() on the sub-tasks.
class Sum extends RecursiveTask<Long> {
private static final int THRESHOLD = 1000;
private final long[] data;
private final int lo, hi;
Sum(long[] data, int lo, int hi) {
this.data = data; this.lo = lo; this.hi = hi;
}
@Override
protected Long compute() {
int len = hi - lo;
if (len <= THRESHOLD) {
long s = 0;
for (int i = lo; i < hi; i++) s += data[i];
return s;
}
int mid = lo + len / 2;
Sum left = new Sum(data, lo, mid);
Sum right = new Sum(data, mid, hi);
left.fork(); // schedule left to run on another worker
long rightResult = right.compute(); // run right on this worker (avoid extra task)
long leftResult = left.join(); // wait for left
return leftResult + rightResult;
}
}
ForkJoinPool pool = new ForkJoinPool();
long total = pool.invoke(new Sum(data, 0, data.length));The shape — check threshold → split → fork one half → compute the other → join — is the canonical fork/join idiom. The "compute one half here instead of fork-ing both" trick avoids creating an unnecessary task and is a small but real win.
The threshold matters
The single most important decision: when to stop splitting. Too small a threshold and you create thousands of tasks for trivial pieces — the overhead dominates the work. Too large and you can't fully use the cores — many workers stand idle while one chews through a big chunk.
Rules of thumb:
- The task body should take at least 10 microseconds. Below that, task-management overhead is comparable to the work.
- Make the threshold a constant you can tune.
100,1000,10000are typical for primitive arrays; the right number depends on the per-element cost. - For very small inputs, fall back to a purely serial implementation. The split-and-fork overhead is wasted on inputs that fit in cache.
fork(), join(), invoke()
The three operations on a RecursiveTask:
| Method | Behaviour |
|---|---|
fork() | Schedule the task on the current pool; return immediately |
join() | Wait for the task and return its result (or throw what it threw) |
invoke() | Combination of fork + join for this thread — synchronous |
compute() | Run the body directly on the calling thread (no fork) |
In the pattern above, left.fork(); right.compute(); left.join(); does the right thing — fork one half to another worker, run the other half here, then wait for the fork to finish.
You should not write left.fork(); right.fork(); left.join(); right.join();. The right side is forked and the current worker waits — there's no thread of execution available to actually run right until the worker reaches join. The combination wastes the current worker's time slot.
The common pool
ForkJoinPool.commonPool() is a JVM-wide shared pool, sized to Runtime.getRuntime().availableProcessors() - 1 by default. It powers:
Stream.parallelStream()CompletableFuture.supplyAsync(supplier)(the no-executor overload)Arrays.parallelSort()
You can configure the common pool's size via the system property java.util.concurrent.ForkJoinPool.common.parallelism at JVM start. You should not use the common pool for I/O — a single blocking call ties up a worker that the whole JVM shares.
Work-stealing in pictures
worker-1 deque: [t1 t2 t3 t4] (it forked these; t4 just got pushed)
worker-2 deque: [] (empty — workers steal)
worker-3 deque: [t10 t11] (still has its own)
worker-2 finds its deque empty; steals t1 from the BOTTOM of worker-1's deque
worker-1 keeps pulling its own tasks from the TOPThe double-ended queue is the heart of the design: workers push and pop from one end (LIFO — locality of reference for cache hits), thieves take from the other end (FIFO — minimal contention with the owner). It's why fork/join performs so well: workers rarely contend on each other's data structures even under heavy load.
A worked example: parallel sum vs. serial
The program below sums a 10-million-element array two ways — serial loop and fork/join recursion — and prints the wall-clock for each.
What to take from the run:
- The fork/join version was several times faster than the serial loop. On an
N-core machine the upper bound is roughlyN×— the actual number was lower because the JVM, the GC, and other JVM threads also wanted CPU, and the threshold work isn't perfectly balanced. Still, a substantial speedup for a few lines of recursive code. - Both sums were equal. That's the partition-and-merge correctness check: each leaf summed its non-overlapping slice; the combine step (
l + r) added them; no double-counting or data races because each leaf wrote to its own local variable. - The
SumTinyvariant with threshold10was slower than the serial loop. With 10M elements split down to chunks of 10, you create about 2M tasks — and the task-management overhead dwarfs the actual addition work. The threshold is a real knob; bench it on representative input. - The
left.fork(); long r = right.compute(); long l = left.join();pattern used one fewer task thanfork(); fork(); join(); join();would. The current worker has free time during thecompute()— using it for one of the halves saves an entire task allocation. That's the small but cumulative win on many real workloads. ForkJoinPool.commonPool()was the executor used in this demo. For a one-off run, the common pool is fine. For a long-running program that mixes fork/join work with streamparallelcalls and async futures, give the heavy fork/join workload its own pool — the common pool is meant for small bursts, not steady-state heavy compute.
What's next
The next chapter, Java Concurrent Collections, covers the data structures designed to be hammered by multiple threads — ConcurrentHashMap, CopyOnWriteArrayList, BlockingQueue, and the rest of java.util.concurrent.
Practice
Inside `compute()` of a `RecursiveTask`, you have two sub-tasks `left` and `right`. Which call pattern is the canonical fork/join idiom?