Java Fork/Join Framework | W3Docs Learn Java

A regular thread pool is great for "many independent tasks." It's not great for "one big task that can be split recursively into smaller versions of itself." For that second shape — divide-and-conquer work — Java has a specialised executor: the ForkJoinPool. It's the pool behind parallelStream, CompletableFuture.supplyAsync (when no executor is given), and any code you write with RecursiveTask/RecursiveAction.

The trick that makes ForkJoinPool special is work-stealing: every worker has its own deque, and when its own deque is empty it steals a task from the bottom of another worker's deque. The result is automatic load-balancing — fast workers help slow workers without any coordination.

When to reach for it

Fork/join is the right tool for:

Recursive divide-and-conquer. Quicksort, mergesort, tree traversal, recursive numeric algorithms, matrix multiply by halving.
CPU-bound work that's parallelisable into roughly equal pieces.
Work whose granularity adapts: split if the chunk is big, do directly if it's small.

It's the wrong tool for:

I/O-bound work. A blocked worker doesn't steal — and the pool's default size is your CPU count. Block one worker and you've lost a core.
Independent unrelated tasks. A regular ThreadPoolExecutor is simpler and just as fast for that shape.
Tasks that depend on a fixed external schedule. Use ScheduledExecutorService.

A handy mental model: if you'd reach for a parallelStream to do this, fork/join is the same shape expressed directly. (Fork/join arrived in Java 7; parallelStream in Java 8 was built on top of it.)

The three classes

ForkJoinPool pool;                                    // the executor
RecursiveTask<V>;                                     // an abstract task returning V
RecursiveAction;                                      // an abstract task returning nothing

You extend RecursiveTask or RecursiveAction, override compute(), decide inside compute() whether to split or to do the work directly, and call fork()/join() on the sub-tasks.

class Sum extends RecursiveTask<Long> {
  private static final int THRESHOLD = 1000;
  private final long[] data;
  private final int lo, hi;

  Sum(long[] data, int lo, int hi) {
    this.data = data; this.lo = lo; this.hi = hi;
  }

  @Override
  protected Long compute() {
    int len = hi - lo;
    if (len <= THRESHOLD) {
      long s = 0;
      for (int i = lo; i < hi; i++) s += data[i];
      return s;
    }
    int mid = lo + len / 2;
    Sum left  = new Sum(data, lo, mid);
    Sum right = new Sum(data, mid, hi);
    left.fork();                                      // schedule left to run on another worker
    long rightResult = right.compute();               // run right on this worker (avoid extra task)
    long leftResult  = left.join();                   // wait for left
    return leftResult + rightResult;
  }
}

ForkJoinPool pool = new ForkJoinPool();
long total = pool.invoke(new Sum(data, 0, data.length));

The shape — check threshold → split → fork one half → compute the other → join — is the canonical fork/join idiom. The "compute one half here instead of fork-ing both" trick avoids creating an unnecessary task and is a small but real win.

The threshold matters

The single most important decision: when to stop splitting. Too small a threshold and you create thousands of tasks for trivial pieces — the overhead dominates the work. Too large and you can't fully use the cores — many workers stand idle while one chews through a big chunk.

Rules of thumb:

The task body should take at least 10 microseconds. Below that, task-management overhead is comparable to the work.
Make the threshold a constant you can tune. 100, 1000, 10000 are typical for primitive arrays; the right number depends on the per-element cost.
For very small inputs, fall back to a purely serial implementation. The split-and-fork overhead is wasted on inputs that fit in cache.

`fork()`, `join()`, `invoke()`

The three operations on a RecursiveTask:

Method	Behaviour
`fork()`	Schedule the task on the current pool; return immediately
`join()`	Wait for the task and return its result (or throw what it threw)
`invoke()`	Combination of `fork` + `join` for this thread — synchronous
`compute()`	Run the body directly on the calling thread (no fork)

In the pattern above, left.fork(); right.compute(); left.join(); does the right thing — fork one half to another worker, run the other half here, then wait for the fork to finish.

You should not write left.fork(); right.fork(); left.join(); right.join();. The right side is forked and the current worker waits — there's no thread of execution available to actually run right until the worker reaches join. The combination wastes the current worker's time slot.

The common pool

ForkJoinPool.commonPool() is a JVM-wide shared pool, sized to Runtime.getRuntime().availableProcessors() - 1 by default. It powers:

Stream.parallelStream()
CompletableFuture.supplyAsync(supplier) (the no-executor overload)
Arrays.parallelSort()

You can configure the common pool's size via the system property java.util.concurrent.ForkJoinPool.common.parallelism at JVM start. You should not use the common pool for I/O — a single blocking call ties up a worker that the whole JVM shares.

Work-stealing in pictures

worker-1 deque:  [t1 t2 t3 t4]            (it forked these; t4 just got pushed)
worker-2 deque:  []                       (empty — workers steal)
worker-3 deque:  [t10 t11]                (still has its own)

worker-2 finds its deque empty; steals t1 from the BOTTOM of worker-1's deque
worker-1 keeps pulling its own tasks from the TOP

The double-ended queue is the heart of the design: workers push and pop from one end (LIFO — locality of reference for cache hits), thieves take from the other end (FIFO — minimal contention with the owner). It's why fork/join performs so well: workers rarely contend on each other's data structures even under heavy load.

A worked example: parallel sum vs. serial

The program below sums a 10-million-element array two ways — serial loop and fork/join recursion — and prints the wall-clock for each.

java— editable, runs on the server

import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.RecursiveTask;

public class ForkJoinDemo {
  public static void main(String[] args) {
    int n = 10_000_000;
    long[] data = new long[n];
    for (int i = 0; i < n; i++) data[i] = (i * 2654435761L) & 0xFFFF;     // some non-trivial values

int cores = Runtime.getRuntime().availableProcessors();
    System.out.println("cores: " + cores);
    System.out.println("common pool parallelism: " + ForkJoinPool.commonPool().getParallelism());

// 1. Serial baseline
    long t1 = System.nanoTime();
    long serialSum = 0;
    for (int i = 0; i < n; i++) serialSum += data[i];
    long serialMs = (System.nanoTime() - t1) / 1_000_000;
    System.out.printf("%nserial:    sum=%d   %d ms%n", serialSum, serialMs);

// 2. Fork/Join
    long t2 = System.nanoTime();
    long pjSum = ForkJoinPool.commonPool().invoke(new Sum(data, 0, n));
    long pjMs = (System.nanoTime() - t2) / 1_000_000;
    System.out.printf("fork/join: sum=%d   %d ms   (about %.1fx faster)%n",
        pjSum, pjMs, Math.max(1.0, (double) serialMs / Math.max(1, pjMs)));

System.out.println("\nsums equal? " + (serialSum == pjSum));

// 3. Threshold matters: too-small THRESHOLD makes it slower
    long t3 = System.nanoTime();
    long tinySum = ForkJoinPool.commonPool().invoke(new SumTiny(data, 0, n));
    long tinyMs = (System.nanoTime() - t3) / 1_000_000;
    System.out.printf("fork/join (threshold=10): sum=%d   %d ms  -- excessive task overhead%n",
        tinySum, tinyMs);
  }

// Sensible threshold
  static class Sum extends RecursiveTask<Long> {
    static final int THRESHOLD = 100_000;
    private final long[] data;
    private final int lo, hi;
    Sum(long[] data, int lo, int hi) { this.data = data; this.lo = lo; this.hi = hi; }
    @Override protected Long compute() {
      int len = hi - lo;
      if (len <= THRESHOLD) {
        long s = 0;
        for (int i = lo; i < hi; i++) s += data[i];
        return s;
      }
      int mid = lo + len / 2;
      Sum left  = new Sum(data, lo, mid);
      Sum right = new Sum(data, mid, hi);
      left.fork();
      long r = right.compute();
      long l = left.join();
      return l + r;
    }
  }

// Threshold WAY too small — millions of micro-tasks; overhead dominates
  static class SumTiny extends RecursiveTask<Long> {
    static final int THRESHOLD = 10;
    private final long[] data;
    private final int lo, hi;
    SumTiny(long[] data, int lo, int hi) { this.data = data; this.lo = lo; this.hi = hi; }
    @Override protected Long compute() {
      int len = hi - lo;
      if (len <= THRESHOLD) {
        long s = 0;
        for (int i = lo; i < hi; i++) s += data[i];
        return s;
      }
      int mid = lo + len / 2;
      SumTiny left  = new SumTiny(data, lo, mid);
      SumTiny right = new SumTiny(data, mid, hi);
      left.fork();
      long r = right.compute();
      long l = left.join();
      return l + r;
    }
  }
}

What to take from the run:

The fork/join version was several times faster than the serial loop. On an N-core machine the upper bound is roughly N× — the actual number was lower because the JVM, the GC, and other JVM threads also wanted CPU, and the threshold work isn't perfectly balanced. Still, a substantial speedup for a few lines of recursive code.
Both sums were equal. That's the partition-and-merge correctness check: each leaf summed its non-overlapping slice; the combine step (l + r) added them; no double-counting or data races because each leaf wrote to its own local variable.
The SumTiny variant with threshold 10 was slower than the serial loop. With 10M elements split down to chunks of 10, you create about 2M tasks — and the task-management overhead dwarfs the actual addition work. The threshold is a real knob; bench it on representative input.
The left.fork(); long r = right.compute(); long l = left.join(); pattern used one fewer task than fork(); fork(); join(); join(); would. The current worker has free time during the compute() — using it for one of the halves saves an entire task allocation. That's the small but cumulative win on many real workloads.
ForkJoinPool.commonPool() was the executor used in this demo. For a one-off run, the common pool is fine. For a long-running program that mixes fork/join work with stream parallel calls and async futures, give the heavy fork/join workload its own pool — the common pool is meant for small bursts, not steady-state heavy compute.

What's next

The next chapter, Java Concurrent Collections, covers the data structures designed to be hammered by multiple threads — ConcurrentHashMap, CopyOnWriteArrayList, BlockingQueue, and the rest of java.util.concurrent.

Practice

Inside `compute()` of a `RecursiveTask`, you have two sub-tasks `left` and `right`. Which call pattern is the canonical fork/join idiom?

`left.fork(); long r = right.compute(); long l = left.join();` — fork one half, compute the other on the current worker, join the fork`left.fork(); right.fork(); long l = left.join(); long r = right.join();` — fork both and join both`left.invoke(); right.invoke();` — invokes run synchronously in parallel automatically`left.start(); right.start(); left.join(); right.join();` — same as plain `Thread` API