Java Parallel Streams
Process Java streams in parallel for speed — when parallelStream helps, and when it makes things worse.
Java Parallel Streams
A parallel stream is the same pipeline you've been writing, but the JVM is allowed to split the source into chunks and process them on multiple threads. The change at the call site is tiny:
long total = nums.parallelStream().mapToLong(n -> heavy(n)).sum();
// ^^^^^^^^^^^^^^^^^or:
long total = nums.stream().parallel().mapToLong(n -> heavy(n)).sum();The pipeline shape, the operations, the result — all unchanged. What changes is who runs it: instead of one thread walking the source, several workers from the common ForkJoinPool (one per CPU core, minus one) divide the work and a coordinator merges their partial results. When the work per element is heavy enough and the source splits cleanly, the pipeline finishes in roughly wall-clock / cores time. When it doesn't, parallel is slower than sequential — and sometimes incorrect. This chapter is about telling the difference.
What "parallel" actually does
A sequential stream pulls one element through the pipeline, then the next. A parallel stream:
- Splits the source into sub-streams via the source's
Spliterator. Arrays,ArrayList,IntStream.range, and similar sources split cleanly in O(1).LinkedList,Files.lines,Stream.iterate, andStream.generateeither split badly or refuse to split. - Runs each sub-stream's intermediate chain on a worker thread from the common pool.
- Merges the partial results — for
reduceandcollect, this is what thecombineris for.
forEach in a parallel stream calls your Consumer from multiple threads concurrently and in unspecified order. forEachOrdered preserves encounter order at the cost of synchronisation. findFirst in parallel is more expensive than findAny for the same reason — it has to coordinate to identify the first match.
The contract — what your pipeline must satisfy
Parallel only gives a correct answer when the pipeline obeys three rules. Sequential code that happens to ignore them still works; parallel code that does so silently produces nonsense.
- The reducer must be associative.
f(f(a, b), c) == f(a, f(b, c)).+,*,max,min, set-union, list-concat all qualify. Subtraction, division, "first match", and "list-append with order" do not. If you pass a non-associativeBinaryOperatortoreduceorCollectors.reducing, the answer depends on how the JVM happens to split. - The pipeline must be stateless. Your lambdas must not read or write shared mutable state. A lambda that captures and mutates an outer
ArrayList, increments an outerint[], or uses any non-atomic counter will race in parallel. - The pipeline must be side-effect free. Logging is okay; persisting through a thread-safe sink is okay; everything else is a bug waiting for a worker to interleave it differently.
The collectors built into Collectors satisfy 1–3 by construction (when used as documented). Your own lambdas inside map, filter, reduce, and peek are the ones to watch.
When parallel helps (and when it doesn't)
A parallel stream wins only when the per-element work is large enough to dwarf the coordination cost — splitting, scheduling, merging, and the framework's bookkeeping. A rough mental model:
- Large source + CPU-bound per-element work + cheap merge + splittable source = parallel often wins. Image processing per pixel, parsing per record, hashing per file — classic cases.
- Tiny source = sequential wins. The pool wake-up is more expensive than the whole computation.
- Cheap per-element work = sequential wins.
nums.stream().mapToInt(Integer::intValue).sum()is faster than itsparallelStream()cousin untilnumsis in the millions; at small sizes the framework overhead dominates. - Blocking I/O per element = parallel streams are the wrong tool. The common
ForkJoinPoolis sized for CPU work; a blocking I/O call ties up a worker and starves every other parallel stream in the JVM (including those from libraries). UseCompletableFuturewith a bounded executor for I/O fan-out. - Non-splittable source = parallel either falls back to sequential or splits badly.
Files.lines,Stream.iterate,Stream.generate, andLinkedList.stream()are the canonical poor splitters; arrays,ArrayList, andIntStream.rangeare the canonical good ones.
The honest advice: default to sequential; switch to parallel only when you have a measured reason to, with jmh or wall-clock numbers in hand.
Operations that get weird in parallel
A few operations whose meaning shifts when the pipeline goes parallel:
forEach— runs from multiple threads, in unspecified order. If order matters, useforEachOrdered(which costs synchronisation).findFirst— has to coordinate across workers to identify the first match in encounter order. UsefindAnyif you don't care which match wins.limit/skip— well-defined on ordered streams, but more expensive in parallel because the JVM must respect order. On a parallel stream where order doesn't matter,stream.parallel().unordered().limit(n)is cheaper.distinct/sorted— must coordinate across workers; the buffer they keep is shared.reducewith the 3-arg overload uses thecombinerto merge worker outputs. With the 2-arg overload, the JVM uses the identity twice plus the accumulator — same contract, same associativity rule.collect—Collectorsare designed to be safe in parallel; the catch is that the result container might be a regularHashMaporArrayList, and parallel collection coordinates internally to keep that safe. Your downstream collectors must obey the contract.
The shared-state trap, in concrete form
The most common bug in beginner parallel code:
// WRONG -- looks fine, races in parallel
List<String> shouts = new ArrayList<>();
words.parallelStream().forEach(w -> shouts.add(w.toUpperCase()));ArrayList.add is not thread-safe; concurrent workers either lose elements, double-add, throw ArrayIndexOutOfBoundsException, or corrupt the list silently. The right version expresses the result as the output of the pipeline, not a side effect of it:
List<String> shouts = words.parallelStream().map(String::toUpperCase).toList();toList(), like every other collector and terminal that produces a value, is designed for parallel use. The minute you reach for a forEach that mutates an outer variable, you've left the safe road.
If you genuinely need a thread-safe sink for forEach, use a ConcurrentLinkedQueue, AtomicLong, LongAdder, or Collections.synchronizedList(...). But almost always, the right answer is "don't use forEach for accumulation; let the pipeline build the result."
ForkJoinPool and why it matters
By default, every parallel stream in your JVM shares the common pool, sized to Runtime.getRuntime().availableProcessors() - 1 worker threads. That has two consequences:
- A long-running parallel stream monopolises the pool. Any other parallel stream — including ones inside libraries — will queue behind it.
- A parallel stream that blocks (I/O, locks,
Thread.sleep) ties up a worker thread without doing any work, halving the pool's effective size while it waits.
You can dedicate a private pool for a one-off pipeline:
try (var pool = new java.util.concurrent.ForkJoinPool(4)) {
long total = pool.submit(() ->
nums.parallelStream().mapToLong(n -> heavy(n)).sum()
).get();
}This is the right move for long-running compute that you don't want to share with the rest of the JVM. It is still the wrong move for blocking I/O — switch to virtual threads or an explicit CompletableFuture chain on a bounded I/O executor.
A worked example: parallel speed-up, the shared-state trap, and an associativity bug
The program below times sequential vs. parallel for a CPU-bound IntStream sum, demonstrates the shared-state race with forEach, shows the correct collector-based version, and contrasts associative (Integer::sum) with non-associative ((a, b) -> a - b) reducers under parallel.
What to take from the run:
- The parallel sum produced the same result as the sequential one and (on any multi-core machine) finished in a fraction of the wall-clock time. The per-element
heavycall is CPU-bound and the source (anint[]) splits cleanly — the two ingredients parallel needs. - The
forEachthat mutatedbadSinkeither lost elements or blew up. There is no fix that adds asynchronizedhere without making the parallel version slower than the sequential one. The fix is to not writeforEachfor accumulation — use a collector or a terminal that produces the result. Integer::sumis associative; the parallel reduction produced the same answer as the sequential one. The non-associative(a, b) -> a - bproduced different answers in sequential vs. parallel because the JVM is free to split and merge in any associative-equivalent order. Same code, two answers — the symptom every parallel-streams bug eventually produces.parallel().forEach(...)printed0..15in some non-monotonic order;parallel().forEachOrdered(...)printed them in order at the cost of cross-worker synchronisation. If yourforEachcares about order, you're paying for it.- The private
ForkJoinPool(2)ran the pipeline against a dedicated pool. Use this when you have a long-running compute job and don't want it sharing the common pool with the rest of the JVM. Don't use it as a band-aid for blocking I/O — that's a different problem with a different tool.
What's next
You can now reason about every stream pipeline: when to write one, how to build it, what's lazy, what short-circuits, what runs in parallel safely, and what doesn't. One central abstraction is still on the table — the one that lets a pipeline express "this value might be absent" without a single null. The next chapter, Java Optional, covers Optional<T> — what it is, where the stream API leaves its loose ends, and how to use map, flatMap, orElse, and ifPresent to write code that is null-safe by construction.
Practice
`nums.parallelStream().reduce(0, (a, b) -> a - b)` returns a different answer than its `stream()` counterpart. Why?