Java Parallel Streams | W3Docs Learn Java

A parallel stream is the same stream pipeline you've been writing, but the JVM is allowed to split the source into chunks and process them on multiple threads. The change at the call site is tiny:

long total = nums.parallelStream().mapToLong(n -> heavy(n)).sum();
//              ^^^^^^^^^^^^^^^^^

or:

long total = nums.stream().parallel().mapToLong(n -> heavy(n)).sum();

The pipeline shape, the operations, the result — all unchanged. What changes is who runs it: instead of one thread walking the source, several workers from the common ForkJoinPool (one per CPU core, minus one) divide the work and a coordinator merges their partial results. When the work per element is heavy enough and the source splits cleanly, the pipeline finishes in roughly wall-clock / cores time. When it doesn't, parallel is slower than sequential — and sometimes incorrect. This chapter is about telling the difference.

What "parallel" actually does

A sequential stream pulls one element through the pipeline, then the next. A parallel stream:

Splits the source into sub-streams via the source's Spliterator. Arrays, ArrayList, IntStream.range, and similar sources split cleanly in O(1). LinkedList, Files.lines, Stream.iterate, and Stream.generate either split badly or refuse to split.
Runs each sub-stream's intermediate chain on a worker thread from the common pool.
Merges the partial results — for reduce and collect, this is what the combiner is for.

forEach in a parallel stream calls your Consumer from multiple threads concurrently and in unspecified order. forEachOrdered preserves encounter order at the cost of synchronisation. findFirst in parallel is more expensive than findAny for the same reason — it has to coordinate to identify the first match.

The contract — what your pipeline must satisfy

Parallel only gives a correct answer when the pipeline obeys three rules. Sequential code that happens to ignore them still works; parallel code that does so silently produces nonsense.

The reducer must be associative. f(f(a, b), c) == f(a, f(b, c)). +, *, max, min, set-union, list-concat all qualify. Subtraction, division, "first match", and "list-append with order" do not. If you pass a non-associative BinaryOperator to reduce or Collectors.reducing, the answer depends on how the JVM happens to split.
The pipeline must be stateless. Your lambdas must not read or write shared mutable state. A lambda that captures and mutates an outer ArrayList, increments an outer int[], or uses any non-atomic counter will race in parallel.
The pipeline must be side-effect free. Logging is okay; persisting through a thread-safe sink is okay; everything else is a bug waiting for a worker to interleave it differently.

The collectors built into Collectors satisfy 1–3 by construction (when used as documented). Your own lambdas inside map, filter, reduce, and peek are the ones to watch.

When parallel helps (and when it doesn't)

A parallel stream wins only when the per-element work is large enough to dwarf the coordination cost — splitting, scheduling, merging, and the framework's bookkeeping. A rough mental model:

Large source + CPU-bound per-element work + cheap merge + splittable source = parallel often wins. Image processing per pixel, parsing per record, hashing per file — classic cases.
Tiny source = sequential wins. The pool wake-up is more expensive than the whole computation.
Cheap per-element work = sequential wins. nums.stream().mapToInt(Integer::intValue).sum() is faster than its parallelStream() cousin until nums is in the millions; at small sizes the framework overhead dominates.
Blocking I/O per element = parallel streams are the wrong tool. The common ForkJoinPool is sized for CPU work; a blocking I/O call ties up a worker and starves every other parallel stream in the JVM (including those from libraries). Use CompletableFuture with a bounded executor for I/O fan-out.
Non-splittable source = parallel either falls back to sequential or splits badly. Files.lines, Stream.iterate, Stream.generate, and LinkedList.stream() are the canonical poor splitters; arrays, ArrayList, and IntStream.range are the canonical good ones.

The honest advice: default to sequential; switch to parallel only when you have a measured reason to, with jmh or wall-clock numbers in hand.

Operations that get weird in parallel

A few operations whose meaning shifts when the pipeline goes parallel:

forEach — runs from multiple threads, in unspecified order. If order matters, use forEachOrdered (which costs synchronisation).
findFirst — has to coordinate across workers to identify the first match in encounter order. Use findAny if you don't care which match wins.
limit / skip — well-defined on ordered streams, but more expensive in parallel because the JVM must respect order. On a parallel stream where order doesn't matter, stream.parallel().unordered().limit(n) is cheaper.
distinct / sorted — must coordinate across workers; the buffer they keep is shared.
reduce with the 3-arg overload uses the combiner to merge worker outputs. With the 2-arg overload, the JVM uses the identity twice plus the accumulator — same contract, same associativity rule.
collect — Collectors are designed to be safe in parallel; the catch is that the result container might be a regular HashMap or ArrayList, and parallel collection coordinates internally to keep that safe. Your downstream collectors must obey the contract.

The shared-state trap, in concrete form

The most common bug in beginner parallel code:

// WRONG -- looks fine, races in parallel
List<String> shouts = new ArrayList<>();
words.parallelStream().forEach(w -> shouts.add(w.toUpperCase()));

ArrayList.add is not thread-safe; concurrent workers either lose elements, double-add, throw ArrayIndexOutOfBoundsException, or corrupt the list silently. The right version expresses the result as the output of the pipeline, not a side effect of it:

List<String> shouts = words.parallelStream().map(String::toUpperCase).toList();

toList(), like every other collector and terminal that produces a value, is designed for parallel use. The minute you reach for a forEach that mutates an outer variable, you've left the safe road.

If you genuinely need a thread-safe sink for forEach, use a ConcurrentLinkedQueue, AtomicLong, LongAdder, or Collections.synchronizedList(...). But almost always, the right answer is "don't use forEach for accumulation; let the pipeline build the result."

`ForkJoinPool` and why it matters

By default, every parallel stream in your JVM shares the common pool, sized to Runtime.getRuntime().availableProcessors() - 1 worker threads. That has two consequences:

A long-running parallel stream monopolises the pool. Any other parallel stream — including ones inside libraries — will queue behind it.
A parallel stream that blocks (I/O, locks, Thread.sleep) ties up a worker thread without doing any work, halving the pool's effective size while it waits.

You can dedicate a private pool for a one-off pipeline:

try (var pool = new java.util.concurrent.ForkJoinPool(4)) {
    long total = pool.submit(() ->
        nums.parallelStream().mapToLong(n -> heavy(n)).sum()
    ).get();
}

This is the right move for long-running compute that you don't want to share with the rest of the JVM. It is still the wrong move for blocking I/O — switch to virtual threads or an explicit CompletableFuture chain on a bounded I/O executor.

A worked example: parallel speed-up, the shared-state trap, and an associativity bug

The program below times sequential vs. parallel for a CPU-bound IntStream sum, demonstrates the shared-state race with forEach, shows the correct collector-based version, and contrasts associative (Integer::sum) with non-associative ((a, b) -> a - b) reducers under parallel.

java— editable, runs on the server

import java.util.*;
import java.util.concurrent.*;
import java.util.stream.*;

public class ParallelStreamsShowcase {
  // Pretend per-element work -- a small CPU-bound loop.
  static long heavy(int n) {
    long acc = 0;
    for (int i = 1; i <= 200; i++) acc += (long) n * i;
    return acc;
  }

public static void main(String[] args) {
    int N = 200_000;
    int[] arr = IntStream.range(0, N).toArray();
    System.out.println("available CPU cores: " + Runtime.getRuntime().availableProcessors());

// ---- Sequential vs parallel timing (CPU-bound, splittable source) ----
    long t0 = System.nanoTime();
    long seq = Arrays.stream(arr).mapToLong(ParallelStreamsShowcase::heavy).sum();
    long t1 = System.nanoTime();
    long par = Arrays.stream(arr).parallel().mapToLong(ParallelStreamsShowcase::heavy).sum();
    long t2 = System.nanoTime();

System.out.printf("sequential sum = %d in %.1f ms%n", seq, (t1 - t0) / 1e6);
    System.out.printf("parallel sum   = %d in %.1f ms%n", par, (t2 - t1) / 1e6);
    System.out.println("equal results? " + (seq == par));

// ---- The shared-state trap ----
    System.out.println("\n--- shared-state race: forEach mutating an outer ArrayList ---");
    List<String> words = IntStream.range(0, 50_000).mapToObj(i -> "w" + i).toList();
    List<String> badSink = new ArrayList<>();
    try {
      words.parallelStream().forEach(w -> badSink.add(w.toUpperCase()));
      System.out.println("  bad sink size = " + badSink.size() + " (expected " + words.size() + ")");
    } catch (Throwable t) {
      System.out.println("  threw " + t.getClass().getSimpleName() + " -- ArrayList is not thread-safe");
    }

// ---- The right version: let the pipeline build the result ----
    List<String> goodSink = words.parallelStream().map(String::toUpperCase).toList();
    System.out.println("  goodSink size = " + goodSink.size() + " (matches input)");

// ---- Associativity matters in parallel ----
    System.out.println("\n--- associativity: + vs - under reduce ---");
    List<Integer> ints = IntStream.rangeClosed(1, 100).boxed().toList();
    int seqPlus  = ints.stream().reduce(0, Integer::sum);
    int parPlus  = ints.parallelStream().reduce(0, Integer::sum);
    int seqMinus = ints.stream().reduce(0, (a, b) -> a - b);
    int parMinus = ints.parallelStream().reduce(0, (a, b) -> a - b);
    System.out.println("  sequential sum  = " + seqPlus);
    System.out.println("  parallel   sum  = " + parPlus + "   (matches -- + is associative)");
    System.out.println("  sequential -    = " + seqMinus);
    System.out.println("  parallel   -    = " + parMinus + "   (almost certainly DIFFERS -- - is NOT associative)");

// ---- forEach unordered vs forEachOrdered ----
    System.out.println("\n--- forEach order in parallel ---");
    System.out.print("  parallel forEach (unordered): ");
    IntStream.range(0, 16).parallel().forEach(i -> System.out.print(i + " "));
    System.out.println();
    System.out.print("  parallel forEachOrdered:      ");
    IntStream.range(0, 16).parallel().forEachOrdered(i -> System.out.print(i + " "));
    System.out.println();

// ---- private pool for an isolated parallel pipeline ----
    System.out.println("\n--- private ForkJoinPool ---");
    try (ForkJoinPool pool = new ForkJoinPool(2)) {
      long privSum = pool.submit(() -> Arrays.stream(arr).parallel().mapToLong(ParallelStreamsShowcase::heavy).sum()).get();
      System.out.println("  private-pool sum = " + privSum + " (matches sequential? " + (privSum == seq) + ")");
    } catch (Exception e) {
      System.out.println("  private pool failed: " + e);
    }
  }
}

What to take from the run:

The parallel sum produced the same result as the sequential one and (on any multi-core machine) finished in a fraction of the wall-clock time. The per-element heavy call is CPU-bound and the source (an int[]) splits cleanly — the two ingredients parallel needs.
The forEach that mutated badSink either lost elements or blew up. There is no fix that adds a synchronized here without making the parallel version slower than the sequential one. The fix is to not write forEach for accumulation — use a collector or a terminal that produces the result.
Integer::sum is associative; the parallel reduction produced the same answer as the sequential one. The non-associative (a, b) -> a - b produced different answers in sequential vs. parallel because the JVM is free to split and merge in any associative-equivalent order. Same code, two answers — the symptom every parallel-streams bug eventually produces.
parallel().forEach(...) printed 0..15 in some non-monotonic order; parallel().forEachOrdered(...) printed them in order at the cost of cross-worker synchronisation. If your forEach cares about order, you're paying for it.
The private ForkJoinPool(2) ran the pipeline against a dedicated pool. Use this when you have a long-running compute job and don't want it sharing the common pool with the rest of the JVM. Don't use it as a band-aid for blocking I/O — that's a different problem with a different tool.

What's next

You can now reason about every stream pipeline: when to write one, how to build it, what's lazy, what short-circuits, what runs in parallel safely, and what doesn't. One central abstraction is still on the table — the one that lets a pipeline express "this value might be absent" without a single null. The next chapter, Java Optional, covers Optional<T> — what it is, where the stream API leaves its loose ends, and how to use map, flatMap, orElse, and ifPresent to write code that is null-safe by construction.

Practice

`nums.parallelStream().reduce(0, (a, b) -> a - b)` returns a different answer than its `stream()` counterpart. Why?

Subtraction isn't associative — `(a - b) - c != a - (b - c)` — and parallel `reduce` is free to split and merge in any associative-equivalent order, so the answer depends on the splitParallel streams skip elements at chunk boundaries; the missing elements explain the differenceThe accumulator's identity `0` is wrong for subtraction; using `Integer.MAX_VALUE` would give the same answer in both`reduce` cannot be called on a parallel stream — the API silently falls back to sequential, but only after the first split