W3docs

Java Character Streams

Read and write text in Java with Reader, Writer, FileReader, FileWriter, and character encoding considerations.

Java Character Streams

The previous chapter covered byte streams — the raw layer where everything is byte. That layer is right for binary data and wrong for text. A UTF-8 character can be one, two, three, or four bytes; UTF-16 uses two-byte code units with surrogate pairs for anything beyond the basic multilingual plane; even ASCII text needs a "this is ASCII" decision somewhere. Calling InputStream.read() on text and casting the result to char works only if you're lucky and the file is single-byte-per-character — and the moment someone writes "é" or "日" or "🎉", the lucky version corrupts the data.

The character stream hierarchy exists to keep that decoding out of your code. Reader and Writer deal in char, not byte. The bridge classes — InputStreamReader and OutputStreamWriter — take a Charset and do the conversion. Get the charset right at the bridge, and every layer above it works with decoded text.

The Reader contract

Reader is the mirror of InputStream, one abstract pair of methods (read(char[], int, int) and close()) with conveniences on top:

int read();                            // next char as int 0..65535, or -1 at end
int read(char[] buf);                  // read up to buf.length chars; return count or -1
int read(char[] buf, int off, int len); // into a slice
String readLine();                       // only on BufferedReader — not on Reader itself
long transferTo(Writer out);             // Java 10+: pipe straight to a sink

Two subtle differences from the byte side. First, the unit is char (a 16-bit UTF-16 code unit), not byte. Second, read() returns 0..65535 for a code unit and -1 at end of stream — the same sentinel trick as InputStream, but the legal range is wider.

A char is not always one "character" — characters outside the basic multilingual plane (U+10000 and up: most emoji, ancient scripts) use two UTF-16 code units (a surrogate pair). If you split on char boundaries (e.g. read 100 chars at a time and process them in chunks) you can split a surrogate pair across two reads. For line-oriented text this rarely matters; for character-level processing of arbitrary Unicode, work in code points (String.codePoints()).

The Writer contract

Writer mirrors OutputStream:

void write(int c);                          // low 16 bits
void write(char[] buf);
void write(char[] buf, int off, int len);
void write(String s);                        // convenience — encodes a whole String
void write(String s, int off, int len);
Writer append(CharSequence csq);             // chainable: w.append("a").append("b")
void flush();
void close();                                // calls flush() first

write(String) is the convenience you'll use most: most text I/O is a small number of large writes (a JSON body, a generated report) rather than character-by-character output.

append exists for CharSequence interop — StringBuilder implements CharSequence, so a Writer can be the target of code that's writing into either depending on a flag. It's the same append method StringBuilder itself has, by interface.

Concrete character streams

ClassWhat it wraps
FileReader / FileWriterA file on disk, decoded as text.
CharArrayReader / CharArrayWriterAn in-memory char[].
StringReader / StringWriterAn in-memory String/StringBuilder.
BufferedReader / BufferedWriterA buffered view of another Reader/Writer.
InputStreamReader / OutputStreamWriterBridge classes: a Reader/Writer over an underlying byte stream, with a Charset.
PrintWriterA Writer decorator that adds print, println, and printf.

The bridge classes are the structural point of the whole hierarchy. Every character stream that talks to a file, socket, or pipe is — underneath — a byte stream plus a charset. FileReader is a thin wrapper around InputStreamReader(new FileInputStream(...)); FileWriter likewise around OutputStreamWriter(new FileOutputStream(...)).

The charset trap

The classic Java I/O bug:

// WRONG in any code that might run on more than one machine
try (FileReader in = new FileReader("data.txt")) { ... }
try (FileWriter out = new FileWriter("data.txt")) { ... }

The no-charset constructors use the JVM's default charset, which is determined at startup from the OS locale. On a developer Mac it's almost always UTF-8. On a Linux server with a C locale it can be US-ASCII. On Windows with an English install it's Cp1252. The "works on my Mac, broken on the production box" bug is exactly this constructor.

Pass a charset explicitly:

// Right
try (FileReader in = new FileReader("data.txt", StandardCharsets.UTF_8)) { ... }
try (FileWriter out = new FileWriter("data.txt", StandardCharsets.UTF_8)) { ... }

(The two-argument forms taking a Charset were added in Java 11. Before that, you had to drop down to the bridge classes — new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8) — and the chained-decorator line is one of the reasons Files.newBufferedReader(path) was added: it defaults to UTF-8 since Java 18 and was always charset-explicit before.)

The modern Files API made this default safer:

String text = Files.readString(path);                // UTF-8 by default (Java 18+)
BufferedReader r = Files.newBufferedReader(path);    // UTF-8 by default (always was)

If you're starting fresh, use the Files factories. If you're touching legacy FileReader/FileWriter code, the cheapest fix is adding the StandardCharsets.UTF_8 second argument.

The bridge classes directly

You need InputStreamReader and OutputStreamWriter whenever the source isn't a file — a ZipEntry, a socket, an HTTP response body, System.in, an Inflater-wrapped stream — and you want text out of it:

// Read text from System.in as UTF-8
try (BufferedReader stdin = new BufferedReader(
        new InputStreamReader(System.in, StandardCharsets.UTF_8))) {
  String line = stdin.readLine();
}

// Write the response of an HttpURLConnection as text
try (BufferedReader resp = new BufferedReader(
        new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8))) {
  resp.lines().forEach(System.out::println);
}

The shape is always the same: byte stream → InputStreamReader(stream, charset) → optional BufferedReader → your code.

A worked example: text in three shapes

The program below writes a small UTF-8 text file containing ASCII, accented characters, and a multi-byte emoji, then reads it back four ways: as a String, character by character, line by line through a BufferedReader, and through the legacy FileReader(charset) constructor. The example also shows the bridge-class shape working over a ByteArrayInputStream so you can see where Reader and InputStream meet.

java— editable, runs on the server

What to take from the run:

  • The file on disk was larger than content.length(). The String has length() == 14 (counting \n and counting the 🎉 emoji as two UTF-16 code units — that's what a Java char measures); UTF-8 encodes the emoji as four bytes and é as two, so the byte count is bigger. The same logical text is one number in chars, another in bytes, another in code points. Knowing which one you mean is half of charset bugs.
  • The char-by-char loop reassembled the exact same string. The Reader API handled UTF-8 decoding for you: a single emoji shows up as two (char) read() calls because of UTF-16 surrogates, but you never had to think about byte boundaries.
  • BufferedReader.readLine() returned three lines: hello, café, 🎉 party. That's the text-oriented vocabulary — line-by-line, terminator-aware (handles \n, \r, and \r\n), and built on top of the bridge class. Every API call this chapter and the next make ultimately reduces to "decode bytes through a charset and serve characters."
  • The direct InputStreamReader(new ByteArrayInputStream(raw), UTF_8) block shows the structural shape: byte source on the inside, charset at the bridge, character API on the outside. Swap ByteArrayInputStream for socket.getInputStream() and the rest is identical — that's why HTTP and JDBC clients all converge on the same idiom.
  • The final block decoded the same bytes with the wrong charset. The accented é and the emoji both came out as garbage — the textbook mojibake bug. The bytes on disk were fine; the charset at the bridge was wrong. That's why pinning the charset explicitly is the single most useful habit in Java text I/O.

What's next

Both byte and character streams default to one-at-a-time I/O, and on a raw file stream every call is a syscall. The next chapter, Java Buffered Streams, covers the Buffered* decorators — an in-memory buffer between your code and the OS — and the readLine() API that lives there.

Practice

Practice

Why do `new FileReader(path)` and `new FileWriter(path)` (no charset argument) cause 'works on my machine, broken on the server' bugs?