Describe the bug
When a buffer has writerIndex == 0 (e.g., a string column where all values
are empty string ""), AbstractCompressionCodec.compress() writes an 8-byte
buffer with uncompressed_length = 0 as a "shortcut for empty buffer":
https://github.com/apache/arrow-java/blob/main/vector/src/main/java/org/apache/arrow/vector/compression/AbstractCompressionCodec.java#L32-L39
if (uncompressedBuffer.writerIndex() == 0L) {
// shortcut for empty buffer
compressedBuffer.setLong(0, 0); // prefix = 0
...
}
This has been present since the initial implementation (ARROW-11899, 2021).
The Java decompress() handles this correctly (if size==0 return empty),
but C++ and Python Arrow readers do not recognize prefix=0. They attempt
to decompress 0 bytes of data, which fails:
- C++ (Arrow 1.0.0 ~ latest):
IOError: Lz4 compressed input contains less than one frame
- PyArrow 21.0: same error
Reproduction
Write an Arrow IPC stream with LZ4_FRAME (or ZSTD) compression where one
string column has all values = "" (empty string, not null). The string data
buffer has writerIndex = 0, triggering the empty buffer path.
// Writer
ArrowStreamWriter writer = new ArrowStreamWriter(root, null, channel,
IpcOption.DEFAULT, CommonsCompressionFactory.INSTANCE, CodecType.LZ4_FRAME);
// All rows: stringVector.setSafe(i, "".getBytes());
Reading with C++ or Python fails at the first RecordBatch.
Root cause
The Arrow IPC compression format defines:
prefix > 0: compressed data follows, decompress to prefix bytes
prefix = -1: buffer stored uncompressed (sentinel)
prefix = 0: undefined — not in spec, not handled by C++/Python
Java writes prefix=0 for empty buffers, but only Java itself knows how to
read it back. C++/Python treat it as "0 bytes to decompress" → fail.
Suggested fix
Change the empty buffer path to use -1 sentinel (which all readers support):
if (uncompressedBuffer.writerIndex() == 0L) {
ArrowBuf compressedBuffer = allocator.buffer(SIZE_OF_UNCOMPRESSED_LENGTH);
compressedBuffer.setLong(0, -1L); // Use -1 instead of 0
compressedBuffer.writerIndex(SIZE_OF_UNCOMPRESSED_LENGTH);
uncompressedBuffer.close();
return compressedBuffer;
}
When a reader sees prefix=-1, it returns an empty/zero-length slice — which
is correct for an originally empty buffer.
Environment
- Affected: All Arrow Java versions with IPC compression (1.0.0+)
- Readers that fail: Arrow C++ (all versions), PyArrow (all versions)
- Codec: Both LZ4_FRAME and ZSTD
Related
Describe the bug
When a buffer has
writerIndex == 0(e.g., a string column where all valuesare empty string ""),
AbstractCompressionCodec.compress()writes an 8-bytebuffer with
uncompressed_length = 0as a "shortcut for empty buffer":https://github.com/apache/arrow-java/blob/main/vector/src/main/java/org/apache/arrow/vector/compression/AbstractCompressionCodec.java#L32-L39
This has been present since the initial implementation (ARROW-11899, 2021).
The Java
decompress()handles this correctly (if size==0 return empty),but C++ and Python Arrow readers do not recognize
prefix=0. They attemptto decompress 0 bytes of data, which fails:
IOError: Lz4 compressed input contains less than one frameReproduction
Write an Arrow IPC stream with LZ4_FRAME (or ZSTD) compression where one
string column has all values = "" (empty string, not null). The string data
buffer has
writerIndex = 0, triggering the empty buffer path.Reading with C++ or Python fails at the first RecordBatch.
Root cause
The Arrow IPC compression format defines:
prefix > 0: compressed data follows, decompress toprefixbytesprefix = -1: buffer stored uncompressed (sentinel)prefix = 0: undefined — not in spec, not handled by C++/PythonJava writes
prefix=0for empty buffers, but only Java itself knows how toread it back. C++/Python treat it as "0 bytes to decompress" → fail.
Suggested fix
Change the empty buffer path to use
-1sentinel (which all readers support):When a reader sees
prefix=-1, it returns an empty/zero-length slice — whichis correct for an originally empty buffer.
Environment
Related
in vector reuse, not the intentional empty buffer path)
handle prefix=0)