Skip to content

[Java][IPC] AbstractCompressionCodec.compress() writes prefix=0 for empty buffers, incompatible with C++/Python readers #1196

Description

@zgdgod

Describe the bug

When a buffer has writerIndex == 0 (e.g., a string column where all values
are empty string ""), AbstractCompressionCodec.compress() writes an 8-byte
buffer with uncompressed_length = 0 as a "shortcut for empty buffer":

https://github.com/apache/arrow-java/blob/main/vector/src/main/java/org/apache/arrow/vector/compression/AbstractCompressionCodec.java#L32-L39

if (uncompressedBuffer.writerIndex() == 0L) {
    // shortcut for empty buffer
    compressedBuffer.setLong(0, 0);  // prefix = 0
    ...
}

This has been present since the initial implementation (ARROW-11899, 2021).

The Java decompress() handles this correctly (if size==0 return empty),
but C++ and Python Arrow readers do not recognize prefix=0. They attempt
to decompress 0 bytes of data, which fails:

  • C++ (Arrow 1.0.0 ~ latest): IOError: Lz4 compressed input contains less than one frame
  • PyArrow 21.0: same error

Reproduction

Write an Arrow IPC stream with LZ4_FRAME (or ZSTD) compression where one
string column has all values = ""
(empty string, not null). The string data
buffer has writerIndex = 0, triggering the empty buffer path.

// Writer
ArrowStreamWriter writer = new ArrowStreamWriter(root, null, channel,
    IpcOption.DEFAULT, CommonsCompressionFactory.INSTANCE, CodecType.LZ4_FRAME);

// All rows: stringVector.setSafe(i, "".getBytes());

Reading with C++ or Python fails at the first RecordBatch.

Root cause

The Arrow IPC compression format defines:

  • prefix > 0: compressed data follows, decompress to prefix bytes
  • prefix = -1: buffer stored uncompressed (sentinel)
  • prefix = 0: undefined — not in spec, not handled by C++/Python

Java writes prefix=0 for empty buffers, but only Java itself knows how to
read it back. C++/Python treat it as "0 bytes to decompress" → fail.

Suggested fix

Change the empty buffer path to use -1 sentinel (which all readers support):

if (uncompressedBuffer.writerIndex() == 0L) {
    ArrowBuf compressedBuffer = allocator.buffer(SIZE_OF_UNCOMPRESSED_LENGTH);
    compressedBuffer.setLong(0, -1L);  // Use -1 instead of 0
    compressedBuffer.writerIndex(SIZE_OF_UNCOMPRESSED_LENGTH);
    uncompressedBuffer.close();
    return compressedBuffer;
}

When a reader sees prefix=-1, it returns an empty/zero-length slice — which
is correct for an originally empty buffer.

Environment

  • Affected: All Arrow Java versions with IPC compression (1.0.0+)
  • Readers that fail: Arrow C++ (all versions), PyArrow (all versions)
  • Codec: Both LZ4_FRAME and ZSTD

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions