Skip to content

Parquet Java ALP Implementation#3397

Open
vinooganesh wants to merge 26 commits into
apache:masterfrom
vinooganesh:vinooganesh/alp-java-implementation
Open

Parquet Java ALP Implementation#3397
vinooganesh wants to merge 26 commits into
apache:masterfrom
vinooganesh:vinooganesh/alp-java-implementation

Conversation

@vinooganesh
Copy link
Copy Markdown
Contributor

@vinooganesh vinooganesh commented Feb 17, 2026

cc @julienledem @alamb @emkornfield @prtkgaur

Rationale for this change

Reworks the ALP encoding implementation to address emkornfield's architectural feedback on PR #3390. The original buffered all values in memory and decoded eagerly. This makes the writer incremental (encode per-vector as values arrive) and the reader lazy (decode on demand), matching how other Parquet encodings work.

Builds on Julien Le Dem's original implementation (#3390). File structure, integration points, core math, and interop test infrastructure all come from his work. The rework focused on the internal writer/reader plumbing.

What changes are included in this PR?

Architecture (addressing review feedback):

  • Incremental writer. Values buffer in a fixed-size vector, each full vector encodes and flushes immediately.
  • Lazy reader. Vectors decode on first access via offset array, skip() is O(1).
  • Interleaved page layout so each vector is self-contained.
  • Extracted AlpValuesReader abstract base class for shared logic.
  • Preset caching. Full parameter search for first 8 vectors, top 5 combos cached for the rest.

Spec compliance:

  • Fixed packed data size formula to ceil(n * bitWidth / 8)
  • Fixed unsigned delta comparison in float writer
  • Explicit little-endian byte reads instead of relying on ByteBuffer order
  • Using parquet-encoding's BytePacker instead of custom bit-packing
  • Capped max vector size at 32768 to prevent uint16 overflow in num_exceptions

Configurable vector size (new):

  • withAlpVectorSize(int) and withAlpVectorSize(String columnPath, int) on ParquetProperties.Builder and the Hadoop ParquetWriter.Builder
  • Threaded through DefaultV1ValuesWriterFactory and DefaultV2ValuesWriterFactory so per-column overrides work
  • Defaults to 1024; validated against AlpConstants min/max bounds eagerly at builder time
  • Enables generating fixture data at multiple vector sizes for cross-language compatibility testing

Reader null tolerance (bug fix):

  • AlpValuesReader was asserting num_elements == page.valuesCount, which fails on optional columns with nulls (num_elements is the encoded non-null count, valuesCount is the page row count including nulls). Relaxed to num_elements > valuesCount. Surfaced by the corner-case fixture below.

Test fixtures (parquet-testing #100 prep):

  • generateAlpFixturesAtMultipleVectorSizes reads the source parquets from parquet-testing PR PARQUET-134 patch - Support file write mode #100 and re-encodes them as Java ALP across 3 axes — page version (V1/V2), vector size (1024/4096), and dataset/type — producing 16 files, each verified bit-exact against its source.
  • generateAndVerifyCornerCaseFixture writes a small (~60 KB) synthetic file with 14 columns each engineered to hit a specific corner case from parquet-testing Parquet 167 - Snappy Compression codec - Optimize freeing of DirectByteBuffers #105: no/one/all exceptions, NaN/Inf/-0.0, constant (bit_width=0), differing exponents per vector, optional with nulls. Both f32 and f64.
  • writeCornerCaseCsvTruth emits a sidecar _expect.csv from the construction recipe (not from reading the parquet back), so the corner-case file is independently verifiable by any future tooling.
  • readAllFixtureFilesIndependently opens every generated fixture and asserts ALP appears in column encodings + every row decodes through the standard reader path.

Integration:

  • Wired ALP into both DefaultV1ValuesWriterFactory and DefaultV2ValuesWriterFactory.

Are these changes tested?

Yes. 125 tests across 5 test classes in parquet-column, plus 12 tests in TestInterOpReadAlp in parquet-hadoop — all passing. Full parquet-column suite also passes.

Key tests construct ALP page bytes directly according to the spec and feed them to the reader without going through the writer. This verifies the reader works independently and catches any bugs where writer and reader agree with each other but disagree with the spec. Also covers NaN bit pattern preservation, negative zero roundtrip, extreme values, every partial vector remainder mod 8, skip across vector boundaries, and preset caching under distribution change.

Cross-language verification

Arrow C++ ALP decoder (apache/arrow#48345) reads every Java-written fixture bit-exact against the canonical _expect.csv truth tables from parquet-testing PR #100. Local verification covers all 17 fixtures across the full {V1, V2} × {vs1024, vs4096} matrix: 1,587,306 values, 0 mismatches. Six representative fixtures submitted as a stacked PR (prtkgaur/parquet-testing#1) toward apache/parquet-testing#100 account for 342K of those.

Are there any user-facing changes?

  • Users can enable ALP encoding for FLOAT and DOUBLE columns via ParquetProperties.withAlpEncoding(), globally or per-column.
  • Users can configure the ALP vector size via withAlpVectorSize(int) (new in this PR), also globally or per-column. Default is 1024.

Note: Likely me missing something — but ALP is not yet in the parquet-format Thrift spec (apache/parquet-format#533), so writing ALP files through the full Hadoop pipeline will fail at metadata serialization until parquet.thrift is updated (parquet-format PR #548).

julienledem and others added 7 commits January 22, 2026 08:44
Implements ALP encoding for FLOAT and DOUBLE types, which converts
floating-point values to integers using decimal scaling, then applies
Frame of Reference (FOR) encoding and bit-packing for compression.

New files:
- AlpConstants.java: Constants for ALP encoding
- AlpEncoderDecoder.java: Core encoding/decoding logic
- AlpValuesWriter.java: Writer implementation
- AlpValuesReaderForFloat/Double.java: Reader implementations

Includes comprehensive unit tests and interop test infrastructure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Restore original comment indentation that was accidentally changed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Escape <= characters as &lt;= in javadoc comments to avoid
malformed HTML errors during documentation generation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ALP encoding is not yet part of the parquet-format Thrift specification,
so it cannot be converted to org.apache.parquet.format.Encoding. Skip it
in the testEnumEquivalence test and add a clear error message in the
converter for when ALP conversion is attempted.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
  size and add independent reader/writer
  verification tests
Switch encode/decode from division-based formula to multiply-by-reciprocal
using separate POW10_NEGATIVE arrays, matching C++ Arrow's approach:
- Encode: fastRound(value * POW10[e] * POW10_NEGATIVE[f])
- Decode: encoded * POW10[f] * POW10_NEGATIVE[e]

Add fastRound helpers with sign branching for correct negative value
rounding. Remove version byte from page header (8 -> 7 bytes). Empty
pages now emit a 7-byte header with numElements=0.

Update all hand-crafted binary tests to match the new header format
and add comprehensive end-to-end tests for overflow boundaries,
large-scale data, preset caching, and NaN bit-pattern preservation.
- Rewrite TestInterOpReadAlp to use LocalInputFile instead of Hadoop
  FileSystem, fixing failures on Java 24+ where Subject.getSubject is
  removed. Tests now read C++ ALP parquet files directly without going
  through Hadoop security/UGI.

- Add AlpExceptionCountTest with per-column exception rate reporting
  against the real Spotify and Arade floating-point datasets from the
  parquet-testing repository. Useful for comparing Java vs C++ ALP
  compression ratios.
- Switch findBestFloatParams/findBestDoubleParams from minimizing
  exception count to minimizing estimated compressed size
  (length * bitWidth + exceptions * (typeSize + 2 bytes)), matching
  the C++ ALP cost model. This closes the ~4-5% compression gap vs C++.

- Rewrite sampler to collect evenly-spaced sample vectors and run
  findBestParams on each, then rank by win count. Matches C++ AlpSampler
  behavior more closely than the previous HashMap-based approach.

- Minor fixes: IOExceptionUtils null check, MemoryManager volatile scale,
  Files utility cleanup, parquet-cli dependency update.
@vinooganesh vinooganesh force-pushed the vinooganesh/alp-java-implementation branch from 15bc06d to 24c23e5 Compare March 22, 2026 23:56
- Move shared LE helper methods (getShortLE/getIntLE/getLongLE) to
  AlpValuesReader base class; remove duplicates from subclasses
- Make EncodingParams fields package-private (remove public modifier)
- Replace fully-qualified java.util.Arrays.fill calls with imported Arrays.fill
  in both float and double readers; add missing import to double reader
- Add explanatory comments to getBufferedSize() magic numbers (3 for float,
  5 for double) explaining the overhead breakdown
- Add ALP enabled state to ParquetProperties.toString()
- Add ALP support to DefaultV1ValuesWriterFactory for float and double columns
- Revert Files.java, IOExceptionUtils.java, MemoryManager.java, and
  parquet-cli/pom.xml to master state; these changes are unrelated to ALP
  and should be submitted in separate PRs
- Clarify ParquetMetadataConverter error message: ALP encoding is defined
  in the ALP paper (enum value 26) but is not yet in the parquet-format
  Thrift spec, so ALP cannot be written through the Hadoop write path;
  the error message now explains what needs to happen to remove the block
Copy link
Copy Markdown

@prtkgaur prtkgaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code organization looks good to me and the code follows the spec. I looked for areas of any extra buffer allocations which might impact performance and I think it is optimally written.

I think we should add a few benchmarks and publish numbers from them.

Thanks for working on this Vinoo!

Copy link
Copy Markdown

@prtkgaur prtkgaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to make sure we have the following testing.

For the cross compatibility testing are we making sure that we write both V1 and V2 pages and the implementation in other language is able to read it.

- Add build-time Perl script to patch generated Encoding.java with ALP(10)
  after Thrift codegen (process-sources phase), since parquet-format 2.12.0
  does not yet include ALP in its Thrift spec
- Remove guard in ParquetMetadataConverter.getEncoding() that blocked ALP
  writes; Encoding.ALP now exists in the patched Thrift enum
- Add withAlpEncoding() builder methods to ParquetWriter
- Add TestInterOpReadAlp: Java V1/V2 write+read round-trip tests and C++
  Arrow interop tests (reads alp_spotify1.parquet, alp_arade.parquet, etc.)
- Add AlpEncodingBenchmarks JMH benchmark
…d pyarrow interop test

- AlpValuesWriter: stop clearing cachedPresets in reset() so preset (e,f)
  pairs survive page flushes; eliminates redundant full parameter search on
  every page after the first, cutting write time ~60%
- AlpEncodingBenchmarks: clarify Javadoc that comparison is PLAIN+UNCOMPRESSED
  (no codec), not plain+ZSTD
- parquet-benchmarks pom: add explicit annotationProcessorPaths and proc=full
  for jmh-generator-annprocess so BenchmarkList is generated under Java 23+
- TestInterOpReadAlp: add pyarrow cross-language compatibility test (skips if
  pyarrow unavailable or does not yet support ALP encoding)
- ParquetProperties: add withAlpVectorSize(int) and withAlpVectorSize(String, int)
  builder methods plus getAlpVectorSize(ColumnDescriptor) accessor, defaulting to
  AlpConstants.DEFAULT_VECTOR_SIZE (1024).
- AlpConstants: promote validateVectorSize to public so the builder can validate
  eagerly across packages.
- DefaultV1/V2 ValuesWriterFactory: pass the configured vector size to the
  4-arg AlpValuesWriter constructors.
- ParquetWriter.Builder: expose withAlpVectorSize facades mirroring withAlpEncoding.
- TestInterOpReadAlp: add testJavaWriteAlpCustomVectorSize covering 4500 rows at
  vectorSize=4096 so we cross a full vector boundary and verify round-trip equality.
  A wrong log_vector_size byte would surface as decode garbage, so round-trip
  equality is sufficient proof the configured size took effect on the wire.

Enables generating ALP test fixtures at different vector sizes (e.g. 4096) for
cross-language compatibility testing against the C++/Rust/Go implementations.
Logging and debug output was missing the new alpVectorSize field
alongside the existing 'ALP enabled' line. Cosmetic only — no
behavior change.
Adds generateAlpFixturesAtMultipleVectorSizes to TestInterOpReadAlp.
For each of the four source files in parquet-testing PR apache#100
(alp_spotify1, alp_arade, alp_float_spotify1, alp_float_arade), reads
every row, then re-encodes as Java ALP at both vectorSize=1024 and
vectorSize=4096. Output goes to ALP_OUTPUT_DIR (default
${user.dir}/alp-java-generated/), producing 8 files total named
alp_java_<stem>_vs{1024,4096}.parquet.

Each output is verified by reading back through the standard reader
path and bit-comparing every value via doubleToRawLongBits /
floatToRawIntBits — catches NaN payload and signed-zero divergence,
not just numerical equality.

Skips when ALP_TEST_DATA_DIR isn't set, so it stays inert in CI on
machines without the source datasets.

To run:
  git clone --branch alpFloatingPointDataset \\
    https://github.com/prtkgaur/parquet-testing.git
  ALP_TEST_DATA_DIR=path/to/parquet-testing/data \\
    mvn -pl parquet-hadoop \\
    -Dtest=TestInterOpReadAlp#generateAlpFixturesAtMultipleVectorSizes \\
    test
Extends generateAlpFixturesAtMultipleVectorSizes to vary writer page
version (PARQUET_1_0, PARQUET_2_0) as a third axis alongside dataset
and ALP vector size. Output grows from 8 → 16 files per run:

  alp_java_<stem>_v{1,2}_vs{1024,4096}.parquet

Page version is orthogonal to ALP encoding — the page version
difference lives in the parquet protocol layer, not in the ALP
payload — but covering both axes makes the fixture set fully
symmetric for cross-language compatibility verification. C++/Rust/Go
readers can use the V1 and V2 variants to prove their decoders
handle Java-written ALP regardless of how the surrounding pages are
framed. Avoids an asymmetry where the existing PR apache#100 set has C++
at V1 and Java at V2 with no overlap.

All 16 outputs independently verified against the canonical
_expect.csv truth files from parquet-testing PR apache#100 (1.56M values,
0 mismatches).
The reader was asserting that the ALP header's num_elements equals
the data page's valuesCount, but those values differ whenever a
column has nulls: num_elements is the count of non-null values that
went through ALP encoding, while valuesCount is the total row count
of the page (which includes null positions tracked by definition
levels). The strict equality check made the reader reject every
optional float/double column with at least one null value.

Relaxes the check to numElements > valuesCount — the header can
never legitimately claim more encoded values than the page has rows,
but it can claim fewer when nulls are present. The downstream code
already uses numElements (not valuesCount) to drive vector
allocation and decoding, so the rest of the read path is unchanged.

This was surfaced by the corner-case fixture per parquet-testing
issue apache#105, which exercises optional columns with null values.
Two new tests in TestInterOpReadAlp:

readAllFixtureFilesIndependently
  Opens every alp_java_*.parquet in ALP_OUTPUT_DIR and asserts each
  column chunk declares Encoding.ALP and decodes through the
  standard reader path without error. Separate from the generator's
  own round-trip verification so reader correctness surfaces as a
  distinct signal in CI when the fixtures are present. Skips
  cleanly when ALP_OUTPUT_DIR is empty so it stays inert in default
  CI environments.

generateAndVerifyCornerCaseFixture
  Writes a single small fixture file (alp_java_cornercases.parquet,
  ~60 KB) targeting the corner cases enumerated in parquet-testing
  issue apache#105: vectors with no exceptions, one exception per vector,
  all exceptions, NaN/Inf/-0.0, constant values (bit_width=0),
  multi-vector with differing exponents, and optional columns with
  nulls. Both f32 and f64 variants — 14 columns × 2048 rows total.
  Reads each column back and bit-exactly verifies every value
  against the expected pattern via doubleToRawLongBits /
  floatToRawIntBits.

The corner-case fixture is intended as a candidate file for
parquet-testing PR apache#100 once naming/design is confirmed. Generating
it also surfaced (and verified the fix for) a pre-existing reader
bug where optional columns with nulls couldn't be decoded — see the
preceding commit.
The corner-case fixture (alp_java_cornercases.parquet) is synthetic
— it isn't derived from any raw dataset in parquet-testing PR apache#100,
so the existing alp_*_expect.csv files don't cover it. That left
cross-language verifiers with no independent ground truth to check
the parquet file against; they had to either trust the Java reader
or duplicate the construction recipe in their own code.

writeCornerCaseCsvTruth now dumps the expected values straight from
the construction recipe into alp_java_cornercases_expect.csv next
to the parquet, every time the generator runs. The CSV uses the
same format conventions as the existing _expect.csv files (comma-
separated, header row, no quoting) plus two extensions:

  • Empty field = null cell (for optional columns)
  • Special values printed via Java's standard toString: "NaN",
    "Infinity", "-Infinity", "-0.0". These all parse via C++
    std::stod / std::stof per the standard (case-insensitive, "inf"
    and "infinity" both accepted).

The Arrow C++ ALP decoder reads the parquet and compares against
this CSV bit-exactly: 27306 non-null cells + 1366 null cells across
14 columns × 2048 rows, 0 mismatches.

This makes the corner-case fixture self-documenting and verifiable
by any future cross-language tooling without rerunning the Java
generator to discover what the expected values are.
18 tests that feed malformed ALP page bytes to the reader and assert
it fails cleanly rather than crashing, returning silent garbage, or
hanging. Two test groups:

Already-validated paths (10 tests) — pin the explicit error messages
the reader emits for known-bad inputs: wrong compression mode, wrong
integer encoding, log_vector_size out of range, negative num_elements,
num_elements exceeding page valuesCount, exponent too high for f32
and f64, factor > exponent, num_exceptions > vectorLen, out-of-bounds
exception positions.

Currently-unvalidated paths (4 tests) — assert *some* Throwable is
raised when the page is truncated mid-offset-array, truncated mid-
vector-data, header-only, or has a corrupted offset pointing past
end-of-data. The reader currently relies on the underlying ByteBuffer
to surface IndexOutOfBoundsException or BufferUnderflowException for
these; the tests pin the loud-failure invariant so a regression that
swallows the corruption would fail.

Plus skip()/read() bounds (3 tests) and a sanity baseline that
confirms the helper produces a decodable page.

Helper builds a known-good ALP-encoded page once per test then
mutates a copy, so the tests target specific byte offsets in the
header and per-vector metadata sections without duplicating the
encoding logic.

All 18 pass. Full ALP suite now 135 tests, 0 regressions.
@vinooganesh vinooganesh changed the title [WIP] Parquet Java ALP Implementation Parquet Java ALP Implementation May 26, 2026
Writes a small ALP-encoded file with eight known float and double
values spanning negatives, zero, and positives, then reads the
column chunk metadata and bit-exactly asserts statistics.min /
statistics.max match the input.

Statistics are populated by the column writer wrapper rather than
the encoder, so they should work the same as any other encoding —
this test pins that assumption. Without correct statistics, parquet
predicate pushdown would either skip valid row groups or scan
unnecessary ones for ALP-encoded columns.

Confirmed working. No bug found.
getOutputDir() previously fell back silently to ${user.dir}/alp-java-generated/
when ALP_OUTPUT_DIR wasn't configured. Running mvn test from a module root
then created parquet-hadoop/alp-java-generated/ with an un-licensed CSV
inside, which made subsequent test runs in that module fail the
apache-rat-plugin license check until the directory was manually removed.
Hit this twice in one session.

Now getOutputDir() returns null when the variable is unset and each of the
three callers (generateAlpFixturesAtMultipleVectorSizes, the corner-case
generator, and readAllFixtureFilesIndependently) skips cleanly via
assumeTrue. The tests run normally when the env var is set, skip cleanly
when it isn't, and never create files outside the configured directory.

Verified:
  - Without ALP_OUTPUT_DIR: 13 tests run, 9 skipped, BUILD SUCCESS, no
    stray files
  - With ALP_OUTPUT_DIR set: 13 tests run, 2 skipped (pyarrow / other
    env-dependent), BUILD SUCCESS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants