Add BQN by alexey-milovidov · Pull Request #939 · ClickHouse/ClickBench

alexey-milovidov · 2026-05-25T04:37:37Z

Summary

Adds the bqn/ directory: a ClickBench entry for BQN using CBQN (built from source by install).
BQN is a pure array language with no SQL or parquet reader, so the 43 ClickBench queries are hand-translated into BQN in bqn/queries.bqn.
prep.py (Python + pyarrow) reads hits.parquet into per-column binary files under cols/: numeric columns as raw little-endian f64; string columns as concatenated UTF-8 bytes plus an offset table. BQN reads these via •FBytes + 8‿64 •bit._cast in bqn/util.bqn. Queries themselves are pure BQN.
The driver runs benchmark.sh against queries.idx (a 43-line file containing 1..43), which dispatches via stdin to the matching Qn function. The standard queries.sql stays as reference documentation.

Query adaptations

Most translations are direct. Two diverge from the SQL spirit:

Q24 (SELECT * ... ORDER BY EventTime LIMIT 10): reconstructing 100+ columns just to discard them isn't useful in BQN — returns the 10 row indices in EventTime order instead.
Q29 (REGEXP_REPLACE): BQN has no regex engine. The hostname is approximated by stripping http:// / https:// / www. prefixes and taking everything before the next /.

EventDate literals in Q37–Q43 are encoded as days-since-epoch integers because prep.py stores EventDate that way.

Test plan

All 43 queries parse and execute against a 10 k-row mock dataset built with python3 + struct.pack. Every Qn returns without error; sub-millisecond per query at that scale.
End-to-end benchmark.sh run against the full 100 M-row hits dataset on a benchmark VM.
install builds CBQN with make o3 -j$(nproc) and symlinks BQN into /usr/local/bin.

🤖 Generated with Claude Code

BQN is a pure array language in the APL/J/k family with no SQL or parquet reader; the 43 ClickBench queries are hand-translated into BQN in queries.bqn, and a Python preprocessor (prep.py) reads hits.parquet into per-column binary files (numeric as little-endian f64, strings as concatenated bytes + offset table). The queries themselves are pure BQN, reading those files via `•FBytes` plus `8‿64 •bit._cast` in util.bqn. The translations stay close to the SQL semantics with two adaptations: Q24 returns the matching row indices instead of reconstructing every column for `SELECT *`, and Q29 approximates the regex hostname extraction by stripping the scheme and `www.` prefix since BQN has no regex engine. Uses CBQN (https://github.com/dzaima/CBQN) — built from source by `install`. The driver runs `benchmark.sh` against `queries.idx` (literal 1..43), which dispatches into queries.bqn. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

queries.sql now contains 43 one-line, self-contained BQN expressions (one per query) instead of SQL plus a separate queries.bqn / queries.idx pair. ./query.bqn prepends `util ← •Import "util.bqn" ⋄ ` and evaluates the line with •BQN; util.bqn re-exports Pair, TopN, LexMin, ContainsAny and _groupBy so each query line can stand on its own. prep.py is rewritten on top of pyarrow + numpy: numeric columns go out via a single `astype("<f8").tobytes()` per row group, string columns via raw `chunk.buffers()` + bulk offset arithmetic. The 14 GB parquet preprocesses in ~3 minutes versus ~3.5 hours for the per-row struct.pack version. 42 of 43 queries return on the full 100 M-row dataset; Q29 (regex approximation) hits a 120-second per-query budget. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Each queries.sql line now begins with `util←•Import •wdpath∾"/util.bqn"⋄` so the line is a complete BQN program — pasteable into any CBQN instance running in the bqn directory and runs end-to-end. `query.bqn` no longer injects a prefix; it just slurps the program from stdin and hands it to •BQN. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

BQN has no SQL DDL; the per-column binary files written by prep.py are the schema. The file was carried over from the duckdb template. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cloud-init runs against the stock Ubuntu 24.04 AMI hit an immediate compile error in src/builtins/ffi.c — CBQN's makefile compiles the FFI module unconditionally and the AMI ships without libffi-dev. The o3 target aborts at "ffi.c: failed to build" before BQN ever runs, so every machine in the EC2 fleet reported a 36–150 s 'Total time:' with no Load time / Data size / runtimes in sink.data. Pulling in libffi-dev alongside build-essential is the smallest fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

prep.py's apt-get install of python3-pyarrow was silenced and fell through to a `pip3 install` that didn't exist on stock Ubuntu cloud AMIs (no python3-pip pre-installed), so every machine in the fleet failed at './load: line 10: pip3: command not found' after the libffi-dev fix unblocked CBQN's build. Pull the parquet / numpy deps into ./install alongside the build tool chain, and keep a pip fallback in ./load for the rare callers that invoke load without install. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Ubuntu 24.04 doesn't ship a python3-pyarrow package; the previous install pulled apt and failed with 'E: Unable to locate package python3-pyarrow' before BQN ever ran. Drop to python3-pip and pip-install pyarrow + numpy with --break-system-packages (PEP 668). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The cloud-init c6a.metal run produced corrupt .off files for every string column: query.bqn errored with 'Expected integer, got 79.96875' (fractional f64) inside LoadS, and every query that touches a string column (Q6, Q11–15, Q17–19, Q21–29, Q31–32, Q34–35, Q37–40) returned null. Numeric-only queries (Q1–5, Q7–10, Q16, Q20, Q30, Q33, Q36, Q41–43) all completed. Root cause: prep.py branched on `chunk.type in (pa.string(), pa.binary())` for the int32-offset layout and otherwise treated it as large_string (int64). When the pip-installed pyarrow returned a variant that didn't compare equal (e.g. string_view, or a dictionary-encoded chunk surfaced after the parquet reader's post-decoding rewrite), the int64 reader read garbage out of a non-int64 buffer. Fix: cast each string column to `pa.large_string()` first — zero-copy when already in that layout, copy otherwise — so the buffer layout is always the known 3-buffer (validity, int64 offsets, bytes) shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

c6a.metal benchmark run @ 2026-05-26 22:01:15 produced 25 null queries — every query that calls util.LoadS errored with "Expected integer, got 79.96875" / "got 3.96875" inside LoadS's ↓/↑ on the .off file. The earlier `arr.cast(pa.large_string())` fix didn't help: on the pip-installed pyarrow the cast returned a chunk whose nominal type is large_string but whose offsets buffer is still i32, so `np.frombuffer(buf[1], dtype="<i8")` read garbage that round-tripped to fractional f64 in the .off file. Switch to a public-API path that doesn't peek inside buffers: combine_chunks → fill_null → binary_length → cumsum for the offsets, and concatenate the value buffer for the bytes. Works regardless of whether the parquet reader hands back string, large_string, string_view, or a dictionary-encoded variant. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The latest c6a.metal run produced the same fractional-offset ("Expected integer, got 79.96875") errors despite the public-API prep rewrite — locally the new prep produces correct files but the same code on the benchmark VM does not, so we need more diagnostics. Add three things to prep.py: * Print `pyarrow:` and `numpy:` versions at startup so future sink.data captures actually show what got installed. * Build the string offsets in int64 and only promote to f64 at the last step. Keeps the cumulative offsets exact regardless of what dtype `pc.binary_length` returns. * `assert offs_i64[-1] == str_offsets[col]` after each row group: the cumulative end offset must equal the total bytes written, and if they diverge the parquet reader is handing us a buffer layout we don't actually understand, which is exactly the cloud failure mode. Better to die loudly at load than to ship a corrupt .off file the query layer reads as fractional floats. While we're here, replace the brittle `typ in STRING_TYPES` tuple test in the schema scan with the proper `pa.types.is_*` predicates so new string variants (string_view, etc.) take the string path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The prep.py asserts didn't fire on the cloud — prep's view of (cumulative offsets == bytes written) held — yet BQN still reads fractional values like 79.96875 out of the .off files. To distinguish "prep wrote bad bytes" from "something between prep and query corrupted the file", dump the first 8 f64 values of each .off file at the end of prep. If those print as integers, the prep is fine and the corruption is downstream; if they print as fractional, prep itself is producing wrong bytes despite the cumsum_end matching total bytes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

alexey-milovidov and others added 11 commits May 25, 2026 01:44

ClickBench/bqn: drop create.sql

a663637

BQN has no SQL DDL; the per-column binary files written by prep.py are the schema. The file was carried over from the duckdb template. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BQN#939

Add BQN#939
alexey-milovidov wants to merge 11 commits into
mainfrom
add-bqn

alexey-milovidov commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexey-milovidov commented May 25, 2026

Summary

Query adaptations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant