Add BQN#939
Open
alexey-milovidov wants to merge 11 commits into
Open
Conversation
BQN is a pure array language in the APL/J/k family with no SQL or parquet reader; the 43 ClickBench queries are hand-translated into BQN in queries.bqn, and a Python preprocessor (prep.py) reads hits.parquet into per-column binary files (numeric as little-endian f64, strings as concatenated bytes + offset table). The queries themselves are pure BQN, reading those files via `•FBytes` plus `8‿64 •bit._cast` in util.bqn. The translations stay close to the SQL semantics with two adaptations: Q24 returns the matching row indices instead of reconstructing every column for `SELECT *`, and Q29 approximates the regex hostname extraction by stripping the scheme and `www.` prefix since BQN has no regex engine. Uses CBQN (https://github.com/dzaima/CBQN) — built from source by `install`. The driver runs `benchmark.sh` against `queries.idx` (literal 1..43), which dispatches into queries.bqn. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
queries.sql now contains 43 one-line, self-contained BQN expressions
(one per query) instead of SQL plus a separate queries.bqn /
queries.idx pair. ./query.bqn prepends `util ← •Import "util.bqn" ⋄ `
and evaluates the line with •BQN; util.bqn re-exports Pair, TopN,
LexMin, ContainsAny and _groupBy so each query line can stand on its
own.
prep.py is rewritten on top of pyarrow + numpy: numeric columns go out
via a single `astype("<f8").tobytes()` per row group, string columns
via raw `chunk.buffers()` + bulk offset arithmetic. The 14 GB parquet
preprocesses in ~3 minutes versus ~3.5 hours for the per-row
struct.pack version.
42 of 43 queries return on the full 100 M-row dataset; Q29 (regex
approximation) hits a 120-second per-query budget.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each queries.sql line now begins with `util←•Import •wdpath∾"/util.bqn"⋄` so the line is a complete BQN program — pasteable into any CBQN instance running in the bqn directory and runs end-to-end. `query.bqn` no longer injects a prefix; it just slurps the program from stdin and hands it to •BQN. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
BQN has no SQL DDL; the per-column binary files written by prep.py are the schema. The file was carried over from the duckdb template. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cloud-init runs against the stock Ubuntu 24.04 AMI hit an immediate compile error in src/builtins/ffi.c — CBQN's makefile compiles the FFI module unconditionally and the AMI ships without libffi-dev. The o3 target aborts at "ffi.c: failed to build" before BQN ever runs, so every machine in the EC2 fleet reported a 36–150 s 'Total time:' with no Load time / Data size / runtimes in sink.data. Pulling in libffi-dev alongside build-essential is the smallest fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
prep.py's apt-get install of python3-pyarrow was silenced and fell through to a `pip3 install` that didn't exist on stock Ubuntu cloud AMIs (no python3-pip pre-installed), so every machine in the fleet failed at './load: line 10: pip3: command not found' after the libffi-dev fix unblocked CBQN's build. Pull the parquet / numpy deps into ./install alongside the build tool chain, and keep a pip fallback in ./load for the rare callers that invoke load without install. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ubuntu 24.04 doesn't ship a python3-pyarrow package; the previous install pulled apt and failed with 'E: Unable to locate package python3-pyarrow' before BQN ever ran. Drop to python3-pip and pip-install pyarrow + numpy with --break-system-packages (PEP 668). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cloud-init c6a.metal run produced corrupt .off files for every string column: query.bqn errored with 'Expected integer, got 79.96875' (fractional f64) inside LoadS, and every query that touches a string column (Q6, Q11–15, Q17–19, Q21–29, Q31–32, Q34–35, Q37–40) returned null. Numeric-only queries (Q1–5, Q7–10, Q16, Q20, Q30, Q33, Q36, Q41–43) all completed. Root cause: prep.py branched on `chunk.type in (pa.string(), pa.binary())` for the int32-offset layout and otherwise treated it as large_string (int64). When the pip-installed pyarrow returned a variant that didn't compare equal (e.g. string_view, or a dictionary-encoded chunk surfaced after the parquet reader's post-decoding rewrite), the int64 reader read garbage out of a non-int64 buffer. Fix: cast each string column to `pa.large_string()` first — zero-copy when already in that layout, copy otherwise — so the buffer layout is always the known 3-buffer (validity, int64 offsets, bytes) shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c6a.metal benchmark run @ 2026-05-26 22:01:15 produced 25 null queries — every query that calls util.LoadS errored with "Expected integer, got 79.96875" / "got 3.96875" inside LoadS's ↓/↑ on the .off file. The earlier `arr.cast(pa.large_string())` fix didn't help: on the pip-installed pyarrow the cast returned a chunk whose nominal type is large_string but whose offsets buffer is still i32, so `np.frombuffer(buf[1], dtype="<i8")` read garbage that round-tripped to fractional f64 in the .off file. Switch to a public-API path that doesn't peek inside buffers: combine_chunks → fill_null → binary_length → cumsum for the offsets, and concatenate the value buffer for the bytes. Works regardless of whether the parquet reader hands back string, large_string, string_view, or a dictionary-encoded variant. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The latest c6a.metal run produced the same fractional-offset
("Expected integer, got 79.96875") errors despite the public-API
prep rewrite — locally the new prep produces correct files but the
same code on the benchmark VM does not, so we need more diagnostics.
Add three things to prep.py:
* Print `pyarrow:` and `numpy:` versions at startup so future
sink.data captures actually show what got installed.
* Build the string offsets in int64 and only promote to f64 at the
last step. Keeps the cumulative offsets exact regardless of what
dtype `pc.binary_length` returns.
* `assert offs_i64[-1] == str_offsets[col]` after each row group:
the cumulative end offset must equal the total bytes written, and
if they diverge the parquet reader is handing us a buffer layout
we don't actually understand, which is exactly the cloud failure
mode. Better to die loudly at load than to ship a corrupt .off
file the query layer reads as fractional floats.
While we're here, replace the brittle `typ in STRING_TYPES` tuple
test in the schema scan with the proper `pa.types.is_*` predicates so
new string variants (string_view, etc.) take the string path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The prep.py asserts didn't fire on the cloud — prep's view of (cumulative offsets == bytes written) held — yet BQN still reads fractional values like 79.96875 out of the .off files. To distinguish "prep wrote bad bytes" from "something between prep and query corrupted the file", dump the first 8 f64 values of each .off file at the end of prep. If those print as integers, the prep is fine and the corruption is downstream; if they print as fractional, prep itself is producing wrong bytes despite the cumsum_end matching total bytes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bqn/directory: a ClickBench entry for BQN using CBQN (built from source byinstall).bqn/queries.bqn.prep.py(Python + pyarrow) readshits.parquetinto per-column binary files undercols/: numeric columns as raw little-endian f64; string columns as concatenated UTF-8 bytes plus an offset table. BQN reads these via•FBytes+8‿64 •bit._castinbqn/util.bqn. Queries themselves are pure BQN.benchmark.shagainstqueries.idx(a 43-line file containing1..43), which dispatches via stdin to the matchingQnfunction. The standardqueries.sqlstays as reference documentation.Query adaptations
Most translations are direct. Two diverge from the SQL spirit:
SELECT * ... ORDER BY EventTime LIMIT 10): reconstructing 100+ columns just to discard them isn't useful in BQN — returns the 10 row indices in EventTime order instead.REGEXP_REPLACE): BQN has no regex engine. The hostname is approximated by strippinghttp:///https:///www.prefixes and taking everything before the next/.EventDateliterals in Q37–Q43 are encoded as days-since-epoch integers becauseprep.pystoresEventDatethat way.Test plan
python3+struct.pack. EveryQnreturns without error; sub-millisecond per query at that scale.benchmark.shrun against the full 100 M-row hits dataset on a benchmark VM.installbuilds CBQN withmake o3 -j$(nproc)and symlinksBQNinto/usr/local/bin.🤖 Generated with Claude Code