Skip to content

Add BQN#939

Open
alexey-milovidov wants to merge 11 commits into
mainfrom
add-bqn
Open

Add BQN#939
alexey-milovidov wants to merge 11 commits into
mainfrom
add-bqn

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Adds the bqn/ directory: a ClickBench entry for BQN using CBQN (built from source by install).
  • BQN is a pure array language with no SQL or parquet reader, so the 43 ClickBench queries are hand-translated into BQN in bqn/queries.bqn.
  • prep.py (Python + pyarrow) reads hits.parquet into per-column binary files under cols/: numeric columns as raw little-endian f64; string columns as concatenated UTF-8 bytes plus an offset table. BQN reads these via •FBytes + 8‿64 •bit._cast in bqn/util.bqn. Queries themselves are pure BQN.
  • The driver runs benchmark.sh against queries.idx (a 43-line file containing 1..43), which dispatches via stdin to the matching Qn function. The standard queries.sql stays as reference documentation.

Query adaptations

Most translations are direct. Two diverge from the SQL spirit:

  • Q24 (SELECT * ... ORDER BY EventTime LIMIT 10): reconstructing 100+ columns just to discard them isn't useful in BQN — returns the 10 row indices in EventTime order instead.
  • Q29 (REGEXP_REPLACE): BQN has no regex engine. The hostname is approximated by stripping http:// / https:// / www. prefixes and taking everything before the next /.

EventDate literals in Q37–Q43 are encoded as days-since-epoch integers because prep.py stores EventDate that way.

Test plan

  • All 43 queries parse and execute against a 10 k-row mock dataset built with python3 + struct.pack. Every Qn returns without error; sub-millisecond per query at that scale.
  • End-to-end benchmark.sh run against the full 100 M-row hits dataset on a benchmark VM.
  • install builds CBQN with make o3 -j$(nproc) and symlinks BQN into /usr/local/bin.

🤖 Generated with Claude Code

alexey-milovidov and others added 11 commits May 25, 2026 01:44
BQN is a pure array language in the APL/J/k family with no SQL or
parquet reader; the 43 ClickBench queries are hand-translated into BQN
in queries.bqn, and a Python preprocessor (prep.py) reads
hits.parquet into per-column binary files (numeric as little-endian
f64, strings as concatenated bytes + offset table). The queries
themselves are pure BQN, reading those files via `•FBytes` plus
`8‿64 •bit._cast` in util.bqn.

The translations stay close to the SQL semantics with two
adaptations: Q24 returns the matching row indices instead of
reconstructing every column for `SELECT *`, and Q29 approximates the
regex hostname extraction by stripping the scheme and `www.` prefix
since BQN has no regex engine.

Uses CBQN (https://github.com/dzaima/CBQN) — built from source by
`install`. The driver runs `benchmark.sh` against `queries.idx`
(literal 1..43), which dispatches into queries.bqn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
queries.sql now contains 43 one-line, self-contained BQN expressions
(one per query) instead of SQL plus a separate queries.bqn /
queries.idx pair. ./query.bqn prepends `util ← •Import "util.bqn" ⋄ `
and evaluates the line with •BQN; util.bqn re-exports Pair, TopN,
LexMin, ContainsAny and _groupBy so each query line can stand on its
own.

prep.py is rewritten on top of pyarrow + numpy: numeric columns go out
via a single `astype("<f8").tobytes()` per row group, string columns
via raw `chunk.buffers()` + bulk offset arithmetic. The 14 GB parquet
preprocesses in ~3 minutes versus ~3.5 hours for the per-row
struct.pack version.

42 of 43 queries return on the full 100 M-row dataset; Q29 (regex
approximation) hits a 120-second per-query budget.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each queries.sql line now begins with `util←•Import •wdpath∾"/util.bqn"⋄`
so the line is a complete BQN program — pasteable into any CBQN
instance running in the bqn directory and runs end-to-end. `query.bqn`
no longer injects a prefix; it just slurps the program from stdin and
hands it to •BQN.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
BQN has no SQL DDL; the per-column binary files written by prep.py are
the schema. The file was carried over from the duckdb template.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cloud-init runs against the stock Ubuntu 24.04 AMI hit an immediate
compile error in src/builtins/ffi.c — CBQN's makefile compiles the FFI
module unconditionally and the AMI ships without libffi-dev. The o3
target aborts at "ffi.c: failed to build" before BQN ever runs, so
every machine in the EC2 fleet reported a 36–150 s 'Total time:' with
no Load time / Data size / runtimes in sink.data.

Pulling in libffi-dev alongside build-essential is the smallest fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
prep.py's apt-get install of python3-pyarrow was silenced and fell
through to a `pip3 install` that didn't exist on stock Ubuntu cloud
AMIs (no python3-pip pre-installed), so every machine in the fleet
failed at './load: line 10: pip3: command not found' after the
libffi-dev fix unblocked CBQN's build.

Pull the parquet / numpy deps into ./install alongside the build tool
chain, and keep a pip fallback in ./load for the rare callers that
invoke load without install.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ubuntu 24.04 doesn't ship a python3-pyarrow package; the previous
install pulled apt and failed with 'E: Unable to locate package
python3-pyarrow' before BQN ever ran. Drop to python3-pip and
pip-install pyarrow + numpy with --break-system-packages (PEP 668).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cloud-init c6a.metal run produced corrupt .off files for every
string column: query.bqn errored with 'Expected integer, got 79.96875'
(fractional f64) inside LoadS, and every query that touches a string
column (Q6, Q11–15, Q17–19, Q21–29, Q31–32, Q34–35, Q37–40) returned
null. Numeric-only queries (Q1–5, Q7–10, Q16, Q20, Q30, Q33, Q36,
Q41–43) all completed.

Root cause: prep.py branched on `chunk.type in (pa.string(),
pa.binary())` for the int32-offset layout and otherwise treated it as
large_string (int64). When the pip-installed pyarrow returned a
variant that didn't compare equal (e.g. string_view, or a
dictionary-encoded chunk surfaced after the parquet reader's
post-decoding rewrite), the int64 reader read garbage out of a
non-int64 buffer.

Fix: cast each string column to `pa.large_string()` first — zero-copy
when already in that layout, copy otherwise — so the buffer layout is
always the known 3-buffer (validity, int64 offsets, bytes) shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c6a.metal benchmark run @ 2026-05-26 22:01:15 produced 25 null
queries — every query that calls util.LoadS errored with
"Expected integer, got 79.96875" / "got 3.96875" inside LoadS's
↓/↑ on the .off file. The earlier `arr.cast(pa.large_string())`
fix didn't help: on the pip-installed pyarrow the cast returned
a chunk whose nominal type is large_string but whose offsets
buffer is still i32, so `np.frombuffer(buf[1], dtype="<i8")`
read garbage that round-tripped to fractional f64 in the .off
file.

Switch to a public-API path that doesn't peek inside buffers:
combine_chunks → fill_null → binary_length → cumsum for the
offsets, and concatenate the value buffer for the bytes. Works
regardless of whether the parquet reader hands back string,
large_string, string_view, or a dictionary-encoded variant.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The latest c6a.metal run produced the same fractional-offset
("Expected integer, got 79.96875") errors despite the public-API
prep rewrite — locally the new prep produces correct files but the
same code on the benchmark VM does not, so we need more diagnostics.

Add three things to prep.py:

  * Print `pyarrow:` and `numpy:` versions at startup so future
    sink.data captures actually show what got installed.
  * Build the string offsets in int64 and only promote to f64 at the
    last step. Keeps the cumulative offsets exact regardless of what
    dtype `pc.binary_length` returns.
  * `assert offs_i64[-1] == str_offsets[col]` after each row group:
    the cumulative end offset must equal the total bytes written, and
    if they diverge the parquet reader is handing us a buffer layout
    we don't actually understand, which is exactly the cloud failure
    mode. Better to die loudly at load than to ship a corrupt .off
    file the query layer reads as fractional floats.

While we're here, replace the brittle `typ in STRING_TYPES` tuple
test in the schema scan with the proper `pa.types.is_*` predicates so
new string variants (string_view, etc.) take the string path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The prep.py asserts didn't fire on the cloud — prep's view of
(cumulative offsets == bytes written) held — yet BQN still reads
fractional values like 79.96875 out of the .off files. To
distinguish "prep wrote bad bytes" from "something between prep and
query corrupted the file", dump the first 8 f64 values of each .off
file at the end of prep. If those print as integers, the prep is
fine and the corruption is downstream; if they print as fractional,
prep itself is producing wrong bytes despite the cumsum_end matching
total bytes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant