Chapter 11

The Rendering Pipeline

From one shape at a time to a hundred thousand per frame

A beginner sketch draws a handful of shapes. A generative-art or particle sketch can draw tens or hundreds of thousands per frame. Bloom is built so the same circle(x, y, r) you learn on day one scales all the way up. This page walks the four layers that make that possible, from least to most aggressive.

Canvas2D State-diffing

→

WebGL2 Instanced batching

→

VM Fused draw opcodes

→

WASM SIMD Particle kernel

Each layer attacks a different bottleneck. Canvas2D state-diffing removes redundant work the browser would otherwise repeat per shape. The WebGL renderer collapses thousands of draw calls into one. The bytecode VM removes per-call allocation and indirection from the dispatch itself. And the SIMD kernel moves the per-shape math off the main per-value path entirely.

How this compares to p5.js For a side-by-side benchmark and a fair accounting of where Bloom wins and where the comparison is narrow, see Why Bloom over p5.js.

Layer 1 — Canvas2D state-diffing

The default Canvas2D path draws each shape with the browser's native fill/stroke. The trap is the state writes around those draws. Assigning ctx.fillStyle = "rgb(224, 112, 64)" forces the browser to re-parse and validate that CSS color string every single time — even if the value is identical to what it already is.

In a palette-reuse sketch (think a field of same-colored dots) almost all of those writes are no-ops. Bloom tracks the last value it applied and skips the write when it hasn't changed:

src/lang/interpreter.ts — applyFill

private appliedFillStyle: string | null = null;
public ctxStateWritesElided = 0;
public ctxStateWritesIssued = 0;

// Set ctx.fillStyle only if it differs from what we last applied.
private applyFill(ctx: CanvasRenderingContext2D, color: string): void {
  if (this.appliedFillStyle === color) {
    this.ctxStateWritesElided++;
    return;                       // skip the re-parse entirely
  }
  ctx.fillStyle = color;
  this.appliedFillStyle = color;
  this.ctxStateWritesIssued++;
}

The same applies to strokeStyle and lineWidth (applyStroke). A small rgbStringCache also memoizes the integer-to-"rgb(...)" string conversion so we don't rebuild the string each shape either.

On a palette-reuse sketch this brings the issued writes down to 156k versus 360k on the naive p5-style unconditional path — about 57% fewer, a 2.3× reduction.

Correctness invariant The writes Bloom reports issuing must exactly equal the writes the context actually observes — no shape may silently inherit a stale color. A regression test pins ctxStateWritesIssued to the real number of context assignments. Diffing is only allowed to remove work, never to change output.

Layer 2 — Auto-batched WebGL2 instanced renderer

Canvas2D, however lean, still issues one native draw call per shape. That ceiling is the reason p5.js and the Canvas2D path both fall off at large counts: p5 has no auto-batching — each shape is its own native call.

Bloom's WebGL2 renderer (src/lang/webgl-renderer.ts) instead accumulates every shape of a given primitive type into a typed array, then issues one gl.drawArraysInstanced call per primitive type per frame over a single unit quad.

The instance buffer

Each shape is six float32 slots — no per-shape Color object, no per-shape geometry. Colors are packed into a single 32-bit integer the GPU unpacks for free in the shader:

6 float32 per instance (24 bytes)

offset:   0      4      8      12     16        20
        +------+------+------+------+----------+-----------+
slot:   |  x   |  y   |  w   |  h   | rotation | packedRGBA|
        +------+------+------+------+----------+-----------+
        \_____ vec2 __/\__ vec2 ___/\_ float _/\_ uint32 _/
         aPos          aSize         aRotation   aColor

instance 0: [ x0 y0 w0 h0 rot0 col0 ]  ─┐
instance 1: [ x1 y1 w1 h1 rot1 col1 ]   ├─ one contiguous Float32Array,
instance 2: [ x2 y2 w2 h2 rot2 col2 ]   │  uploaded once per frame
   ...                                  ─┘

The quad geometry is static (4 corners). Per-instance attributes use gl.vertexAttribDivisor(..., 1) so each quad instance reads its own row. A circle is just the quad with a fragment-shader radial alpha mask; a rect leaves the mask off:

src/lang/webgl-renderer.ts — fragment shader (circle mask)

if (uIsCircle > 0.5) {
  // distance from center; circle edge at radius 0.5
  float dist = length(vLocal);
  // antialias the edge using the screen-space derivative
  float aa = fwidth(dist);
  float alpha = 1.0 - smoothstep(0.5 - aa, 0.5, dist);
  if (alpha <= 0.0) discard;
  fragColor = vec4(vColor.rgb, vColor.a * alpha);
} else {
  fragColor = vColor;  // rect: full quad
}

The packed color travels as an integer vertex attribute and is unpacked in the vertex shader with bit ops — a free GPU tint, with zero allocation on the JS side:

src/lang/webgl-renderer.ts — packColor

// Pack 8-bit R,G,B,A into one 32-bit int (R is the low byte).
return ((ai << 24) | (bi << 16) | (gi << 8) | ri) >>> 0;

p5.js / Canvas2D

N native calls

color object + path setup

Bloom WebGL2

1 instanced call

6 floats into a shared buffer

Renderer in isolation Driven directly (no language layer), the instanced renderer sustains 60fps at 200k circles.

Layer 3 — VM fused draw opcodes

The renderer removes the GPU-side ceiling, but the dispatch still has to get each shape's arguments out of the running program and into the batcher. The generic native-call path allocates an arguments array per call and hops through the native bridge. For the hottest primitives that overhead dominates.

So the bytecode compiler emits dedicated superinstructions for the fixed-arity drawing calls — CALL_DRAW_CIRCLE and CALL_DRAW_RECT (opcodes 0x64/0x65 in src/lang/bytecode.ts). They read operands straight off the VM stack, with no args-array allocation and no bridge indirection:

src/lang/bytecode.ts — compiler emits the fused opcode

// circle(x,y,r) and rect(x,y,w,h) are fixed-arity, push nil, and
// skip the args-array allocation + native bridge indirection.
if (funcName === 'circle' && call.arguments.length === 3) {
  for (const arg of call.arguments) this.compileExpr(arg);
  this.emit(OpCode.CALL_DRAW_CIRCLE);
  return;
}

src/lang/bytecode.ts — the VM handler appends straight to the batcher

case OpCode.CALL_DRAW_CIRCLE: {
  // Stack (top-down): r, y, x.
  const r = this.stack[--this.stackTop] as number;
  const y = this.stack[--this.stackTop] as number;
  const x = this.stack[--this.stackTop] as number;
  if (this.webglRenderer) {
    if (this.webglFillOn) {
      this.webglRenderer.circle(x, y, r, this.webglFillPacked);
    }
  } else {
    this.drawCircleFallback(x, y, r);   // Canvas2D, identical output
  }
  this.stack[this.stackTop++] = null;
  break;
}

Per-shape dispatch on this path is about 0.12µs — roughly 8 million shapes per second. Higher-arity forms (for example rect(x,y,w,h,radius) for a rounded rect) fall through to the generic native call, so nothing is lost.

The context gotcha that shaped the design A canvas hosts exactly one context type. Calling canvas.getContext('2d') permanently blocks a later getContext('webgl2') on the same canvas — the second call returns null forever. So in WebGL mode Bloom must never grab a 2D context. createBytecodeRuntime skips the 2D context when WebGL is requested, and enableWebGL() sets this.ctx = null before creating the WebGL2 context.

Backend and context are independent The execution backend (VM vs. tree-walking interpreter) is orthogonal to the canvas context type (2D vs. WebGL2). Earlier code wrongly conflated the two. A program can run on the VM and draw to either context; the draw opcode picks the renderer at runtime.

Layer 4 — WASM SIMD particle kernel

Above ~50k shapes the bottleneck stops being draw dispatch and becomes the per-shape math — the cos/sin/arithmetic the program runs one value at a time in the VM. The fix is to move that math into WebAssembly using SIMD128, running four lanes (f32x4) at a time, called once per frame instead of once per shape (src/lang/simd-kernel.ts).

Struct-of-Arrays in linear memory

All particle state lives in one WASM linear-memory block, laid out as parallel regions (Struct-of-Arrays) so each region is contiguous and SIMD-loadable:

SoA layout (P = pad4(N), all f32)

[ px[0..P) | py[0..P) | vx[0..P) | vy[0..P) | phase[0..P) ]
  \__ positions __/   \__ velocities __/    \_ angles _/

Velocities, phases and bounds are written once at setup. Per frame, only the scalars (t, forceScale, bound, count) cross the JS↔WASM boundary. The kernel updates every particle and the px/py regions feed straight into the batcher via WebGLRenderer.circlesFromArrays.

Never call WASM per shape The whole win is paying the boundary cost once per frame. Per-shape WASM calls are a net loss — the boundary overhead dwarfs the arithmetic you save at that granularity. Batch, always.

The sin/cos approximation

There is no SIMD sin/cos instruction, so the kernel range-reduces into [-pi, pi] and evaluates a degree-9 odd least-squares polynomial; cos(x) is computed as sin(x + pi/2). Max absolute error is about 1.6e-5 in f32 — far sub-pixel at any sane radius, with measured drift under 0.05px over 60 frames.

src/lang/simd-kernel.ts — sinV128 (Horner form)

// sin(x) ~= x * (C0 + C1 x^2 + C2 x^4 + C3 x^6 + C4 x^8)
const C0 = 0.9999843532;
const C1 = -0.1666321722;
const C2 = 0.008312196284;
const C3 = -0.0001931312360;
const C4 = 0.000002171576238;

The kernel module is emitted as raw WASM bytes and instantiated at runtime via WebAssembly.Instance — no build step, exactly like Bloom's existing WASM compiler. SIMD is feature-detected (isSimdSupported) and the field gracefully falls back to the VM draw path when SIMD is unavailable.

100k particles, end-to-end

0.6ms/frame (~27× under the 16.7ms / 60fps budget)

SIMD math vs. VM per-shape math @ 100k

~130× faster (57ms → 0.43ms)

100k particles @ 60fps end-to-end

Reachable

Engineering notes The honest tradeoffs and the gotchas found building these layers — the trig approximation's limits, the getContext gotcha, the SoA interleave copy, packed-int color, and a couple of WASM opcode-encoding bugs — are written up engineer-facing in HACKS.md at the repository root.

For the measured progression of all these optimizations — the before/after numbers for each layer, charted from a single dataset — see Performance — How Bloom Got Fast.

← Chapter 10: The Runtime Chapter 12: Performance →