Chapter 11
The Rendering Pipeline
From one shape at a time to a hundred thousand per frame
A beginner sketch draws a handful of shapes. A generative-art or particle sketch can draw tens or hundreds of thousands per frame. Bloom is built so the same circle(x, y, r) you learn on day one scales all the way up. This page walks the four layers that make that possible, from least to most aggressive.
Each layer attacks a different bottleneck. Canvas2D state-diffing removes redundant work the browser would otherwise repeat per shape. The WebGL renderer collapses thousands of draw calls into one. The bytecode VM removes per-call allocation and indirection from the dispatch itself. And the SIMD kernel moves the per-shape math off the main per-value path entirely.
Layer 1 — Canvas2D state-diffing
The default Canvas2D path draws each shape with the browser's native fill/stroke. The trap is the state writes around those draws. Assigning ctx.fillStyle = "rgb(224, 112, 64)" forces the browser to re-parse and validate that CSS color string every single time — even if the value is identical to what it already is.
In a palette-reuse sketch (think a field of same-colored dots) almost all of those writes are no-ops. Bloom tracks the last value it applied and skips the write when it hasn't changed:
private appliedFillStyle: string | null = null;
public ctxStateWritesElided = 0;
public ctxStateWritesIssued = 0;
// Set ctx.fillStyle only if it differs from what we last applied.
private applyFill(ctx: CanvasRenderingContext2D, color: string): void {
if (this.appliedFillStyle === color) {
this.ctxStateWritesElided++;
return; // skip the re-parse entirely
}
ctx.fillStyle = color;
this.appliedFillStyle = color;
this.ctxStateWritesIssued++;
}
The same applies to strokeStyle and lineWidth (applyStroke). A small rgbStringCache also memoizes the integer-to-"rgb(...)" string conversion so we don't rebuild the string each shape either.
On a palette-reuse sketch this brings the issued writes down to 156k versus 360k on the naive p5-style unconditional path — about 57% fewer, a 2.3× reduction.
ctxStateWritesIssued to the real number of context assignments. Diffing is only allowed to remove work, never to change output.
Layer 2 — Auto-batched WebGL2 instanced renderer
Canvas2D, however lean, still issues one native draw call per shape. That ceiling is the reason p5.js and the Canvas2D path both fall off at large counts: p5 has no auto-batching — each shape is its own native call.
Bloom's WebGL2 renderer (src/lang/webgl-renderer.ts) instead accumulates every shape of a given primitive type into a typed array, then issues one gl.drawArraysInstanced call per primitive type per frame over a single unit quad.
The instance buffer
Each shape is six float32 slots — no per-shape Color object, no per-shape geometry. Colors are packed into a single 32-bit integer the GPU unpacks for free in the shader:
offset: 0 4 8 12 16 20
+------+------+------+------+----------+-----------+
slot: | x | y | w | h | rotation | packedRGBA|
+------+------+------+------+----------+-----------+
\_____ vec2 __/\__ vec2 ___/\_ float _/\_ uint32 _/
aPos aSize aRotation aColor
instance 0: [ x0 y0 w0 h0 rot0 col0 ] ─┐
instance 1: [ x1 y1 w1 h1 rot1 col1 ] ├─ one contiguous Float32Array,
instance 2: [ x2 y2 w2 h2 rot2 col2 ] │ uploaded once per frame
... ─┘
The quad geometry is static (4 corners). Per-instance attributes use gl.vertexAttribDivisor(..., 1) so each quad instance reads its own row. A circle is just the quad with a fragment-shader radial alpha mask; a rect leaves the mask off:
if (uIsCircle > 0.5) {
// distance from center; circle edge at radius 0.5
float dist = length(vLocal);
// antialias the edge using the screen-space derivative
float aa = fwidth(dist);
float alpha = 1.0 - smoothstep(0.5 - aa, 0.5, dist);
if (alpha <= 0.0) discard;
fragColor = vec4(vColor.rgb, vColor.a * alpha);
} else {
fragColor = vColor; // rect: full quad
}
The packed color travels as an integer vertex attribute and is unpacked in the vertex shader with bit ops — a free GPU tint, with zero allocation on the JS side:
// Pack 8-bit R,G,B,A into one 32-bit int (R is the low byte).
return ((ai << 24) | (bi << 16) | (gi << 8) | ri) >>> 0;
Layer 3 — VM fused draw opcodes
The renderer removes the GPU-side ceiling, but the dispatch still has to get each shape's arguments out of the running program and into the batcher. The generic native-call path allocates an arguments array per call and hops through the native bridge. For the hottest primitives that overhead dominates.
So the bytecode compiler emits dedicated superinstructions for the fixed-arity drawing calls — CALL_DRAW_CIRCLE and CALL_DRAW_RECT (opcodes 0x64/0x65 in src/lang/bytecode.ts). They read operands straight off the VM stack, with no args-array allocation and no bridge indirection:
// circle(x,y,r) and rect(x,y,w,h) are fixed-arity, push nil, and
// skip the args-array allocation + native bridge indirection.
if (funcName === 'circle' && call.arguments.length === 3) {
for (const arg of call.arguments) this.compileExpr(arg);
this.emit(OpCode.CALL_DRAW_CIRCLE);
return;
}
case OpCode.CALL_DRAW_CIRCLE: {
// Stack (top-down): r, y, x.
const r = this.stack[--this.stackTop] as number;
const y = this.stack[--this.stackTop] as number;
const x = this.stack[--this.stackTop] as number;
if (this.webglRenderer) {
if (this.webglFillOn) {
this.webglRenderer.circle(x, y, r, this.webglFillPacked);
}
} else {
this.drawCircleFallback(x, y, r); // Canvas2D, identical output
}
this.stack[this.stackTop++] = null;
break;
}
Per-shape dispatch on this path is about 0.12µs — roughly 8 million shapes per second. Higher-arity forms (for example rect(x,y,w,h,radius) for a rounded rect) fall through to the generic native call, so nothing is lost.
canvas.getContext('2d') permanently blocks a later getContext('webgl2') on the same canvas — the second call returns null forever. So in WebGL mode Bloom must never grab a 2D context. createBytecodeRuntime skips the 2D context when WebGL is requested, and enableWebGL() sets this.ctx = null before creating the WebGL2 context.
Layer 4 — WASM SIMD particle kernel
Above ~50k shapes the bottleneck stops being draw dispatch and becomes the per-shape math — the cos/sin/arithmetic the program runs one value at a time in the VM. The fix is to move that math into WebAssembly using SIMD128, running four lanes (f32x4) at a time, called once per frame instead of once per shape (src/lang/simd-kernel.ts).
Struct-of-Arrays in linear memory
All particle state lives in one WASM linear-memory block, laid out as parallel regions (Struct-of-Arrays) so each region is contiguous and SIMD-loadable:
[ px[0..P) | py[0..P) | vx[0..P) | vy[0..P) | phase[0..P) ]
\__ positions __/ \__ velocities __/ \_ angles _/
Velocities, phases and bounds are written once at setup. Per frame, only the scalars (t, forceScale, bound, count) cross the JS↔WASM boundary. The kernel updates every particle and the px/py regions feed straight into the batcher via WebGLRenderer.circlesFromArrays.
The sin/cos approximation
There is no SIMD sin/cos instruction, so the kernel range-reduces into [-pi, pi] and evaluates a degree-9 odd least-squares polynomial; cos(x) is computed as sin(x + pi/2). Max absolute error is about 1.6e-5 in f32 — far sub-pixel at any sane radius, with measured drift under 0.05px over 60 frames.
// sin(x) ~= x * (C0 + C1 x^2 + C2 x^4 + C3 x^6 + C4 x^8)
const C0 = 0.9999843532;
const C1 = -0.1666321722;
const C2 = 0.008312196284;
const C3 = -0.0001931312360;
const C4 = 0.000002171576238;
The kernel module is emitted as raw WASM bytes and instantiated at runtime via WebAssembly.Instance — no build step, exactly like Bloom's existing WASM compiler. SIMD is feature-detected (isSimdSupported) and the field gracefully falls back to the VM draw path when SIMD is unavailable.
getContext gotcha, the SoA interleave copy, packed-int color, and a couple of WASM opcode-encoding bugs — are written up engineer-facing in HACKS.md at the repository root.
For the measured progression of all these optimizations — the before/after numbers for each layer, charted from a single dataset — see Performance — How Bloom Got Fast.