The Lexer
Breaking code into tokens
The lexer (also called a tokenizer or scanner) is the first stage of understanding your code. It reads your source code character by character and groups them into meaningful chunks called tokens.
What is a Token?
A token is the smallest meaningful unit in a programming language. Think of it like words in a sentence — just as "The cat sat" has three words, code has tokens.
Let's see what happens when the lexer processes this simple line:
circle(100, 50, 30)
The lexer produces this stream of tokens:
Types of Tokens in Bloom
Bloom recognizes several categories of tokens:
Keywords
Reserved words with special meaning:
Identifiers
Names you give to things (variables, functions):
Literals
Values written directly in code:
Operators
Symbols for operations:
How the Lexer Works
The lexer reads your code one character at a time. Here's the mental model:
When the lexer sees l, it doesn't immediately create a token. It keeps reading (e, t) until it hits a space. Then it checks: is "let" a keyword? Yes! So it creates a LET token.
Step-by-Step Example
Let's trace through tokenizing this code:
let x = 10 + 5
Final result:
Handling Multi-Character Operators
Some operators are more than one character. When the lexer sees =, it peeks ahead:
- Is the next character
=? → Emit==(equals comparison) - Otherwise → Emit
=(assignment)
This is called lookahead. Bloom uses 1-character lookahead for these operators:
Error Tracking
Each token remembers where it came from in your source code:
{
type: "NUMBER",
lexeme: "42",
literal: 42,
line: 3,
column: 10
}
This is crucial for error messages. When something goes wrong later (during parsing or execution), Bloom can point you to the exact location in your code.
What the Lexer Ignores
- Whitespace — spaces, tabs, newlines (except to separate tokens)
- Comments — everything after
//until the end of line
// This is a comment
let x = 5 // Another comment
Output: Just LET, IDENTIFIER(x), EQUALS, NUMBER(5)
Common Lexer Errors
The lexer catches these problems early:
In the Source Code
The lexer implementation lives in src/lang/lexer.ts. Key methods:
scanToken()— Main loop, decides what kind of token to makestring()— Handles string literalsnumber()— Handles number literalsidentifier()— Handles keywords and identifiers