Architecture
sgrep combines several key technologies to deliver fast, offline semantic code search. This document explains the technical foundations.
Model2Vec
Static Embeddings vs Transformers
Traditional transformer models like BERT and CodeBERT run inference through deep neural networks at search time. This is computationally expensive and slow.
Model2Vec takes a different approach:
- Pre-computation: During model creation, the transformer runs on a massive vocabulary once, computing embeddings for every possible token.
- Lookup at runtime: At search time, we simply look up pre-computed embeddings from a static table.
This trades a small amount of quality for massive speed gains and smaller model size.
Transformer at search time:
query → [attention layers] → embedding (~100ms per query)
Model2Vec at search time:
query → [tokenize] → [lookup table] → embedding (~1ms per query)
Embedding Pipeline
Given a code snippet or query, sgrep produces a 256-dimensional embedding:
"function authenticate user"
Step 1: Tokenization (WordPiece)
["function", "authenticate", "user"]
Step 2: Embedding Lookup
Each token maps to a 256-dimensional vector:
"function" → [0.12, -0.34, 0.56, ...] (256 values)
"authenticate" → [-0.23, 0.45, -0.67, ...]
"user" → [0.78, -0.12, 0.34, ...]
Step 3: Mean Pooling
Average all token embeddings:
embedding = mean(token_embeddings)
= [0.22, -0.003, 0.08, ...]
Step 4: L2 Normalization
Normalize to unit length for consistent similarity scores:
embedding = embedding / ||embedding||
Cosine Similarity
The Math
To find similar code, we compute cosine similarity between query and document embeddings:
similarity(A, B) = (A · B) / (||A|| × ||B||)
Since embeddings are L2-normalized, this simplifies to:
similarity(A, B) = A · B
A single dot product operation: extremely fast on modern CPUs.
Score Interpretation
- 1.0: Identical meaning (same direction in vector space)
- 0.8-0.9: Very similar (likely relevant)
- 0.6-0.8: Somewhat related (may be relevant)
- 0.5-0.6: Weakly related (edge cases)
- below 0.5: Likely unrelated
Note: These thresholds vary by domain. Experiment with
--thresholdto find what works for your codebase.
Int8 Quantization
Why Quantize?
Floating point embeddings (f32) use 4 bytes per dimension:
256 dimensions × 4 bytes = 1,024 bytes per vector
With 50k vocabulary tokens, that’s about 50 MB just for embeddings.
Int8 Quantization
We convert f32 vectors to int8 using a scale factor:
int8_value = round(float32_value / scale)
The scale factor is computed per embedding vector:
scale = max(abs(vector)) / 127
Size Reduction
256 dimensions × 1 byte = 256 bytes per vector
(4x smaller than f32)
Dequantization
When computing similarity, we convert back to f32:
float32_value = int8_value × scale
This maintains most of the quality while dramatically reducing memory usage and improving cache locality.
Trade-offs
| Aspect | f32 | int8 |
|---|---|---|
| Size | 1KB per vector | 256B per vector |
| Precision | Full | Slight loss |
| Speed | Good | Better (cache) |
| Quality | Baseline | 2-5% MRR loss |
For code search, this trade-off is well worth it.
WASM Compilation
sgrep can be compiled to WebAssembly for browser-based search:
Rust source → wasm32-wasi target → .wasm file
Benefits
- No server: Search runs entirely in the browser
- Privacy: Code never leaves your machine
- Speed: Native performance via Wasm SIMD
- Offline: Works without internet connection
Wasm Size
sgrep.wasm: ~500KB compressed
embeddings.bin: ~12MB compressed (int8 quantized)
vocab.json: ~500KB
tokenizer.json: ~1MB
Total download is under 8MB and cached permanently.
SIMD Acceleration
When available, sgrep uses Wasm SIMD for faster dot products:
// Pseudo-code
f32x8.dot_product(a, b) -> single instruction for 8 values
This provides 2-4x speedup on supported browsers.
Token-Level Coloring
Score Gradient
sgrep highlights tokens based on their contribution to the match. The color gradient uses a quadratic curve:
score ∈ [0, 1]
t = score
t² = normalized score
rgb(r, g, b) where:
r = 140 + (255 - 140) × t²
g = 140 + (50 - 140) × t²
b = 140 + (50 - 140) × t²
At t=0: gray (140, 140, 140) At t=1: red (255, 50, 50)
The quadratic curve means:
- Low similarity tokens stay gray
- Medium tokens show slight red
- High similarity tokens become bright red
Computing Token Scores
Each token’s contribution is computed as:
token_score = cosine_similarity(query_embedding, token_embedding)
Tokens are then normalized:
normalized = (score - min_score) / (max_score - min_score)
This ensures the gradient always spans the full range.
Caching Architecture
SQLite-Vec Index
sgrep caches embeddings in a local SQLite database using the sqlite-vec extension:
CREATE TABLE embeddings (
path TEXT,
line INTEGER,
hash TEXT UNIQUE,
embedding BLOB, -- int8 quantized
mtime REAL
);
CREATE VIRTUAL TABLE vec_index USING vec0(
embedding float_vector(256)
);
Cache Invalidation
Files are re-indexed when:
- mtime changes: File modification time differs from cached value
- content hash differs: SHA-256 of file content changed
- manual rebuild: User runs
sgrep --rebuild-index
Cache Location
~/.cache/sgrep/
├── embeddings.db # SQLite index
├── model/
│ ├── vocab.json
│ ├── embeddings.bin
│ └── tokenizer.json
└── registry/
└── sgrep-code-v1/
Performance Characteristics
Typical Latencies
Tokenization: ~0.1ms per line
Embedding lookup: ~0.05ms per token
Similarity computation: ~0.001ms per comparison
Total per query: ~5-50ms depending on codebase size
Scalability
| Lines of Code | Index Time | Search Time |
|---|---|---|
| 1K | ~1s | ~5ms |
| 10K | ~5s | ~10ms |
| 100K | ~30s | ~20ms |
| 1M | ~5min | ~50ms |
Index time is one-time. Subsequent searches use cached embeddings.
Memory Usage
Base binary: ~2MB
Model (int8): ~7.5MB
Per-file overhead: ~line_count × 256 bytes
Typical usage: 50-100MB for large projects
Memory usage scales linearly with codebase size.
Was this page helpful?