How It Works
sgrep transforms your search query into a semantic journey through your codebase. This document explains the complete pipeline from query to colored results.
Query Pipeline
When you run sgrep "authenticate user" src/, several steps occur:
Step 1: Query Tokenization
Your natural language query is tokenized using WordPiece:
"authenticate user"
↓
["authenticate", "user"]
WordPiece breaks words into subword units, handling unknown words by splitting them:
"authentication" → ["auth", "##ent", "##ication"]
Step 2: Query Embedding
Each token is looked up in the pre-computed embedding table:
"authenticate" → [0.12, -0.34, 0.56, ...] (256 dimensions)
"user" → [0.78, -0.12, 0.34, ...]
These embeddings are averaged (mean pooling) and normalized:
query_embedding = mean([token_embeddings]) / ||mean||
Result: A single 256-dimensional vector representing your query’s meaning.
Step 3: File Discovery
sgrep walks the directory tree to find files:
src/
├── main.rs
├── auth/
│ ├── login.rs
│ └── oauth.rs
└── db/
└── connection.rs
Files are filtered by:
- Extension (config language support)
.gitignoreand.ignorefiles- Binary file detection
Step 4: Line Embedding
For each line in discovered files, sgrep:
- Tokenizes the line:
fn login(user: User) {→["fn", "login", "(", "user", ":", "User", ")", "{"] - Looks up each token’s embedding
- Mean pools the embeddings
- Normalizes the result
Cached embeddings are used when available (see Caching section).
Step 5: Similarity Computation
Cosine similarity is computed between query and each line:
score = query_embedding · line_embedding
This is a single dot product operation: extremely fast.
Step 6: Ranking and Filtering
Results are sorted by score and filtered:
threshold = 0.5 # default
top_k = 10 # default
results = all_matches
.filter(m => m.score >= threshold)
.sort_by(m => m.score.descending())
.take(top_k)
Step 7: Display with Token Coloring
Each matching line is displayed with colored tokens. The color gradient represents each token’s contribution to the match (see Token-Level Coloring below).
Chunking Strategies
sgrep supports different granularity for search:
Line-Level (Default)
Each line is embedded separately:
fn authenticate_user(username: &str, password: &str) -> Result<User, Error> {
let user = db.find_user(username)?;
if verify_password(password, &user.hash) {
Ok(user)
} else {
Err(Error::AuthenticationFailed)
}
}
Becomes 7 separate embeddings.
Pros: Precise location, easier to scan Cons: May miss multi-line concepts
Paragraph-Level
Multiple lines grouped together (planned feature):
fn authenticate_user(...) {
let user = db.find_user(username)?;
if verify_password(password, &user.hash) {
One embedding for the whole function body.
Pros: Captures multi-line semantics Cons: Harder to locate exact match
Token-Level Coloring
The colored output shows which tokens contributed most to the match.
Score Calculation
For each token in the matching line:
token_score = cosine_similarity(query_embedding, token_embedding)
Normalization
Scores are normalized across all tokens in the match:
normalized = (score - min_score) / (max_score - min_score)
Gradient Formula
A quadratic curve maps scores to colors:
t = normalized_score
t² = t * t
rgb(r, g, b):
r = 140 + (255 - 140) × t²
g = 140 + (50 - 140) × t²
b = 140 + (50 - 140) × t²
Result:
- t=0: gray (140, 140, 140) — irrelevant
- t=0.5: slight red tint — somewhat relevant
- t=1.0: bright red (255, 50, 50) — highly relevant
Example
Query: "database connection"
Match: let pool = PgPool::connect(&db_url).await?;
Token colors:
let— gray (irrelevant)pool— gray (variable name)PgPool— gray (type name)connect— bright red (highly relevant!)&db_url— light red (somewhat relevant).await?— gray (syntax)
Caching Architecture
SQLite Storage
Embeddings are cached in ~/.cache/sgrep/embeddings.db:
CREATE TABLE embeddings (
path TEXT NOT NULL,
line_number INTEGER NOT NULL,
content_hash TEXT NOT NULL,
embedding BLOB NOT NULL, -- 256 × int8
modified_time REAL NOT NULL,
PRIMARY KEY (path, line_number)
);
CREATE INDEX idx_path ON embeddings(path);
CREATE INDEX idx_hash ON embeddings(content_hash);
Cache Lookup
When searching a file:
- Check cache for existing embeddings
- Verify file modification time matches
- Compute SHA-256 of file content for integrity check
- Use cached embeddings if valid
- Re-compute only if file changed
Cache Invalidation
Files are re-indexed when:
if cached_mtime != file_mtime || cached_hash != compute_hash(file) {
recompute_embeddings(file);
update_cache(file);
}
Hybrid Mode Cache
In --hybrid mode, BM25 keyword results are also cached:
CREATE TABLE bm25_cache (
path TEXT,
pattern TEXT,
results TEXT, -- JSON array
timestamp REAL
);
Model Loading
First Run
On first execution, sgrep downloads the model:
$ sgrep "test" src/
Downloading model: sgrep-code-v1 (~7.5MB)
Extracting to: ~/.cache/sgrep/model/sgrep-code-v1/
Model ready. Searching...
Model Files
~/.cache/sgrep/model/sgrep-code-v1/
├── vocab.json # Tokenizer vocabulary (500KB)
├── embeddings.bin # Pre-computed token embeddings (7.5MB)
└── tokenizer.json # WordPiece configuration (1MB)
Lazy Loading
Embeddings are memory-mapped, not fully loaded:
let embeddings = MmappedEmbeddings::new("embeddings.bin")?;
// Only accessed pages are loaded into RAM
This keeps memory usage low even with large models.
Performance Optimizations
Parallel Processing
Files are processed in parallel using Rayon:
files.par_iter()
.flat_map(|file| search_file(file, query))
.collect()
SIMD Dot Product
When available, SIMD instructions accelerate similarity computation:
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;
unsafe fn cosine_simd(a: &[f32], b: &[f32]) -> f32 {
// AVX2 processes 8 floats per instruction
}
Early Termination
Low-score matches are discarded early:
if partial_score < threshold_lower_bound {
continue; // Skip remaining tokens
}
Result Batching
Results are streamed, not stored all at once:
search_results
.take(top_k)
.for_each(|result| print_result(result));
End-to-End Example
$ sgrep "database query error" src/
[1. Computing query embedding]
"database query error"
→ tokenize → ["database", "query", "error"]
→ embed → mean([emb_db, emb_query, emb_error])
→ normalize → [0.23, -0.45, 0.67, ...]
[2. Scanning files]
src/main.rs: skip (not matching pattern)
src/db/postgres.rs: process...
[3. Processing src/db/postgres.rs]
Line 1: "use postgres::Client;"
→ embed → [0.12, 0.34, ...]
→ similarity = 0.32 → below threshold
Line 42: "conn.query(\"SELECT * FROM users\").await?"
→ embed → [0.34, 0.56, ...]
→ similarity = 0.89 → match!
[4. Displaying results]
src/db/postgres.rs:42
conn.query("SELECT * FROM users").await?
^^^^ bright red (query!)
The entire process typically completes in 10-50ms for medium codebases.
Summary
- Query → embedding via tokenization + lookup + mean pool
- Files → discovered via filesystem walk
- Lines → embedded and cached
- Similarities → computed via dot product
- Results → ranked and filtered
- Output → colored by token contribution
All offline, all local, all fast.
Was this page helpful?