How It Works

sgrep transforms your search query into a semantic journey through your codebase. This document explains the complete pipeline from query to colored results.

Query Pipeline

When you run sgrep "authenticate user" src/, several steps occur:

Step 1: Query Tokenization

Your natural language query is tokenized using WordPiece:

"authenticate user"
  ↓
["authenticate", "user"]

WordPiece breaks words into subword units, handling unknown words by splitting them:

"authentication" → ["auth", "##ent", "##ication"]

Step 2: Query Embedding

Each token is looked up in the pre-computed embedding table:

"authenticate" → [0.12, -0.34, 0.56, ...]  (256 dimensions)
"user"        → [0.78, -0.12, 0.34, ...]

These embeddings are averaged (mean pooling) and normalized:

query_embedding = mean([token_embeddings]) / ||mean||

Result: A single 256-dimensional vector representing your query’s meaning.

Step 3: File Discovery

sgrep walks the directory tree to find files:

src/
├── main.rs
├── auth/
│   ├── login.rs
│   └── oauth.rs
└── db/
    └── connection.rs

Files are filtered by:

Extension (config language support)
.gitignore and .ignore files
Binary file detection

Step 4: Line Embedding

For each line in discovered files, sgrep:

Tokenizes the line: fn login(user: User) { → ["fn", "login", "(", "user", ":", "User", ")", "{"]
Looks up each token’s embedding
Mean pools the embeddings
Normalizes the result

Cached embeddings are used when available (see Caching section).

Step 5: Similarity Computation

Cosine similarity is computed between query and each line:

score = query_embedding · line_embedding

This is a single dot product operation: extremely fast.

Step 6: Ranking and Filtering

Results are sorted by score and filtered:

threshold = 0.5  # default
top_k = 10       # default

results = all_matches
  .filter(m => m.score >= threshold)
  .sort_by(m => m.score.descending())
  .take(top_k)

Step 7: Display with Token Coloring

Each matching line is displayed with colored tokens. The color gradient represents each token’s contribution to the match (see Token-Level Coloring below).

Chunking Strategies

sgrep supports different granularity for search:

Line-Level (Default)

Each line is embedded separately:

fn authenticate_user(username: &str, password: &str) -> Result<User, Error> {
    let user = db.find_user(username)?;
    if verify_password(password, &user.hash) {
        Ok(user)
    } else {
        Err(Error::AuthenticationFailed)
    }
}

Becomes 7 separate embeddings.

Pros: Precise location, easier to scan Cons: May miss multi-line concepts

Paragraph-Level

Multiple lines grouped together (planned feature):

fn authenticate_user(...) {
    let user = db.find_user(username)?;
    if verify_password(password, &user.hash) {

One embedding for the whole function body.

Pros: Captures multi-line semantics Cons: Harder to locate exact match

Token-Level Coloring

The colored output shows which tokens contributed most to the match.

Score Calculation

For each token in the matching line:

token_score = cosine_similarity(query_embedding, token_embedding)

Normalization

Scores are normalized across all tokens in the match:

normalized = (score - min_score) / (max_score - min_score)

Gradient Formula

A quadratic curve maps scores to colors:

t = normalized_score
t² = t * t

rgb(r, g, b):
  r = 140 + (255 - 140) × t²
  g = 140 + (50 - 140) × t²
  b = 140 + (50 - 140) × t²

Result:

t=0: gray (140, 140, 140) — irrelevant
t=0.5: slight red tint — somewhat relevant
t=1.0: bright red (255, 50, 50) — highly relevant

Example

Query: "database connection"

Match: let pool = PgPool::connect(&db_url).await?;

Token colors:

let — gray (irrelevant)
pool — gray (variable name)
PgPool — gray (type name)
connect — bright red (highly relevant!)
&db_url — light red (somewhat relevant)
.await? — gray (syntax)

Caching Architecture

SQLite Storage

Embeddings are cached in ~/.cache/sgrep/embeddings.db:

CREATE TABLE embeddings (
  path TEXT NOT NULL,
  line_number INTEGER NOT NULL,
  content_hash TEXT NOT NULL,
  embedding BLOB NOT NULL,  -- 256 × int8
  modified_time REAL NOT NULL,
  PRIMARY KEY (path, line_number)
);

CREATE INDEX idx_path ON embeddings(path);
CREATE INDEX idx_hash ON embeddings(content_hash);

Cache Lookup

When searching a file:

Check cache for existing embeddings
Verify file modification time matches
Compute SHA-256 of file content for integrity check
Use cached embeddings if valid
Re-compute only if file changed

Cache Invalidation

Files are re-indexed when:

if cached_mtime != file_mtime || cached_hash != compute_hash(file) {
    recompute_embeddings(file);
    update_cache(file);
}

Hybrid Mode Cache

In --hybrid mode, BM25 keyword results are also cached:

CREATE TABLE bm25_cache (
  path TEXT,
  pattern TEXT,
  results TEXT,  -- JSON array
  timestamp REAL
);

Model Loading

First Run

On first execution, sgrep downloads the model:

$ sgrep "test" src/
Downloading model: sgrep-code-v1 (~7.5MB)
Extracting to: ~/.cache/sgrep/model/sgrep-code-v1/
Model ready. Searching...

Model Files

~/.cache/sgrep/model/sgrep-code-v1/
├── vocab.json        # Tokenizer vocabulary (500KB)
├── embeddings.bin    # Pre-computed token embeddings (7.5MB)
└── tokenizer.json    # WordPiece configuration (1MB)

Lazy Loading

Embeddings are memory-mapped, not fully loaded:

let embeddings = MmappedEmbeddings::new("embeddings.bin")?;
// Only accessed pages are loaded into RAM

This keeps memory usage low even with large models.

Performance Optimizations

Parallel Processing

Files are processed in parallel using Rayon:

files.par_iter()
    .flat_map(|file| search_file(file, query))
    .collect()

SIMD Dot Product

When available, SIMD instructions accelerate similarity computation:

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

unsafe fn cosine_simd(a: &[f32], b: &[f32]) -> f32 {
    // AVX2 processes 8 floats per instruction
}

Early Termination

Low-score matches are discarded early:

if partial_score < threshold_lower_bound {
    continue; // Skip remaining tokens
}

Result Batching

Results are streamed, not stored all at once:

search_results
    .take(top_k)
    .for_each(|result| print_result(result));

End-to-End Example

$ sgrep "database query error" src/

[1. Computing query embedding]
  "database query error"
  → tokenize → ["database", "query", "error"]
  → embed → mean([emb_db, emb_query, emb_error])
  → normalize → [0.23, -0.45, 0.67, ...]

[2. Scanning files]
  src/main.rs: skip (not matching pattern)
  src/db/postgres.rs: process...

[3. Processing src/db/postgres.rs]
  Line 1: "use postgres::Client;"
    → embed → [0.12, 0.34, ...]
    → similarity = 0.32 → below threshold
  Line 42: "conn.query(\"SELECT * FROM users\").await?"
    → embed → [0.34, 0.56, ...]
    → similarity = 0.89 → match!

[4. Displaying results]
  src/db/postgres.rs:42
  conn.query("SELECT * FROM users").await?
       ^^^^ bright red (query!)

The entire process typically completes in 10-50ms for medium codebases.

Summary

Query → embedding via tokenization + lookup + mean pool
Files → discovered via filesystem walk
Lines → embedded and cached
Similarities → computed via dot product
Results → ranked and filtered
Output → colored by token contribution

All offline, all local, all fast.