navigaraMethodology
White paperResearch
MethodologyOpen sourceQ1 2025 — Q1 2026

How we measure
engineering performance.

A two-stage scoring engine: an LLM classifies the work, deterministic algorithms compute the weight. Reproducible, file-by-file, no black box.

Overview

What ETV measures and why it exists.

Lines of code. Commit count. Story points. DORA. Each answers a different question. None answer the one that matters: did the output get more valuable, or just more numerous?

ETV (Engineering Throughput Value) is a unit of performance, produced by a measurement engine that reads code the way a senior engineer does — not just what changed, but what it meant, where it landed, and whether it touched the architecture.

The Engineering Throughput Value applied per file, per merged commit, combines five factors:

  • Complexity — structural weight of the change itself.
  • Engagement — ratio of surrounding code complexity to change complexity, so targeted edits in dense areas score higher than equivalent edits in trivial files.
  • Architecture — where the change lands in the feature graph; deeply connected features carry more weight than peripheral ones.
  • Decay — reduces credit when a change isn't real cognitive work: mechanical refactors, self‑rewrites of yesterday's code, copy‑paste from elsewhere in the repo.
  • Multiplier — amplifies fixes when the bug was costly: old code, unfamiliar code, high‑churn areas.

Each file contributes to one of three buckets: Growth, Maintenance, or Fixes. The buckets stay separate. Two engineers with identical totals can be doing very different work, and the score shows it.

ETV is answerable from commit history alone. No PM tools, no surveys, no self‑report.

The Three Buckets

Growth, Maintenance, Fixes — additive within, never across.

Every file change is classified into one of three buckets. Scores are additive inside a bucket (you can sum Growth ETV across a quarter for an org), but deliberately not additive across buckets. Two engineers with identical total ETV can be doing very different work.

Growth

New functionality and net-new capabilities. Added endpoints, new modules, new product surface area.

Maintenance

Upkeep, refactors, cleanup, performance tuning, tests, dependency updates, docs, style, build, CI.

Fixes

Work that corrects previous output — bug fixes, regressions, hotfixes. Each fix is traced back to the commit that introduced it.

Scoring Engine

Per-commit, per-file, deterministic structure with ML inside.

For each merged commit, the engine produces three sub-scores — one for Growth, one for Maintenance, one for Fixes. Each sub-score is assembled per file and summed across the commit. The structure of the score itself is deterministic; machine-learning components are used to tune thresholds and coefficients inside it, and a large-language-model classifier resolves ambiguous work classifications where pattern-based signals are insufficient.

Per-file sub-score

Each per-file sub-score begins with a context complexity signal derived from the structural properties of the change. That signal is then scaled by an engagement multiplier that captures the ratio of surrounding context complexity to the complexity of the change itself — targeted modifications in complex areas score higher than equivalent changes in trivial code. Several decay and amplification factors are then applied (see below).

Where ML enters — and where it doesn't

ML tunes thresholds and coefficients within the deterministic structure. An LLM classifies ambiguous changes into Growth / Maintenance / Fixes when pattern-based signals are insufficient, and traces bug-fix commits back to the commit that introduced the issue. The structure of the score itself is deterministic. The same diff produces the same score on every run.

Architecture

A feature graph inferred from code, used to weight changes by where they land.

Before any decay or amplification factors are applied, Navigara builds a structural model of each repository — a feature graph. The graph is derived from code organization alone (no external metadata, no PM tooling) and informs how per-file scores are weighted.

Feature graph

An AI analysis discovers distinct named features (e.g. auth, billing, checkout) and assigns each to a vertical layer — frontend, backend, or data. Edges between feature nodes capture inter-feature dependencies.

Commit → feature mapping

Each commit is mapped to one or more features via weighted path scoring (exact path match > directory containment > filename affinity). Files are no longer treated atomically; they're located inside a product surface.

Architecture multiplier

The structural complexity of the surrounding feature contributes a multiplier to each per-file sub-score. A change inside a deeply connected feature with many cross-feature dependencies carries more weight than the same change inside a peripheral one.

Inputs: code structure only. No connected ticketing system, no external metadata. When multiple repositories are connected, the graph extends across them via shared libraries and API contracts.

Decay & Amplification

Where credit is reduced — and where it's amplified.

Several factors run before the per-file score is finalized. Three dampeners reduce credit when a change exists but doesn't represent genuine cognitive work; one multiplier amplifies fixes when the surrounding signals say the bug was costly.

Similarity dampener

Reduces credit for changes that are structurally similar to existing code — mechanical refactors and copy-paste patterns.

Blame decay

Discounts changes that overwrite very recent work by the same author. The signal fades over a short business-day window — rewriting your own code from yesterday is partial credit; revisiting it weeks later is scored normally.

Copy decay

Reduces credit when a high proportion of added lines are duplicated from elsewhere in the codebase.

Waste multiplier

For Fixes only. The score is amplified based on three signals: how long the original code existed, whether the fix targets another author's code, and how frequently the affected area has been modified recently. A trivial self-fix on code written the same day barely moves the score; a fix in a high-churn area on code the fixer has never touched before is amplified substantially.

What's Excluded

Generated code, lockfiles, binaries — out before scoring.

The filter list runs before any scoring and is identical across organizations.

  • Generated code. Protocol Buffers, GraphQL schemas, OpenAPI clients, anything matching *_generated or *.gen.* patterns.
  • Dependency lockfiles. go.sum, package-lock.json, yarn.lock, Cargo.lock, poetry.lock, etc.
  • Build artifacts. dist/, node_modules/, hashed outputs, anything checked in by mistake.
  • Minified files. Detected by average line length over 300 characters.
  • Binary and media files. Images, video, archives, fonts, compiled binaries.

Languages

Full structural analysis for 13 languages; partial for the rest.

Full analysis

Go · Java · JavaScript · TypeScript · C · C++ · C# · Kotlin · Python · PHP · Ruby · Rust · Scala · Swift

Function-scope context complexity, within-file engagement, cross-repo data-flow analysis.

Partial analysis

HTML · CSS · SQL · Terraform · shell · YAML · Markdown

Classification runs as normal; mechanical fidelity is reduced because structural parsing is shallower.

Report-Layer Aggregation

How per-commit scores become the headline number.

The scoring engine produces three sub-scores per commit. The report layer collapses them into a single scalar — Engineering Throughput Value (ETV) — and aggregates upward through three levels.

Per commit

ETV per commit = sum of the three sub-scores (Growth + Maintenance + Fixes) for that commit.

Per SWE, per quarter

Sum of ETV across all qualifying commits authored by that engineer in the quarter.

Per org, per quarter

Mean ETV across the organization's qualifying SWEs that quarter.

Cross-org aggregate

Developer-weighted mean. Every qualifying SWE contributes one observation per quarter, weighted equally regardless of organization size. A 30-person org and a 200-person org each get pooled engineer-by-engineer — not org-by-org. This avoids the headline being dominated by the largest org's mean.

Figures in the report label the scalar as "performance" for readability. The formal definition is ETV.

Attribution

Who gets credit for a commit.

Primary credit goes to the git author of the merged commit, after email-alias resolution. Co-authors are tracked but do not receive score credit. Attribution uses the merge date, not the commit date — so out-of-order merges land in the quarter they actually shipped.

Automation accounts (dependabot, renovate, github-actions) are excluded by default. Organizations can flag additional bots or service accounts.

Limitations

What ETV does not measure.

ETV is descriptive, not normative. It tells you what shipped, not whether the right thing shipped.

  • Cross-repository comparisons are not straightforward. Repositories of very different size, language mix, or team composition produce ETV scores on different effective scales. The repository is the natural boundary of the engagement calculation.
  • Unconnected work is invisible. A contributor who does most of their work in repositories Navigara has not connected to will appear low.
  • Supporting roles are not directly captured. Code review, mentorship, on-call, incident response, and design work do not carry ETV unless they show up as commits on the default branch.
  • Co-authors are tracked but uncredited. The merge commit's git author receives the score; pair- and mob-programming need to be reconstructed externally.
  • Causal claims require care. ETV moving up after a tooling change is consistent with the tooling helping. It is also consistent with several other explanations. The Q1 2026 report deliberately avoids causal language.
  • Business-goal alignment is out of scope here. ETV measures what shipped, not whether what shipped advanced a particular business objective. Navigara's Alignment concept extends the model toward that question by mapping engineering output to quarterly goals and OKRs, but it requires either a connected goal-tracking tool or manually entered goals. It is not part of this report.

Display Modes

The three buckets, or KTLO + Growth.

Growth, Maintenance, and Fixes is the canonical view. For executive reporting some teams prefer a two-bucket view: Maintenance + Fixes combined into Keep The Lights On (KTLO), with Growth on its own. The underlying three-category data is unchanged — KTLO is a visualization choice, not a different metric.

The full appendix is in the white paper.

Sample floor, fixed-panel sanity check, the 418-engineer constant-population result, OpenAI's four-quarter window, and the complete list of 66 repositories analyzed.

Read the white paperProduct docs
© 2026 NavigaraMethodology · No. 01 · open source