perf: skip ensureWasmTrees when native engine provides complete data#344
perf: skip ensureWasmTrees when native engine provides complete data#344carlos-alm wants to merge 26 commits intomainfrom
Conversation
Three optimizations to recover build speed after CFG/dataflow default-on:
1. Eliminate redundant WASM parsing (biggest win):
- Remove complexity.js clearing _tree after each file (builder already
clears all trees after all phases complete)
- Add ensureWasmTrees() in parser.js for a single WASM pre-parse pass
before CFG/dataflow, replacing two independent parse passes
- Memoize createParsers() so repeated calls return cached parsers
2. Filter CFG/dataflow to changed files only:
- Build changedSymbols map excluding reverse-dep-only files
- Pass changedSymbols (not allSymbols) to buildCFGData/buildDataflowEdges
- Reverse-dep files only need edge rebuilding, not AST re-analysis
3. Report wasmPreMs in phase timing output for visibility
Impact: 4 functions changed, 28 affected
Impact: 4 functions changed, 10 affected
…325) Three optimizations to recover build speed after CFG/dataflow default-on: 1. Eliminate redundant WASM parsing (biggest win): - Remove complexity.js clearing _tree after each file (builder already clears all trees after all phases complete) - Add ensureWasmTrees() in parser.js for a single WASM pre-parse pass before CFG/dataflow, replacing two independent parse passes - Memoize createParsers() so repeated calls return cached parsers 2. Filter CFG/dataflow to changed files only: - Build changedSymbols map excluding reverse-dep-only files - Pass changedSymbols (not allSymbols) to buildCFGData/buildDataflowEdges - Reverse-dep files only need edge rebuilding, not AST re-analysis 3. Report wasmPreMs in phase timing output for visibility Impact: 4 functions changed, 28 affected Impact: 4 functions changed, 10 affected
…ames (#324) collectIdentifiers hardcoded node.type === 'identifier', missing PHP's variable_name and other language-specific identifier types in return statement referencedNames. Now uses the existing isIdent(nodeType, rules) helper with rules passed from the call site. Also adds null guard. Updated PHP dataflow test to assert referencedNames contains $x, $a, $b — previously only checked funcName without validating identifier capture. Impact: 3 functions changed, 2 affected
…ty tests Update parity tests to compare children (parameters, properties, constants) and call receivers now that the native binary supports them. Document known native gaps: implicit self parameters (Python/Rust) and local constant scope (extracts function-local const as top-level).
The native engine's JS extractor walked the entire tree recursively and extracted `const x = <literal>` as top-level constants regardless of scope. The WASM extractor correctly only walks program-level children. Add a `find_parent_of_types` check to skip constants nested inside function_declaration, arrow_function, function_expression, or method_definition nodes. Impact: 2 functions changed, 1 affected
* perf: fix v3.0.1 build performance regression (14.1 → ~5.8 ms/file)
Three optimizations to recover build speed after CFG/dataflow default-on:
1. Eliminate redundant WASM parsing (biggest win):
- Remove complexity.js clearing _tree after each file (builder already
clears all trees after all phases complete)
- Add ensureWasmTrees() in parser.js for a single WASM pre-parse pass
before CFG/dataflow, replacing two independent parse passes
- Memoize createParsers() so repeated calls return cached parsers
2. Filter CFG/dataflow to changed files only:
- Build changedSymbols map excluding reverse-dep-only files
- Pass changedSymbols (not allSymbols) to buildCFGData/buildDataflowEdges
- Reverse-dep files only need edge rebuilding, not AST re-analysis
3. Report wasmPreMs in phase timing output for visibility
Impact: 4 functions changed, 28 affected
Impact: 4 functions changed, 10 affected
* fix: update native binary to v3.0.1 and enable extended kinds in parity tests
Update parity tests to compare children (parameters, properties,
constants) and call receivers now that the native binary supports them.
Document known native gaps: implicit self parameters (Python/Rust) and
local constant scope (extracts function-local const as top-level).
* fix(native): skip local constants inside function bodies
The native engine's JS extractor walked the entire tree recursively and
extracted `const x = <literal>` as top-level constants regardless of
scope. The WASM extractor correctly only walks program-level children.
Add a `find_parent_of_types` check to skip constants nested inside
function_declaration, arrow_function, function_expression, or
method_definition nodes.
Impact: 2 functions changed, 1 affected
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
buildAstNodes was calling tx(rows) per-file inside the loop, creating 172 separate SQLite transactions each with a BEGIN/COMMIT/fsync cycle. All other phases (CFG, dataflow, complexity) correctly used a single wrapping transaction. Collect all rows across files first, then insert in one tx(allRows) call. Benchmarked: astMs drops from ~3600ms to ~350ms (native) and ~547ms (WASM). Native per-file build time: 24.9 → 8.5 ms/file. Impact: 1 functions changed, 12 affected
) * perf: use single transaction for AST node insertion (3.6s → 350ms) buildAstNodes was calling tx(rows) per-file inside the loop, creating 172 separate SQLite transactions each with a BEGIN/COMMIT/fsync cycle. All other phases (CFG, dataflow, complexity) correctly used a single wrapping transaction. Collect all rows across files first, then insert in one tx(allRows) call. Benchmarked: astMs drops from ~3600ms to ~350ms (native) and ~547ms (WASM). Native per-file build time: 24.9 → 8.5 ms/file. Impact: 1 functions changed, 12 affected * chore: bump version to 3.0.3
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…-58) (#338) Add 19 backlog items across 6 new tiers (1b-1f) for features that leverage existing build data (ast_nodes, cfg_blocks, dataflow, co_changes, embeddings) which is currently computed on every build but consumed by only 1-2 commands each. Highest-value items: data-dependent impact analysis (ID 49, 52), co-change coupling in triage (ID 53), missing co-change partner warning (ID 55).
Before calling ensureWasmTrees, check whether native engine already supplies CFG and dataflow data for all files. When it does, skip the WASM pre-parse entirely — avoiding a full WASM parse of every file on native builds where the data is already available. Impact: 1 functions changed, 14 affected
#339) The phase breakdown table in BUILD-BENCHMARKS.md was missing astMs, cfgMs, dataflowMs, and wasmPreMs — phases that account for ~75% of native build time since v3.0.1. Add them to phaseKeys/phaseLabels in update-benchmark-report.js so future regenerations include them. Also add a notes section explaining the v3.0.0→v3.0.3 regression: the four new default-on phases total ~1,575ms, and the WASM pre-parse double-parses all files on native builds because CFG/dataflow/complexity are JS-only and need tree-sitter ASTs the native engine doesn't expose.
Greptile SummaryThis PR lifts the Key changes:
Confidence Score: 4/5
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[buildGraph] --> B{opts.cfg != false\nOR opts.dataflow != false?}
B -- No --> G[Skip CFG + dataflow entirely]
B -- Yes --> C[Loop over astComplexitySymbols]
C --> D{symbols._tree\nalready set?}
D -- Yes --> C
D -- No --> E{needsCfg?\nCheck fnDefs}
E -- fnDefs.length > 0\nAND some lack\nnative CFG --> F[needsWasmTrees = true\nbreak]
E -- All have native CFG\nOR no fn defs --> H{needsDataflow?\nCheck symbols.dataflow}
H -- dataflow missing --> F
H -- native dataflow present --> C
F --> I[ensureWasmTrees\nastComplexitySymbols]
I --> J[Parse each file\nwithout _tree via WASM]
J --> K
C -- All files checked,\nneedsWasmTrees = false --> L[_t.wasmPreMs = 0\nskip WASM entirely]
L --> K[buildCFGData\nastComplexitySymbols]
K --> M[buildDataflowEdges\nastComplexitySymbols]
M --> N[Release _tree refs\nfor GC]
Last reviewed commit: b6a7699 |
| // Exclude constant nodes — the native engine has a known scope bug where it | ||
| // extracts local `const` variables inside functions as top-level constants, | ||
| // while WASM correctly limits constant extraction to program-level declarations. | ||
| const nodes = db | ||
| .prepare('SELECT name, kind, file, line FROM nodes ORDER BY name, kind, file, line') | ||
| .prepare( | ||
| "SELECT name, kind, file, line FROM nodes WHERE kind != 'constant' ORDER BY name, kind, file, line", | ||
| ) | ||
| .all(); | ||
| const edges = db | ||
| .prepare(` | ||
| SELECT n1.name AS source_name, n2.name AS target_name, e.kind | ||
| FROM edges e | ||
| JOIN nodes n1 ON e.source_id = n1.id | ||
| JOIN nodes n2 ON e.target_id = n2.id | ||
| WHERE n1.kind != 'constant' AND n2.kind != 'constant' | ||
| ORDER BY n1.name, n2.name, e.kind | ||
| `) |
There was a problem hiding this comment.
The WHERE kind != 'constant' exclusion handles the native engine scope bug where function-scoped const declarations are wrongly extracted as top-level constants. That bug is fixed in this PR's javascript.rs change. However, the test exclusion has no mechanism to be re-enabled once the 3.0.3 native binary is published — constant-node parity will never be verified even after the fix is live.
Consider adding a TODO comment referencing the binary version at which this exclusion should be removed:
// TODO: Remove this exclusion once the native binary >= 3.0.3 is published.
// The fix is in crates/codegraph-core/src/extractors/javascript.rs (find_parent_of_types guard).There was a problem hiding this comment.
Fixed in bc3b58c — added TODO comment referencing native binary >= 3.0.4 as the threshold for removing the kind != 'constant' exclusion.
CHANGELOG.md
Outdated
| ## [3.0.3](https://github.com/optave/codegraph/compare/v3.0.1...v3.0.3) (2026-03-04) | ||
|
|
||
| ### Performance | ||
|
|
||
| * **ast:** use single transaction for AST node insertion — `astMs` drops from ~3600ms to ~350ms (native) and ~547ms (WASM), reducing overall native build from 24.9 to 8.5 ms/file ([#333](https://github.com/optave/codegraph/pull/333)) | ||
|
|
||
| ## [3.0.1](https://github.com/optave/codegraph/compare/v3.0.0...v3.0.1) (2026-03-03) |
There was a problem hiding this comment.
The 3.0.3 entry only documents the AST single-transaction optimization. Two other significant changes shipped in this release are missing:
src/builder.js: skipensureWasmTreeswhen native engine provides complete CFG + dataflow datacrates/codegraph-core/src/extractors/javascript.rs: fix function-scopedconstdeclarations being incorrectly extracted as top-level constants
Additionally, the version jumps from 3.0.1 directly to 3.0.3 with no 3.0.2 entry. A note explaining why 3.0.2 was skipped would help future maintainers.
Consider updating the entry to include all three changes and clarifying the version gap.
There was a problem hiding this comment.
Fixed in bc3b58c — added the two missing changelog entries (ensureWasmTrees skip and const scope fix) and a note about 3.0.2 being internal/unpublished.
…omplete changelog
|
Addressed both review comments in bc3b58c:
|
src/builder.js
Outdated
| const fnDefs = symbols.definitions.filter( | ||
| (d) => (d.kind === 'function' || d.kind === 'method') && d.line, | ||
| ); |
There was a problem hiding this comment.
Missing null guard on symbols.definitions
symbols.definitions.filter(...) will throw a TypeError if a symbols object has definitions: undefined or definitions: null. This can happen for binary or unsupported files whose extractor returns a partial object, and for language-specific extractors that may omit definitions under error conditions.
Every other place in the codebase that accesses symbols.definitions uses a || [] guard (e.g. src/ast.js:171 — const defs = symbols.definitions || []). This new code path is the only exception, which means a single bad file can abort the entire build at the pre-parse check rather than silently skipping.
| const fnDefs = symbols.definitions.filter( | |
| (d) => (d.kind === 'function' || d.kind === 'method') && d.line, | |
| ); | |
| const fnDefs = (symbols.definitions || []).filter( | |
| (d) => (d.kind === 'function' || d.kind === 'method') && d.line, | |
| ); |
There was a problem hiding this comment.
Fixed in cb4a6c3 — changed to (symbols.definitions || []).filter(...), matching the guard pattern used everywhere else in the codebase.
Impact: 3 functions changed, 10 affected
|
Claude finished @carlos-alm's task —— View job PR Review AnalysisReviewing PR #344: Performance optimization to skip WASM parsing when native engine provides complete data. Todo List:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization to avoid redundant WASM parsing when the native engine already provides complete CFG and dataflow data. This addresses unnecessary work in builds that use the native engine. Why This Approach: The solution adds a pre-check loop that examines Risk Assessment: Low risk - the optimization is conservative (only skips when data is definitely complete) and maintains backward compatibility. Phase 2: Configuration Change Detection ✅No problematic configuration changes detected. The changes are purely algorithmic optimizations in Phase 3: Test Integrity ✅Test integrity is maintained. The author addressed previous review feedback by:
Phase 4: Code Quality Assessment ✅CRITICAL BUG CONFIRMED: The review comment from Greptile is correct. At Evidence of Inconsistent Pattern:
This inconsistency suggests the new code path doesn't follow the established safety pattern used elsewhere in the codebase. Other Code Quality Observations:
Root Cause AnalysisProblem Being Solved: Performance bottleneck where Why This Approach: The pre-check heuristic is logical and conservative - it only skips WASM parsing when it can definitively confirm that native data is complete for all required functionality (CFG blocks for functions/methods and dataflow data). Risk Assessment: Low risk optimization with proper fallback behavior and accurate timing metrics. Critical Concerns❌ NULL SAFETY BUG: Line Required Fix: const fnDefs = (symbols.definitions || []).filter(
(d) => (d.kind === 'function' || d.kind === 'method') && d.line,
);Final Recommendation
The fix is trivial but essential - all other aspects of the PR are acceptable. |
Impact: 1 functions changed, 0 affected
|
Claude finished @carlos-alm's task —— View job Claude is reviewing PR #344 Todo List:
Phase 1: Root Cause Analysis ✅Problem Being Solved: Performance optimization to avoid redundant WASM parsing when the native engine already provides complete CFG and dataflow data. This addresses unnecessary work in builds that use the native engine. Why This Approach: The solution adds a pre-check loop that examines Risk Assessment: Low risk - the optimization is conservative (only skips when data is definitely complete) and maintains backward compatibility. Phase 2: Configuration Change Detection ✅No problematic configuration changes detected. The changes are purely algorithmic optimizations in Phase 3: Test Integrity ✅Test integrity is maintained. The author addressed previous review feedback by:
Phase 4: Code Quality Assessment ✅CRITICAL BUG CONFIRMED: The review comment from Greptile is correct. At Evidence of Inconsistent Pattern:
This inconsistency suggests the new code path doesn't follow the established safety pattern used elsewhere in the codebase. Other Code Quality Observations:
Root Cause AnalysisProblem Being Solved: Performance bottleneck where Why This Approach: The pre-check heuristic is logical and conservative - it only skips WASM parsing when it can definitively confirm that native data is complete for all required functionality (CFG blocks for functions/methods and dataflow data). Risk Assessment: Low risk optimization with proper fallback behavior and accurate timing metrics. Critical Concerns❌ NULL SAFETY BUG: Line Required Fix: const fnDefs = (symbols.definitions || []).filter(
(d) => (d.kind === 'function' || d.kind === 'method') && d.line,
);Final Recommendation
The fix is trivial but essential - all other aspects of the PR are acceptable. |
|
Addressed the null guard comment in cb4a6c3 — added (symbols.definitions || []) guard. |
src/builder.js
Outdated
| const fnDefs = (symbols.definitions || []).filter( | ||
| (d) => (d.kind === 'function' || d.kind === 'method') && d.line, | ||
| ); | ||
| if (fnDefs.length > 0 && !fnDefs.every((d) => d.cfg?.blocks?.length)) { |
There was a problem hiding this comment.
Empty blocks array treated as missing native CFG
d.cfg?.blocks?.length evaluates to 0 (falsy) when the native engine returns cfg: { blocks: [] } for trivial functions (e.g., empty body or single return). This causes !fnDefs.every(d => d.cfg?.blocks?.length) to be true, spuriously triggering needsWasmTrees = true even though native CFG was provided.
In practice, any file containing a trivially simple function prevents the skip from firing, silently negating the optimization. cfg.js line 1151 also skips such functions, but that doesn't help here—the builder check should recognize empty-blocks as "native CFG was computed, just returned empty."
Consider using Array.isArray(d.cfg?.blocks) instead, which correctly treats an explicitly-set empty-blocks array as "native CFG was provided":
| if (fnDefs.length > 0 && !fnDefs.every((d) => d.cfg?.blocks?.length)) { | |
| if (fnDefs.length > 0 && !fnDefs.every((d) => Array.isArray(d.cfg?.blocks))) { |
There was a problem hiding this comment.
Fixed in b6a7699 — changed the check from d.cfg?.blocks?.length (falsy for empty arrays) to d.cfg === null || Array.isArray(d.cfg?.blocks). This correctly treats cfg: { blocks: [] } as valid native CFG while preserving cfg: null for nodes that intentionally have no CFG (e.g. interface members).
cb4a6c3 to
bf8de1f
Compare
d.cfg?.blocks?.length evaluates to 0 (falsy) when the native engine
returns cfg: { blocks: [] } for trivial functions, spuriously triggering
needsWasmTrees. Use Array.isArray(d.cfg?.blocks) instead, and preserve
d.cfg === null for nodes that intentionally have no CFG (e.g. interface
members).
Impact: 1 functions changed, 1 affected
Impact: 1 functions changed, 0 affected

Summary
ensureWasmTrees, checks whether native engine already provides CFG and dataflow data for all files_t.wasmPreMs = 0when skipped so timings remain accurateBuilds on PRs #342 (native CFG) and #343 (native dataflow) which add per-module bypass logic. This change lifts the optimization to the shared pre-parse step in
builder.js, eliminating the redundantensureWasmTreescall when native data is already sufficient.Test plan
node src/cli.js build . --engine native --timings— verifywasmPreMsdrops to ~0node src/cli.js build . --engine wasm --timings— verify WASM path still works