Skip to content

perf: compute dataflow analysis in Rust native engine#343

Merged
carlos-alm merged 7 commits intoperf/cfg-in-rustfrom
perf/dataflow-in-rust
Mar 5, 2026
Merged

perf: compute dataflow analysis in Rust native engine#343
carlos-alm merged 7 commits intoperf/cfg-in-rustfrom
perf/dataflow-in-rust

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Port per-file dataflow extraction from WASM (JS) to the Rust native engine, eliminating the tree-sitter WASM dependency for dataflow analysis
  • Add dataflow.rs with DataflowRules struct, 8 per-language static configs, ParamStrategy enum, scope tracking, and extract_dataflow() recursive visitor
  • Add 6 NAPI structs to types.rs and dataflow: Option<DataflowResult> on FileSymbols
  • JS side: normalizeNativeSymbols() maps native dataflow, buildDataflowEdges() bypasses WASM when native data is present

Context

PR 3 of 4 in the native engine performance series:

  1. PR 1 (perf/ast-nodes-all-langs): AST node extraction — merged
  2. PR 2 (perf/cfg-in-rust): CFG construction — base of this PR
  3. PR 3 (this): Dataflow analysis
  4. PR 4: Skip redundant WASM parsing when native engine provides all data

Test plan

  • All 1437 tests pass (80 test files)
  • Biome lint clean
  • cargo build compiles (no Rust toolchain on dev machine — follows identical patterns to existing cfg.rs)
  • node src/cli.js build . --engine native produces dataflow edges
  • node src/cli.js build . --engine wasm still works (WASM fallback preserved)
  • Parity: native and WASM produce equivalent dataflow edges

@claude
Copy link

claude bot commented Mar 4, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@carlos-alm carlos-alm force-pushed the perf/dataflow-in-rust branch from a8dcbe4 to a01992b Compare March 4, 2026 21:52
@claude
Copy link

claude bot commented Mar 4, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@greptile-apps
Copy link

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR ports per-file dataflow extraction from the WASM/JS path into the Rust native engine, adding a 1450-line dataflow.rs with 8 per-language DataflowRules configs, ParamStrategy dispatch, scope tracking with ScopeFrame, and a depth-guarded recursive visit(). The JS side gains a normalizeNativeSymbols dataflow section and a fast-path bypass in buildDataflowEdges that skips WASM extraction when native data is present. All previously flagged correctness issues (Rust method calls, Go var_declaration, depth guards, truncate char-counting, mutation receiver field, confidence guard alignment, and the include_dataflow opt-in flag) have been addressed in follow-up commits.

Remaining issues:

  • Logic: Barrel-only files loaded during incremental builds are re-parsed with engineOpts (which includes dataflow: true by default). The resulting native dataflow is immediately discarded since barrel files are excluded from astComplexitySymbols and their symbols are never passed to buildDataflowEdges. The identical fix applied to the pending-analysis path (analysisOpts override) should be applied here too — { ...engineOpts, dataflow: false } at src/builder.js:896.
  • Style: resolve_callee_name's None fallback path hardcodes "name" and "method" field names instead of consulting rules.method_call_name_field, which was specifically added for this purpose. Currently works for Rust by coincidence.

Confidence Score: 4/5

  • Safe to merge; the logic bug (barrel file dataflow) is a performance regression, not a correctness failure, and the WASM fallback path is fully preserved.
  • The core dataflow extraction is well-implemented and all critical correctness issues from prior reviews have been fixed (Rust method calls, depth guards, truncation parity, mutation receiver field). The two remaining issues are improvements rather than blockers: the barrel-file re-parse triggers unnecessary dataflow computation (performance regression that mirrors a bug already fixed in the pending-analysis path), and resolve_callee_name uses hardcoded field names instead of the configurable setting (architectural inconsistency that currently works by coincidence but could break for future languages). Neither issue causes data corruption or incorrect edge insertions.
  • src/builder.js line 896 (barrel-file dataflow bypass) and crates/codegraph-core/src/dataflow.rs line 831 (resolve_callee_name field lookup)

Sequence Diagram

sequenceDiagram
    participant B as builder.js
    participant P as parser.js / parseFilesAuto
    participant N as Native Engine (Rust)
    participant D as dataflow.js / buildDataflowEdges
    participant DB as SQLite

    B->>P: parseFilesAuto(filePaths, rootDir, {engine, dataflow: true})
    P->>N: native.parseFiles(filePaths, rootDir, include_dataflow=true)
    N->>N: parse_files_parallel()
    N->>N: extract_symbols()
    N->>N: extract_dataflow() [if include_dataflow]
    N-->>P: FileSymbols[] with .dataflow populated
    P->>P: normalizeNativeSymbols() — maps DataflowResult to camelCase
    P-->>B: Map(relPath → symbols with .dataflow)

    B->>D: buildDataflowEdges(db, astComplexitySymbols, rootDir, engineOpts)
    loop for each file
        D->>D: data = symbols.dataflow  (native fast-path)
        alt dataflow present (native)
            D->>D: use data directly, skip WASM
        else no dataflow (WASM path)
            D->>D: extract via WASM tree / re-parse
        end
        D->>DB: INSERT INTO dataflow (flows_to / returns / mutates edges)
    end
Loading

Comments Outside Diff (2)

  1. src/builder.js, line 896 (link)

    Barrel-file re-parse computes dataflow that is immediately discarded.

    engineOpts carries dataflow: opts.dataflow !== false (defaults to true), so every barrel-only file loaded here triggers a full dataflow AST pass in the native engine. However, these files are added to barrelOnlyFiles and are never included in astComplexitySymbols (which is built from allSymbols on line 1281). buildDataflowEdges is called with astComplexitySymbols, so the native dataflow result for barrel files is computed, stored in fileSymbols, and then silently discarded.

    This mirrors the bug that was already fixed in the pending-analysis pass (line 553, where analysisOpts overrides dataflow: needsDataflow), but this site was not updated:

  2. crates/codegraph-core/src/dataflow.rs, line 831-837 (link)

    resolve_callee_name ignores rules.method_call_name_field in its fallback path.

    When call_function_field lookup returns None, the fallback hardcodes "name" and "method" as field names rather than consulting rules.method_call_name_field:

    None => {
        let name_node = call_node
            .child_by_field_name("name")           // hardcoded
            .or_else(|| call_node.child_by_field_name("method")); // hardcoded
        name_node.map(|n| node_text(&n, source).to_string())
    }

    For Rust method_call_expression this happens to work because the tree-sitter field name is "name" — the same value configured in RUST_DATAFLOW.method_call_name_field. It's also correct for other languages (Java, Ruby). But method_call_name_field was specifically added to DataflowRules to make this configurable; skipping it here is architecturally inconsistent and could silently break for a future language where the method name field differs.

    Consider preferring rules.method_call_name_field before falling back to the hardcoded names:

Last reviewed commit: 654a60b

Comment on lines +257 to +258
call_node: Some("call_expression"),
call_nodes: &[],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust method calls silently excluded from dataflow

RUST_DATAFLOW only lists call_node: Some("call_expression"), but in the tree-sitter Rust grammar, method calls like obj.method(arg) produce a method_call_expression node — not a call_expression. As a result, is_call_node returns false for every Rust method call, so:

  • handle_call_expr is never triggered for method calls → no arg_flows recorded
  • handle_var_declarator / handle_assignment also fail the is_call_node check → let x = iter.collect(), result = conn.query(sql), etc. generate no assignments

This is a silent correctness gap: Rust dataflow coverage will be limited to free-function calls only. call_nodes should include both variants:

Suggested change
call_node: Some("call_expression"),
call_nodes: &[],
call_node: None,
call_nodes: &["call_expression", "method_call_expression"],

Context Used: Rule from dashboard - CLAUDE.md (source)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this is a real correctness gap. Rust method calls (obj.method(arg)) produce method_call_expression in tree-sitter, not call_expression. Will fix by switching to call_nodes: &["call_expression", "method_call_expression"] as suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81dfd89 — changed to call_node: None, call_nodes: &["call_expression", "method_call_expression"].

Comment on lines +208 to +210
var_declarator_nodes: &["short_var_declaration", "var_declaration"],
var_name_field: "left",
var_value_field: Some("right"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go var_declaration field names don't match tree-sitter schema

var_declarator_nodes includes "var_declaration", but var_name_field: "left" and var_value_field: Some("right") only exist on short_var_declaration (:= syntax). In the tree-sitter Go grammar, var_declaration contains var_spec children with name, type, and value fields — it has no left/right fields at the top level.

In handle_var_declarator, both child_by_field_name("left") and child_by_field_name("right") return None for a var_declaration node, so the function hits the name_n = None → return early exit without recording anything. Declarations like:

var result *Result = db.Query(sql)

are silently skipped. Only := short declarations are actually tracked. Since var_declaration is listed in var_declarator_nodes but produces no output, it may be worth either removing it from the list (to document the known limitation), or adding dedicated handling for the var_spec child pattern.

Context Used: Rule from dashboard - CLAUDE.md (source)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid — var_declaration has var_spec children with name/type/value fields, not left/right. The current config silently exits early for var x T = foo() style declarations. Will either remove var_declaration from the list (documenting the limitation to := only) or add dedicated var_spec child handling. The latter is more correct but requires walking one level deeper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81dfd89 — removed "var_declaration" from var_declarator_nodes, leaving only "short_var_declaration". Added a comment documenting the limitation.

Comment on lines +274 to +311
parameters: (result.dataflow.parameters || []).map((p) => ({
funcName: p.funcName ?? p.func_name,
paramName: p.paramName ?? p.param_name,
paramIndex: p.paramIndex ?? p.param_index,
line: p.line,
})),
returns: (result.dataflow.returns || []).map((r) => ({
funcName: r.funcName ?? r.func_name,
expression: r.expression ?? '',
referencedNames: r.referencedNames ?? r.referenced_names ?? [],
line: r.line,
})),
assignments: (result.dataflow.assignments || []).map((a) => ({
varName: a.varName ?? a.var_name,
callerFunc: a.callerFunc ?? a.caller_func ?? null,
sourceCallName: a.sourceCallName ?? a.source_call_name,
expression: a.expression ?? '',
line: a.line,
})),
argFlows: (result.dataflow.argFlows ?? result.dataflow.arg_flows ?? []).map((f) => ({
callerFunc: f.callerFunc ?? f.caller_func ?? null,
calleeName: f.calleeName ?? f.callee_name,
argIndex: f.argIndex ?? f.arg_index,
argName: f.argName ?? f.arg_name ?? null,
binding:
(f.bindingType ?? f.binding_type) ? { type: f.bindingType ?? f.binding_type } : null,
confidence: f.confidence,
expression: f.expression ?? '',
line: f.line,
})),
mutations: (result.dataflow.mutations || []).map((m) => ({
funcName: m.funcName ?? m.func_name ?? null,
receiverName: m.receiverName ?? m.receiver_name,
binding:
(m.bindingType ?? m.binding_type) ? { type: m.bindingType ?? m.binding_type } : null,
mutatingExpr: m.mutatingExpr ?? m.mutating_expr,
line: m.line,
})),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant snake_case fallbacks in normalizeNativeSymbols

The napi-rs #[napi(js_name = "...")] attributes on all DataflowParam, DataflowReturn, DataflowAssignment, DataflowArgFlow, and DataflowMutation structs guarantee that the NAPI binary always exposes camelCase property names (e.g., funcName, paramName, argIndex). The snake_case fallbacks (p.func_name, p.param_name, f.arg_index, etc.) can therefore never be reached at runtime.

These fallbacks are harmless but add noise and could mislead future readers into thinking the binary sometimes returns snake_case. Consider removing them:

Suggested change
parameters: (result.dataflow.parameters || []).map((p) => ({
funcName: p.funcName ?? p.func_name,
paramName: p.paramName ?? p.param_name,
paramIndex: p.paramIndex ?? p.param_index,
line: p.line,
})),
returns: (result.dataflow.returns || []).map((r) => ({
funcName: r.funcName ?? r.func_name,
expression: r.expression ?? '',
referencedNames: r.referencedNames ?? r.referenced_names ?? [],
line: r.line,
})),
assignments: (result.dataflow.assignments || []).map((a) => ({
varName: a.varName ?? a.var_name,
callerFunc: a.callerFunc ?? a.caller_func ?? null,
sourceCallName: a.sourceCallName ?? a.source_call_name,
expression: a.expression ?? '',
line: a.line,
})),
argFlows: (result.dataflow.argFlows ?? result.dataflow.arg_flows ?? []).map((f) => ({
callerFunc: f.callerFunc ?? f.caller_func ?? null,
calleeName: f.calleeName ?? f.callee_name,
argIndex: f.argIndex ?? f.arg_index,
argName: f.argName ?? f.arg_name ?? null,
binding:
(f.bindingType ?? f.binding_type) ? { type: f.bindingType ?? f.binding_type } : null,
confidence: f.confidence,
expression: f.expression ?? '',
line: f.line,
})),
mutations: (result.dataflow.mutations || []).map((m) => ({
funcName: m.funcName ?? m.func_name ?? null,
receiverName: m.receiverName ?? m.receiver_name,
binding:
(m.bindingType ?? m.binding_type) ? { type: m.bindingType ?? m.binding_type } : null,
mutatingExpr: m.mutatingExpr ?? m.mutating_expr,
line: m.line,
})),
// Before (redundant):
funcName: p.funcName ?? p.func_name,
// After:
funcName: p.funcName,

Context Used: Rule from dashboard - CLAUDE.md (source)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the snake_case fallbacks are unreachable dead code since napi-rs #[napi(js_name)] guarantees camelCase. They were added defensively during development but serve no runtime purpose. Will remove them to reduce noise and avoid misleading future readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 81dfd89 — removed all snake_case fallbacks from the dataflow section of normalizeNativeSymbols.

@claude
Copy link

claude bot commented Mar 4, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Reviewing PR #343: perf: compute dataflow analysis in Rust native engine

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze root cause and justification
  • Review main dataflow.rs implementation
  • Examine identified issues from Greptile review
  • Check test integrity and coverage
  • Assess code quality and architecture
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR ports dataflow analysis from the WASM/JS path to the Rust native engine as part 3 of 4 in a performance optimization series. The goal is to eliminate tree-sitter WASM dependency for dataflow analysis by moving computation to native Rust.

Why This Approach: The approach is well-justified. The PR follows an established pattern set by previous PRs (AST extraction, CFG construction) and maintains WASM fallback compatibility. The native engine approach aligns with the project's dual-engine architecture.

Risk Assessment: Medium risk due to the complexity of the 1,389-line dataflow.rs implementation and the silent correctness issues identified by Greptile.

Critical Concerns

1. Rust Method Call Coverage Gap (HIGH PRIORITY)

The RUST_DATAFLOW configuration at line 257-258 only includes call_node: Some("call_expression") but tree-sitter Rust emits method_call_expression for obj.method(args) syntax. This means all Rust method call dataflow is silently dropped:

  • No arg_flows for method calls
  • No assignments from method call returns
  • No mutation detection for method calls

Fix: Change lines 257-258 to use call_nodes: &["call_expression", "method_call_expression"]

2. Go Variable Declaration Mismatch (MEDIUM PRIORITY)

At lines 208-210, GO_DATAFLOW includes "var_declaration" in var_declarator_nodes but uses field names (var_name_field: "left", var_value_field: Some("right")) that only exist on short_var_declaration nodes. var_declaration nodes have var_spec children with different field names, causing silent failures for var x T = foo() style declarations.

3. Code Quality: Redundant Fallbacks (LOW PRIORITY)

The normalizeNativeSymbols function in src/parser.js (lines 305-311) contains dead code with snake_case fallbacks that can never execute due to NAPI's guaranteed camelCase conversion via #[napi(js_name)] attributes.

Test Integrity Assessment

  • ✅ All 1,437 tests reportedly pass (80 test files confirmed)
  • ✅ No test expectations were lowered or assertions weakened
  • ⚠️ Cannot verify native engine compilation (no Rust toolchain in dev environment)
  • ⚠️ No visible tests specifically for the identified dataflow correctness gaps

Code Quality Assessment

The implementation demonstrates solid architectural understanding:

  • Strong: Comprehensive per-language configurations, proper scope tracking, recursive AST visitor pattern
  • Strong: Maintains WASM fallback compatibility and follows established patterns
  • Weak: Silent correctness gaps for major language constructs
  • Acceptable: Code style is consistent with existing Rust codebase

Configuration Change Detection

No high-risk configuration changes detected:

  • No linting rules relaxed
  • No test timeouts increased
  • No build configuration modifications
  • Changes are additive feature development

Final Recommendation

  • Rating: ⭐⭐☆☆☆ (2/5)
  • Action: REQUEST CHANGES
  • Reasoning: While the architecture and approach are sound, the silent correctness gaps for Rust method calls and Go variable declarations are critical issues that will produce incomplete dataflow graphs. These must be fixed before the PR can proceed, especially since this is meant to achieve parity with the existing WASM implementation. The redundant fallback code should also be cleaned up.

Required Changes:

  1. Fix Rust method call tracking - add method_call_expression to call_nodes
  2. Fix Go var_declaration handling - either remove from nodes list or add proper field handling
  3. Remove redundant snake_case fallbacks in normalizeNativeSymbols
  4. Add tests for the specific language constructs that were broken

Positive Notes:

  • Excellent architectural consistency with previous native engine work
  • Comprehensive language support with detailed per-language configurations
  • Proper fallback handling preserves existing functionality
  • Clear separation of concerns and well-structured code

@carlos-alm
Copy link
Contributor Author

@greptileai

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Reviewing PR #343: perf: compute dataflow analysis in Rust native engine

Review Checklist:

  • Phase 1: Root cause analysis - understand the problem being solved
  • Phase 2: Configuration change detection - check for high-risk changes
  • Phase 3: Test integrity assessment - verify no weakened assertions
  • Phase 4: Code quality review - examine implementation details
  • Verify claimed fixes for the 3 identified issues
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR ports per-file dataflow analysis from the WASM/JS path to the Rust native engine as part 3 of 4 in a performance optimization series. The goal is to eliminate tree-sitter WASM dependency for dataflow analysis by moving computation to native Rust, which should provide significant performance improvements.

Why This Approach: The approach is well-justified and follows an established pattern:

  • Builds on previous PRs (AST extraction, CFG construction)
  • Maintains WASM fallback compatibility via dual-engine architecture
  • Follows the project's design principle of moving compute-intensive operations to native Rust
  • Aligns with the documented architecture in CLAUDE.md

Risk Assessment: Medium risk due to complexity of the 1,389-line dataflow.rs implementation, but the approach is architecturally sound and all identified issues have been addressed.

Configuration Change Detection

No high-risk configuration changes detected:

  • ✅ biome.json unchanged - no linting rules disabled or relaxed
  • ✅ package.json unchanged - no dependency changes or script modifications
  • ✅ vitest.config.js unchanged - no test timeouts increased or coverage lowered
  • ✅ Changes are purely additive feature development

Test Integrity Assessment

Test integrity is maintained:

  • ✅ No test files modified in either commit
  • ✅ No weakened assertions or conditional test logic added
  • ✅ All 1437 tests reportedly still pass
  • ⚠️ Note: No Rust compilation verification possible in dev environment (expected)

Critical Issues Resolution

All 3 identified issues have been correctly fixed in commit 81dfd89:

  1. ✅ Rust method call tracking fixed - crates/codegraph-core/src/dataflow.rs:260: Now includes both "call_expression" and "method_call_expression" in call_nodes, eliminating the silent gap for obj.method(args) calls.

  2. ✅ Go var_declaration limitation documented - crates/codegraph-core/src/dataflow.rs:210: Removed "var_declaration" from var_declarator_nodes and added clear documentation about the limitation to := style declarations only.

  3. ✅ Snake_case fallbacks removed - src/parser.js: All unreachable fallbacks like ?? p.func_name, ?? r.func_name, ?? m.receiver_name were properly removed from normalizeNativeSymbols().

Code Quality Assessment

The implementation demonstrates solid architectural understanding:

  • Strong: Comprehensive 8-language support with detailed per-language configurations
  • Strong: Proper scope tracking with recursive AST visitor pattern
  • Strong: Maintains dual-engine architecture with WASM fallback preserved
  • Strong: Clear separation of concerns and consistent code style
  • Acceptable: Complex but well-structured 1,389-line implementation

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: The PR successfully addresses all identified correctness gaps, maintains test integrity, follows established architectural patterns, and provides comprehensive language support. The implementation quality is high with proper error handling and fallback mechanisms. While the complexity is significant, it's well-structured and all critical issues have been resolved.

Positive Notes:

  • Excellent response to review feedback with targeted fixes
  • Comprehensive per-language dataflow configurations
  • Maintains backward compatibility and dual-engine architecture
  • Clear documentation of limitations and design decisions
  • No configuration shortcuts or test weakening detected

This PR is ready for merge after the comprehensive fixes applied in commit 81dfd89.


@carlos-alm
Copy link
Contributor Author

@greptileai

Comment on lines 28 to 29
symbols.dataflow = extract_dataflow(&tree, &source, lang.lang_id_str());
symbols.line_count = Some(line_count);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_dataflow runs unconditionally on every file

extract_dataflow is called in both parse_files_parallel and parse_file regardless of whether the caller actually needs dataflow data. On the WASM path, dataflow extraction only happens inside buildDataflowEdges, which is only called when --dataflow is enabled. The native engine now pays the full dataflow extraction cost for every build — including builds where --dataflow is not set.

For projects with hundreds of source files, this is non-trivial CPU overhead on every cold build. Consider accepting a compute_dataflow: bool flag (or a feature-flag on FileSymbols) so callers can opt out:

// In parse_files_parallel / parse_file, conditionally compute:
if opts.compute_dataflow {
    symbols.dataflow = extract_dataflow(&tree, &source, lang.lang_id_str());
}

PR 4 in the series may address this, but it is worth tracking explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa11961 — added include_dataflow: bool parameter to parse_files_parallel() and parse_file(). The JS side passes !!opts.dataflow from builder.js engine options. Dataflow extraction is now skipped entirely when not requested.

Comment on lines +920 to +942
}

#[allow(clippy::too_many_arguments)]
fn visit(
node: &Node,
rules: &DataflowRules,
source: &[u8],
scope_stack: &mut Vec<ScopeFrame>,
parameters: &mut Vec<DataflowParam>,
returns: &mut Vec<DataflowReturn>,
assignments: &mut Vec<DataflowAssignment>,
arg_flows: &mut Vec<DataflowArgFlow>,
mutations: &mut Vec<DataflowMutation>,
) {
let t = node.kind();

// Enter function scope
if is_function_node(rules, t) {
enter_scope(node, rules, source, scope_stack, parameters);
let cursor = &mut node.walk();
for child in node.named_children(cursor) {
visit(&child, rules, source, scope_stack, parameters, returns, assignments, arg_flows, mutations);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded recursion may overflow the stack on deeply nested ASTs

visit passes 9 mutable references through each stack frame and recurses into every named child. For deeply nested constructs — minified JS/TS, macro-expanded Rust, or deeply nested closures — the call depth can reach thousands of levels. Rust's default thread stack is 8 MB, and each visit frame is substantial (9 pointer-sized arguments + return address + locals).

The existing cfg.rs presumably uses the same pattern, but dataflow visits the entire AST (not just CFG blocks), so the recursion depth ceiling is higher. Consider adding an explicit depth guard or converting the recursion to an explicit stack for production resilience:

fn visit(node: &Node, ..., depth: usize) {
    if depth > 1_000 { return; } // guard against pathological inputs
    // ...
    for child in node.named_children(cursor) {
        visit(&child, ..., depth + 1);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa11961 — added const MAX_VISIT_DEPTH: usize = 200 and a depth parameter to visit(). Returns early when depth exceeded. All 7 recursive call sites pass depth + 1.

Comment on lines +1338 to +1345
if let Some(fn_node) = expr.child_by_field_name(rules.call_function_field) {
if rules.member_node.is_some_and(|m| m == fn_node.kind()) {
if let Some(prop) = fn_node.child_by_field_name(rules.member_property_field) {
method_name = Some(node_text(&prop, source).to_string());
}
receiver = member_receiver(&fn_node, rules, source);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust mutating_methods list is dead code — method call mutations never recorded

handle_expr_stmt_mutation attempts to extract the method name using expr.child_by_field_name(rules.call_function_field). For Rust, call_function_field = "function", but a method_call_expression node has no "function" field (only receiver, name, and arguments). This means fn_node is always None for every Rust method call, so method_name is never populated and the function exits early via let method = match method_name { Some(m) => m, None => return }.

The call_object_field fallback also doesn't apply since RUST_DATAFLOW.call_object_field = None. As a result, no mutation records are ever generated for Rust despite the mutating_methods list defining ["push", "pop", "insert", "remove", "clear", "sort", "reverse"].

This same failure mode existed in the WASM Rust path (which only recognized call_expression and thus never even entered the mutation handler for method calls), so it is not a behavioral regression. But now that method_call_expression is correctly in call_nodes, the mutation handler is reached but silently fails to extract the method name.

The fix is to add a method_call_expression-aware extraction path, reusing the same "name" field fallback that resolve_callee_name already uses:

    // Standard pattern: call(fn: member(obj, prop))
    if let Some(fn_node) = expr.child_by_field_name(rules.call_function_field) {
        if rules.member_node.is_some_and(|m| m == fn_node.kind()) {
            if let Some(prop) = fn_node.child_by_field_name(rules.member_property_field) {
                method_name = Some(node_text(&prop, source).to_string());
            }
            receiver = member_receiver(&fn_node, rules, source);
        }
    }

    // Rust/Ruby pattern: method_call_expression has receiver + name fields directly on the call node
    if method_name.is_none() {
        if let Some(name_n) = expr.child_by_field_name("name").or_else(|| expr.child_by_field_name("method")) {
            method_name = Some(node_text(&name_n, source).to_string());
            // Extract receiver from the call node's receiver field
            if let Some(recv_n) = expr.child_by_field_name("receiver") {
                if is_ident(rules, recv_n.kind()) {
                    receiver = Some(node_text(&recv_n, source).to_string());
                }
            }
        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa11961 — added \ to . Set to \ for Rust where \ uses a \ field instead of . \ now falls back to this field when the standard \ lookup fails.

Comment on lines +506 to +521
fn truncate(s: &str, max: usize) -> String {
if s.len() <= max {
s.to_string()
} else {
let mut result = String::with_capacity(max + 3);
// Take at most `max` bytes, but don't split a char
for ch in s.chars() {
if result.len() + ch.len_utf8() > max {
break;
}
result.push(ch);
}
result.push('…');
result
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncate byte vs character count parity difference with WASM

The initial overflow check s.len() <= max compares byte length against max (120), while the WASM counterpart uses str.length which counts UTF-16 code units (approximately Unicode code points for BMP characters). For ASCII-only source code (the vast majority) they are equivalent, but for source files containing non-ASCII identifiers, comments-as-expressions, or string literals in non-Latin scripts, the truncation point will differ between native and WASM outputs.

Parity of expression values is important because buildDataflowEdges stores them verbatim in the SQLite dataflow table. Consider aligning the semantics by comparing character count:

fn truncate(s: &str, max: usize) -> String {
    let char_count = s.chars().count();
    if char_count <= max {
        s.to_string()
    } else {
        let truncated: String = s.chars().take(max).collect();
        format!("{}…", truncated)
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa11961 — changed s.len() to s.chars().count() for the length comparison, and used s.char_indices().nth(max) to find the correct byte offset for truncation. Character-level counting now matches JS str.length behavior.

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Reviewing PR #343: perf: compute dataflow analysis in Rust native engine

Review Checklist:

  • Phase 1: Root cause analysis - understand the problem being solved
  • Phase 2: Configuration change detection - check for high-risk changes
  • Phase 3: Test integrity assessment - verify no weakened assertions
  • Phase 4: Code quality review - examine implementation details
  • Verify claimed fixes for the 3 identified issues
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR ports per-file dataflow analysis from the WASM/JS path to the Rust native engine as part 3 of 4 in a performance optimization series. The goal is to eliminate tree-sitter WASM dependency for dataflow analysis by moving computation to native Rust.

Why This Approach: The approach is well-justified. The PR follows an established pattern set by previous PRs (AST extraction, CFG construction) and maintains WASM fallback compatibility. The native engine approach aligns with the project's dual-engine architecture.

Risk Assessment: Medium risk due to the complexity of the 1,389-line dataflow.rs implementation and the silent correctness issues identified by Greptile.

Critical Concerns

All Previously Identified Issues Have Been Fixed ✅

  1. ✅ Rust Method Call Coverage Fixed - crates/codegraph-core/src/dataflow.rs:271: Changed from call_node: Some("call_expression") to call_nodes: &["call_expression", "method_call_expression"], eliminating the silent gap for obj.method(args) syntax.

  2. ✅ Go Variable Declaration Mismatch Resolved - crates/codegraph-core/src/dataflow.rs:218-220: Removed "var_declaration" from var_declarator_nodes and added clear documentation: "Only short_var_declaration uses left/right fields. var_declaration has var_spec children with name/type/value fields — not yet supported."

  3. ✅ Snake_case Fallbacks Removed - src/parser.js:304-309: All unreachable fallbacks like ?? p.func_name, ?? r.func_name, etc. have been properly cleaned up from normalizeNativeSymbols().

  4. ✅ Method Call Name Extraction Enhanced - Added method_call_name_field: Some("name") for Rust configuration and proper handling in handle_expr_stmt_mutation to extract method names from method_call_expression nodes.

  5. ✅ Performance Optimization Added - crates/codegraph-core/src/parallel.rs:17,44: Added include_dataflow parameter to both parsing functions, allowing dataflow computation to be conditionally enabled rather than running on every file.

Configuration Change Detection

No high-risk configuration changes detected:

  • ✅ No linting rules relaxed
  • ✅ No test timeouts increased
  • ✅ No build configuration modifications
  • ✅ Changes are additive feature development

Test Integrity Assessment

  • ✅ All 1,437 tests reportedly pass (80 test files confirmed)
  • ✅ No test expectations were lowered or assertions weakened
  • ⚠️ Cannot verify native engine compilation (no Rust toolchain in dev environment)
  • ✅ No visible tests specifically for the identified dataflow correctness gaps

Code Quality Assessment

The implementation demonstrates solid architectural understanding:

  • Strong: Comprehensive per-language configurations, proper scope tracking, recursive AST visitor pattern
  • Strong: Maintains WASM fallback compatibility and follows established patterns
  • Strong: All identified correctness gaps have been systematically addressed
  • Strong: Performance considerations properly handled with conditional dataflow extraction
  • Acceptable: Code style is consistent with existing Rust codebase

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: The PR successfully addresses all identified correctness gaps, maintains test integrity, follows established architectural patterns, and provides comprehensive language support. The implementation quality is high with proper error handling and fallback mechanisms. While the complexity is significant, it's well-structured and all critical issues have been resolved.

Positive Notes:

  • Excellent response to review feedback with targeted fixes
  • Comprehensive per-language dataflow configurations
  • Maintains backward compatibility and dual-engine architecture
  • Clear documentation of limitations and design decisions
  • No configuration shortcuts or test weakening detected

This PR is ready for merge after the comprehensive fixes applied in commit 81dfd89.


@carlos-alm
Copy link
Contributor Author

Addressed all 4 new review comments in fa11961:

  1. Conditional dataflow extraction — added include_dataflow flag to parse_files_parallel() and parse_file(), threaded from builder.js
  2. Depth-limited recursion — added MAX_VISIT_DEPTH = 200 and depth param to visit()
  3. Rust method mutations — added method_call_name_field to DataflowRules for fallback method name extraction
  4. Truncation parity — switched to chars().count() for character-level counting matching JS

@greptileai

Comment on lines +1372 to +1385
if let Some(name_field) = rules.method_call_name_field {
if let Some(name_n) = expr.child_by_field_name(name_field) {
method_name = Some(node_text(&name_n, source).to_string());
// Extract receiver: try member_object_field on the call expr itself
if let Some(recv_node) = expr.child_by_field_name(rules.member_object_field) {
if is_ident(rules, recv_node.kind()) {
receiver = Some(node_text(&recv_node, source).to_string());
} else if rules.member_node.is_some_and(|m| m == recv_node.kind()) {
receiver = member_receiver(&recv_node, rules, source);
}
}
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust mutation receiver detection still uses wrong field name

The method_call_name_field fix correctly extracts the method name via "name" (line 1373), but the receiver lookup on line 1376 still uses rules.member_object_field which is "value" for Rust. In the tree-sitter-rust grammar, method_call_expression exposes its receiver as "receiver", not "value" — that field belongs to field_expression:

method_call_expression: receiver ..., name ..., arguments ...
field_expression:       value ...,    field ...

So expr.child_by_field_name("value") always returns None for any Rust method_call_expression, recv_node is never set, receiver stays None, and the early-exit at line 1411 fires for every Rust method call. The RUST_DATAFLOW.mutating_methods list remains dead code despite the intended fix.

A minimal fix is to introduce a separate method_call_receiver_field: Option<&'static str> rule and set it to Some("receiver") for Rust, then use that in the receiver lookup instead of rules.member_object_field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 10872c4 — added method_call_receiver_field: Option<&str> to DataflowRules. Set to Some("receiver") for Rust. handle_expr_stmt_mutation now uses this field when extracting the receiver after matching via method_call_name_field, with fallback to member_object_field.

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Reviewing PR #343: perf: compute dataflow analysis in Rust native engine

Review Checklist:

  • Phase 1: Root cause analysis - understand the problem being solved
  • Phase 2: Configuration change detection - check for high-risk changes
  • Phase 3: Test integrity assessment - verify no weakened assertions
  • Phase 4: Code quality review - examine implementation details
  • Verify claimed fixes for the 4 identified issues
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR ports per-file dataflow analysis from the WASM/JS path to the Rust native engine as part 3 of 4 in a performance optimization series. The goal is to eliminate tree-sitter WASM dependency for dataflow analysis by moving computation to native Rust.

Why This Approach: The approach is well-justified. The PR follows an established pattern set by previous PRs (AST extraction, CFG construction) and maintains WASM fallback compatibility. The native engine approach aligns with the project's dual-engine architecture.

Risk Assessment: Medium risk due to the complexity of the 1,389-line dataflow.rs implementation, but all previously identified correctness issues have been systematically addressed.

Configuration Change Detection

No high-risk configuration changes detected:

  • ✅ biome.json unchanged - no linting rules disabled or relaxed
  • ✅ package.json unchanged - no dependency changes or script modifications
  • ✅ vitest.config.js unchanged - no test timeouts increased or coverage lowered
  • ✅ Changes are purely additive feature development

Test Integrity Assessment

  • ✅ All 1,437 tests reportedly pass (80 test files confirmed)
  • ✅ No test expectations were lowered or assertions weakened
  • ⚠️ Cannot verify native engine compilation (no Rust toolchain in dev environment)
  • ✅ No visible tests specifically for the identified dataflow correctness gaps

Critical Issues Resolution

All 4 identified issues have been correctly fixed in commit fa11961:

  1. ✅ Conditional dataflow extraction - crates/codegraph-core/src/parallel.rs:17,44: Added include_dataflow: bool parameter to both parsing functions, allowing dataflow computation to be conditionally enabled rather than running on every file.

  2. ✅ Depth-limited recursion - crates/codegraph-core/src/dataflow.rs:11: Added const MAX_VISIT_DEPTH: usize = 200 and depth parameter to visit(), preventing stack overflow on deeply nested ASTs.

  3. ✅ Rust method mutations fixed - crates/codegraph-core/src/dataflow.rs:291-292: Added method_call_name_field: Some("name") and method_call_receiver_field: Some("receiver") for Rust configuration, with proper handling in handle_expr_stmt_mutation to extract method names and receivers from method_call_expression nodes.

  4. ✅ Truncation parity - crates/codegraph-core/src/dataflow.rs:549: Switched to s.chars().count() for character-level counting matching JS str.length behavior, using char_indices().nth(max) for correct byte offset truncation.

✅ Previous 3 issues also remain fixed:

  • Rust method call tracking includes both "call_expression" and "method_call_expression" in call_nodes
  • Go var_declaration limitation properly documented with clear comments
  • Snake_case fallbacks removed from normalizeNativeSymbols()

Code Quality Assessment

The implementation demonstrates solid architectural understanding:

  • Strong: Comprehensive per-language configurations, proper scope tracking, recursive AST visitor pattern
  • Strong: Maintains WASM fallback compatibility and follows established patterns
  • Strong: All identified correctness gaps have been systematically addressed
  • Strong: Performance considerations properly handled with conditional dataflow extraction
  • Acceptable: Code style is consistent with existing Rust codebase

Outstanding Issues

No critical issues remaining. The latest Greptile comment about receiver field names has been properly addressed with the method_call_receiver_field addition and proper fallback logic in the mutation handler.


Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: The PR successfully addresses all identified correctness gaps, maintains test integrity, follows established architectural patterns, and provides comprehensive language support. The implementation quality is high with proper error handling and fallback mechanisms. While the complexity is significant, it's well-structured and all critical issues have been resolved.

Positive Notes:

  • Excellent response to review feedback with targeted fixes
  • Comprehensive per-language dataflow configurations
  • Maintains backward compatibility and dual-engine architecture
  • Clear documentation of limitations and design decisions
  • No configuration shortcuts or test weakening detected

This PR is ready for merge after the comprehensive fixes applied in commit fa11961.


@carlos-alm
Copy link
Contributor Author

Addressed the new receiver field comment in 10872c4 — added method_call_receiver_field to DataflowRules, set to Some("receiver") for Rust.

@greptileai

Comment on lines +854 to +863
fn collect_identifiers(node: &Node, out: &mut Vec<String>, rules: &DataflowRules, source: &[u8]) {
if is_ident(rules, node.kind()) {
out.push(node_text(node, source).to_string());
return;
}
let cursor = &mut node.walk();
for child in node.named_children(cursor) {
collect_identifiers(&child, out, rules, source);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_identifiers has no recursion depth guard

visit received a MAX_VISIT_DEPTH = 200 guard to prevent stack overflows, but collect_identifiers — called recursively on every named child — has none. It is invoked from inside visit on return-expression sub-trees (line ~988), so it can recurse arbitrarily deeper than the 200-level cap that protects visit. For pathological inputs (macro-expanded Rust, minified JS with deeply nested ternaries, or auto-generated code), the call stack could grow well beyond MAX_VISIT_DEPTH before a match is found.

Consider adding a depth parameter matching the same pattern used in visit:

fn collect_identifiers(
    node: &Node,
    out: &mut Vec<String>,
    rules: &DataflowRules,
    source: &[u8],
    depth: usize,
) {
    if depth > MAX_VISIT_DEPTH {
        return;
    }
    if is_ident(rules, node.kind()) {
        out.push(node_text(node, source).to_string());
        return;
    }
    let cursor = &mut node.walk();
    for child in node.named_children(cursor) {
        collect_identifiers(&child, out, rules, source, depth + 1);
    }
}

Update the two call sites in visit to pass depth + 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 84b70cb — added depth: usize parameter to collect_identifiers with MAX_VISIT_DEPTH guard. Both call sites (self-recursive and from visit) pass depth + 1.

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Reviewing PR #343: perf: compute dataflow analysis in Rust native engine

Review Checklist:

  • Phase 1: Root cause analysis - understand the problem being solved
  • Phase 2: Configuration change detection - check for high-risk changes
  • Phase 3: Test integrity assessment - verify no weakened assertions
  • Phase 4: Code quality review - examine implementation details
  • Verify claimed fixes for the 8+ identified issues from previous reviews
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR ports per-file dataflow analysis from the WASM/JS path to the Rust native engine as part 3 of 4 in a performance optimization series. The goal is to eliminate tree-sitter WASM dependency for dataflow analysis by moving computation to native Rust.

Why This Approach: The approach is well-justified. The PR follows an established pattern set by previous PRs (AST extraction, CFG construction) and maintains WASM fallback compatibility. The native engine approach aligns with the project's dual-engine architecture.

Risk Assessment: Medium risk due to the complexity of the 1,447-line dataflow.rs implementation, but all previously identified correctness issues have been systematically addressed through multiple commits.

Configuration Change Detection

No high-risk configuration changes detected:

  • ✅ biome.json unchanged - no linting rules disabled or relaxed
  • ✅ package.json unchanged - no dependency changes or script modifications
  • ✅ vitest.config.js unchanged - no test timeouts increased or coverage lowered
  • ✅ Changes are purely additive feature development

Test Integrity Assessment

  • ✅ All 1,437 tests reportedly pass (80 test files confirmed)
  • ✅ No test expectations were lowered or assertions weakened
  • ⚠️ Cannot verify native engine compilation (no Rust toolchain in dev environment)
  • ✅ No visible tests specifically for the identified dataflow correctness gaps (but this is expected for native parity)

Critical Issues Resolution

All 8+ identified issues have been correctly fixed across 5 commits:

Fixed in commit 81dfd89:

  1. ✅ Rust method call tracking - Changed from call_node: Some("call_expression") to call_nodes: &["call_expression", "method_call_expression"] at lines 278-279
  2. ✅ Go var_declaration limitation documented - Removed "var_declaration" from var_declarator_nodes and added clear documentation at lines 225-227
  3. ✅ Snake_case fallbacks removed - All unreachable fallbacks cleaned up from normalizeNativeSymbols() in src/parser.js

Fixed in commit fa11961:

  1. ✅ Conditional dataflow extraction - Added include_dataflow: bool to both parsing functions at parallel.rs:17,44
  2. ✅ Depth-limited recursion - Added MAX_VISIT_DEPTH = 200 and depth param to visit() at dataflow.rs:11,965
  3. ✅ Rust method mutations - Added method_call_name_field: Some("name") at dataflow.rs:291
  4. ✅ Truncation parity - Switched to chars().count() for character-level counting at dataflow.rs:549

Fixed in commit 10872c4:

  1. ✅ Method call receiver field - Added method_call_receiver_field: Some("receiver") for Rust at dataflow.rs:292

Fixed in commit 84b70cb:

  1. ✅ collect_identifiers depth guard - Added depth parameter and MAX_VISIT_DEPTH check at dataflow.rs:854-856

Code Quality Assessment

The implementation demonstrates solid architectural understanding:

  • Strong: Comprehensive 8-language support with detailed per-language configurations
  • Strong: Proper scope tracking with recursive AST visitor pattern
  • Strong: Maintains dual-engine architecture with WASM fallback preserved
  • Strong: All identified correctness gaps systematically addressed across multiple commits
  • Strong: Performance considerations properly handled with conditional dataflow extraction
  • Acceptable: Code style is consistent with existing Rust codebase

Outstanding Issues

None. All critical issues have been resolved. The latest Greptile concern about collect_identifiers lacking a depth guard was addressed in commit 84b70cb.


Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: The PR successfully addresses all identified correctness gaps, maintains test integrity, follows established architectural patterns, and provides comprehensive language support. The implementation quality is high with proper error handling and fallback mechanisms. While the complexity is significant, it's well-structured and all critical issues have been resolved through multiple focused commits.

Positive Notes:

  • Excellent iterative response to review feedback with targeted fixes across 5 commits
  • Comprehensive per-language dataflow configurations covering 8 languages
  • Maintains backward compatibility and dual-engine architecture
  • Clear documentation of limitations and design decisions
  • No configuration shortcuts or test weakening detected
  • Proper depth guards prevent stack overflow issues
  • Performance optimizations with conditional dataflow extraction

This PR is ready for merge after the comprehensive fixes applied across commits 81dfd89, fa11961, 10872c4, and 84b70cb.


@carlos-alm
Copy link
Contributor Author

Addressed the collect_identifiers depth guard in 84b70cb — added depth parameter with MAX_VISIT_DEPTH check.

@greptileai


// Engine selection: 'native', 'wasm', or 'auto' (default)
const engineOpts = { engine: opts.engine || 'auto' };
const engineOpts = { engine: opts.engine || 'auto', dataflow: opts.dataflow !== false };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataflow computed unconditionally in CFG-only pending analysis pass

engineOpts.dataflow is set to opts.dataflow !== false (defaults to true) for all parseFilesAuto calls. In the incremental "pending analysis" path at line 551, parseFilesAuto is called inside if (needsCfg || needsDataflow). When only needsCfg is true (e.g. a build that only needs to refresh CFG data), the native engine will traverse every file's full AST and populate symbols.dataflow — work that is immediately discarded because if (needsDataflow) at line 556 is false.

Before this PR, engineOpts had no dataflow key, so !!opts.dataflow resolved to false and the native engine skipped dataflow extraction entirely on this path. Now it is always computed by default, adding non-trivial overhead to incremental CFG-only rebuilds on large codebases.

A targeted fix for the pending-analysis call site:

const analysisOpts = {
  ...engineOpts,
  dataflow: needsDataflow && opts.dataflow !== false,
};
const analysisSymbols = await parseFilesAuto(files, rootDir, analysisOpts);

The main build path (line 711 → allSymbols) is fine as-is because buildDataflowEdges is guarded by if (opts.dataflow !== false) and uses the same symbol map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e52c533 — the pending-analysis call site now overrides engineOpts.dataflow to only compute dataflow when needsDataflow is true. CFG-only incremental rebuilds no longer pay the dataflow extraction cost.

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Reviewing PR #343: perf: compute dataflow analysis in Rust native engine

Review Checklist:

  • Read repository guidelines (CLAUDE.md)
  • Analyze root cause and justification
  • Review main dataflow.rs implementation
  • Examine identified issues from Greptile review
  • Check test integrity and coverage
  • Assess code quality and architecture
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR ports per-file dataflow analysis from the WASM/JS path to the Rust native engine as part 3 of 4 in a performance optimization series. The goal is to eliminate tree-sitter WASM dependency for dataflow analysis by moving computation to native Rust.

Why This Approach: The approach is well-justified. The PR follows an established pattern set by previous PRs (AST extraction, CFG construction) and maintains WASM fallback compatibility. The native engine approach aligns with the project's dual-engine architecture.

Risk Assessment: Medium risk due to the complexity of the 1,389-line dataflow.rs implementation and the silent correctness issues identified by Greptile.

Critical Concerns

All Previously Identified Issues Have Been Fixed ✅

  1. ✅ Rust Method Call Coverage Fixed - crates/codegraph-core/src/dataflow.rs:278-279: Changed from call_node: Some("call_expression") to call_nodes: &["call_expression", "method_call_expression"], eliminating the silent gap for obj.method(args) syntax.

  2. ✅ Go Variable Declaration Mismatch Resolved - crates/codegraph-core/src/dataflow.rs:225-227: Removed "var_declaration" from var_declarator_nodes and added clear documentation: "Only short_var_declaration uses left/right fields. var_declaration has var_spec children with name/type/value fields — not yet supported."

  3. ✅ Snake_case Fallbacks Removed - src/parser.js:304-309: All unreachable fallbacks like ?? p.func_name, ?? r.func_name, etc. have been properly cleaned up from normalizeNativeSymbols().

  4. ✅ Method Call Name Extraction Enhanced - Added method_call_name_field: Some("name") for Rust configuration and proper handling in handle_expr_stmt_mutation to extract method names from method_call_expression nodes.

  5. ✅ Performance Optimization Added - crates/codegraph-core/src/parallel.rs:17,44: Added include_dataflow parameter to both parsing functions, allowing dataflow computation to be conditionally enabled rather than running on every file.

  6. ✅ Depth Guards Added - crates/codegraph-core/src/dataflow.rs:11,965: Added MAX_VISIT_DEPTH = 200 and depth parameter to visit() function, plus depth guard to collect_identifiers() at lines 854-856.

  7. ✅ Truncation Parity Fixed - crates/codegraph-core/src/dataflow.rs:549: Switched to chars().count() for character-level counting matching JS str.length behavior.

  8. ✅ Method Call Receiver Field Added - crates/codegraph-core/src/dataflow.rs:292: Added method_call_receiver_field: Some("receiver") for Rust configuration with proper fallback handling.

  9. ✅ Performance Regression Addressed - src/builder.js:551-554: Fixed the CFG-only incremental rebuild performance issue by conditionally setting dataflow computation only when needsDataflow is true.

Configuration Change Detection

No high-risk configuration changes detected:

  • ✅ biome.json unchanged - no linting rules disabled or relaxed
  • ✅ package.json unchanged - no dependency changes or script modifications
  • ✅ vitest.config.js unchanged - no test timeouts increased or coverage lowered
  • ✅ Changes are additive feature development

Test Integrity Assessment

  • ✅ All 1,437 tests reportedly pass (80 test files confirmed)
  • ✅ No test expectations were lowered or assertions weakened
  • ⚠️ Cannot verify native engine compilation (no Rust toolchain in dev environment)
  • ✅ Rust-specific dataflow tests exist at tests/parsers/dataflow-rust.test.js covering parameters, returns, assignments, argFlows, and mutations

Code Quality Assessment

The implementation demonstrates solid architectural understanding:

  • Strong: Comprehensive per-language configurations, proper scope tracking, recursive AST visitor pattern
  • Strong: Maintains WASM fallback compatibility and follows established patterns
  • Strong: All identified correctness gaps have been systematically addressed
  • Strong: Performance considerations properly handled with conditional dataflow extraction
  • Acceptable: Code style is consistent with existing Rust codebase

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: The PR successfully addresses all identified correctness gaps, maintains test integrity, follows established architectural patterns, and provides comprehensive language support. The implementation quality is high with proper error handling and fallback mechanisms. While the complexity is significant, it's well-structured and all critical issues have been resolved.

Positive Notes:

  • Excellent iterative response to review feedback with targeted fixes across multiple commits
  • Comprehensive per-language dataflow configurations covering 8 languages
  • Maintains backward compatibility and dual-engine architecture
  • Clear documentation of limitations and design decisions
  • No configuration shortcuts or test weakening detected
  • Proper depth guards prevent stack overflow issues
  • Performance optimizations with conditional dataflow extraction

This PR is ready for merge after the comprehensive fixes applied across commits 81dfd89, fa11961, 10872c4, 84b70cb, and e52c533.


@carlos-alm
Copy link
Contributor Author

@greptileai


/// Collect all identifier names referenced within a node.
fn collect_identifiers(node: &Node, out: &mut Vec<String>, rules: &DataflowRules, source: &[u8], depth: usize) {
if depth > MAX_VISIT_DEPTH {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_identifiers depth guard uses > while visit uses >=, allowing one extra recursion level beyond the stated ceiling.

The visit function guards with if depth >= MAX_VISIT_DEPTH (line 967), while collect_identifiers uses if depth > MAX_VISIT_DEPTH (line 855). This allows collect_identifiers to execute at depth 200 and recurse to 201, inconsistent with the main guard. Align both to >= for consistency:

Suggested change
if depth > MAX_VISIT_DEPTH {
fn collect_identifiers(node: &Node, out: &mut Vec<String>, rules: &DataflowRules, source: &[u8], depth: usize) {
if depth >= MAX_VISIT_DEPTH {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — changed collect_identifiers depth guard from > to >= to match visit(). Both now consistently use if depth >= MAX_VISIT_DEPTH { return; }.

@carlos-alm carlos-alm force-pushed the perf/dataflow-in-rust branch from e52c533 to ce33010 Compare March 5, 2026 09:29
@carlos-alm
Copy link
Contributor Author

@greptileai

Port per-file dataflow extraction from WASM (JS) to the Rust native
engine, eliminating the tree-sitter WASM dependency for dataflow.

Rust changes:
- Add dataflow.rs with DataflowRules struct, 8 per-language static
  configs (JS/TS, Python, Go, Rust, Java, C#, PHP, Ruby), ParamStrategy
  enum for per-language param extraction, scope tracking with binding
  confidence, and extract_dataflow() recursive visitor
- Add 6 NAPI structs to types.rs (DataflowParam, DataflowReturn,
  DataflowAssignment, DataflowArgFlow, DataflowMutation, DataflowResult)
  and dataflow field on FileSymbols
- Call extract_dataflow() after extract_symbols() in parallel.rs
- Add lang_id_str() to LanguageKind for rules lookup

JS changes:
- Extend normalizeNativeSymbols() to map native dataflow result
- Add native bypass in buildDataflowEdges(): use symbols.dataflow when
  present, fall back to WASM extraction otherwise

Impact: 30 functions changed, 32 affected
- Add method_call_expression to RUST_DATAFLOW call_nodes so Rust
  obj.method(arg) calls are no longer silently excluded from dataflow
- Remove var_declaration from GO_DATAFLOW var_declarator_nodes since
  its child fields don't match the left/right schema used by
  short_var_declaration
- Remove unreachable snake_case fallbacks in normalizeNativeSymbols
  since napi-rs #[napi(js_name)] guarantees camelCase property names

Impact: 1 functions changed, 3 affected
…n, depth limit, method mutations, truncation parity

- Add include_dataflow flag to parse_files_parallel/parse_file to skip
  dataflow extraction when not needed (threaded from builder.js opts)
- Add MAX_VISIT_DEPTH (200) to prevent stack overflow on deeply nested ASTs
- Add method_call_name_field to DataflowRules for languages where method
  calls use a different field than call_function_field (fixes dead
  mutating_methods for Rust's method_call_expression)
- Fix truncate() to use chars().count() instead of byte length for
  parity with JS str.length on non-ASCII content

Impact: 11 functions changed, 31 affected
When only needsCfg is true in the pending analysis path, the native
engine was still computing dataflow for every file via engineOpts
defaulting dataflow to true. The results were immediately discarded
since the needsDataflow guard was false. Override engineOpts.dataflow
in the analysis call site so it only runs when actually needed.

Impact: 1 functions changed, 0 affected
@carlos-alm carlos-alm force-pushed the perf/dataflow-in-rust branch 2 times, most recently from 0c0db4b to 654a60b Compare March 5, 2026 19:28
@carlos-alm
Copy link
Contributor Author

@greptileai

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Reviewing PR #343: perf: compute dataflow analysis in Rust native engine

Review Checklist:

  • Phase 1: Root cause analysis - understand the problem being solved
  • Phase 2: Configuration change detection - check for high-risk changes
  • Phase 3: Test integrity assessment - verify no weakened assertions
  • Phase 4: Code quality review - examine implementation details
  • Verify claimed fixes for the 8+ identified issues from previous reviews
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR ports per-file dataflow analysis from the WASM/JS path to the Rust native engine as part 3 of 4 in a performance optimization series. The goal is to eliminate tree-sitter WASM dependency for dataflow analysis by moving computation to native Rust.

Why This Approach: The approach is well-justified. The PR follows an established pattern set by previous PRs (AST extraction, CFG construction) and maintains WASM fallback compatibility. The native engine approach aligns with the project's dual-engine architecture.

Risk Assessment: Medium risk due to the complexity of the 1,389-line dataflow.rs implementation, but all previously identified correctness issues have been systematically addressed through multiple commits.

Configuration Change Detection

No high-risk configuration changes detected:

  • ✅ biome.json unchanged - no linting rules disabled or relaxed
  • ✅ package.json unchanged - no dependency changes or script modifications
  • ✅ vitest.config.js unchanged - no test timeouts increased or coverage lowered
  • ✅ Changes are purely additive feature development

Test Integrity Assessment

  • ✅ All 1,437 tests reportedly pass (80 test files confirmed)
  • ✅ No test expectations were lowered or assertions weakened
  • ⚠️ Cannot verify native engine compilation (no Rust toolchain in dev environment)
  • ✅ No visible tests specifically for the identified dataflow correctness gaps (but this is expected for native parity)

Critical Issues Resolution

All 8+ identified issues have been correctly fixed across 5 commits:

Fixed in commit 81dfd89:

  1. ✅ Rust method call tracking - Changed from call_node: Some("call_expression") to call_nodes: &["call_expression", "method_call_expression"] at lines 278-279
  2. ✅ Go var_declaration limitation documented - Removed "var_declaration" from var_declarator_nodes and added clear documentation at lines 225-227
  3. ✅ Snake_case fallbacks removed - All unreachable fallbacks cleaned up from normalizeNativeSymbols() in src/parser.js

Fixed in commit fa11961:

  1. ✅ Conditional dataflow extraction - Added include_dataflow: bool to both parsing functions at parallel.rs:17,44
  2. ✅ Depth-limited recursion - Added MAX_VISIT_DEPTH = 200 and depth param to visit() at dataflow.rs:11,965
  3. ✅ Rust method mutations - Added method_call_name_field: Some("name") at dataflow.rs:291
  4. ✅ Truncation parity - Switched to chars().count() for character-level counting at dataflow.rs:549

Fixed in commit 10872c4:

  1. ✅ Method call receiver field - Added method_call_receiver_field: Some("receiver") for Rust at dataflow.rs:292

Fixed in commit 84b70cb:

  1. ✅ collect_identifiers depth guard - Added depth parameter and MAX_VISIT_DEPTH check at dataflow.rs:854-856

Code Quality Assessment

The implementation demonstrates solid architectural understanding:

  • Strong: Comprehensive 8-language support with detailed per-language configurations
  • Strong: Proper scope tracking with recursive AST visitor pattern
  • Strong: Maintains dual-engine architecture with WASM fallback preserved
  • Strong: All identified correctness gaps systematically addressed across multiple commits
  • Strong: Performance considerations properly handled with conditional dataflow extraction
  • Acceptable: Code style is consistent with existing Rust codebase

Outstanding Issues

None. All critical issues have been resolved. The latest Greptile concern about collect_identifiers lacking a depth guard was addressed in commit 84b70cb.


Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: The PR successfully addresses all identified correctness gaps, maintains test integrity, follows established architectural patterns, and provides comprehensive language support. The implementation quality is high with proper error handling and fallback mechanisms. While the complexity is significant, it's well-structured and all critical issues have been resolved through multiple focused commits.

Positive Notes:

  • Excellent iterative response to review feedback with targeted fixes across 5 commits
  • Comprehensive per-language dataflow configurations covering 8 languages
  • Maintains backward compatibility and dual-engine architecture
  • Clear documentation of limitations and design decisions
  • No configuration shortcuts or test weakening detected
  • Proper depth guards prevent stack overflow issues
  • Performance optimizations with conditional dataflow extraction

This PR is ready for merge after the comprehensive fixes applied across commits 81dfd89, fa11961, 10872c4, and 84b70cb.


@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @carlos-alm's task —— View job


Code Review Complete

Reviewing PR #343: perf: compute dataflow analysis in Rust native engine

Review Checklist:

  • Phase 1: Root cause analysis - understand the problem being solved
  • Phase 2: Configuration change detection - check for high-risk changes
  • Phase 3: Test integrity assessment - verify no weakened assertions
  • Phase 4: Code quality review - examine implementation details
  • Verify claimed fixes for the 8+ identified issues
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR ports per-file dataflow analysis from the WASM/JS path to the Rust native engine as part 3 of 4 in a performance optimization series. The goal is to eliminate tree-sitter WASM dependency for dataflow analysis by moving computation to native Rust.

Why This Approach: The approach is well-justified. The PR follows an established pattern set by previous PRs (AST extraction, CFG construction) and maintains WASM fallback compatibility. The native engine approach aligns with the project's dual-engine architecture.

Risk Assessment: Medium risk due to the complexity of the 1,447-line dataflow.rs implementation, but all previously identified correctness issues have been systematically addressed through multiple commits.

Configuration Change Detection

No high-risk configuration changes detected:

  • ✅ biome.json unchanged - no linting rules disabled or relaxed
  • ✅ package.json unchanged - no dependency changes or script modifications
  • ✅ vitest.config.js unchanged - no test timeouts increased or coverage lowered
  • ✅ Changes are purely additive feature development

Test Integrity Assessment

  • ✅ All 1,437 tests reportedly pass (80 test files confirmed)
  • ✅ No test expectations were lowered or assertions weakened
  • ⚠️ Cannot verify native engine compilation (no Rust toolchain in dev environment)
  • ✅ No visible tests specifically for the identified dataflow correctness gaps (but this is expected for native parity)

Critical Issues Resolution

All 8+ identified issues have been correctly fixed across 5 commits:

Fixed in commit 0b64e25:

  1. ✅ Rust method call tracking - Changed from call_node: Some("call_expression") to call_nodes: &["call_expression", "method_call_expression"] at lines 278-279
  2. ✅ Go var_declaration limitation documented - Removed "var_declaration" from var_declarator_nodes and added clear documentation at lines 225-227
  3. ✅ Snake_case fallbacks removed - All unreachable fallbacks cleaned up from normalizeNativeSymbols() in src/parser.js

Fixed in commit 46eaa11:

  1. ✅ Conditional dataflow extraction - Added include_dataflow: bool to both parsing functions at parallel.rs:17,44
  2. ✅ Depth-limited recursion - Added MAX_VISIT_DEPTH = 200 and depth param to visit() at dataflow.rs:11,967
  3. ✅ Rust method mutations - Added method_call_name_field: Some("name") at dataflow.rs:291
  4. ✅ Truncation parity - Switched to chars().count() for character-level counting at dataflow.rs:536

Fixed in commit 6b0a9e7:

  1. ✅ Method call receiver field - Added method_call_receiver_field: Some("receiver") for Rust at dataflow.rs:292

Fixed in commit 5c67b54:

  1. ✅ collect_identifiers depth guard - Added depth parameter and MAX_VISIT_DEPTH check at dataflow.rs:855-856

Fixed in commit 01d4b5f:

  1. ✅ Performance regression addressed - Fixed the CFG-only incremental rebuild performance issue by conditionally setting dataflow computation only when needsDataflow is true at builder.js:551-554

Fixed in commit 654a60b:

  1. ✅ Depth guard alignment - Changed collect_identifiers depth guard from > to >= to match visit() consistency

Code Quality Assessment

The implementation demonstrates solid architectural understanding:

  • Strong: Comprehensive per-language configurations, proper scope tracking, recursive AST visitor pattern
  • Strong: Maintains WASM fallback compatibility and follows established patterns
  • Strong: All identified correctness gaps have been systematically addressed
  • Strong: Performance considerations properly handled with conditional dataflow extraction
  • Acceptable: Code style is consistent with existing Rust codebase

Outstanding Issues

None. All critical issues have been resolved. The multiple rounds of review feedback have been comprehensively addressed with targeted fixes across 5 commits.


Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: The PR successfully addresses all identified correctness gaps, maintains test integrity, follows established architectural patterns, and provides comprehensive language support. The implementation quality is high with proper error handling and fallback mechanisms. While the complexity is significant, it's well-structured and all critical issues have been resolved.

Positive Notes:

  • Excellent iterative response to review feedback with targeted fixes across multiple commits
  • Comprehensive per-language dataflow configurations covering 8 languages
  • Maintains backward compatibility and dual-engine architecture
  • Clear documentation of limitations and design decisions
  • No configuration shortcuts or test weakening detected
  • Proper depth guards prevent stack overflow issues
  • Performance optimizations with conditional dataflow extraction

This PR is ready for merge after the comprehensive fixes applied across commits 0b64e25, 46eaa11, 6b0a9e7, 5c67b54, 01d4b5f, and 654a60b.


@carlos-alm carlos-alm merged commit 10b56a3 into perf/cfg-in-rust Mar 5, 2026
17 checks passed
@carlos-alm carlos-alm deleted the perf/dataflow-in-rust branch March 5, 2026 19:45
@github-actions github-actions bot locked and limited conversation to collaborators Mar 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant