perf: extract AST nodes for all languages in native engine by carlos-alm · Pull Request #340 · optave/codegraph

carlos-alm · 2026-03-04T09:15:44Z

Summary

Add shared walk_ast_nodes_with_config() to Rust helpers.rs with per-language AST node type configs (LangAstConfig struct)
All 7 non-JS extractors (Python, Go, Rust, Java, C#, Ruby, PHP) now call walk_ast_nodes_with_config() during extract(), producing astNodes natively
buildAstNodes() in ast.js now checks symbols.astNodes first for all languages, falling back to WASM tree walk only for JS/TS/TSX when no native data is present

Language coverage

Language	new	throw	await	string	regex
Python	-	`raise_statement`	`await`	`string`	-
Go	-	-	-	`interpreted_string_literal`, `raw_string_literal`	-
Rust	-	-	`await_expression`	`string_literal`, `raw_string_literal`	-
Java	`object_creation_expression`	`throw_statement`	-	`string_literal`	-
C#	`object_creation_expression`	`throw_statement`, `throw_expression`	`await_expression`	`string_literal`, `verbatim_string_literal`	-
Ruby	-	-	-	`string`	`regex`
PHP	`object_creation_expression`	`throw_expression`	-	`string`, `encapsed_string`	-

Expected performance impact

Once the native binary is compiled with these changes:

astMs phase: ~651ms → ~50ms (DB inserts only, no WASM tree walk needed)
This is PR 1 of 4 in the double-parse elimination plan

Test plan

All 1435 existing tests pass (0 regressions)
New JS-side tests verify buildAstNodes accepts astNodes for non-JS files (.py, .java)
New native parity tests for all 7 languages (skipped locally, run in CI with native binary)
CI: build-native.yml compiles successfully with new Rust code
CI: Native parity tests pass with compiled binary

Add walk_ast_nodes_with_config() to helpers.rs with per-language configs for node type mappings (new, throw, await, string, regex). Each non-JS extractor now calls this during extract(), producing astNodes for Python, Go, Rust, Java, C#, Ruby, and PHP. On the JS side, buildAstNodes() now checks symbols.astNodes first (all languages) before falling back to WASM tree walk (JS/TS/TSX only). This eliminates the WASM dependency for AST extraction when using the native engine. Expected impact: astMs drops from ~651ms to ~50ms (DB inserts only) for native builds once the binary is compiled. Impact: 14 functions changed, 11 affected

claude · 2026-03-04T09:15:59Z

Claude finished @carlos-alm's task —— View job

PR Review: AST Nodes Native Extraction

I'm conducting a thorough review of this performance optimization PR following the structured methodology:

Review Progress

Read repository guidelines and context
Phase 1: Root cause analysis - understand the performance problem
Phase 2: Configuration change detection - check for rule relaxations
Phase 3: Test integrity - ensure tests aren't weakened
Phase 4: Code quality review - examine implementation
Provide structured assessment and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: The current system has a "double-parse" performance bottleneck where AST nodes (new, throw, await, string, regex) are extracted twice:

Once in the native Rust engine during symbol extraction
Again in the JS layer by walking the tree-sitter AST via WASM

This results in ~651ms spent in the astMs phase, with most time wasted on redundant WASM tree walking.

Why This Approach: The solution eliminates duplication by:

Adding shared walk_ast_nodes_with_config() function in Rust helpers.rs with language-specific configurations
All 7 non-JS extractors now call this function during extract(), producing native astNodes
JS buildAstNodes() checks for pre-extracted symbols.astNodes first, falling back to WASM only for JS/TS/TSX when no native data exists

Justification: This is a legitimate performance optimization addressing real inefficiency. Expected improvement: ~651ms → ~50ms (92% reduction). This is part 1 of 4 in a systematic double-parse elimination plan.

Phase 2: Configuration Change Detection ✅

No configuration changes detected. All modified files are implementation code:

8 Rust extractor files (minimal one-line additions)
1 JavaScript file (src/ast.js - logic reordering)
1 new test file (comprehensive test coverage)

No linting rules, build configs, test thresholds, or other risky configuration changes.

Phase 3: Test Integrity ✅

Test changes are strengthening, not weakening:

New comprehensive test file: tests/parsers/ast-all-langs.test.js (452 lines)
Tests existing functionality: Verifies buildAstNodes() accepts native astNodes for non-JS files
Tests new functionality: When native engine available, verifies each language produces correct AST node kinds
Smart conditional skipping: Uses describe.skipIf(!canTestMultiLangAst) - skips native tests when binary unavailable rather than failing
No existing tests modified: All 1435 existing tests pass with 0 regressions

The test approach is proper - adding coverage for new functionality without relaxing existing assertions.

Phase 4: Code Quality Review ✅

Implementation Quality: High

✅ Positive Aspects:

Solid architecture: Language-agnostic walk_ast_nodes_with_config() with per-language configurations is well-designed
Proper error handling: Uses unwrap_or("") and unwrap_or("?") for safe text extraction
Secure text handling: truncate() function uses proper Unicode char boundary checks (lines 85-89)
Minimal invasiveness: Each extractor adds only 1 line: walk_ast_nodes_with_config(&tree.root_node(), source, &mut symbols.ast_nodes, &LANG_AST_CONFIG);
Backwards compatibility: Fallback mechanism preserves existing WASM behavior for JS/TS/TSX
Language coverage accurate: Configurations match the coverage table in PR description

🔍 Minor Observations:

Performance improvement unverified: Cannot confirm 651ms→50ms without running benchmarks, but logic appears sound
String extraction heuristics: Language configs look reasonable but are based on tree-sitter grammar assumptions
Memory usage: Creates additional astNodes vectors but this shouldn't impact memory significantly

🎯 Specific Technical Validation:

Thread safety: ✅ Functions are pure with no shared mutable state
Memory safety: ✅ Proper Rust ownership patterns, no unsafe code
SQL injection: ✅ JS side uses parameterized queries only
Input validation: ✅ All node text is sanitized through truncate()
Fallback behavior: ✅ Graceful degradation to WASM for unsupported scenarios

📋 Language Configuration Accuracy:

Python: raise_statement → throw ✅
Go: interpreted_string_literal, raw_string_literal ✅
Java: object_creation_expression → new, throw_statement ✅
C#: Includes both throw_statement and throw_expression, await_expression ✅
Ruby: string and regex support ✅
PHP: object_creation_expression, encapsed_string ✅

Root Cause Analysis

Problem Being Solved: Legitimate performance bottleneck in double AST parsing
Why This Approach: Systematic elimination of redundant WASM tree walking
Risk Assessment: Low risk - backwards compatible with comprehensive test coverage

Critical Concerns

None identified. This is a well-implemented performance optimization with proper testing.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: High-quality performance optimization that eliminates genuine inefficiency. Implementation is clean, well-tested, and backwards-compatible. The only reason it's not 5 stars is that the performance claims cannot be independently verified without CI benchmarks, but the architectural approach is sound.

CI Dependencies: Success depends on build-native.yml compiling successfully with new Rust code and native parity tests passing.

greptile-apps · 2026-03-04T09:20:31Z

Greptile Summary

This PR moves AST node extraction (new, throw, await, string, regex) from JavaScript/WASM to the native Rust engine for 7 non-JS languages (Python, Go, Rust, Java, C#, Ruby, PHP), as part 1 of 4 in the double-parse elimination optimization plan.

Key changes:

Added shared walk_ast_nodes_with_config() function in helpers.rs with language-specific configs via LangAstConfig struct
All 7 non-JS extractors now call this function during extraction, populating astNodes natively
buildAstNodes() in ast.js now checks symbols.astNodes first for all languages, falling back to WASM tree walk only for JS/TS/TSX
Comprehensive test coverage for all languages with native parity tests (CI-gated)

Expected performance impact: ~651ms → ~50ms for the astMs phase (DB inserts only, eliminates WASM tree walk overhead)

Issue found:

C# verbatim string handling may incorrectly strip @ characters from string content when they appear at the end (e.g., @"test@" → test instead of test@) due to bidirectional quote character trimming

Confidence Score: 4/5

Safe to merge with one edge case to address in follow-up
Well-structured implementation following existing patterns with comprehensive test coverage. The C# verbatim string quote handling issue is an edge case that's unlikely to affect most real-world code, but should be fixed. All 1435 existing tests pass with no regressions.
Pay attention to crates/codegraph-core/src/extractors/helpers.rs for the C# quote handling edge case

Important Files Changed

Filename	Overview
crates/codegraph-core/src/extractors/helpers.rs	Adds shared `walk_ast_nodes_with_config()` and language-specific AST configs for extracting new/throw/await/string/regex nodes; minor edge case with C# verbatim string quote trimming
src/ast.js	Inverts logic to check `symbols.astNodes` first (from native extraction) for all languages before falling back to WASM tree walk for JS/TS/TSX
tests/parsers/ast-all-langs.test.js	Comprehensive test coverage for native AST node extraction across all 7 non-JS languages with fixtures and assertions for each supported node type

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Parse file with tree-sitter] --> B{Language type?}
    B -->|Non-JS languages| C[Native Rust extractor]
    B -->|JS/TS/TSX| D[Native Rust extractor]
    
    C --> E[walk_node for symbols]
    C --> F[walk_ast_nodes_with_config]
    F --> G[Extract new/throw/await/string/regex]
    G --> H[Populate symbols.astNodes]
    
    D --> I[walk_node for symbols]
    D --> J[walk_ast_nodes for JS]
    J --> K[Populate symbols.astNodes]
    
    H --> L[buildAstNodes in ast.js]
    K --> L
    
    L --> M{symbols.astNodes exists?}
    M -->|Yes| N[Use native astNodes directly]
    M -->|No, JS/TS/TSX| O[Fallback: WASM tree walk]
    M -->|No, other lang| P[Skip AST nodes]
    
    N --> Q[Insert into DB]
    O --> Q
    P --> Q
    
    style F fill:#90EE90
    style G fill:#90EE90
    style H fill:#90EE90
    style N fill:#FFD700
    style O fill:#FFB6C1

_{Last reviewed commit: 6bc4a05}

greptile-apps · 2026-03-04T09:20:35Z

crates/codegraph-core/src/extractors/helpers.rs

+    await_types: &["await_expression"],
+    string_types: &["string_literal", "verbatim_string_literal"],
+    regex_types: &[],
+    quote_chars: &['"', '@'],


C# verbatim strings (@"text") may lose @ characters from the string content itself. For example, @"test@" would incorrectly become test instead of test@ because trim_end_matches removes all trailing @ and " characters. The @ is a prefix modifier, not a delimiter, so it should only be stripped from the start. Consider handling verbatim strings separately or removing @ from quote_chars.

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

carlos-alm mentioned this pull request Mar 4, 2026

perf: compute CFG in Rust native engine for all languages #342

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: extract AST nodes for all languages in native engine#340

perf: extract AST nodes for all languages in native engine#340
carlos-alm wants to merge 1 commit intomainfrom
perf/ast-nodes-all-langs

carlos-alm commented Mar 4, 2026

Uh oh!

claude bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Mar 4, 2026

Summary

Language coverage

Expected performance impact

Test plan

Uh oh!

claude bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: AST Nodes Native Extraction

Review Progress

Phase 1: Root Cause Analysis ✅

Phase 2: Configuration Change Detection ✅

Phase 3: Test Integrity ✅

Phase 4: Code Quality Review ✅

✅ Positive Aspects:

🔍 Minor Observations:

🎯 Specific Technical Validation:

📋 Language Configuration Accuracy:

Root Cause Analysis

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Mar 4, 2026 •

edited

Loading