Skip to content

perf: extract AST nodes for all languages in native engine#340

Open
carlos-alm wants to merge 1 commit intomainfrom
perf/ast-nodes-all-langs
Open

perf: extract AST nodes for all languages in native engine#340
carlos-alm wants to merge 1 commit intomainfrom
perf/ast-nodes-all-langs

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Add shared walk_ast_nodes_with_config() to Rust helpers.rs with per-language AST node type configs (LangAstConfig struct)
  • All 7 non-JS extractors (Python, Go, Rust, Java, C#, Ruby, PHP) now call walk_ast_nodes_with_config() during extract(), producing astNodes natively
  • buildAstNodes() in ast.js now checks symbols.astNodes first for all languages, falling back to WASM tree walk only for JS/TS/TSX when no native data is present

Language coverage

Language new throw await string regex
Python - raise_statement await string -
Go - - - interpreted_string_literal, raw_string_literal -
Rust - - await_expression string_literal, raw_string_literal -
Java object_creation_expression throw_statement - string_literal -
C# object_creation_expression throw_statement, throw_expression await_expression string_literal, verbatim_string_literal -
Ruby - - - string regex
PHP object_creation_expression throw_expression - string, encapsed_string -

Expected performance impact

Once the native binary is compiled with these changes:

  • astMs phase: ~651ms → ~50ms (DB inserts only, no WASM tree walk needed)
  • This is PR 1 of 4 in the double-parse elimination plan

Test plan

  • All 1435 existing tests pass (0 regressions)
  • New JS-side tests verify buildAstNodes accepts astNodes for non-JS files (.py, .java)
  • New native parity tests for all 7 languages (skipped locally, run in CI with native binary)
  • CI: build-native.yml compiles successfully with new Rust code
  • CI: Native parity tests pass with compiled binary

Add walk_ast_nodes_with_config() to helpers.rs with per-language
configs for node type mappings (new, throw, await, string, regex).
Each non-JS extractor now calls this during extract(), producing
astNodes for Python, Go, Rust, Java, C#, Ruby, and PHP.

On the JS side, buildAstNodes() now checks symbols.astNodes first
(all languages) before falling back to WASM tree walk (JS/TS/TSX
only). This eliminates the WASM dependency for AST extraction when
using the native engine.

Expected impact: astMs drops from ~651ms to ~50ms (DB inserts only)
for native builds once the binary is compiled.

Impact: 14 functions changed, 11 affected
@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


PR Review: AST Nodes Native Extraction

I'm conducting a thorough review of this performance optimization PR following the structured methodology:

Review Progress

  • Read repository guidelines and context
  • Phase 1: Root cause analysis - understand the performance problem
  • Phase 2: Configuration change detection - check for rule relaxations
  • Phase 3: Test integrity - ensure tests aren't weakened
  • Phase 4: Code quality review - examine implementation
  • Provide structured assessment and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: The current system has a "double-parse" performance bottleneck where AST nodes (new, throw, await, string, regex) are extracted twice:

  1. Once in the native Rust engine during symbol extraction
  2. Again in the JS layer by walking the tree-sitter AST via WASM

This results in ~651ms spent in the astMs phase, with most time wasted on redundant WASM tree walking.

Why This Approach: The solution eliminates duplication by:

  • Adding shared walk_ast_nodes_with_config() function in Rust helpers.rs with language-specific configurations
  • All 7 non-JS extractors now call this function during extract(), producing native astNodes
  • JS buildAstNodes() checks for pre-extracted symbols.astNodes first, falling back to WASM only for JS/TS/TSX when no native data exists

Justification: This is a legitimate performance optimization addressing real inefficiency. Expected improvement: ~651ms → ~50ms (92% reduction). This is part 1 of 4 in a systematic double-parse elimination plan.

Phase 2: Configuration Change Detection ✅

No configuration changes detected. All modified files are implementation code:

  • 8 Rust extractor files (minimal one-line additions)
  • 1 JavaScript file (src/ast.js - logic reordering)
  • 1 new test file (comprehensive test coverage)

No linting rules, build configs, test thresholds, or other risky configuration changes.

Phase 3: Test Integrity ✅

Test changes are strengthening, not weakening:

  • New comprehensive test file: tests/parsers/ast-all-langs.test.js (452 lines)
  • Tests existing functionality: Verifies buildAstNodes() accepts native astNodes for non-JS files
  • Tests new functionality: When native engine available, verifies each language produces correct AST node kinds
  • Smart conditional skipping: Uses describe.skipIf(!canTestMultiLangAst) - skips native tests when binary unavailable rather than failing
  • No existing tests modified: All 1435 existing tests pass with 0 regressions

The test approach is proper - adding coverage for new functionality without relaxing existing assertions.

Phase 4: Code Quality Review ✅

Implementation Quality: High

Positive Aspects:

  1. Solid architecture: Language-agnostic walk_ast_nodes_with_config() with per-language configurations is well-designed
  2. Proper error handling: Uses unwrap_or("") and unwrap_or("?") for safe text extraction
  3. Secure text handling: truncate() function uses proper Unicode char boundary checks (lines 85-89)
  4. Minimal invasiveness: Each extractor adds only 1 line: walk_ast_nodes_with_config(&tree.root_node(), source, &mut symbols.ast_nodes, &LANG_AST_CONFIG);
  5. Backwards compatibility: Fallback mechanism preserves existing WASM behavior for JS/TS/TSX
  6. Language coverage accurate: Configurations match the coverage table in PR description

🔍 Minor Observations:

  1. Performance improvement unverified: Cannot confirm 651ms→50ms without running benchmarks, but logic appears sound
  2. String extraction heuristics: Language configs look reasonable but are based on tree-sitter grammar assumptions
  3. Memory usage: Creates additional astNodes vectors but this shouldn't impact memory significantly

🎯 Specific Technical Validation:

  • Thread safety: ✅ Functions are pure with no shared mutable state
  • Memory safety: ✅ Proper Rust ownership patterns, no unsafe code
  • SQL injection: ✅ JS side uses parameterized queries only
  • Input validation: ✅ All node text is sanitized through truncate()
  • Fallback behavior: ✅ Graceful degradation to WASM for unsupported scenarios

📋 Language Configuration Accuracy:

  • Python: raise_statementthrow
  • Go: interpreted_string_literal, raw_string_literal
  • Java: object_creation_expressionnew, throw_statement
  • C#: Includes both throw_statement and throw_expression, await_expression
  • Ruby: string and regex support ✅
  • PHP: object_creation_expression, encapsed_string

Root Cause Analysis

Problem Being Solved: Legitimate performance bottleneck in double AST parsing
Why This Approach: Systematic elimination of redundant WASM tree walking
Risk Assessment: Low risk - backwards compatible with comprehensive test coverage

Critical Concerns

None identified. This is a well-implemented performance optimization with proper testing.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality performance optimization that eliminates genuine inefficiency. Implementation is clean, well-tested, and backwards-compatible. The only reason it's not 5 stars is that the performance claims cannot be independently verified without CI benchmarks, but the architectural approach is sound.

CI Dependencies: Success depends on build-native.yml compiling successfully with new Rust code and native parity tests passing.


@greptile-apps
Copy link

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR moves AST node extraction (new, throw, await, string, regex) from JavaScript/WASM to the native Rust engine for 7 non-JS languages (Python, Go, Rust, Java, C#, Ruby, PHP), as part 1 of 4 in the double-parse elimination optimization plan.

Key changes:

  • Added shared walk_ast_nodes_with_config() function in helpers.rs with language-specific configs via LangAstConfig struct
  • All 7 non-JS extractors now call this function during extraction, populating astNodes natively
  • buildAstNodes() in ast.js now checks symbols.astNodes first for all languages, falling back to WASM tree walk only for JS/TS/TSX
  • Comprehensive test coverage for all languages with native parity tests (CI-gated)

Expected performance impact: ~651ms → ~50ms for the astMs phase (DB inserts only, eliminates WASM tree walk overhead)

Issue found:

  • C# verbatim string handling may incorrectly strip @ characters from string content when they appear at the end (e.g., @"test@"test instead of test@) due to bidirectional quote character trimming

Confidence Score: 4/5

  • Safe to merge with one edge case to address in follow-up
  • Well-structured implementation following existing patterns with comprehensive test coverage. The C# verbatim string quote handling issue is an edge case that's unlikely to affect most real-world code, but should be fixed. All 1435 existing tests pass with no regressions.
  • Pay attention to crates/codegraph-core/src/extractors/helpers.rs for the C# quote handling edge case

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/helpers.rs Adds shared walk_ast_nodes_with_config() and language-specific AST configs for extracting new/throw/await/string/regex nodes; minor edge case with C# verbatim string quote trimming
src/ast.js Inverts logic to check symbols.astNodes first (from native extraction) for all languages before falling back to WASM tree walk for JS/TS/TSX
tests/parsers/ast-all-langs.test.js Comprehensive test coverage for native AST node extraction across all 7 non-JS languages with fixtures and assertions for each supported node type

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Parse file with tree-sitter] --> B{Language type?}
    B -->|Non-JS languages| C[Native Rust extractor]
    B -->|JS/TS/TSX| D[Native Rust extractor]
    
    C --> E[walk_node for symbols]
    C --> F[walk_ast_nodes_with_config]
    F --> G[Extract new/throw/await/string/regex]
    G --> H[Populate symbols.astNodes]
    
    D --> I[walk_node for symbols]
    D --> J[walk_ast_nodes for JS]
    J --> K[Populate symbols.astNodes]
    
    H --> L[buildAstNodes in ast.js]
    K --> L
    
    L --> M{symbols.astNodes exists?}
    M -->|Yes| N[Use native astNodes directly]
    M -->|No, JS/TS/TSX| O[Fallback: WASM tree walk]
    M -->|No, other lang| P[Skip AST nodes]
    
    N --> Q[Insert into DB]
    O --> Q
    P --> Q
    
    style F fill:#90EE90
    style G fill:#90EE90
    style H fill:#90EE90
    style N fill:#FFD700
    style O fill:#FFB6C1
Loading

Last reviewed commit: 6bc4a05

await_types: &["await_expression"],
string_types: &["string_literal", "verbatim_string_literal"],
regex_types: &[],
quote_chars: &['"', '@'],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C# verbatim strings (@"text") may lose @ characters from the string content itself. For example, @"test@" would incorrectly become test instead of test@ because trim_end_matches removes all trailing @ and " characters. The @ is a prefix modifier, not a delimiter, so it should only be stripped from the start. Consider handling verbatim strings separately or removing @ from quote_chars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant