Skip to content

Add subinterpreter parallelism with OWN_GIL support#11

Open
benoitc wants to merge 30 commits intomainfrom
feature/py-subinterpreter-parallelism
Open

Add subinterpreter parallelism with OWN_GIL support#11
benoitc wants to merge 30 commits intomainfrom
feature/py-subinterpreter-parallelism

Conversation

@benoitc
Copy link
Owner

@benoitc benoitc commented Mar 5, 2026

Summary

  • Add Python 3.12+ subinterpreter support with per-interpreter GIL (OWN_GIL)
  • Implement process-per-context architecture for true N-way parallelism
  • Add event loop isolation per subinterpreter
  • Fix GIL handling for OWN_GIL subinterpreter creation/destruction

Performance

With Python 3.14 and subinterpreter mode:

Benchmark Latency Throughput
Sync py:call 0.003 ms 394K/s
Sync py:eval 0.007 ms 136K/s
Cast single 0.004 ms 265K/s
Concurrent (50 procs) 9.5 ms 105K/s

Key Changes

  • py_context.erl: Process-per-context gen_server
  • py_nif.c: Subinterpreter creation with OWN_GIL, proper GIL release
  • py_event_loop.c: Per-interpreter event loop isolation
  • Renamed call_async to cast for clarity

Implement a general-purpose worker thread pool that eliminates per-request
GIL acquisition overhead. Each worker holds the GIL (or has its own
subinterpreter with OWN_GIL on Python 3.12+) and processes requests from
a shared MPSC queue.

Key features:
- Sync API: call, apply, eval, exec, asgi_run, wsgi_run
- Async API: all *_async variants returning request_id for non-blocking calls
- await/1,2 for waiting on async results
- Per-worker module caching to avoid reimport overhead
- Support for FREE_THREADED (3.13+), SUBINTERP (3.12+), and FALLBACK modes
- Fix potential crash when locals_term is uninitialized (check for 0)
- Add benchmark results directory with baseline comparisons

Known issue: ~0.5-1% of concurrent sync calls may timeout under high
load (100+ concurrent callers). Async API unaffected.
1. Use-after-free on request_id: Save request_id BEFORE enqueueing
   the request to the worker pool. Once enqueued, a worker can
   process and free the request at any time. Accessing req->request_id
   after py_pool_enqueue() is undefined behavior.

2. Double-free of msg_env: After a successful enif_send(), the message
   environment is consumed/invalidated by the Erlang runtime. We must
   set req->msg_env = NULL to prevent py_pool_request_free() from
   calling enif_free_env() on an already-freed environment.

These bugs caused ~0.5-1% of concurrent calls to timeout under high load
because request IDs could be corrupted, leading to message/response
mismatch.

Also adds debug counters (responses_sent, responses_failed) to pool stats
for monitoring send success rate.
Changed py_pool_process_asgi to call run_asgi(module_name, callable_name,
scope, body) instead of run(app, scope, body), matching hornbeam's
hornbeam_asgi_runner interface.

Also updated extract_asgi_response to handle both dict and tuple return
formats, supporting hornbeam's dict-based response.
- Add compile-time detection of PyInterpreterConfig_OWN_GIL (Python 3.12+)
- Add mutex to py_subinterp_worker_t for thread-safe parallel access
- Add nif_subinterp_asgi_run for ASGI on subinterpreters
- Add py_resource_pool module with lock-free round-robin scheduling
- Benchmark shows 8-10x improvement with subinterpreters enabled
Replace worker pool with process-per-context model where each Python context
is owned by a dedicated Erlang process. Enables reentrant callbacks via
suspension-based mechanism without deadlock.

- Add py_context.erl with recursive receive pattern for inline callback handling
- Add py_context_router.erl for scheduler-affinity based routing
- Add nif_context_resume for Python replay with cached callback results
- Support sequential callbacks via callback_results array accumulation
- Remove old pool modules (py_pool, py_worker, py_worker_pool, etc.)
- Pass timeout parameter through py:eval/3 and do_call/5
- Add py:contexts_started/0 and py_context_router:is_started/0
- Fix test_timeout to use time.sleep for reliable delay
- Fix thread callback suite to check existing contexts
When the application restarts, py_thread_handler registers as the new
coordinator, but existing thread workers in the NIF-level pool still
had has_handler=true from the previous run. This caused them to skip
spawning new handler processes and write to dead pipes.

Reset has_handler=false on all existing workers when a new coordinator
is registered.
Two fixes:

1. suspended_context_state_destructor: For subinterpreters with OWN_GIL,
   use PyThreadState_Swap to switch to the correct interpreter before
   releasing Python objects. PyGILState_Ensure only works for the main
   interpreter and causes memory corruption with subinterpreter objects.

2. thread_worker_set_coordinator: Reset has_handler=false on all existing
   workers when a new coordinator registers (e.g., after app restart).
   Old workers kept has_handler=true but their handler processes were dead.
- Rename priv/erlang/ to priv/_erlang_impl/ to avoid C module shadowing
- Add _extend_erlang_module() helper in py_callback.c to re-export
  Python package functions (run, new_event_loop, EventLoopPolicy, etc.)
- Update py_event_loop.erl to call extension during initialization
- Delete buggy erlang_asyncio.py (blocking sleep replaced by proper
  asyncio.sleep backed by Erlang timers via call_later)
- Add test infrastructure in priv/tests/ for event loop integration

The unified erlang module now provides uvloop-compatible API:
- erlang.run(coro) - run async code with Erlang event loop
- erlang.new_event_loop() - create ErlangEventLoop instance
- erlang.install() - install ErlangEventLoopPolicy (deprecated 3.12+)
- erlang.call() / erlang.async_call() - call Erlang functions
- asyncio.sleep() works via Erlang timers
- Update py_erlang_sleep_SUITE to use erlang.run() with standard asyncio
  instead of the removed erlang_asyncio module
- Skip py_asyncio_compat_SUITE: tests create standalone ErlangEventLoop
  instances via erlang.new_event_loop() and call loop.run_forever().
  Timer scheduling for standalone loops needs work - timers fire
  immediately instead of after the scheduled delay.
- Add isolated parameter to ErlangEventLoop.__init__() that creates
  a per-loop capsule via _loop_new() for proper event routing
- Update all loop methods (call_at, _run_once, stop, close, add_reader,
  remove_reader, add_writer, remove_writer) to use per-loop capsule APIs
  when running as isolated instance
- new_event_loop() now passes isolated=True by default
- Fix run_forever() to honor stop() called before run_forever() by not
  resetting _stopping flag at start
- Simplify async_test_runner to run tests synchronously without
  erlang.run() wrapper, avoiding nested event loop issues
- Add timeout fallback to test_add_remove_writer to prevent hanging
- Remove skip from py_asyncio_compat_SUITE to enable tests

Test results: 46 tests run, 42 passed, 4 failures (edge cases)
The pthread+usleep polling async workers have been replaced with an
event-driven model using py_event_loop and enif_select:

- Add _run_and_send wrapper in Python for result delivery via erlang.send()
- Add nif_event_loop_run_async NIF for direct coroutine submission
- Add py_event_loop:run_async/2 Erlang API
- Add py_event_loop_pool.erl for managing event loop-based async execution
- Rewrite py_async_pool.erl to delegate to event_loop_pool
- Update supervisor tree to include py_event_loop_pool
- Remove py_async_worker.erl and py_async_worker_sup.erl
- Stub deprecated async_worker NIFs to return errors
- Remove async_event_loop_thread and async_future_callback C code

Performance improvements:
- Latency: ~10-20ms polling -> <1ms (enif_select)
- CPU idle: 100 wakeups/sec -> Zero
- Threads: N pthreads -> 0 extra threads

API unchanged: py:async_call/3,4 and py:await/1,2 work the same.
Replace global variables with module state structure stored in the
Python module, enabling proper per-interpreter/per-context event
loop isolation.

Changes:
- Add py_event_loop_module_state_t struct containing event_loop,
  shared_router, shared_router_valid, and isolation_mode
- Update PyModuleDef to allocate module state (m_size)
- Update get_interpreter_event_loop() to read from module state
- Update set_interpreter_event_loop() to write to module state
- Update nif_set_python_event_loop() to use module state
- Update nif_set_isolation_mode() to use module state
- Update nif_set_shared_router() to use module state
- Update py_get_isolation_mode() to read from module state
- Update py_loop_new() to read shared_router from module state
- Update event_loop_destructor() to clear module state
- Update create_default_event_loop() to use module state
- Remove g_python_event_loop, g_shared_router, g_shared_router_valid,
  and g_isolation_mode global variables
- Remove erlang_loop.py, use _erlang_impl as the single implementation
- Add get_event_loop_policy() export to _erlang_impl and erlang module
- Fix signal tests: ErlangEventLoop has limited signal support (SIGINT,
  SIGTERM, SIGHUP only), other signals raise ValueError
- Skip subprocess tests for Erlang (not yet implemented)
- Update all imports to use erlang module (public API) with _erlang_impl
  as internal fallback
- Update docs and examples to use erlang module imports
- test_run_until_complete_nested_raises: Use asyncio.sleep(0.1) to ensure
  timer path (not fast path), properly close coroutine in finally block
- test_run_until_complete_on_closed_raises: Store coroutine in variable
  and close it in finally block
- tearDown: Cancel pending tasks and shutdown async generators before
  closing loop to prevent resource leaks
- Add test_asyncio_sleep_zero_fast_path: Verify sleep(0) uses fast path
- test_add_remove_writer: Use socketpair for reliable write readiness
- Share fd_resource per fd to prevent enif_select stealing errors
- Add NIF functions for fd resource management
- Use send() instead of sendto() for connected UDP sockets
- Fix TCP EOF handling to call connection_lost properly
await coro() runs in shared context (changes visible to caller),
while create_task(coro()) runs in copied context (changes isolated).
Updated test_context_in_task and test_multiple_context_vars to
reflect correct Python behavior.
Subprocess is not supported because Python's subprocess module uses
fork() which corrupts the Erlang VM when called from within the NIF.

Users should use Erlang ports directly via erlang.call() instead,
which provides superior subprocess management with built-in
supervision, monitoring, and fault tolerance.

Changes:
- Replace _subprocess.py with NotImplementedError stub and docs
- Remove subprocess event handling from _loop.py
- Remove subprocess functions from py_event_loop.c
- Update tests to verify NotImplementedError is raised
- Set HAS_SUBPROCESS_SUPPORT = False in test base
ETF encoding for pids and references:
- Add decode_etf_string() helper in py_callback.c to convert
  __etf__:base64 encoded strings back to Erlang terms
- Add ETF encoding in term_to_python_repr for pids and refs
  in py_context.erl and py_thread_handler.erl

Test fixes:
- Skip ProcessPoolExecutor test inside Erlang NIF (fork issues)
- Use 'spawn' multiprocessing context instead of 'fork'
- Accept OSError in addition to TimeoutError for connect timeout test

Cleanup:
- Remove obsolete multi_loop test files
Implement low-level fd-based API where Erlang handles I/O scheduling
via enif_select and Python handles protocol logic.

- Add priv/_erlang_impl/_reactor.py with Protocol base class and registry
- Add src/py_reactor_context.erl for Erlang reactor context process
- Expose erlang.reactor via sys.modules for 'import erlang.reactor' syntax
- Add test suite (py_reactor_SUITE.erl) with 6 tests
- Add Python tests (py_test_reactor.py) with 3 tests
- Add examples/reactor_echo.erl as usage example

Works with any fd - TCP, UDP, Unix sockets, pipes, etc.
- Add _sandbox.py with Python audit hooks (PEP 578) to block dangerous
  operations: fork, exec, spawn, subprocess, os.system, os.popen
- Install sandbox automatically when running inside Erlang VM
- Remove signal handling support (not applicable in Erlang context)
- Update policy to always return ErlangEventLoop
- Fix ExecutionMode test to check correct enum values
- Remove signal tests and AIO subprocess tests from test suite
New documentation:
- docs/security.md: Document audit hook sandbox, blocked operations
  (fork, exec, subprocess), and Erlang port alternatives
- docs/reactor.md: Document erlang.reactor module for FD-based
  protocol handling with Protocol base class and examples

Updated documentation:
- docs/asyncio.md: Update for unified erlang module, mark
  erlang.install() as deprecated in 3.12+, add Limitations section
  for subprocess/signal handling, add ExecutionMode documentation
- docs/getting-started.md: Add Security Considerations section,
  update asyncio section to use erlang.run()
- README.md: Add security sandbox to features, add doc links

Also fixed edoc errors in source files:
- src/py_nif.erl: Fix angle bracket syntax in reactor function docs
- src/py_context_router.erl: Replace markdown code blocks with <pre>
API change: py:call_async/3,4 renamed to py:cast/3,4 following
gen_server convention (call=sync, cast=async).

Add benchmark_compare.erl for comparing performance between versions.
Current version shows ~2-3x improvement over v1.8.1:
- Sync calls: 0.011ms -> 0.004ms (2.9x faster)
- Cast single: 0.011ms -> 0.004ms (2.8x faster)
- Throughput: ~90K -> ~250K calls/sec
Covers:
- py:call_async -> py:cast rename
- py:bind/unbind removal (use py_context_router)
- py:ctx_* removal (use py_context directly)
- erlang_asyncio -> erlang module consolidation
- Subprocess removal (use Erlang ports)
- Signal handler removal (use Erlang level)
- New features: context router, reactor, erlang.send()
- Performance comparison table
Each subinterpreter context now gets its own event worker for asyncio
support. This ensures asyncio.sleep() and timers work correctly in
subinterpreter contexts.

Changes:
- Add nif_context_get_event_loop/1 NIF to retrieve event loop reference
- Create dedicated event worker per subinterpreter context in py_context
- Extend erlang module with run/new_event_loop in each subinterpreter
- Handle EXIT signals properly (shutdown from supervisor vs normal exits)
- Initialize event loop for worker pool subinterpreters

Worker mode contexts (Python < 3.12) continue to use the shared router.
test_memory_stats and test_reload use modules (tracemalloc) that don't
support Python subinterpreters. Skip these tests when running with
subinterpreter support enabled (Python 3.12+).
- Add atomic runtime state machine (UNINIT->INITING->RUNNING->SHUTTING_DOWN->STOPPED)
- Convert volatile flags to _Atomic for thread safety
- Add NIF guards to reject work when not RUNNING
- Fix destructor memory corruption for OWN_GIL subinterpreters
- Add enif_keep_resource/release for ctx in suspended states
- Add debug counters NIF for runtime diagnostics
- Add CI sanitizer builds (ASan, TSan, UBSan)
@benoitc benoitc force-pushed the feature/py-subinterpreter-parallelism branch from 6bf70a1 to dd577e0 Compare March 5, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant