Skip to content

Fix crash from asio callback during Client::close with pending operations#551

Open
BewareMyPower wants to merge 8 commits intoapache:mainfrom
BewareMyPower:bewaremypower/fix-connect-crash
Open

Fix crash from asio callback during Client::close with pending operations#551
BewareMyPower wants to merge 8 commits intoapache:mainfrom
BewareMyPower:bewaremypower/fix-connect-crash

Conversation

@BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Mar 9, 2026

Fixes: #550

Motivation

ClientTest.testCloseClient is flaky that sometimes it could cause segmentation fault. This is because client.close() could be called quickly after createProducerAsync. In this case, the async_connect operation might not complete. In the close method, socket_->close() is called, which could complete the async_connect with operation_aborted. However, after closing the ConnectionPool, the ClientConnection object, as well as its internal socket_ object, will be destroyed. In asio, when the completion handler (callback) is called while the caller (socket) is destroyed, it will crash because the callback refers the caller internally.

Capturing socket_ in the callback to extend socket's lifetime is not enough, because after closing the pool, the io_context object will be stopped by its stop method in ExecutorService. Then the io_context's internal queue will be destroyed, if the callback is triggered after that, it could access an destroyed object and cause

Modifications

  • In the close method, when the state is Pending, i.e. the TCP connection is not done, don't close the socket. Instead, delay the socket close to the callback (handleTcpConnected).
  • Track the number of pending operations except for timers. Then wait all pending operations to be completed when closing the connection pool. Use 5 seconds as the timeout in case of any deadlock.
  • Fix the thread safety when accessing isClosed because state_ could be modified by close() method in a different thread.

… crash

ASIO's async_connect (range_connect_op) holds a raw reference to the socket
and calls socket.close() internally between endpoint retry attempts. If
ClientConnection is destroyed before those internal calls complete (i.e. the
socket_ shared_ptr drops to zero), that raw reference becomes dangling and
causes a segfault.

Fix by capturing socket_ as a shared_ptr in the async_connect completion
handler. Since ASIO stores the completion handler as part of the operation
object (range_connect_op), this keeps the socket alive for the entire
lifetime of the async_connect operation, including all internal retry logic.

Fixes: apache#550

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BewareMyPower BewareMyPower added this to the 4.1.0 milestone Mar 9, 2026
@BewareMyPower BewareMyPower self-assigned this Mar 9, 2026
@BewareMyPower BewareMyPower added the bug Something isn't working label Mar 9, 2026
@BewareMyPower BewareMyPower marked this pull request as draft March 9, 2026 03:58
@BewareMyPower BewareMyPower changed the title Fix: keep socket alive during async_connect to prevent use-after-free crash Fix crash from asio completion handler during Client::close with pending operations Mar 9, 2026
@BewareMyPower BewareMyPower changed the title Fix crash from asio completion handler during Client::close with pending operations Fix crash from asio callback during Client::close with pending operations Mar 9, 2026
@BewareMyPower BewareMyPower marked this pull request as ready for review March 9, 2026 13:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a crash caused by ASIO completion handlers running during/after Client::close() when there are still in-flight network operations (notably pending async_connect), which can lead to use-after-free of socket / io_context internals during shutdown.

Changes:

  • Adds tracking of in-flight (non-timer) async operations in ClientConnection and exposes a pendingOperations() counter.
  • Adjusts connection shutdown behavior to avoid closing the socket while the TCP connect is still pending, deferring close handling to handleTcpConnected.
  • Makes ConnectionPool::close() wait (up to 5s) for each connection’s pending operations to complete before clearing the pool.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/PulsarFriend.h Uses C++17 lock-guard deduction when locking ClientConnection internals in tests.
lib/MockServer.h Updates mutex locking to use C++17 lock-guard deduction for consistency.
lib/ConnectionPool.cc Waits for pending async ops per connection during pool shutdown to reduce shutdown-time crashes.
lib/ClientConnection.h Introduces pendingOperations_ counter, exposes accessor, and changes mutex type to recursive_mutex.
lib/ClientConnection.cc Instruments connect/handshake/read/write paths for pending-op tracking and modifies shutdown/connect completion handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Client.close might crash

2 participants