fix: refresh upstream tokens transparently instead of forcing re-auth#4036
fix: refresh upstream tokens transparently instead of forcing re-auth#4036aron-muon wants to merge 12 commits intostacklok:mainfrom
Conversation
efdbe42 to
cd44bbf
Compare
cd44bbf to
0dd2c1c
Compare
The upstreamswap middleware returned 401 when upstream access tokens expired, forcing users through full re-authentication even though valid refresh tokens existed in storage. This happened because: 1. Redis/memory storage TTL was set to access token expiry, deleting the entry (and refresh token) when the access token expired 2. Storage returned nil on ErrExpired, discarding the refresh token 3. The middleware had no refresh path — only 401 Fix all three layers: - Add DefaultRefreshTokenTTL (30 days) to storage entry TTL so refresh tokens survive past access token expiry - Return token data alongside ErrExpired from storage so callers can use the refresh token - Add UpstreamTokenRefresher interface and implementation that wraps the upstream OAuth2Provider and storage - Plumb the refresher through Server → EmbeddedAuthServer → Runner → MiddlewareRunner - Update upstreamswap middleware to attempt refresh before returning 401, only requiring re-auth when the refresh token itself fails Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0dd2c1c to
ab9807b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4036 +/- ##
==========================================
+ Coverage 68.63% 68.70% +0.06%
==========================================
Files 446 447 +1
Lines 45435 45525 +90
==========================================
+ Hits 31185 31276 +91
+ Misses 11842 11835 -7
- Partials 2408 2414 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add comprehensive tests for RefreshAndStore (6 cases) and middleware refresh paths (4 cases: successful refresh, failed refresh, no refresh token, defense-in-depth expired-without-error). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix step numbering: renumber step 6 → 5 after step 5 removal - Update redis integration test: assert returned token data is non-nil on ErrExpired, consistent with the unit test contract - Fix test closures: pass subtest t to setupStorage/setupProvider to ensure assertion failures are attributed to the correct subtest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
16bddba to
d1415a7
Compare
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Wrap upstream token refresh in singleflight.Group keyed on sessionID to collapse concurrent refreshes into one call. Prevents providers with single-use refresh tokens from failing all but the first concurrent caller. Added TestSingleFlightRefresh_ConcurrentRequests to verify the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Thanks for the review! I included a fix for each nit and also added a fix for the multiple concurrent requests issue. |
jhrozek
left a comment
There was a problem hiding this comment.
Third pass: reviewing the singleflight addition. Nice implementation overall — two findings from the new code.
|
two more nits and then we'll be good to go 🚀 |
Co-authored-by: Jakub Hrozek <jakub.hrozek@posteo.se>
3a74255 to
0a6fb9f
Compare
Nice comments! Thank you! |
Summary
Users are forced to fully re-authenticate with upstream OAuth providers every time the upstream access token expires (controlled by
accessTokenLifespan), even though valid refresh tokens exist in storage.The root cause is that upstream access tokens and refresh tokens are stored together in a single storage entry, with the entry's TTL/expiry set to the access token's expiry. When the access token expires, the entry is deleted (Redis) or marked expired (memory) — losing the refresh token, which is typically valid for 30-90 days or longer depending on the provider. The
upstreamswapmiddleware has no refresh path and returns 401, forcing full re-authentication.This affects both Redis and in-memory storage backends — Redis deletes the key via TTL, and memory storage's cleanup goroutine removes the expired entry.
Type of change
Root cause
The storage model bundles access + refresh tokens in a single entry (
UpstreamTokensstruct). The entry TTL is derived from the access token'sExpiresAt, so when the access token expires:GetUpstreamTokensreturnsnil, ErrExpired— discarding the token dataIdeally, access and refresh tokens would be stored separately with independent TTLs matching their actual lifetimes. This fix extends the bundled entry's TTL as a pragmatic solution; a future refactor could separate them for cleaner lifecycle management.
Changes
Storage layer (
pkg/authserver/storage/)DefaultRefreshTokenTTL(30 days) in both Redis and memory storage, so refresh tokens survive past access token expiryGetUpstreamTokensto return token data alongsideErrExpired(instead ofnil) so callers can use the refresh tokenExpiresAt(access token expiry) rather than the entry'sexpiresAt(storage TTL) for the expired checkToken refresher (
pkg/authserver/refresher.go)UpstreamTokenRefresherinterface instorage/types.goupstream.OAuth2Provider.RefreshTokens()+UpstreamTokenStorage.StoreUpstreamTokens()Plumbing
Server→EmbeddedAuthServer→Runner→MiddlewareRunnerusing the same lazy accessor pattern asGetUpstreamTokenStorageMiddleware (
pkg/auth/upstreamswap/)getOrRefreshUpstreamTokenshelper to keep cyclomatic complexity under lint thresholdProduction validation
Deployed to a production cluster with Redis (AWS Valkey) storage (we use a sentinel emulator which basically just returns the Valkey URL in each case. One of the little clever ways we could use Valkey). All four upstream providers successfully refreshed tokens transparently — no user re-authentication required:
cf.mcp.atlassian.com/v1/tokenapp.asana.com/-/oauth_tokenslack-gov.com/api/oauth.v2.accessoauth2.googleapis.com/tokenRedis TTLs confirmed updated to ~30 days (previously ~1 hour). GitHub has not yet expired its 8-hour access token but uses the same code path.
Test plan
ErrExpiredDoes this introduce a user-facing change?
Yes — upstream OAuth sessions now persist beyond the access token lifetime. Users will no longer be forced to re-authenticate as long as their upstream refresh token is valid (typically 30 days to indefinite depending on the provider).
Large PR Justification
Generated with Claude Code