Skip to content

Add Northflank launcher and runner for GPU job execution#456

Draft
Champ-Goblem wants to merge 1 commit intogpu-mode:mainfrom
nf-testing:feature/northflank-runner
Draft

Add Northflank launcher and runner for GPU job execution#456
Champ-Goblem wants to merge 1 commit intogpu-mode:mainfrom
nf-testing:feature/northflank-runner

Conversation

@Champ-Goblem
Copy link

Implement Northflank integration for running kernel benchmarks on managed GPU infrastructure with object storage result delivery.

Files:

  • northflank-runner.py: Container entrypoint that parses compressed config from env vars, executes benchmarks, and uploads results to object storage for retrieval
  • northflank.py: NorthflankLauncher that triggers jobs via REST API, polls for completion, and downloads results from storage

Features:

  • Configurable repo URL and branch for testing
  • Timeout management based on submission mode
  • Compressed payload encoding for config transfer
  • Environment-based storage configuration

Implement Northflank integration for running kernel benchmarks on
managed GPU infrastructure with object storage result delivery.

Files:
- northflank-runner.py: Container entrypoint that parses compressed
  config from env vars, executes benchmarks, and uploads results to
  object storage for retrieval
- northflank.py: NorthflankLauncher that triggers jobs via REST API,
  polls for completion, and downloads results from storage

Features:
- Configurable repo URL and branch for testing
- Timeout management based on submission mode
- Compressed payload encoding for config transfer
- Environment-based storage configuration

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Champ-Goblem <cameron@northflank.com>
@Champ-Goblem Champ-Goblem marked this pull request as draft March 4, 2026 21:48
@msaroufim
Copy link
Member

msaroufim commented Mar 4, 2026

Thanks @Champ-Goblem! This is a good first pass but to make it ready there are a few missing things

On the launcher itself

  1. In src/kernelbot/main.py and src/libkernelbot/consts.py you still need to add the northflank gpus and northflank backend respectively otherwise I'm not sure this actually tests a northflank launcher e2e
  2. In src/kernelbot/env.py we want to be adding the northflank env variables as well
  3. We are also leaking information on the benchmark infra and want to be calling del os.environ["PAYLOAD"] immediately after the payload is read

Once the new app is up, you should be able to send requests to it via our API, Claude Code has all the right skills in the repo to figure this out. As is this code isn't testing our launcher but more of a smoke test, so need to do something like this

Gaps

  1. A big omission that I think would be key to showcasing the platform is having strong resource isolation guarantees, if a node has 8 gpus then we need the ability to queue 8 concurrent jobs where each job has 1/8 of the total cpu cores, ram resources. This might be implicit in the machine setup but figured it's important enough to discuss here. When I clicked on the runner in the UI I only saw it mention 1 GPU
  2. Ideally what I'd really like to see is some self serve instructions for how we expect to onboard new machines, Github runners for instance make you wget a script and then do a .run.sh so ideally this should be as simple
  3. We're not making it clear what dependencies run on the target machine, for instance with both the AMD and NVIDIA github workflows, we specify a requirements.txt and often a Dockerfile
  4. It's not clear to me what workflow northflank would be running so for instance let's say we run 2 concurrent competitions, in the Github world we have a new workflow file per hardware target whereas with the current integration we do lose some flexibility
  5. I believe profiler data is ignored now
  6. Some tests, esp since this is new I worry we'll break it for you

@msaroufim msaroufim requested review from S1ro1, msaroufim and ngc92 and removed request for S1ro1 March 4, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants