Skip to content

sandboxset: thread-safe sandbox pool (issue #217)#425

Draft
AmiBuch wants to merge 8 commits intoopen-lambda:mainfrom
AmiBuch:main
Draft

sandboxset: thread-safe sandbox pool (issue #217)#425
AmiBuch wants to merge 8 commits intoopen-lambda:mainfrom
AmiBuch:main

Conversation

@AmiBuch
Copy link

@AmiBuch AmiBuch commented Mar 3, 2026

sandboxset: thread-safe sandbox pool (issue #217)

Problem

LambdaInstance uses one goroutine per sandbox, sitting idle on channels
between requests. A SandboxSet replaces that with a mutex-protected pool —
callers just ask for a sandbox and don't worry about whether it's new or
reused.

API

GetOrCreateUnpaused returns a *SandboxRef — a handle that wraps the
sandbox with a health state and back-pointer to its parent set. The caller
uses ref.Sandbox() to access the container, then calls ref.Put() or
ref.Destroy() when done.

set, _ := sandboxset.New(&sandboxset.Config{
    Pool:        myPool,
    CodeDir:     "/path/to/lambda",
    ScratchDirs: myScratchDirs,
})

ref, _ := set.GetOrCreateUnpaused()   // create or reuse a sandbox
sb := ref.Sandbox()                    // access the underlying container
// ... handle request using sb.Client() ...

if broken {
    ref.Destroy("reason")             // kill a broken sandbox
} else {
    ref.Put()                          // return it to the pool
}

set.Close()                            // tear down everything

Set-level Put(sb) and Destroy(sb, reason) methods are also available
for callers that have a raw sandbox.Sandbox instead of a SandboxRef.

SandboxRef

type SandboxRef struct {
    State SandboxState  // StateReady or StateBroken
}

func (r *SandboxRef) Sandbox() sandbox.Sandbox  // access the container
func (r *SandboxRef) Put() error                 // return to pool
func (r *SandboxRef) Destroy(reason string) error // destroy and remove

Flow

     [created]
         |
         v
     [paused]  <---+
         |         |
         v         |
     [in-use]  ----+  (Put)
         |
         v
   [destroyed]     (Destroy / Close / error)
  • GetOrCreateUnpaused claims an idle sandbox and unpauses it. If the
    container died while paused, it is destroyed and the next idle one is
    tried. If no idle sandbox exists, a new one is created via
    SandboxPool.Create. Returns a *SandboxRef with State: StateReady.
  • Put pauses the sandbox and returns it to the pool. If Pause fails,
    the sandbox is destroyed — a bad sandbox never re-enters the pool.
  • Destroy removes a sandbox from the pool and frees its resources.
    The sandbox is always destroyed even if it wasn't in the pool.
  • Close destroys all sandboxes. Callers still holding references
    will find them already dead, which is safe per the Sandbox contract.

Edge cases handled

  • Mass unpause failure: GetOrCreateUnpaused loops (not recursion)
    over idle sandboxes, destroying dead ones until a live one is found or
    a new one is created. No stack-growth risk.
  • DirMaker panic: DirMaker.Make panics on disk-full.
    makeScratchDir() wraps it in defer/recover and returns an error.
  • Put after Close: Returns a clear error message indicating the set
    was closed, rather than a confusing "Pause failed" error.

File structure

go/worker/sandboxset/
    api.go           — SandboxSet interface, Config, New()
    sandboxset.go    — All implementation: types + methods
    tests/
        sandboxset_test.go              — Unit tests (MockSandboxPool)
        sandboxset_integration_test.go  — Integration tests (real DockerPool)

Dependencies

sandboxset is a thin layer on top of sandbox — no new abstractions:

sandboxset
    │
    ├── sandbox.Sandbox      (4 of 11 methods used: ID, Pause, Unpause, Destroy)
    ├── sandbox.SandboxPool  (1 method used: Create)
    ├── sandbox.SandboxMeta  (passed through to Create, never read)
    └── common.DirMaker      (1 method used: Make)

The dependency is one-way: sandboxset imports sandbox, never the reverse.

Testing

  • Unit tests (22 tests): Use MockSandboxPool from sandbox/mock.go.
    Fast, no Docker needed, test all pool logic and concurrency.

    cd go && go test ./worker/sandboxset/tests/ -v -race -count=1
    
  • Integration tests (4 tests): Use real DockerPool with ol-min
    image. Gated with //go:build integration. Verify real containers are
    created, paused, unpaused, and destroyed.

    cd go && sudo env "PATH=$PATH" go test ./worker/sandboxset/tests/ -v -tags=integration -count=1
    

What this PR does NOT do

  • LambdaInstance goroutines are unchanged — that is Step 2
  • No capacity management (Warm/Shrink) or metrics — later PRs

Next steps

  • Step 2: Replace LambdaInstance goroutines with a SandboxSet
  • Step 3: Use SandboxSet as the node in the zygote tree

@@ -0,0 +1,52 @@
package sandboxset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move all the sandbox set funcs to here, not many different files

// A fresh scratch directory is created for each new sandbox
// via Config.ScratchDirs. Reused sandboxes keep their
// existing scratch directory from when they were first created.
Get() (sandbox.Sandbox, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the cleanup will be more foolproof if we return a wrapper instead of a sandbox itself?

sset = ...
sb_ref = sset.Get()
...use sb_ref.sb...
sb_ref.Put() // don't need to say which sandbox set

// A fresh scratch directory is created for each new sandbox
// via Config.ScratchDirs. Reused sandboxes keep their
// existing scratch directory from when they were first created.
Get() (sandbox.Sandbox, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call GetOrCreateUnpaused?

Copy link
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I forgot to click submit on the feedback, sorry!

// If the sandbox is not in the pool it is still destroyed —
// resources are always freed. The returned error is
// informational only.
Destroy(sb sandbox.Sandbox, reason string) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need

@@ -0,0 +1,449 @@
package tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do 2-3 tests initially

@@ -0,0 +1,122 @@
package sandbox
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of just "docker" or "sock", user should be able to start OL with "mock" for sandbox implementation and get these dummy replies

Copy link
Member

@tylerharter tylerharter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Getting closer.

// Package sandboxset provides a thread-safe pool of sandboxes for a single
// Lambda function.
//
// A SandboxSet replaces per-instance goroutines with a simple pool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't talk about replaces, just what it is

// Callers just ask for a sandbox and don't worry about whether it is
// freshly created or recycled from a previous request.
//
// Sandbox lifecycle inside a SandboxSet:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the API, they don't need to know about internals.

A SandboxSet manages a pool of sandboxes for one Lambda function.
All methods are safe to call from multiple goroutines.

The design mirrors the C process API: GetOrCreateUnpaused (create),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what C process API this is, but probably drop this paragraph

// Parent sandbox to fork from (may be nil). When nil, new
// sandboxes are created from scratch. Not all SandboxPool
// implementations support forking.
Parent sandbox.Sandbox
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the parent be another sandbox set?

// but is otherwise harmless.
//
// Prefer ref.Put() when you have a SandboxRef.
Put(sb sandbox.Sandbox) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have a Put here if the SandboxRef will have a method to release for us?

// here and return an error instead of crashing the worker.
func (s *sandboxSetImpl) makeScratchDir() (dir string, err error) {
defer func() {
if r := recover(); r != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never recover from panics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we use panics to indicate there is a bug or the state is something that should never happen.

We use errors for recoverable issues.

// makeScratchDir creates a scratch directory for a new sandbox.
// DirMaker.Make panics on failure (e.g., disk full), so we recover
// here and return an error instead of crashing the worker.
func (s *sandboxSetImpl) makeScratchDir() (dir string, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't return an error because ScratchDirs.Make never returns an error

// Loop over idle sandboxes until one unpauses successfully,
// or the pool has no idle sandboxes left.
for {
s.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have defer s.mu.Unlock() and hold for the whole function. Your idea of not holding the lock during Unpause is a good one, but this is a sign you should refactor the code, into one that has a lock (the whole time), and one that calls the first method, before unpausing.

}
s.mu.Unlock()

if err := sb.Pause(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not Pause first, before locking or accessing the data structs?

//
// All sandboxes are snapshot under the lock, then destroyed outside it.
func (s *sandboxSetImpl) Close() error {
s.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we close often, or just at shutdown? Maybe not worth optimizing with early lock release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants