`sandboxset`: thread-safe sandbox pool (issue #217) by AmiBuch · Pull Request #425 · open-lambda/open-lambda

AmiBuch · 2026-03-03T19:20:49Z

`sandboxset`: thread-safe sandbox pool (issue #217)

Problem

LambdaInstance uses one goroutine per sandbox, sitting idle on channels
between requests. A SandboxSet replaces that with a mutex-protected pool —
callers just ask for a sandbox and don't worry about whether it's new or
reused.

API

GetOrCreateUnpaused returns a *SandboxRef — a handle that wraps the
sandbox with a health state and back-pointer to its parent set. The caller
uses ref.Sandbox() to access the container, then calls ref.Put() or
ref.Destroy() when done.

set, _ := sandboxset.New(&sandboxset.Config{
    Pool:        myPool,
    CodeDir:     "/path/to/lambda",
    ScratchDirs: myScratchDirs,
})

ref, _ := set.GetOrCreateUnpaused()   // create or reuse a sandbox
sb := ref.Sandbox()                    // access the underlying container
// ... handle request using sb.Client() ...

if broken {
    ref.Destroy("reason")             // kill a broken sandbox
} else {
    ref.Put()                          // return it to the pool
}

set.Close()                            // tear down everything

Set-level Put(sb) and Destroy(sb, reason) methods are also available
for callers that have a raw sandbox.Sandbox instead of a SandboxRef.

SandboxRef

type SandboxRef struct {
    State SandboxState  // StateReady or StateBroken
}

func (r *SandboxRef) Sandbox() sandbox.Sandbox  // access the container
func (r *SandboxRef) Put() error                 // return to pool
func (r *SandboxRef) Destroy(reason string) error // destroy and remove

Flow

     [created]
         |
         v
     [paused]  <---+
         |         |
         v         |
     [in-use]  ----+  (Put)
         |
         v
   [destroyed]     (Destroy / Close / error)

GetOrCreateUnpaused claims an idle sandbox and unpauses it. If the
container died while paused, it is destroyed and the next idle one is
tried. If no idle sandbox exists, a new one is created via
SandboxPool.Create. Returns a *SandboxRef with State: StateReady.
Put pauses the sandbox and returns it to the pool. If Pause fails,
the sandbox is destroyed — a bad sandbox never re-enters the pool.
Destroy removes a sandbox from the pool and frees its resources.
The sandbox is always destroyed even if it wasn't in the pool.
Close destroys all sandboxes. Callers still holding references
will find them already dead, which is safe per the Sandbox contract.

Edge cases handled

Mass unpause failure: GetOrCreateUnpaused loops (not recursion)
over idle sandboxes, destroying dead ones until a live one is found or
a new one is created. No stack-growth risk.
DirMaker panic: DirMaker.Make panics on disk-full.
makeScratchDir() wraps it in defer/recover and returns an error.
Put after Close: Returns a clear error message indicating the set
was closed, rather than a confusing "Pause failed" error.

File structure

go/worker/sandboxset/
    api.go           — SandboxSet interface, Config, New()
    sandboxset.go    — All implementation: types + methods
    tests/
        sandboxset_test.go              — Unit tests (MockSandboxPool)
        sandboxset_integration_test.go  — Integration tests (real DockerPool)

Dependencies

sandboxset is a thin layer on top of sandbox — no new abstractions:

sandboxset
    │
    ├── sandbox.Sandbox      (4 of 11 methods used: ID, Pause, Unpause, Destroy)
    ├── sandbox.SandboxPool  (1 method used: Create)
    ├── sandbox.SandboxMeta  (passed through to Create, never read)
    └── common.DirMaker      (1 method used: Make)

The dependency is one-way: sandboxset imports sandbox, never the reverse.

Testing

Unit tests (22 tests): Use MockSandboxPool from sandbox/mock.go.
Fast, no Docker needed, test all pool logic and concurrency.
```
cd go && go test ./worker/sandboxset/tests/ -v -race -count=1
```
Integration tests (4 tests): Use real DockerPool with ol-min
image. Gated with //go:build integration. Verify real containers are
created, paused, unpaused, and destroyed.
```
cd go && sudo env "PATH=$PATH" go test ./worker/sandboxset/tests/ -v -tags=integration -count=1
```

What this PR does NOT do

LambdaInstance goroutines are unchanged — that is Step 2
No capacity management (Warm/Shrink) or metrics — later PRs

Next steps

Step 2: Replace LambdaInstance goroutines with a SandboxSet
Step 3: Use SandboxSet as the node in the zygote tree

…ace and microservice like architecture

tylerharter · 2026-03-05T21:33:53Z

go/worker/sandboxset/sandboxset.go

@@ -0,0 +1,52 @@
+package sandboxset


move all the sandbox set funcs to here, not many different files

tylerharter · 2026-03-05T21:40:39Z

go/worker/sandboxset/api.go

+	// A fresh scratch directory is created for each new sandbox
+	// via Config.ScratchDirs. Reused sandboxes keep their
+	// existing scratch directory from when they were first created.
+	Get() (sandbox.Sandbox, error)


Maybe the cleanup will be more foolproof if we return a wrapper instead of a sandbox itself?

sset = ...
sb_ref = sset.Get()
...use sb_ref.sb...
sb_ref.Put() // don't need to say which sandbox set

tylerharter · 2026-03-05T21:41:13Z

go/worker/sandboxset/api.go

+	// A fresh scratch directory is created for each new sandbox
+	// via Config.ScratchDirs. Reused sandboxes keep their
+	// existing scratch directory from when they were first created.
+	Get() (sandbox.Sandbox, error)


Can we call GetOrCreateUnpaused?

tylerharter

I think I forgot to click submit on the feedback, sorry!

go/worker/sandboxset/api.go

tylerharter · 2026-03-05T22:23:19Z

go/worker/sandboxset/api.go

+	// If the sandbox is not in the pool it is still destroyed —
+	// resources are always freed. The returned error is
+	// informational only.
+	Destroy(sb sandbox.Sandbox, reason string) error


tylerharter · 2026-03-05T22:23:56Z

go/worker/sandboxset/tests/sandboxset_test.go

@@ -0,0 +1,449 @@
+package tests


let's do 2-3 tests initially

tylerharter · 2026-03-05T22:24:44Z

go/worker/sandbox/mock.go

@@ -0,0 +1,122 @@
+package sandbox


instead of just "docker" or "sock", user should be able to start OL with "mock" for sandbox implementation and get these dummy replies

tylerharter

Good work! Getting closer.

tylerharter · 2026-03-13T18:31:05Z

go/worker/sandboxset/api.go

+// Package sandboxset provides a thread-safe pool of sandboxes for a single
+// Lambda function.
+//
+// A SandboxSet replaces per-instance goroutines with a simple pool.


Don't talk about replaces, just what it is

tylerharter · 2026-03-13T18:31:28Z

go/worker/sandboxset/api.go

+// Callers just ask for a sandbox and don't worry about whether it is
+// freshly created or recycled from a previous request.
+//
+// Sandbox lifecycle inside a SandboxSet:


This is the API, they don't need to know about internals.

tylerharter · 2026-03-13T18:32:33Z

go/worker/sandboxset/api.go

+A SandboxSet manages a pool of sandboxes for one Lambda function.
+All methods are safe to call from multiple goroutines.
+
+The design mirrors the C process API: GetOrCreateUnpaused (create),


Not sure what C process API this is, but probably drop this paragraph

tylerharter · 2026-03-13T18:34:08Z

go/worker/sandboxset/api.go

+	// Parent sandbox to fork from (may be nil). When nil, new
+	// sandboxes are created from scratch. Not all SandboxPool
+	// implementations support forking.
+	Parent sandbox.Sandbox


Shouldn't the parent be another sandbox set?

tylerharter · 2026-03-13T18:35:33Z

go/worker/sandboxset/api.go

+	// but is otherwise harmless.
+	//
+	// Prefer ref.Put() when you have a SandboxRef.
+	Put(sb sandbox.Sandbox) error


Why have a Put here if the SandboxRef will have a method to release for us?

tylerharter · 2026-03-13T18:39:29Z

go/worker/sandboxset/sandboxset.go

+// here and return an error instead of crashing the worker.
+func (s *sandboxSetImpl) makeScratchDir() (dir string, err error) {
+	defer func() {
+		if r := recover(); r != nil {


never recover from panics

because we use panics to indicate there is a bug or the state is something that should never happen.

We use errors for recoverable issues.

tylerharter · 2026-03-13T18:40:43Z

go/worker/sandboxset/sandboxset.go

+// makeScratchDir creates a scratch directory for a new sandbox.
+// DirMaker.Make panics on failure (e.g., disk full), so we recover
+// here and return an error instead of crashing the worker.
+func (s *sandboxSetImpl) makeScratchDir() (dir string, err error) {


don't return an error because ScratchDirs.Make never returns an error

tylerharter · 2026-03-13T18:42:40Z

go/worker/sandboxset/sandboxset.go

+	// Loop over idle sandboxes until one unpauses successfully,
+	// or the pool has no idle sandboxes left.
+	for {
+		s.mu.Lock()


Better to have defer s.mu.Unlock() and hold for the whole function. Your idea of not holding the lock during Unpause is a good one, but this is a sign you should refactor the code, into one that has a lock (the whole time), and one that calls the first method, before unpausing.

tylerharter · 2026-03-13T18:44:44Z

go/worker/sandboxset/sandboxset.go

+	}
+	s.mu.Unlock()
+
+	if err := sb.Pause(); err != nil {


why not Pause first, before locking or accessing the data structs?

tylerharter · 2026-03-13T18:46:43Z

go/worker/sandboxset/sandboxset.go

+//
+// All sandboxes are snapshot under the lock, then destroyed outside it.
+func (s *sandboxSetImpl) Close() error {
+	s.mu.Lock()


will we close often, or just at shutdown? Maybe not worth optimizing with early lock release.

AmiBuch and others added 7 commits February 19, 2026 14:09

refactor: separation of package, abstraction included, simpler interf…

4d351bc

…ace and microservice like architecture

Merge branch 'open-lambda:main' into main

47d5d10

refactor: simpler, much much simpler

8037432

fix: simpler diagram

a357a25

fix: commenting, some edge cases

e7fda1f

Merge branch 'open-lambda:main' into main

13ba7a0

fix: recursive Get, DirMaker panic, put after close documented

f85697d

tylerharter requested changes Mar 5, 2026

View reviewed changes

refactor: changed the function distribution and it return type

ae2ac69

tylerharter reviewed Mar 13, 2026

View reviewed changes

Conversation

AmiBuch commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sandboxset: thread-safe sandbox pool (issue #217)

Problem

API

SandboxRef

Flow

Edge cases handled

File structure

Dependencies

Testing

What this PR does NOT do

Next steps

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerharter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerharter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AmiBuch commented Mar 3, 2026 •

edited

Loading

`sandboxset`: thread-safe sandbox pool (issue #217)