sandboxset: thread-safe sandbox pool (issue #217)#425
sandboxset: thread-safe sandbox pool (issue #217)#425AmiBuch wants to merge 8 commits intoopen-lambda:mainfrom
sandboxset: thread-safe sandbox pool (issue #217)#425Conversation
…ace and microservice like architecture
| @@ -0,0 +1,52 @@ | |||
| package sandboxset | |||
There was a problem hiding this comment.
move all the sandbox set funcs to here, not many different files
go/worker/sandboxset/api.go
Outdated
| // A fresh scratch directory is created for each new sandbox | ||
| // via Config.ScratchDirs. Reused sandboxes keep their | ||
| // existing scratch directory from when they were first created. | ||
| Get() (sandbox.Sandbox, error) |
There was a problem hiding this comment.
Maybe the cleanup will be more foolproof if we return a wrapper instead of a sandbox itself?
sset = ...
sb_ref = sset.Get()
...use sb_ref.sb...
sb_ref.Put() // don't need to say which sandbox set
go/worker/sandboxset/api.go
Outdated
| // A fresh scratch directory is created for each new sandbox | ||
| // via Config.ScratchDirs. Reused sandboxes keep their | ||
| // existing scratch directory from when they were first created. | ||
| Get() (sandbox.Sandbox, error) |
There was a problem hiding this comment.
Can we call GetOrCreateUnpaused?
tylerharter
left a comment
There was a problem hiding this comment.
I think I forgot to click submit on the feedback, sorry!
| // If the sandbox is not in the pool it is still destroyed — | ||
| // resources are always freed. The returned error is | ||
| // informational only. | ||
| Destroy(sb sandbox.Sandbox, reason string) error |
| @@ -0,0 +1,449 @@ | |||
| package tests | |||
There was a problem hiding this comment.
let's do 2-3 tests initially
| @@ -0,0 +1,122 @@ | |||
| package sandbox | |||
There was a problem hiding this comment.
instead of just "docker" or "sock", user should be able to start OL with "mock" for sandbox implementation and get these dummy replies
tylerharter
left a comment
There was a problem hiding this comment.
Good work! Getting closer.
| // Package sandboxset provides a thread-safe pool of sandboxes for a single | ||
| // Lambda function. | ||
| // | ||
| // A SandboxSet replaces per-instance goroutines with a simple pool. |
There was a problem hiding this comment.
Don't talk about replaces, just what it is
| // Callers just ask for a sandbox and don't worry about whether it is | ||
| // freshly created or recycled from a previous request. | ||
| // | ||
| // Sandbox lifecycle inside a SandboxSet: |
There was a problem hiding this comment.
This is the API, they don't need to know about internals.
| A SandboxSet manages a pool of sandboxes for one Lambda function. | ||
| All methods are safe to call from multiple goroutines. | ||
|
|
||
| The design mirrors the C process API: GetOrCreateUnpaused (create), |
There was a problem hiding this comment.
Not sure what C process API this is, but probably drop this paragraph
| // Parent sandbox to fork from (may be nil). When nil, new | ||
| // sandboxes are created from scratch. Not all SandboxPool | ||
| // implementations support forking. | ||
| Parent sandbox.Sandbox |
There was a problem hiding this comment.
Shouldn't the parent be another sandbox set?
| // but is otherwise harmless. | ||
| // | ||
| // Prefer ref.Put() when you have a SandboxRef. | ||
| Put(sb sandbox.Sandbox) error |
There was a problem hiding this comment.
Why have a Put here if the SandboxRef will have a method to release for us?
| // here and return an error instead of crashing the worker. | ||
| func (s *sandboxSetImpl) makeScratchDir() (dir string, err error) { | ||
| defer func() { | ||
| if r := recover(); r != nil { |
There was a problem hiding this comment.
because we use panics to indicate there is a bug or the state is something that should never happen.
We use errors for recoverable issues.
| // makeScratchDir creates a scratch directory for a new sandbox. | ||
| // DirMaker.Make panics on failure (e.g., disk full), so we recover | ||
| // here and return an error instead of crashing the worker. | ||
| func (s *sandboxSetImpl) makeScratchDir() (dir string, err error) { |
There was a problem hiding this comment.
don't return an error because ScratchDirs.Make never returns an error
| // Loop over idle sandboxes until one unpauses successfully, | ||
| // or the pool has no idle sandboxes left. | ||
| for { | ||
| s.mu.Lock() |
There was a problem hiding this comment.
Better to have defer s.mu.Unlock() and hold for the whole function. Your idea of not holding the lock during Unpause is a good one, but this is a sign you should refactor the code, into one that has a lock (the whole time), and one that calls the first method, before unpausing.
| } | ||
| s.mu.Unlock() | ||
|
|
||
| if err := sb.Pause(); err != nil { |
There was a problem hiding this comment.
why not Pause first, before locking or accessing the data structs?
| // | ||
| // All sandboxes are snapshot under the lock, then destroyed outside it. | ||
| func (s *sandboxSetImpl) Close() error { | ||
| s.mu.Lock() |
There was a problem hiding this comment.
will we close often, or just at shutdown? Maybe not worth optimizing with early lock release.
sandboxset: thread-safe sandbox pool (issue #217)Problem
LambdaInstanceuses one goroutine per sandbox, sitting idle on channelsbetween requests. A SandboxSet replaces that with a mutex-protected pool —
callers just ask for a sandbox and don't worry about whether it's new or
reused.
API
GetOrCreateUnpausedreturns a*SandboxRef— a handle that wraps thesandbox with a health state and back-pointer to its parent set. The caller
uses
ref.Sandbox()to access the container, then callsref.Put()orref.Destroy()when done.Set-level
Put(sb)andDestroy(sb, reason)methods are also availablefor callers that have a raw
sandbox.Sandboxinstead of aSandboxRef.SandboxRef
Flow
container died while paused, it is destroyed and the next idle one is
tried. If no idle sandbox exists, a new one is created via
SandboxPool.Create. Returns a*SandboxRefwithState: StateReady.the sandbox is destroyed — a bad sandbox never re-enters the pool.
The sandbox is always destroyed even if it wasn't in the pool.
will find them already dead, which is safe per the Sandbox contract.
Edge cases handled
GetOrCreateUnpausedloops (not recursion)over idle sandboxes, destroying dead ones until a live one is found or
a new one is created. No stack-growth risk.
DirMaker.Makepanics on disk-full.makeScratchDir()wraps it indefer/recoverand returns an error.was closed, rather than a confusing "Pause failed" error.
File structure
Dependencies
sandboxset is a thin layer on top of sandbox — no new abstractions:
The dependency is one-way:
sandboxsetimportssandbox, never the reverse.Testing
Unit tests (22 tests): Use
MockSandboxPoolfromsandbox/mock.go.Fast, no Docker needed, test all pool logic and concurrency.
Integration tests (4 tests): Use real
DockerPoolwithol-minimage. Gated with
//go:build integration. Verify real containers arecreated, paused, unpaused, and destroyed.
What this PR does NOT do
LambdaInstancegoroutines are unchanged — that is Step 2Warm/Shrink) or metrics — later PRsNext steps
LambdaInstancegoroutines with aSandboxSetSandboxSetas the node in the zygote tree