Multi-tenant AI agent SaaS — live production

The situation

We needed to prove the Framework product could run more than one tenant without the stacks interfering with each other — the common SaaS failure mode where an outage for one customer cascades into all of them. The internal test was: can multiple paying tenants run simultaneously, each on their own infrastructure, with one person maintaining the fleet?

What we did

Built an automated provisioning pipeline: checkout → cloud VPS creation → cloud-init bootstrap → edge tunnel + DNS records → per-tenant LLM key → welcome email
Each tenant gets a dedicated subdomain, dedicated container, dedicated tunnel, dedicated LLM budget
Wrote the fleet-management tooling so adding the next tenant is a single webhook away
Enforced budget caps on per-tenant LLM keys so a runaway cost on one tenant cannot affect the others
Set a hard tenant cap for this generation of the platform; new tiers planned before the cap matters

Timeline: pipeline built in two weeks, first paying tenant onboarded within 24 hours of going live.

What changed

The platform has been running paying tenants in production since March. Each tenant is isolated at the container, tunnel, and budget layer. When one has an incident it does not affect the others. The fleet is operated by one person.

The architecture survived an enterprise on-prem engagement and concurrent SaaS tenants without structural changes. The same provisioning pipeline is now sold as the Framework product.

Relevant context

The architecture that runs this platform is what the Framework product installs for new customers. We dogfood what we sell. Infrastructure cost stays well inside the tenant subscription price at current scale.