I Built an MMORPG on AWS Lambda. Here's What It Forced Me to Throw Away.
· updated Apr 19, 2026 · 8 min · hackerquest.online
Every tutorial on building a multiplayer game starts the same way: pick a persistent server. Node with WebSockets, maybe Colyseus, maybe a dedicated Go or Rust process. Spin up EC2. Put it behind a load balancer. Welcome to the club.
I did not do any of that. HackerQuest is a browser-based hacking MMORPG with PvP, player trading, a dark-web marketplace, 21 NPCs with branching dialogue, and a persistent world. The whole backend now runs on 28 Lambda functions, two DynamoDB tables (game + marketing), Cognito, an HTTP API, and a WebSocket API. There is still no server sitting idle. When no one is playing, the compute bill is zero.
This works. It also forced me to redesign a handful of things I assumed a game "just has." Here is what the architecture looks like, and where it stops making sense.
Why Lambda Is Fine for This Genre (and Terrible for Others)
The games people use to argue against serverless are the wrong games. Nobody is suggesting you host Counter-Strike on Lambda. Twitch-reaction PvP needs a persistent process holding state in memory at sub-100ms tick rates. Lambda is not that.
HackerQuest is not that either. It is turn-based at heart. A player types scan, the server runs a probability roll, writes the result to DynamoDB, returns JSON. The next action might come 10 seconds later or 10 minutes later. Between actions, there is nothing to keep warm. The game genre I grew up on -- RuneScape, Torn City, EVE Online's market game loop -- is almost perfectly shaped for pay-per-request compute.
The test I use: can the state of the game be reconstructed from scratch by reading the database, with no in-memory context from the last tick? If yes, Lambda works. If no, you need a process.
Server Authority When the Client Is a Sandbox
A browser game is a JavaScript bundle the player can open the devtools on. If there is a variable called balance and the client trusts it, that variable is already 9,999,999.
The rule in this codebase is blunt: the client renders, the server decides. Every action that changes state goes through a Lambda. The client sends "I want to hack 10.0.41.41" -- the Lambda checks if the target exists, pulls the player's hacking skill, rolls the success probability server-side, writes the outcome, and returns the result. The client animates it. If someone edits the JS to claim they have hacking 999, the Lambda reads their real level from DynamoDB and ignores the claim.
The wrinkle is offline mode. HackerQuest has a BACKEND_ENABLED flag that lets the game run entirely in localStorage for solo play. That is a different universe. Offline, the client is the authority and the player is cheating themselves, which is fine. The moment the flag flips and a player is on the leaderboard, the server is the only source of truth and the client's copy of state is a cache that exists to make the UI feel fast.
Building both modes from day one was more work, but it meant the game was playable while the backend was still being written. I could ship the frontend to S3 and let people mess with the mechanics before I had a single Lambda deployed.
One DynamoDB Table for the Entire Game
The instinct when you have a player entity, an item entity, a chat message entity, a leaderboard, and a marketplace listing is to make five tables. Don't. You end up paying for provisioned throughput on five tables, writing join logic in application code, and dealing with transactions across them.
HackerQuest uses a single DynamoDB table in PAY_PER_REQUEST mode with 3 global secondary indexes. The partition key is a generic PK string, the sort key is SK. A player is PK=PLAYER#abc123, SK=PROFILE. That player's inventory items are PK=PLAYER#abc123, SK=ITEM#xyz, which means fetching all their items is one query. A chat message is PK=CHAT#global, SK=TS#1745020301, so the last 50 messages is a range query on SK in reverse order with Limit=50.
Leaderboards are where the GSIs earn their keep. GSI1 flips the schema so PK=LEADERBOARD#hacking and SK is a zero-padded skill score. The top 100 ranked hackers is a single query with no application sorting. The marketplace listings use a second GSI keyed on item category. The third handles active PvP bounties.
Total monthly spend on storage plus per-request at the current player count: a dollar or two. If the game gets 10x bigger, the bill goes up roughly 10x, and I do not have to rewrite anything.
Energy Regen Without a Cron Job
Every RPG has a stamina bar that refills over time. The obvious way to build it is a scheduled job that runs every minute and adds a point to everyone's energy. Do not do this. It scales badly, it wakes up a Lambda for players who are not online, and it is going to drift.
Lazy regen is the move. The player record stores energy and energy_updated_at. When a request comes in, the Lambda computes elapsed = now - energy_updated_at, adds floor(elapsed / 180) points (capped at max), and writes the result with the new timestamp. A player who logs back in 6 hours later gets their full refill on the next request. A player who never logs in costs zero.
The math is on one function in one file. No cron, no queue, no drift. Scheduled jobs still exist in the stack -- mining payouts every 15 minutes, botnet decay every hour, leaderboard snapshots every hour -- but those are actual scheduled mutations that cannot be computed lazily. Anything that can be derived from a timestamp should be.
The Cognito Trigger That Cost Me an Afternoon
New users sign up through Cognito. When they confirm their email, I want a Lambda to fire and create their initial DynamoDB player record. Cognito calls this a post-confirmation trigger. The SAM template should wire it up.
It did not. The first time I tried to deploy the full stack with the trigger defined in the same template as the user pool, SAM complained about a circular dependency. The Lambda needs permission to write to DynamoDB, the user pool needs the Lambda ARN as a trigger, and somewhere in CloudFormation's dependency graph, those two chase their own tail.
The fix is not elegant but it works: deploy the Lambda and the user pool separately in SAM, then attach the trigger in a second step with a one-line AWS CLI call.
aws cognito-idp update-user-pool \
--user-pool-id us-east-1_GgMLLnvXB \
--lambda-config PostConfirmation=arn:aws:lambda:us-east-1:692859945539:function:hq-auth-trigger
The Lambda had the cognito-idp.amazonaws.com invoke permission from the original deploy, so it just worked. I wrapped the command in the deploy script with a guard that only runs it when the trigger is not already attached. If I had done this in the template from the start, I would still be fighting CloudFormation.
When This Architecture Stops Working
Be honest about the limits. I would not build any of the following on Lambda:
- Real-time combat with sub-second tick rates. Lambda cold starts are not your friend.
- Any mechanic where multiple players need to share ephemeral state between actions. Lambdas do not keep state between invocations, and forcing it through DynamoDB adds latency on every read.
- Voice or video chat. Obviously. WebRTC needs a SFU, not a function.
- Physics simulation, pathfinding on a shared map, any game where the world has inertia between player actions.
HackerQuest avoids all of these by being a terminal game. The world is a set of database rows. Actions are discrete. Nobody cares if there is a 200ms latency before their hack command resolves -- it reads more cinematic when there is.
The whole backend -- 28 Lambdas, two tables, one user pool, five scheduled jobs, one WebSocket API -- fits in a single ~1100-line SAM template. It deploys in under three minutes. I can tear the entire stack down and redeploy it from scratch while making coffee. For the kind of game this is, that is worth more than any persistent-server feature I gave up.
Six Weeks Later: What Changed
Update Apr 19, 2026. Six weeks of live play taught me where the original "MVP" architecture aged out and where it held. Here is the short list.
WebSockets after all -- but only the smallest kind
Chat, party invites, org-war declarations, and PvP attack notifications all needed real-time push. I held off as long as I could. The breaking point was party formation: polling for "did your party leader just invite you" reads like garbage. I added an API Gateway WebSocket layer with three Lambda routes ($connect, $disconnect, $default action-router). On $connect I verify the Cognito ID token signature using aws-jwt-verify against the user pool's JWKS, then write a CONN#{connectionId} row in the same DynamoDB table with an 8-hour rolling TTL. Push is just a DDB query (find connections for a target handle via a GSI3 lookup) plus PostToConnection calls. Per-action sliding-window rate limits live in the same DDB pattern as chat. A scheduled $default-adjacent sweep deletes stale CONN# rows older than 2 minutes so the addressable-push targets stay honest. The state-reconstruction property still holds for HTTP -- it's just that the WS layer now maintains its own ephemeral state in the same table, with TTLs.
Server-side party state. Conditional writes for handle uniqueness.
I shipped party formation client-mirrored at first -- each player held their own roster locally, server only brokered the invite delivery. A motivated client could lie and inflate the shared-XP multiplier by adding fake party members. Now there is a PARTY#{partyId} row plus a PARTY_MEMBER#{userId} reverse index; mutations go through a small server-side helper module and broadcast a full party:state_sync snapshot to every member's connections (clients adopt wholesale, drift-proof).
Same trick solves handle uniqueness without a global "handles" table. A new save writes to HANDLE_INUSE#{handleLower} with a conditional attribute_not_exists(PK) OR currentOwner = :sub. Two clients racing for the same fresh handle: one wins, the other gets ConditionalCheckFailedException -> 409. After a delete, the handle goes into a HANDLE_RESERVED#{handleLower} row with 30-day TTL (and a 6-month escalation if the same owner cycles register-delete-reclaim more than 3 times, defeating the perpetual-squat exploit). A read-only handle-check Lambda lets the client pre-flight the name while typing; the conditional write is the authoritative gate.
The trust-boundary lesson nobody warned me about
The biggest non-obvious thing six weeks of operation taught me: client-authoritative state-save is a footgun. The original state-save.js spread the incoming JSON into 13 DynamoDB rows. Anything the client sent, I wrote. A pre-launch audit (run by a five-agent parallel sweep against the codebase) found that a single DevTools paste could set profile.donatorStatus = 'handler' and grant free Handler-tier benefits, bypassing Stripe entirely. Same vector forged max skills, every legendary item, the galactic ending achievement, and put the cheater at #1 on every leaderboard.
The fix wasn't subtle: per-slice allow-lists, a PROTECTED_FIELDS set that strips subscription/billing fields on every save (only the Stripe webhook and Cognito post-confirmation trigger may write them), numeric clamping on bounded values, and a one-way commit pattern for ending achievements. The same audit found I was silently dropping ~12 state slices added in the months after launch (relationships, pythia, rested, activeCraft, worldState, etc.). Every player save was hemorrhaging progress to a save handler that didn't know about the slices yet. Schema-aware persistence is now table stakes.
Two adjacent gotchas worth knowing if you go this route: (1) WebSocket payloads are also client-authoritative. Trust nothing -- read the attacker's combat stats from DynamoDB on every pvp:attack push, not from the WS frame. (2) Stripe webhooks need event-ID idempotency or replays will re-apply benefits past the cancel date; resolve plan from subscription.items.data[0].price.id, not from metadata.plan (customers can mutate metadata via the Customer Portal).
What the cost actually looks like
Compute is still pay-per-request and still drops to zero between players. DynamoDB on-demand reads/writes scale with concurrent activity. The marketing forms got a separate table (HackerQuestMarketing -- mailing list, support tickets, GDPR requests) so support submissions never co-mingle with player state. SES adds a few cents for double-opt-in confirmations and admin-forward emails. CloudFront serves both the game (single-page Vite bundle) and the marketing site. WebSocket connection-minutes are billed per minute connected, but at the current player count the entire monthly bill is still under $10. The cost cliff isn't compute -- it's CloudFront bandwidth if a campaign goes viral, which is a problem I would love to have.
What I would do differently if starting over
Two things. First: write the schema-aware state-save handler on day one, not six months in. The cost of validating + clamping every field at the boundary is small; the cost of retrofitting it once the game has shipped half a dozen new state slices is annoying. Second: WebSocket signature verification on connect is non-negotiable. The MVP shipped with decode-only token reading, and CONN# rows stored whatever the client claimed. That's fine until the moment you start addressable-pushing handle-targeted events, at which point identity has to be authoritative. aws-jwt-verify is the easy answer.
Everything else from the original architecture held up. Single-table DynamoDB scaled cleanly. Lambda cold starts have not been a player-noticeable issue. SAM templates remained the right unit of deployment even past 1000 lines. The genre fit is the load-bearing decision -- turn-based hacking MMORPG mapped onto pay-per-request compute. If you are building anything with similar pacing, the same pattern works.