Systems Pilots
2026-02-28
I used to work at RapidSOS, which routes GPS coordinates from smartphones to 9-1-1 dispatch centers for roughly 165 million emergencies a year. The FCC estimated that more than 10,000 people die annually because dispatchers cannot locate wireless callers fast enough1; that was the problem the system was built to fix. Someone stranded on a highway median at 2 AM, unable to describe where they are; someone having a cardiac event in a high-rise where cell-tower triangulation puts them on the wrong floor. The system fused GPS, WiFi, barometric pressure, and Bluetooth into a location accurate to about ten meters, then delivered it to the dispatcher before the caller finished saying “I need help.” Five nines. The bugs that scared me were never the ones that crashed the system. They were the ones that degraded it quietly, where the error rate crept high enough to matter but stayed low enough to hide inside normal variance, and no one noticed the coordinates had been drifting for hours.
Aviation has a word for this: automation complacency. When pilots over-trust automation, skills atrophy and reaction times slow, and the one anomaly that matters slips by unnoticed until there is no longer room to recover from it. Air France 447 fell out of the sky because the crew had so little practice flying without automation that they couldn’t interpret what was happening when the autopilot disconnected. The stall warning blared 75 times. They never recognized it. A pilot earns the salary in the moments when the system does something it shouldn’t, not in the eight hours of holding altitude.
Software engineering is entering its autopilot era. LLMs scaffold projects and agents refactor modules; copilots write the tests, the boilerplate, the migration nobody wanted to touch. If AI generates 99% of code correctly, the scarce thing shifts from writing code to knowing when the AI is plausibly wrong and designing architectures that degrade gracefully rather than shatter. Modern systems fail quietly, in patterns that look like noise until someone is hurt.
Calm-Weather Hours
For years, the industry trained keyboard operators. Entry-level engineering optimized for CRUD apps, framework familiarity, LeetCode patterns, and shipping features fast, all of which made sense when writing code was the bottleneck.
Code generation is approaching marginal cost zero, and a senior engineer with good AI tooling can outproduce several juniors on routine implementation. Juniors used to gain judgment by grinding through routine builds, the way new pilots accumulate flight hours in calm weather before anyone trusts them in a storm. If the calm-weather hours are automated away, the judgment has to come from somewhere, and no one has a good answer for where.
Engine Failures on Purpose
Aviation trains pilots in simulators: engine failures, sensor malfunctions, crosswinds, instrument blackouts. Pilots practice losing control so they can learn to recover it.
The curriculum should match. Instead of “build a to-do app,” the assignment should be: “build a distributed key-value store, inject latency, corrupt packets, observe how it breaks, write the postmortem.” Create race conditions on purpose, simulate partial network partitions, and sit with the failure long enough to explain why it broke.
LLMs already know syntax better than most humans. They lack durable mental models of tradeoffs: why abstractions leak, how consistency models constrain architecture, the way latency multiplies across service boundaries until a request that felt instant in testing takes two seconds in production. You build those intuitions by touching the hot stove, not by reading the warning label the generator printed for you.
The Checkride
The characteristic failure mode of AI-generated code is plausible incorrectness. It reads well, passes a casual review, and is slightly wrong in a way that surfaces three months later when a rarely-exercised path finally runs in production. Most software interviews select for the skill being automated fastest (producing correct code from a known specification) rather than the skill that matters most (catching the plausible failure before it ships).
In aviation, a checkride is a practical exam where an FAA examiner sits in the cockpit and deliberately introduces failures: engine out on takeoff, a panel of instruments that suddenly disagree, hydraulics bleeding pressure on final approach. You pass by handling degradation. Software interviews should work the same way: show a candidate a service throwing intermittent 502s and ask them to walk through the diagnosis; hand them AI-generated code with a concurrency bug and see if they catch it; present a system architecture and ask where it would fail first under ten times the traffic.
The deeper thing to test for is taste: the instinct for which tradeoff to make and when to choose simplicity over cleverness. An AI can generate five valid architectures for a problem; knowing which one to pick, which abstraction earns its keep and which is ceremony, when two services should be one, when the elegant solution is the fragile one, is judgment that accrues through years of living with the consequences of your own designs. Taste in tradeoffs separates an engineer who ships from an engineer who ships something you can still maintain in two years.
Companies default to LeetCode because better assessments are hard to build, but the same tools that generate code can generate realistic broken systems: a service with a subtle memory leak, an API with a race condition under concurrent writes, a deployment pipeline with a misconfigured rollback. The same tool that makes coding easier makes testing judgment cheaper.
The bugs that scared me at RapidSOS were the quiet ones, and the quiet ones are the only kind LLMs produce. Clean numbers and correct numbers will increasingly look the same on the dashboard, and telling them apart is the part of the job that does not automate, at least not by the same machine that generated the numbers in the first place.
FCC, Notice of Proposed Rulemaking, PS Docket No. 07-114 (February 2014), estimating that inadequate wireless caller location contributes to more than 10,000 deaths per year.