How We Reduced Median Memory Estimation Error by 99%, With the Help of AI
When you’re running a system that processes hundreds of thousands of compaction jobs, even small inaccuracies in memory usage estimates compound into real operational pain. Overestimate, and you waste resources. Underestimate, and you get OOMs: pods crash, work gets retried, and on-call engineers get paged.
At Mixpanel, our people compaction pipeline merges and rewrites user profile data across millions of datashards. Each compaction pod processes many requests concurrently, and it uses a memory estimate for each request to decide how many it can safely run at once. For years, we estimated with a multiplier: the input file size times a fixed constant (2.5). It was simple, it was fast, and it was wrong, often by hundreds of megabytes.
We recently fixed this with an approach that reduced median estimation error by 99%. The fix itself turned out to be surprisingly simple, but gaining confidence in it required real analysis: sampling hundreds of thousands of production requests, parsing messy CSVs, building scatter plots and histograms for multiple estimation strategies. I’m an infrastructure engineer, not a data scientist, and this kind of data wrangling would normally have taken me a week just to learn the tooling. AI-assisted analysis compressed that to hours, which meant we had time to explore the entire solution space before committing to a direction.
We Knew Estimates Were Bad
We didn’t need a study to know memory estimation was a problem. OOMKills were a regular occurrence, our dashboards showed estimation error spread across gigabytes, and on-call engineers were spending real time dealing with compaction failures. But “it’s bad” isn’t actionable. We needed to understand exactly how bad, and where the errors were coming from.
So we added sampled logging to the compaction pipeline, recording the estimated and actual memory usage for each request. Over a three-day window we collected 146,000 data points.
Analyzing the Data
Now we had data. The problem was what to do with it. We used an AI agent to generate the analysis script. The workflow: describe the analysis we wanted, have the agent modify the script, audit it for correctness, run it against the data, iterate. When we wanted to explore a new estimation approach, we described the idea and had a working comparison in minutes.
This turned out to matter a lot, not just for speed, but for what we were willing to try.
How Bad Is the Multiplier?
First question: how well does the current 2.5x multiplier actually perform?
The mean absolute error was 0.48 GB. Estimates are consistently underestimating, with the worst outliers off by 10+ GB. These outliers are what cause OOMs.
The most obvious fix was to tune the multiplier. We used least-squares regression to find the optimal constant, which turned out to be 3.9x:
The MAE only decreased from 0.48 GB to 0.47 GB. Only 2% improvement, and we still have outlier datapoints going up to 10+ GB. The multiplication factor wasn’t the problem, it was the approach itself.
The situation was even worse for our most problematic requests. We separately looked at requests that had more than 2 GB of estimation error, pulling 70k of them over a week-long window:
With a MAE of 5.9 GB, these requests were being underestimated by enormous amounts, and putting us at the highest risk of OOMKilling. The current approach had almost no predictive value for them (R² = -0.80).
What If We Just Remember Last Time?
Looking at the data, one thing stood out: a given datashard’s memory footprint doesn’t change much between compaction runs. The set of users in a shard, the number of properties they have, the shape of the data: these things evolve slowly. If a shard used 1.2 GB last time, it’ll probably use something close to 1.2 GB next time.
This suggested an almost trivially simple approach: after each compaction run, record how much memory was actually used, and use that as the estimate for the next run.
We described this to the agent, and within minutes had the analysis running against the same dataset:
Mean absolute error dropped from 0.48 GB to 0.12 GB, a 4x improvement.
The largest impact came on the requests that were causing us the most trouble: the outliers.
MAE dropped from 5.9 GB to 0.14 GB for the cases that mattered most.
Making Sure We Weren’t Fooling Ourselves
A 4x improvement from a one-line idea sounds too good to be true. Before shipping, we wanted to make sure we weren’t missing a better approach or a hidden failure mode. Because LLMs made iteration so cheap, we could afford to be thorough.
We explored four alternatives, each motivated by a specific hypothesis:
Including request source in the grouping. Early investigation of one particularly problematic project revealed bimodal memory patterns correlated with the source of the compaction request. We tested whether grouping by shard and source would improve estimates. On the worst outliers, it helped modestly (MAE from 0.14 to 0.11 GB), but for standard requests it made no difference. Not worth the added complexity.
Grouping by project instead of shard. Using the last memory value for any shard in the same project would mean fewer cold starts for new shards. But project-level grouping is a worse predictor overall (shards within the same project can have very different memory profiles) and the cold-start benefit only matters for the first couple of compaction runs.
Exponential moving average. Instead of using last run’s value directly, we could smooth the estimate with an EMA to reduce volatility. With α=0.1, this actually showed a slight improvement on standard data (0.11 GB vs 0.12 GB MAE). But on the outlier set, exactly the cases where accurate estimation matters most, it was significantly worse (0.27 GB vs 0.14 GB). The smoothing prevented the estimate from keeping up with legitimate changes in memory usage.
Per-shard file size ratios. We could combine both approaches: track a per-shard multiplier (memory used ÷ input size) and apply it to the next run’s input size. In theory, this captures both shard-specific behavior and input size changes. In practice, it was worse (0.15 GB vs 0.12 GB MAE) because when inputs are small, the ratio blows up and produces absurd estimates for the next run.
Each alternative had a plausible reason to be better, and each fell short for a specific, understandable reason. Four approaches, two datasets each, scatter plots and error metrics for all of them. By the time we shipped, we had the data to prove the simple approach was best.
Production Results
After rolling out the change over the course of a week, the numbers spoke for themselves:
99% reduction in median memory estimation error (-570 MiB to -4.6 MiB)
The middle 50% of estimates went from undershooting by 283 to 839 MiB to landing within ±47 MiB of actual usage
The solution also has a nice durability property: because it’s based on observed behavior rather than assumptions about the binary format, it automatically adapts when the compaction code changes. If a future optimization alters memory usage, the estimates self-correct on the next run.
What We Took Away
The lesson is about where AI fits in engineering work. There’s a legitimate concern around AI-generated code in high blast radius environments like production databases, but there’s a large category of engineering work (analysis, investigation, prototyping, data exploration) where the stakes are fundamentally different. Bugs in an analysis script produce wrong charts, not production outages. Match the AI’s role to the blast radius: we got AI’s speed advantage exactly where errors are cheap and caught fast, and kept human authorship where errors are expensive and hard to detect. AI for investigation, humans for implementation.
AI didn’t write our solution. It compressed the research cycle enough that an infrastructure engineer with no data science background could explore the solution space, rule out the alternatives, and ship with real confidence. The whole project, from initial data collection through production rollout, took about two weeks.









