writing

Fresh Data is Fairer Data

Sep 2025

My mother’s solution to sibling cake disputes was elegant: one child cuts, the other chooses. My sister and I would cut as evenly as humanly possible; we couldn’t allow the other to get one over on us.

Fairness emerged not from rules about equal slices, but from consequences. Cut unfairly and you got the smaller piece. The cost of your own unfairness landed on you, instantly.

Between siblings, equality was fair; no ‘feature’ of either of us warranted a bigger slice. Most decisions are messier. When a model decides who gets a loan, some features should matter and others shouldn’t. A fair decision, at minimum, is one that doesn’t lean on irrelevant characteristics like race or gender 1.

Trying to ensure we give models fair exemplars to learn from is tricky, and the two obvious approaches to constructing them both fail.

Remove sensitive attributes like race or gender, and the model finds proxies; postcode, name, and browsing history all correlate with the thing you hid. This is “proxy discrimination”, and it’s near-universal.

Remove correlated attributes too? Uncorrelated features can combine to reconstruct what you removed. The information is encoded redundantly across the data. You can’t scrub it out.

Even deciding which features count as “fair” gets tangled. Take loan repayment. Income correlates with gender, but income is also predictive on its own. Is it fair to use? Most of the predictive power comes from having money, not from being male. But “most” is doing a lot of work in that sentence.

Cut and choose

These problems share a root: we’re trying to retrofit fairness onto data produced under unfair conditions. The data reflects the world that generated it, bias2 and all.

But moral standards shift. Google’s language models show measurable decreases in gender bias when trained on more recent text.

Google PAIR’s Fill in the Blank explorable, showing how gender bias in language models decreases with more recent training data.

The world that generates today’s data is, imperfect, but certainly fairer than the world that generated yesterday’s.

This is the cake-cutting principle at scale. A society still reckoning with its own biases produces data that reflects that reckoning; people feel the consequences of unfairness and push back. Stale data carries no such correction.

Today’s consensus has blind spots that future generations will find obvious, much as we look back and wonder how certain things were ever acceptable.

Fresh data won’t make models fair. But stale data almost guarantees they won’t be.

  1. A decision is fair, at minimum, if it does not rely on protected characteristics such as race or gender. The UK Government lists: age, gender reassignment, being married or in a civil partnership, being pregnant or on maternity leave, disability, race, religion or belief, sex, and sexual orientation. This baseline isn’t foolproof; we could discriminate against an unprotected characteristic (say, blue eyes) that would certainly strike us as unfair. But we must restrict the definition somewhere, otherwise we’re specifying the problem to such a degree that we no longer need to learn a decision.

  2. I have a thing about the word bias; you can read about it if you’re interested.