Nguyen Le PhongNguyen Le Phong

Celebrating Failure

A grounded look at what teams should really mean by celebrating failure: not applauding damage, but making honest learning, early signal reporting, repair, and changed systems safer to practice.

The calendar invite appears the morning after a small incident: failure celebration. Everyone knows the intention is kind, but the room still feels careful. Support spent the evening answering customers. Engineers are tired. A manager wants the team to learn without fear. The phrase is trying to help, but it can sound like the team is being asked to smile at damage that real people had to absorb.

That is why I think celebrating failure needs a calmer definition. A healthy team does not celebrate the failure itself. It celebrates the honesty that made the failure visible, the care used to repair it, and the system changes that make the next failure less likely. The celebration is not applause for a broken release. It is respect for the learning loop that starts after something goes wrong.

Failure is useful only when it becomes information. A failed experiment can teach the team that an assumption was wrong. A production incident can reveal a missing alert, a confusing runbook, or a risky deployment habit. A missed deadline can show that planning ignored review time or dependency risk. The value is not in the pain. The value is in the signal that was hidden until reality pushed back.

This distinction matters because slogans can become careless. If leaders say fail fast but punish the first person who names a risk, people learn to hide the risk. If a postmortem claims to be blameless but quietly searches for the one person who clicked the wrong button, people learn to write safer stories. Culture is not what the team says about failure when everything is fine. It is what happens to the person who brings uncomfortable truth into the room.

A team worth trusting celebrates early signals more than dramatic recoveries. The engineer who says this migration is not ready yet may prevent an incident nobody will ever see. The QA who keeps a release from going out because one edge case feels wrong is protecting trust in a quiet way. The support teammate who reports a pattern before it becomes a dashboard spike is doing engineering work in another form. These moments deserve attention because they make failure smaller.

When failure does happen, repair should come before reflection. Customers need clear communication. Data may need correction. The team may need rest. Only then can learning become honest. A postmortem held too early can become a performance of maturity while people are still trying to clean up. Calm learning usually begins after the immediate harm has been contained.

A useful postmortem asks simple questions. What happened? How did we detect it? Why was this path reasonable to the people involved at the time? Where did our system make the wrong action easy or the right action hard? What will we change, who owns it, and how will we know the change worked? These questions move the team away from personal drama and toward better defaults.

The phrase blameless does not mean nobody was responsible. It means responsibility is used to improve the system, not to stage a trial. Someone may own a follow-up action. Someone may need more support or clearer review. A decision may need to be corrected. But the goal is to understand the conditions that produced the mistake, because those conditions are usually waiting for the next person too.

There is also a point where repeated failure is no longer brave. If the same incident happens three times and the team keeps calling it learning, the ritual has lost its honesty. Learning should leave evidence: a safer deploy process, a better alert, a smaller blast radius, a clearer owner, a changed checklist, a retired risky shortcut. Without changed behavior, celebration becomes a way to avoid accountability.

Leaders set the temperature here. They can lower fear by sharing their own missed calls, funding prevention work, protecting time for follow-up, and refusing to turn incidents into gossip. They can also raise fear quickly with one sarcastic comment, one private blame campaign, or one public lesson that names the person more than the system. People remember what happened after the last mistake.

The healthiest teams I have seen do not romanticize failure. They would rather avoid it. But when it arrives, they treat it as shared material for improvement. They thank the person who surfaced it, repair what was hurt, study the path with patience, and change something real. That is quieter than a celebration, but much more useful.

If your team has a story about a failure that genuinely made the work safer, the interesting part is probably not the failure itself. It is the small change that stayed afterward: the alert that now fires earlier, the review question people now ask, the release habit that became less rushed. Those are the things worth remembering, because they are how a team turns one difficult day into better ordinary days.

記事はいかがでしたか?