Nguyen Le PhongNguyen Le Phong

Software Measurement and Estimation: Fewer Gut Feelings, Better Engineering Decisions

A practical reflection on Software Measurement and Estimation: A Practical Approach by Linda M. Laird and M. Carol Brennan. The article focuses on the ideas engineers can use at work: measure from decisions, estimate as uncertainty instead of promise, track quality before release, and keep dashboards useful instead of decorative.

The spreadsheet was already open before the planning meeting started. A few rows had feature names, a few cells had optimistic dates, and one column was waiting for effort estimates. Everyone in the room knew the ritual: someone would ask for a number, someone else would hesitate, and eventually the team would write down something that looked precise enough to move the meeting forward.

That small office moment is where software measurement becomes real. Not in a dashboard, not in a methodology poster, but in the quiet gap between what we want to know and what we can honestly say. Is this release on track? Is the defect trend safe? Is the estimate a promise or a range? Are we measuring the thing that helps a decision, or just collecting numbers because grown-up projects are supposed to have numbers?

Book source

This article is a personal reflection from Software Measurement and Estimation: A Practical Approach by Linda M. Laird and M. Carol Brennan, published by John Wiley & Sons with the IEEE Computer Society in 2006. I am not trying to summarize every chapter. I am keeping the ideas that still feel useful when a real software team has to plan, ship, explain risk, and improve.

The most useful lesson I took from the book is simple: measurement is not about worshiping numbers. It is about reducing avoidable confusion. A number should help someone make a better decision. If it does not change a decision, explain a risk, or reveal a trend, it may only be decoration.

A very ordinary example is a team preparing to launch a new payment flow. The sentence “80% of tasks are done” sounds comforting, but it may not help anyone decide whether to release. More useful questions are: are payment callbacks still timing out, how many orders get stuck after the user has been charged, does the problem cluster around one bank or device type, and does support have enough information to trace a failed transaction? Those numbers are less decorative, but they sit closer to the real decision.

Measure from decisions, not from dashboard hunger

One of the strongest ideas in the book is the spirit of GQM: Goal, Question, Metric. Start with the goal, ask the question that tells you whether the goal is being met, then choose the metric that can answer that question. This sounds obvious until you watch how many teams do it backward. They start with available data, create a chart, and only later try to invent a reason the chart matters.

A healthier flow is more humble. If the goal is to know whether a release is stable enough to ship, the question may be: are severe defects still arriving faster than we can close them? A useful metric may combine defect arrival, defect closure, and backlog. If the goal is to understand whether an outsourcing vendor is healthy, the question may be: are they delivering usable work, with acceptable quality, and responding quickly when something is wrong? The metric set may need progress, quality, and responsiveness, not only hours spent.

Imagine a team building a new checkout flow. If the goal is to reduce drop-off at payment, the question should not stop at “is the coding done?” Better questions are: where do users drop, how much of the drop is caused by technical errors, how much comes from confusing validation copy, and whether one user segment is affected more than others. The metric set may include funnel drop-off, payment error rate, support tickets by root cause, and the time needed to resolve stuck orders. A small dashboard like that lets product, QA, backend, and support look at the same picture.

Weak measurementBetter measurement questionMore useful signal
Count all bugsWhich phase is leaking defects into later work?Defect removal by phase, escaped defects, backlog trend.
Count lines of codeIs size growing in a way that changes effort and risk?Size trend, functional size, complexity, review effort.
Track utilizationIs the team making progress toward valuable outcomes?Milestones passed, accepted work, cycle blockers, rework.
Show many chartsWhich few facts does leadership need this week?A small dashboard with target, trend, and drill-down path.

The book also adds a practical warning through GQM²: a metric needs a mechanism. Who collects it? How often? From which source? With what definition? A metric without a collection mechanism becomes stale quickly. Worse, different people may collect the same metric using different rules, and the team ends up arguing about the number instead of learning from it.

Tiny metric spec

A practical release-readiness metric can be written in one line: track P0 and P1 defects opened, closed, and still ownerless in the last 48 hours, from the issue tracker, reviewed every Monday and Thursday during stabilization. The decision rule is just as important as the number: if the critical backlog grows in two consecutive reviews, optional scope moves out before the team adds pressure or overtime.

Estimation is not a promise carved into stone

The second lesson is about estimation. An estimate is not a moral commitment. It is a model of uncertainty based on what the team currently knows. That distinction matters because software work contains discovery. Requirements shift, dependencies surprise us, hidden complexity appears, and people learn while doing the work.

The book walks through several estimation families: expert estimation, decomposition, Wideband Delphi, Use Case Points, Function Points, and COCOMO-style algorithmic models. I read those less as a menu of formulas to memorize and more as a reminder that every estimate has a worldview. Expert estimation carries lived context but can be biased. Parametric models look more defensible but depend heavily on the input definition. Proxy methods like Use Case Points or Function Points can help, but only if the team counts consistently.

The useful professional habit is triangulation. If expert judgment, historical data, size-based estimation, and risk analysis all point in roughly the same direction, confidence rises. If they disagree, the disagreement itself is information. It may reveal unclear scope, unfamiliar technology, weak requirements, or a team assumption nobody has said out loud yet.

For example, a senior engineer may estimate a third-party eKYC integration at five days because “the API is simple.” Historical data may tell a different story: previous third-party integrations took 10 to 15 days because the sandbox was unstable, error cases were poorly documented, and compliance review arrived late. Decomposition may then show that the happy path is only three days, while retries, reconciliation, support logs, and monitoring add the real cost. The disagreement between estimates is not noise to hide. It is the signal that the team needs to surface assumptions and buffer.

The dangerous sentence

“Just add more people and compress the schedule” is often not a plan. The book discusses schedule compression limits: after a point, pushing calendar time down forces effort and coordination cost up. In real teams, the added people also need context, review, communication, and integration. The date may look shorter on paper while the system becomes harder to finish.

This is why a good estimate should travel with assumptions. What scope is included? Which dependencies are trusted? What historical data supports the number? What risks have a cost reserve? What would make the estimate invalid? A number without its assumptions looks clean, but it is fragile. A range with assumptions may look less confident, but it is usually more honest.

A more useful planning answer may sound like this: “If we count only the happy path and the partner API is stable, this is around five to seven days. If we handle timeout, retry, reconciliation, and operational dashboard properly, the range is closer to nine to fourteen days. If sandbox access is not ready by Wednesday, this estimate needs to be reopened.” It is longer than one number, but it helps the PM, PO, and engineer see what is being traded.

Quality should become visible before release day

The book treats defects as more than unpleasant surprises. Defects are traces of the engineering process. Where they are injected, where they are found, how fast they arrive, and how many escape into production all say something about how the team is working.

I found the idea of Defect Removal Efficiency especially useful. The formula is simple: compare the defects found before release with the defects found after release. But the deeper value appears when you look by phase. If requirements defects keep escaping into design and testing, the answer is not only more testing. The answer may be better requirement review, clearer examples, earlier stakeholder alignment, or a smaller batch of work.

Take a simple profile update screen. If QA repeatedly finds bugs because required fields are unclear, date formats differ between screens, or error messages do not tell users what to fix, the defect is not only a testing problem. It may have been injected when requirements lacked examples, Figma did not include error states, or the API contract did not define validation rules. Looking at where defects are created and where they are found helps the team improve the place that actually produced the problem.

This connects well with in-process metrics. A team should not wait until the final week to learn that quality is in trouble. Code integration trend, test execution progress, pass rate, defect arrival, defect closure, and backlog movement are all early signals. They are not perfect, but they help the team ask better questions while there is still time to act.

Another example is an integration branch that turns red every afternoon because many small merges collide. If the team only looks at the end of the sprint, the story becomes “the release slipped.” If the team watches earlier signals, a pattern may appear: failures spike when three modules touch pricing rules, automated tests miss discount-plus-voucher cases, and bug closure slows because nobody clearly owns that rule. The metric helps the team look for the real bottleneck instead of only putting more pressure on the people fixing bugs.

A practical release question

Before asking whether all test cases have been run, ask whether the system is still producing new important defects faster than the team can understand and close them. A release can look busy and still not be converging.

Metrics can go stale, and people can learn to game them

The book is careful about a problem every organization eventually meets: once a metric becomes a target, people adapt to it. If lines of code are rewarded, code grows. If number of tickets closed is rewarded, tickets become smaller or easier. If test count is rewarded, weak tests multiply. The team may improve the visible number while the underlying system does not improve.

Velocity can behave the same way. If a team is asked every week why story points are not increasing, people may learn to split tickets smaller, inflate points, or avoid work that is hard to count, such as refactoring, reading production logs, or writing a safe migration. The chart may look smoother while the codebase quietly becomes harder to change. At that point the metric is no longer a lens on reality. It is a game people play to protect themselves.

That does not mean metrics are useless. It means metrics need review. A metric that helped the team last year may be stale this year. It may have solved the original problem, lost decision value, or started creating unhealthy behavior. Good measurement programs are living systems. They retire old metrics, introduce better ones, and keep asking whether the number still reflects the reality it claims to represent.

The same applies to dashboards. The book describes the 4 Ds: decide the metric, draw it clearly, place a small set on a dashboard, and allow drill-down when something turns red. I like that because it keeps dashboards from becoming wall art. A useful dashboard should show target, trend, and enough context to ask the next question. If it cannot lead to a decision or a drill-down, it is probably only a reporting habit.

How I would use this book inside a team

I would not start by launching a full measurement program. That would be too heavy for most teams. I would start with one recurring decision that currently feels too emotional or too vague. Release readiness is a good candidate. Estimation confidence is another. Vendor delivery, defect leakage, production reliability, or project financial health can also work.

Then I would write the chain in plain language: goal, question, metric, mechanism, review rhythm, and decision rule. For example: our goal is to know whether the release is converging. The question is whether important defects are arriving slower than they are being closed. The metrics are weekly arrival, closure, and backlog by severity. The mechanism is the issue tracker with one shared definition of severity. The review rhythm is twice a week during stabilization. The decision rule is that release scope must be reduced or the date revisited if critical backlog keeps growing for two consecutive reviews.

In practice, I would start with a very small board: open critical defects, new defects minus closed defects in the last 48 hours, defects older than five days without a clear owner, and untested scope by risk level. Four rows can make a review meeting less vague. If everything is green, the team has a calmer reason to keep moving. If one row is red, the team knows where to drill down first instead of opening 12 charts and making no decision.

That kind of measurement will not make software predictable like factory work. Software still contains judgment, creativity, uncertainty, and people. But it gives the team a calmer shared surface. Instead of arguing from memory, anxiety, or seniority, the team can point to a trend and ask what it means.

Key takeaways

  • Measurement should start from a decision. Goal, question, metric, and mechanism keep the team from collecting numbers that nobody uses.
  • Estimates are models of uncertainty, not promises. Keep assumptions, ranges, risks, and historical context attached to the number.
  • Use several estimation lenses. Expert judgment, decomposition, historical data, and parametric models each see different parts of the work.
  • Defects are process signals. Where defects are injected and found can reveal weak reviews, unclear requirements, or late integration.
  • Dashboards should be small and actionable. A good chart has definition, target, trend, status, and a drill-down path.
  • Metrics go stale. Review them before people optimize the number while the real system stops improving.

The lasting value of Software Measurement and Estimation, for me, is not that it gives software teams a perfect formula. It gives a quieter discipline: say what you are trying to decide, define the number carefully, collect it consistently, and keep enough humility to change the metric when reality changes. That is not glamorous work. But many healthier teams are built from exactly that kind of quiet accumulation: one clearer estimate, one better review signal, one less decorative dashboard, and one more decision made with evidence instead of only feeling.

記事はいかがでしたか?

よくある質問

What is software measurement?
Software measurement is the disciplined practice of defining, collecting, and interpreting numbers about software products, processes, teams, and outcomes so a team can make better decisions.
What is GQM in software measurement?
GQM means Goal, Question, Metric. You start with a goal, define the question that shows whether the goal is being met, then choose the metric that can answer that question.
Why are software estimates often wrong?
Software estimates are difficult because requirements, dependencies, technical complexity, team familiarity, and hidden risks change while the work is happening. A good estimate should include assumptions, ranges, and risks rather than pretending to be a fixed promise.
How can teams avoid bad metrics?
Start from a real decision, define the metric clearly, decide who collects it and how often, review whether it is still useful, and watch for gaming behavior when people optimize the number instead of the outcome.