Managers at software companies often turn to the more mature construction and manufacturing industries for inspiration on how to improve the reliability of the software development process. While Ericsson, early in my career, I had a mentor who was managing a large software engineering unit. He recommended that before I read anything else, I start with The Goal, a novel about a young executive who turns around a failing manufacturing plant. Years later, The Phoenix Project appeared, an adaptation of The Goal to the IT industry. It was a huge success, and soon gained a cult following among software engineering managers.
Both The Goal and The Phoenix Project are written as parables: novels accessible to laymen, but designed to impart wisdom on how to detect and remedy issues in manufacturing and software development. What are some of their key takeaways?
When things go wrong, everyone has their own opinion on what needs to be fixed first. To reach consensus, its useful to view the organization as a system (be it a manufacturing plant, a support organization or a software development team) producing some kind of output (cars built, tickets closed, user stories delivered). By defining key metrics (the fewer the better) which measure the output of the system, it becomes possible to identify which processes drive these metrics. Key metrics become natural targets for optimization, a common goal for the whole team.
By measuring first, we know the performance of the system today. Only changes which bring an improvement to these metrics are of interest, prioritising the team's work.
When there is pressure to deliver more than what the system is capable of, it's tempting to try to ramp up production by "giving it 110%". A notorious example of this is pre-release crunch time at game studios. Sometimes there really is no alternative to temporary hustling, but the reward which goes along with successfully delivering on elevated expectations provides a strong incentive to normalize the new rate of production. Soon yesterday's 110% becomes tomorrow's minimum expectation.
Now if the system is adopted to deliver higher performance without strain, this is fine. Think of someone who recently started weight training. In 3-4 months they can substantially increase how much they can lift without injury - something that will likely occur if they try to achieve this in half that time.
Increasing the rate of production without eliminating inherent bottlenecks puts strain on the system which increases the risk of catastrophic failure. Production lines break down, engineers burn out and technical debt accumulates.
Removing a bottleneck from the system can yield increased production when the bottleneck was on a critical path. As soon as a bottleneck is eliminated, another one will become apparent in a different part of the system, leading to a game of whack-a-mole.
It's tempting to try to fix all the bottlenecks at once. I know I'm particularly prone to trying to "save the world" in one fell swoop. The disadvantage of trying to fix many things at once is that it paradoxically often takes longer. There is hungarian folk tale about the devil challenging a tailor to a sewing contest. Satan, wanting to finish first, threads a mile-long thread into his needle so he wastes no time re-threading. As a result, he must jump in and out of the tailor shop's window to give him enough room so that the thread doesn't tangle. In the end, the tailor wins the contest by using short thread (and re-threading the needle when necessary).
By effecting a single change at a time, it's easier to measure the impact of the change, make additional adjustments if necessary and go on. Parallel work requires more calendar time, and when it's complete, it's challenging to attribute the results of the changes. Why did things improve (or deteriorate)?
Inventory is anything created by the system that cannot be delivered. Think half-finished engines, code which has not been deployed or support tickets put on hold. Inventory has a tendency to accumulate in front of bottlenecks, tying up the organization's resources (money, customer trust, code ownership, etc). While trapped in inventory, those resource cannot produce value, leading to starvation of the system if unaddressed.
Inventory also has a tendency to depreciate. Even non-physical inventory like undeployed code is at risk of "rotting".
Of course some inventory is necessary. Methods like kanban aim to control - but not necessarily minimize - inventory.
Both books have a lot more to say about manufacturing and software development / operations (devops), but these were the most important lessons for me.