I. Putting On My McNulty Hat

While interviewing a candidate for an engineering role recently, they asked me what my favorite part of the job was. “The detective work,” I told them.

There’s this David Simon - creator of HBO’s The Wire - quote that has stuck with me for several years.

…and I think there's a frightening aspect to McNulty, which is this: He cares about making the case, clearly. But does he care about the people he's making it for? Does he care about West Baltimore? Is he connected to these people in any empathetic way? And I'm not going near that until viewers are ready to accept the absolute truth of all the cops I've known, which is, the best you can hope for from a really good cop is that he cares about the game. To a good homicide detective, the murder is an affront to his intellectual vanity, and I mean that in the best possible way. "This fucker did this murder, I caught it, and he thinks he's fucking better than me. Fuck him. He's about to find out." That's a good cop.

The detective work that I was talking about involves investigating impersonal things. I’ll leave the other type of detective work, practiced by other fun TV detectives like Dale Cooper, to a future post called (Metaphysical) Detective Work.

We catch some process issue at a measurement step and things don’t add up. I stick close to my personal maxim that “there is no voodoo”: there is always an explanation as to why things turned out the way they are; there is always a way to make sense of data that isn’t making sense.

Given that there is an explanation out there, I always feel a little provoked in the way McNulty does. It’s not personal; there’s no demon causing machines to malfunction in the factory just to piss me off. But there is a right answer, and I’m going to find it, because I know I’m smart and I know the systems and it should be a solvable problem.

II. Being Systematically Stupid

The hours go by quickly when I’m in investigation mode. There are always enough nuggets to keep my motivation up. The first thing to do is to ensure no obvious mistakes happened. I typically review the processing history of whatever material I am investigating and double-check any experiments to ensure they were done as planned. Many times, there is a mistake in the plan that the team didn’t catch at the time and it was followed with predictably bad results. Other times there are issues where we forgot to do something that was in the plan, again with bad results.

But things generally don’t turn into an investigation requiring detective work without the root cause being hidden. The reason the root cause is hidden is because it doesn't conform to our first-order understanding of the process. Typically this means that the root cause is unforeseen process interactions, human error, or a design change causing unintended effects.

When the simple items are eliminated, you have to get systematically stupid to identify the likely culprits. “Systematically stupid” is my affectionate way of describing many quality management processes.

In one of them (chosen because it’s particularly stupid), called is/is not analysis, you take certain aspects about your problem, and you identify what the problem is and what is not. For example, let’s say we have seen the problem after steps 7, 9, and 11, and never seen it before step 7, so it’s logical to assume that the problem is caused by something that happened at step 7, combined with something that happened before step 7. By collecting a list of facts like this about the problem - and putting that list in one place - you can quickly see what the problem is, or at least think of experiments that will allow you to isolate your problem

Another helpful systematically stupid tool is value-stream mapping. This is a technique used in Lean operations to identify sources of waste and remove them, but it is also useful in doing detective work to enumerate exactly what happened next.

A value-stream map lays out all of the steps of whatever process is under investigation. If the process has 10 steps in it, and the problem could occur anywhere within those 10 steps, it would just be a list of the 10 steps, which on its own isn’t very powerful. What’s important is what you do with it. Mainly you would ask the question, for each step X: can we verify that step X occurred as intended? What data do we have to support that conclusion? Once we do, we move onto step X+1 and repeat until you have accounted for all steps.

III. Your Limited Working Memory

The reason why value-stream mapping works is that your working memory is very limited. At least, mine is in practice.

Growing up, I used to think that my memory was special. I had a real talent for rote memorization. Now, I no longer think that. I’ve been burned too many times by misremembering something, or getting some critical detail wrong. Now I rely on written notes wherever possible to generate an artifact to reference in the future.

Writing down all of the steps in a process is something stupid, to be sure. If we were all smart, we would immediately understand the cause of a problem and how to fix it as soon as it happened. But we are human beings, and it is very easy for us to overlook small details while you are thinking out loud. Generating some sort of artifact - such as a value-stream map or a is/is not analysis or some other process - is an easy way to exercise the discipline needed to solve problems in a systematic manner. Discipline is the virtue that enables you to record the truth accurately and completely. By committing to solving the problem with a defined framework, the piece of documentation helps you put your mind in order as you think about the problem.

I’ll likely write another post on this in the future, but my golden rule for producing these types of artifacts is: document them for your future self.

I arrived at this golden rule after frustrating myself by producing documents that didn’t capture their context completely or explain their takeaway messages well. There are untold numbers of documents that I’ve made that have proven useful as quick references because I made them that way at the time. It turns out that the context that you need to give to present new information to other people is the same context that you need to give to your future self.

The feeling of trying to solve a technical problem entirely in my mind is like juggling. I am trying to keep all the facts suspended in midair while simultaneously thinking about the solution to the problem. If I forget one of the facts, they all start tumbling to the floor and I have to spend energy getting them all back in the air again. It is not easy to juggle while you’re distracted. By externalizing your memory to a piece of documentation, you no longer have to juggle, you can clearly sort through the facts while trying to find a solution.

IV. The Inevitability of a Solution

One interesting thing I’ve observed, working in manufacturing, is that a solution is inevitable. There are rarely “cold cases” that we just ignore and never come back. Rather, solutions are inevitable because rare events always end up reoccurring on long enough time scales. If rare issues cause serious problems, a team will be assembled to investigate and drive to a conclusion.

This is a very painful and humbling experience. The times I have been bitten by cold cases coming back to strike again, we couldn’t close on a root cause during the first investigation because the sample size was too low to be able to draw meaningful conclusions. When you deal with probabilistic systems, it is difficult to tell why certain process interactions happen; the temptation is always to call a problem with a sample size of 1 “bad luck”.

But that’s part of what makes detective work so satisfying: you know that, given enough time, with a systematic enough approach, you will converge on a solution. And there is a lot of glory at the end of the tunnel, for the detective who pieces everything together.

The thing that makes the job different from detective work is that your job isn’t done when you put the puzzle pieces together and find the root cause. The job isn’t even done when you prove it. The real job is putting the fix in place. And I’ll have to write about that - the struggles of implementation - another time.