Big Data can be used for many things, from ad targeting to improving medicine to, well these days, you name it. One area that seems ripe to apply it to is itself, or more generally, to use Big Data to find, categorize, and pattern-match fault signatures within software itself. Allow me to explain.

People have a hard time understanding software because you can’t touch it. In human readable form, it’s a bunch of loops, branches, and kinda math-like functions. But for the sake of discussion, for now, think of software as a little solid-state motor where the moving parts are tiny wave-functions…aka electrons. By analogy, based on this article about motor faults it seems possible to apply this same approach to “software motors.” If we were able to gather enough data about such beasts, and take this from science project to actual product, we’d have something interesting and useful to people.

An acquaintance of mine thought that a cascading set of log gathering calls implemented as callbacks into their respective sub-systems could be triggered as fault patterns were recognized and assuming they continued to match, once a confidence threshold were passed that this is indeed the *exact* problem we have a run-book for, kick that in and avoid an undesirable (and previously unavoidable) event. Effectively, this is the same outcome as a human getting paged, rebooting the right systems, and going back to sleep (the human that is ;).

If you’re still with me, I hear you asking, “But *how* do you know you truly avoided something rather than just perturbed the system enough to cause it to not happen? Hmm?!” It’s a fair and intelligent question, and to answer it requires data that we don’t have yet, but for now let’s say:

  1. we could compare KPIs (uptime, latency, calls/sec, etc.) across one system w/ the automated maintenance applied and another similar system without it and see if there were measurable benefits;
  2. does it matter?

#1 needs some careful, skeptical measurement and #2 is kinda a joke, but it’s also kinda not, being it’s the default remedy for many software problems today, whether you realize it or not. You see, we know our softwares work well for the happy paths that they’re usually on, but it’s easy for our data to wander off that and fall off a cliff, get eaten by a bear, or enter a goto portal that jumps it from where it’s expected to Timbukfoo. In short, I’m admitting that for now, a restart with some strategery behind it is better (and kinda the same, just not as rigid) as a cron’d reboot (which we know people do these days). I speak from experience here because I’ve recommended that exact thing, though we professionally called it a “scheduled therapeutic reboot.” 😀

Anyway, that brings us to the end of this episode. The alternate title was “On Taking a Big Data Selfie…” Hope you found it instructive and that it inspires you to build something new, amazing, and wonderful.