Let’s consider the idea that the whole Data Science thing might be a bit overhyped. After all, any time there’s this much attention given to any subject, you have to wonder how much of it is just flimflammery.
With this mindset, we can attack the main question, which is…is data science? In other words, are those of us engaged in this practice actual scientists?
What’s the Experiment?
We certainly can’t call ourselves scientists unless we’re experimenting on something. However, as we consider the experimentation part, we see a glimmer of hope that maybe there’s some science going on. After all, any programmer who has managed to get somebody to pay for her work knows that the actual job has a lot more to do with finding and fixing bugs than with writing code.
And here we see that there is some experimentation going on. A bug is really best understood as an unplanned event. And in particular, in my experience, it is rare for a bug to be unsurprising.
So what we have here is the need to classify a surprising event, form a concept of what is wrong about our assumptions that would lead to the event, gathering evidence to support that concept (or disproving it), and finally reproducing our work (in the form of a program change).
Now, I’m not claiming that bugfixing is actually science. But it is interesting to see how it contains some basic scientific structure.
Congratulations, Data Scientist. Here’s your shovel
From what I’ve experienced, the “real” Data Scientist work has a lot more to do with doing data calls, figuring out the semantics of the data (now, just what does that column mean exactly?), merging data that originates from different sources (and usually different data modelers), dealing with unexpected nulls and duplications, and generally creating order out of chaos.
This has to happen before you can even begin to do the fun stuff, like generating labels or running neural nets on the result.
Apologies for those of you who thought this work was glamorous.
Are We Sciencing Yet?
So what exactly do those messy sets of data have in them?
Real world events, that’s what. Or at least references to them.
And what are we doing when we merge ’em and clean ’em?
We’re creating a new way of looking at the events that make sense to us, so we can answer some previously unknown question about them.
So…is this really different from theorizing and experimenting in the real world?
No, it isn’t.
In fact, the similarities to “real” science go beyond that. In “real” science, nothing is ever proven; all “scientific facts” are contingent, and we have to be humble enough to question the facts when any counterexample presents itself.
For data science (and for the downstream AI we might run on the data sets we create), the same thing is true. We’ve got some notion that there might be a useful, previously unknown order in the data. And we might run an AI experiment on the data, and it might give us a positive result. And the positive result might still be wrong if we made an incorrect assumption somewhere.
Or maybe there was never anything useful there to begin with, and we were completely fooling ourselves all along.
Yep, sounds like science to me.