Embracing Uncertainty – My Journey to a Data Science Career continued
In my last post, where I described my journey into the world of data science, I kind of skipped the part where I learned how to actually be a data scientist.
I mentioned that I completed an online data science course with Coursera, and then managed to wrangle my way onto the data science team at my consultancy. This was all very well, but neither achieving certification in the field nor actually convincing those who must be obeyed that I could do the job, actually prepared me for life in the fast lane. Nor did it make me a data scientist.
So what did?
“A data scientist is forged in the fires of uncertainty”
Kind of melodramatic I know (been reading too much Dan Brown), but “fires of uncertainty” neatly sums up my experience in the field.
At first I felt that the uncertainty I was dealing with was my own. How was I going to tackle the problem? There are so many approaches one could take for each type of problem. How was I to know which one was correct? Initially this conundrum worried me. How was I going to know that I was on the right path? Would I just be wasting the client’s time and money working out how to answer the question without actually getting to the nub of the issue and answering it? How much time should I spend deciding which way to go?
Then I began to realise that the search for an approach to the problem was actually intrinsic to the whole process. The process of exploratory analysis and discovery was relevant not just to the data itself, but also to how the data was to be analysed.
“Methodological exploration is essential to intrinsic understanding of the subject matter”
I found that the exercise of trying out various methods and approaches tended to actually refine the question, as well as intensify understanding of both the subject and the components of a possible solution. Indeed this was emphasised during the course I did, but until coming face-to-face with it in the wild, it hadn’t really sunk in.
I realised that exploration of the approach and of the data itself were both essential to initiating a data science project. Almost incidentally, the process of trying out different methodologies has forced me to become familiar with a number of completely different statistical and analytic programming tools, which have now been added to my repertoire.
Although core to a data science undertaking, coming to grips with the data, and deciding on an approach through methodological exploration does not actually mean a resolution to the question at hand will have been found. Exploration is just the beginning, and the uncertainty of delivering an outcome still remains.
Data science seeks to tame this uncertainty.
“Data science is a science – remember that!”
Like any science, you start with a question, convert it into an hypothesis and then test it.
So having a clearly defined question is paramount. Yet sometimes this is not so straight-forward, especially when the client may not have a coherent idea of what it is they wish to ask. This uncertainty must be addressed, and while unresolved, its determination is the data scientists’ priority. For without knowing what the question is, there is no project, no analysis, no science.
I have found that together with clarifying the question, I have also (on one occasion at least) been forced to explain what I could bring to the table that was so different from conventional business intelligence. It requires a switch in thinking on the part of the client to understand that their data is not merely a repository of what was, but that it also contains within it the seeds of future behaviour that can facilitate a deeper understanding of the current state, and a prediction of what is yet to come. I explained that a data scientist (hopefully me) is able to use both conventional business intelligence tools, as well as more sophisticated statistical analysis and machine learning to bring about this deeper insight into where the business came from, where it is sitting currently, and where it is likely headed.
My attempts to demystify the role of data science to business has forced me to think not just about the array of tools at my disposal, but also what the end product of a data science project should be. The client wants to be left with something concrete and useful after spending all that money on you. So what should you give them?
“Just answer the question!”
Well, most clients simply expect an answer to their question.
I’ve found that the simplest way to present the answer is in a scientific paper format. Like all scientific papers, it should have a title, introduction, outcomes of key data exploration findings (including charts and tables), a section on methodology – including how the data were treated/cleansed/enhanced etc, the results – including a comparison of approaches, and a conclusion containing the findings (whatever they may be).
The first time I wrote such a paper, I felt I had arrived – that I had actually become a data scientist. It’s a beautiful thing to hold a completed paper in one’s hands, a distillation of your hard work – often of months, condensed into a kind of story that is easy and straight-forward for the client to read. It should be something they can use, for example, to present to the board or key stakeholders, to provide support, or otherwise, for a business case.
Because I have mostly used R, I have tended to create these papers in R Markdown containing both the code and wordage. This makes the whole paper self-documenting, and it can be re-run whenever required. Of course this can be done with other languages used in data science, such as Python or Scala as well.
Of course, some data science projects are ongoing and require results at incremental stages. These will require different types of artefacts to be delivered at agreed intervals. Other projects are not about providing a particular answer. Rather, they are concerned with delivering a methodology, a predictive algorithm they can reuse or even a series of visualisations to be used within the organisation. These all, however, should be backed up with a storyboard or method section outlining ‘how you got there’, to support your work, and enable the client to feel comfortable with it.
So there we have it, the makings of a data scientist and the components of a typical data science project. By using the scientific method and a drive to answer the big (and not so big) questions, we aim to banish (or at least reduce) the uncertainty that our clients throw at us!