The Data Science Consultant Learning Challenge

8 May

The challenges of being a Data Scientist may be many, but when you are also a consultant, they are compounded by the mere fact that you are never allowed to become comfortable with what you already know. There is always a new project coming up, likely at a new client, and it almost certainly involves learning a new technology of some sort.

Yes, you think you know how to design that analytic process, or how to create that predictive model, or even which visualisation technique to use. But as the challenges keep coming it can seem that there is no end to the new things you need to learn.

The necessity to constantly be learning can really weigh on you. At any one time, I am halfway through an online training course or traversing the pages of a data science text. It could be on anything from financial analysis in Python to comparisons of various data analytics tools such as RapidMiner and Orange, to the impact on Hive queries when choosing between versions 1 and 2 of MapReduce. Whatever I’m learning, it is usually related to what I am currently doing. This week I’ve been revisiting my Python skills on DataCamp.

It’s enough to make one’s head spin!

The challenge for me is not so much being able to learn the material but knowing when to learn a particular topic. Sometimes being a Data Science consultant means learning how to achieve something that you are already familiar with when working on a particular platform, but now need to apply to another completely different technology. When this happens, things can become confusing to say the least.

For instance, one project had me working on a large amount of data within a banking environment where technology had limited funding. This meant there was a need to use as much freeware as possible and that I wasn’t able to work on a cloud platform, but on a laptop with 8Gb of RAM. As such, to run the predictive model, I had to devise a reliable and valid method to process the large data set in an environment that did not rely on Hadoop and its massive parallel processing paradigm.

Unless you are the sort of person that likes learning under pressure, this sort of game is probably not for you. Luckily I like a challenge, so I think I’m where I’m supposed to be. That being said, there must be a way to manage how one goes about learning and keeping up with it all.

Someone recently said to me that breaking a task down into its components is the first step to gaining mastery over it. This cannot be truer than when designing an ongoing learning program for oneself.

Whilst it is true that, in consulting at least, one never really knows what they will need to know when, there are some basics that need to be understood first. The following list is an overview of those topics I think are essential to know wherever you go. These are off the top of my head and in no particular order and so I am probably missing some.

1. A statistical programming language such as Python, R, Scala, SAS, Matlab, etc

2. Basic visualisation techniques

3. Basic statistics

4. Overview of machine learning models

5. Being able to implement 2-4 in your chosen language

6. A Big Data platform such as Cloudera or HDInsight including Hadoop and its associated ecosystem of applications and paradigms

7. Soft skills such as business communication, how to run a meeting and how to give a presentation

8. How to write a project overview, technical specification and scientific report document

More advanced topics that you really do need to know as you progress in your career would include:

1. A second or third statistical programming language

2. How to identify follow-on investigations that could be pursued following the end of the current project

3. More advanced statistics topics

4. Expanding ones repertoire of analytics tools to include algorithms and models that are not as commonly used, but could be of value such as deep learning and neural networks

5. The ability to use multiple statistical applications such as RapidMiner or Orange

Of course, what you learn and when you learn it may be dictated to a large extent by the upcoming pipeline of work.

One thing I’ve learnt about online learning is that if you start a course and it feels like the material is not really on the mark, then drop it. Don’t waste your time with it and find another provider who knows what they are talking about. On the other hand, when you’ve started a course and it is something you really should know, don’t let it go. Make set times in your week to get through it and just get it done.

I know that not all training is super interesting, but if you want to stay in the game, you need to ensure you stay on top of your learning.

Well, for me, it’s back to Python scripting…Enjoy!

Daniel Karp

The Data Science Consultant Learning Challenge

Python versus R: Who is choosing what?

Embracing Uncertainty – My Journey to a Data Science Career continued