My friends and family don’t know what I do.

And I’m not talking about some hypothetical grandmother who has never owned a smartphone, or friends that operate in sweet, blissful ignorance of tech. I’m talking about people who work in the industry. Who have very good ideas about what product managers do, about what software engineers do.

Despite my best efforts - ok, maybe not my absolute best - they would have no ability to describe what it is that I, or any other data scientist, does.

Why?

Defined by the echo system

There have been two attempts at describing data science using the context that it sits in within the data echo system.

Data science within the data stack

The first is by imagining data work as a sort of stack, with each layer being dependent on the preceding layer.

Data Work Stack

This makes sense. You can’t do data science without data. And you probably shouldn’t do it without some basic monitoring and analysis first, either.

But this view is biased towards larger organisations, in which people perform specific roles. You can’t apply this description across organisational size. If you’re at a small company, or in a small team, you’re going to end up doing all manner of engineering and other data tasks.

Data science defined by skills

It can also be helpful to define data science by the skills it requires.

This is useful because different nominal roles often overlap in terms of the work that they do, and hence the skills required to do that work.

But this fails, too. Because the skills required to do the role change very much depending on the business needs, the team, the project, even the individual task. A data scientist might conceivably be researching a certain algorithm one day and building a data pipeline the next.

Some things we can define

There are some roles that do have good definitions. When people refer to a (generalist) data scientist, they may actually be referring to one of these. My own definitions, that I don’t think are very controversial:

  • ML Engineer: write the code and create the infrastructure to train, serve, and monitor ML models in production.
  • Product Data Scientist: create experiments and perform analysis to guide product direction and answer product questions.

There are also the adjacent roles, which have good definitions:

  • Data Analyst: visualise and analyse data to help the business make better decisions.
  • Data Engineer: write the code and create the infrastructure to collect and organise the data that the business wants.

I can’t do this for a (again, generalist) data scientist.

Is it what you need?

This sometimes results in the hiring of a data scientist when a different role is required, typically because more basic data requirements haven’t been satisfied.

So should we just hire fewer data scientists?

Yes, if the data scientist is just going to spend all their time building data pipelines or dashboards, it’s better to hire data engineers and data analysts.

However, these may only be the activities that people think that are the most valuable use of the data scientist’s time. This is because they are far more understandable and material and known: everyone is aware of what a dashboard is and what data is. Everyone wants more data, and more representations of that data. There are nice metrics and KPIs you can attach to these areas.

But there seems to be something redundant about some of this type of work. Value comes from action, not data availability and how well it’s presented.

It’s hard to do

The best answer I can come up with for what a data scientist does:

Figure out the best way to add value to the business using data, then go and do it.

I know, I know - this is distressingly open-ended and vague.

This is a feature, not a bug. More precise definitions, either in skills or areas or tasks, restrict the data scientist to activities that may not necessarily be the best use of their time.

This is why one often needs a broad array of skills and expertise, to cope with diverse business requirements, as well as knowledge about the business itself.

And you need the freedom to operate, via either influence or autonomy.


<
Previous Post
What AI Isn’t
>
Next Post
Forecasting the Economy