Essential skills of a Data Scientist [closed]

What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?

A few ideas germane to this discussion:

  • Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
  • Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
  • Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
  • What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
  • OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company

Thoughts?


To quote from the intro to Hadley's phd thesis:

First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future

Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)

Step 2 means visualisation/ plotting skills.

Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.

The final step is mostly about soft skills like introspection and management-type skills.

Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.


Just to throw in some ideas for others to expound upon:

At some ridiculously high level of abstraction all data work involves the following steps:

  • Data Collection
  • Data Storage/Retrieval
  • Data Manipulation/Synthesis/Modeling
  • Result Reporting
  • Story Telling

At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.


JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:

  1. Skill #1: Statistics (Studying)
  2. Skill #2: Data Munging (Suffering)
  3. Skill #3: Visualization (Story telling)

At dataist the question is addressed in a general way with a nice Venn diagram:

venn diagram