I gained my data scientist skills by training. I improved and extended these skills at various workplaces, either as an employee or entrepreneur. And I did some good old fashioned homework to obtain some data engineering skills too. Most helpful were technical books and internet forums, online computer courses and the open source community. Follow my tech timeline below if you like to know more about my complete data science and data engineering profile...
My first computer was a Commodore 64, which I bought from the money that I earned with a holiday job. Loved to play the computer games, but was immediately hooked when I found out how to create my own programs by coding in BASIC and machine language.
A few years later, during my econometrics bachelor's, I bought a Windows machine with 80386 Intel processor and started coding in Turbo Pascal to start with. I encountered a whole bunch of other computer languages down the line, which were useful one way or the other. Never a dull moment!
As a student assistant at the department of Mathematical Economics at Erasmus University Rotterdam, I worked on a European project. The goal was to determine which poverty index performs best at measuring poverty among general populations. Back to basics: we coded in FORTRAN.
I obtained a masters degree in econometrics on the subject of fractionally integrated time series modeling. Never used it in my career later on, but what the heck. Coding in Matlab was fun. My thesis was written in ChiWrite.
Started working at an academic research company and coded my projects in Gauss. Creating reports on one of the first Macintosh machines was a (non lifetime changing) new experience for me. And yes, also those things sometimes needed a hard reset, nothing new there.
The company had implemented at a Unix server. Great opportunity to get acquainted with the Unix environment. It was during this time that I learned the necessary commands in the VIM text editor to write my code. This would become very handy later on when I dived into the webpage and webserver building world.
I thought about doing a PhD, but chose to leave the academic atmosphere. Surely, to create publications in scientific magazines in order to show how good you are to your colleagues couldn't be the meaning of life. If I cannot explain my family what I am doing, how could it be worth doing it?
So, I started to work at an insurance company. Huhh??? Yes, a health insurance company. Why? Big data. Yes!!! Coding in SPSS.
A few years later, in order to satisfy my scientific curiosity, I started writing a PhD thesis on the subject of risk equalization among health insurers. Coding in SAS, again. OpenOffice, movie, subtitling
Along the line I started a website development company. And I coded my own web scraper and aggregator in PHP and MySQL, just because many RSS feeds were lacking in those days. Go back in time with the timemachine of the Internet Archive for a >snapshot of the homepage.
Of course, I could have run the web scraper and aggregator on an external webserver. But why go the easy way? I decided to buy a harddrive, motherboard, memory blocks, DVD drive etc. and builded a webserver myself. I installed Linux, an Apache webserver and did some more. Finally, I was up and running.
At the end of 2008, together with two other persons, I started a strategy consulting firm. Facts and figures formed the center of our proposition. Destined to become the CIO and CTO of this startup, I decided to acquire a big data server and had Stata installed on it.
Stata is used for statistics in health economics at universities a lot, in particular for cross section and panel analysis. But it was not very well suited for solving standard quadratic programming problems. Also, we found out that the Stata community was not very supportive (to put it mildly) when discussing how to code a non standard quadratic programming problem. This is were R came to the rescue. It was the starting point for developing a preference for coding in R.
Collaboration when writing code is a big thing. Not only would one use versioning to keep track of the older versions of your files. When working in a project team, one also would like to work on code files at the same time without loosing track of who did what when all the time. Although I used Subversion as a version control system on my NAS previously, I switched to Git immediately when reading online forums on Git and the Git manual. Embracing Github followed along the way.
Standard linear regression is great for solving lots of prediction problems, even those for which better estimation techniques apply theoretically. Machine learning on the other hand produces even better predictions, although they come at the risk of overfitting your model. A combination of the worlds of statistics and machine learning would be optimal. This combination is clearly explained by Trevor Hastie and Robert Tibshirani (both Professor of Biomedical Data Science and Professor of Statistics) in the profound Stanford University online course called Statistical Learning, which I highly recommend.
Now it was time to find out how to create my own web apps such that they can be published remotely or locally. The Microsoft Azure ML suite offers these kind of facilities out of the box. Major advantage: you can write your own code in R and combine it with the out of the box tools. But most of all: an API is automatically created after estimation of your model and a demo script for calling this API externally (even from good old Excel) is also provided.
We started 2017 by repositioning the strategy of my consulting firm. Repositioning one's business creates new opportunities :-) On the tech front, developments are accelerating now. We created a web app that sends the client input from a HTML form to the Azure ML API first. The next step is that it receives the API output, which is handled in PHP to produce a jQuery graphic and to publish it on a webpage. See the demo result for calculating the risk equalization subsidies for individual enrollees of Dutch health insurers in 2017.
I installed Ubuntu on an external VPS and setup a RStudio and Shiny server for test purposes. Shiny applications for clients can be offered via private access, Shiny showcases can be offered via public access. As an example, see the demo app for calculating the risk equalization subsidies for individual enrollees of Dutch health insurers in 2017.
My first steps on the blockchain track led to a publication in a Dutch magazine for economists and policy makers. We explained the blockchain phenomenon and argued that it can be used to reduce information requirements in Dutch health care and health insurance markets substantially. Download the pdf from the Equalis company website here.
As a data scientist by training one starts coding in R rather easily. But to get to know the finesses of a computer language, taking courses can never hurt (at least, if you rule out the time that you have to invest in it...). I took a highly recommend Coursera course R programming and received my certificate from the Johns Hopkins University.
Never know what the future brings :-) I will add new stuff to this timelime as it arrives. Stay tuned!