Data Science Skills
This article is a getting started level skills needed to ship a data science product to production from scratch. We will take the case study of building a recommendation engine on an e-commerce platform for illustration. I will also break this essay into 3 chapters: (1) Building a data pipeline, (2) Modelling, and (3) Shipping code to production. Each chapters have their necessary skills.
The primary aim of a recommendation engine is a better user experience and increase in overall engagement on the platform. There are many ways in which better UX and engagement would lead to more revenue. For instance, it could increase time spent on the platform, increase ad impressions, or in our example of an E-commerce storeâââmore sales. This essay will not detail the algorithmic or mathematical details, for there are much better technical essays and books, complete with matrices and linear algebra for it. The intention, however, is to get a sneak peek at the engineering of it, and behind the scenes decision making in the quick world of startupsâââespecially the skills required in rolling out a production ready Data Science + Big Data product.
Chapter 1: Building a data pipeline
Like for all trade to flourish, a train track has to be laid, a shipping route has to be set and an airport built. Infrastructure is key to sustaining and scaling up for running algorithms on data. It would mean the difference in running perhaps a decision tree classifier or a regressionâââthere are many factors that you need to consider before running models, and the most important factor for a low latency, high concurrency system is a sound understanding of running times of various algorithms on various infrastructure stacks with various data sizes. Before you make infrastructure decisions though, you have to understand what the data is, how big it is, where does it come from, etc.
Skill 1âââKnowing the dataset and running times
It is difficult to proceed without understanding these three things - (1) How much data size per day?, (2) What are the fields available? and (3) What is the size per record? These three questions are what Iâd like to call âThree fundamental questions on datasetsâ. There are other questions of course, but those come at an EDA and feature engineering phase, later in Chapter 2 of this essay.
The second part of this skill is having an idea of running times of various algorithms versus your data size. This is a crucial skill that will come in handy forever. You need to be an absolute expert in this. Even if you realise O(n log n) is faster than O(n²), you need to intuitively have a feeling of how much time is actually saved running a O(n log n) algorithm on, say, a 500 GB data set on a two-node HDFS cluster with 8 Xeon CPUs. Weâll see how to build such a skill briefly.
Skill 2âââMaking infrastructure decisions
Like I re-learned from a rather good friend of mine recentlyâââthe world is fast and ruthless, and people increasingly donât care if youâre the kind of person who replies âI need to look more closely at the data and get back to youâ. You have to learn to make quick decisions and cost estimates with as little information as possible.
Learn to answer âIf we have 100 GB/day of data, and 5000 USD/month budget, based on work force, weâre likely going to need such and such retention policy and our best bet would be to rely on SQL and use a managed service like Redshift, which allows fast aggregations.â
That was just one such example; there are many other business use cases. Often a railway line would solve the issue, but sometimes you might also need an airport and a ferry way. The trick is to know what you need and when you need it. The good thing is that this skill is acquiredâââwhich leads us to the next sectionâââbenchmarking.
Skill 3âââBenchmarking
As a Data scientist or an engineer, you need to be up to date with benchmarks. A good grasp of benchmarks of popular classification or clustering algorithms on different data sizes with different hardware is powerful knowledge. A few interesting benchmark resources to start with are (1) Stanford DAWN, and (2) tech.marksblogg.com.
Over time, familiarity will wire some nerves in your brain, and youâll be able to take lightning fast decisions on what data architecture to use for building a recommendation engine from scratchâââthat is, assuming youâre first mathematically aware of the operations involved in the approach: perhaps it is ALS, or some other such strategy.
Before we look at the next 2 skills of building a data pipeline, letâs quickly pause and review.
My sample pipeline notes
In our E-commerce example, again, we decide to host our data on Amazon Redshift. Weâll rely on fast SQL summaries for our columnar store and pipe in data to Redshift from an external logging service. These decisions were taken based on knowledge that a logging service would give us the usual suspects of dataâââa cookie based ID from browser, or a user ID, an email if logged in, time of access, IP, location, page visits, referral, etc., and at some rate of 10 GB/day with each record being an average of 5 KB.
The actual piping in to redshift could be done using many ways. One solution would be to use Apache Kafka perhaps, or alternatively, if the logging writes to S3 dumps, even a simple /copy
would work. Keep in mind that the bandwidth for ingesting data into a warehouse should be faster than rate of generation of data. Do not rely on realtime analysis if this condition is not met.
A popular counter argument here would beâââwhy not Spark with Spark Streaming and MLlib? Thatâs a fair question. Although Spark is fast, it needs powerful machines and maintenance, and I have a personal bias to Redshiftâs pricing versus EC2 + EMR pricing. Pricing aside, Redshift is also remarkably powerful. Itâs designed for massive aggregations.
Consider this problem statementâââwe need to find the similarity between two vectors a and b using cosine similarity. It turns out to be real simple to write a quick SQL query for it, and Redshift is fast.
SELECT
a,
b,
SUM(a * b) / (SQRT(SUM(a * a)) * SQRT(SUM(b * b))) as similarity
FROM
massive_table
GROUP BY
a, b
Skill 4âââDeciding schemas, indexing and tuning
The final frontier in building a data pipeline is deciding schemas. If itâs a document store, like perhaps Couchbase or Aerospike or a cool new storage engine, the amount of pain for deciding schemas should be limited. However, if it is a columnar store, you have to be more careful. Excess normalisation is harmful for any massive retrievals. Joins can be costly, and choosing indices is an important exercise.
Spend time mocking schemas and retrieval scenarios. The tricky part is planning for the future. Often, a new column would be necessary, or some specific views have to be created. A general awareness of the scope of the solution is highly recommended. On managed services like Redshift, data replication and distribution across nodes are made simpleâââand it gets my personal bias. Having to avoid data sharding issues like maintaining multiple nodes, clusters, IP addresses, network engineering them, etc. is a massive digression from our end goal of building a data science product. Top tip: Use hosted data infrastructure unless you really know what youâre doing.
Skill 5âââETLing like crazy
Your data sources are diverse and bizarre. They promise theyâll send you a CSV with 7 columns, and Iâll bet you there will be a row with 26 columns. Theyâll tell you subscribe to their topic and youâll receive JSONs formatted in UTF-8 only, but Iâll bet you there will be a JSON with Latin-1 encoding. Theyâll say I have three tables with uniform schemas, but no, theyâre lying. Hereâs where youâll need some non-linearity in your decision making. Think ahead, and convince yourself what algorithm is the best, and format your data accordingly. If you need to clean and compute and store everything in one table, do it. If you need to store it as two files in HDFS, do it.
If your algorithm fails due to bad data, or unexpected columns, or noise, remember that it is your fault, because you havenât ETL-ed it right. Sometimes you might need a cron, or a daemon that constantly aggregates new data and appends to your master analytical table in Redshift or files in HDFS. If this is a big problem, and takes up a lot of time, and doesnât allow for your model to recompute (remember Skill 1 and 2?), you could perhaps try Spark and Lambda architecture. But thereâs a cost to those as well. Nothing is a perfect solution.
Chapter 2: Modelling
The first step in modelling is EDA or exploratory data analysis. Often this needs to be done before making infrastructure decisions, thus making this entire essay non-linear. But thereâs no perfect path to take. Sometimes with experience, you just know how a dataset will look like. Exploratory data analysis is a key step in feature engineering, and good features lead to great models.
Skill 6âââFeature engineering and choosing models
Feature engineering is an integral part of choosing the right model for your task. There are plenty of tutorials available on choosing the right features for training. We have to avoid too many features if the model is overfitting, and have a diverse set of training rows.
A limited set of features can be good for some use cases. For a recommendation system, we can actually get by with just user_id , page_viewed , and time_of_view for a basic model. A slightly more intelligent approach would be to build an intent tree of the user. An intent tree could be a derived column. Hereâs where knowledge of SQL joins would help, and we may as well write an entire model in SQL too. Be agnostic of tooling. Due diligence is of course necessary on fitting and cross validations, which are important practices for improving accuracy of your model.
Skill 7âââCoverage and mathematical intuition
A model might seem like a really intelligent choice; for example: letâs say we have a side problem to cluster types of users for a particular recommendation, and we end up using k-means with n_clusters=6
, and it feels like it was a great choice, and performs brilliant on the training set.
However, even if the features were chosen after careful deliberation, if 90% of future users map to one particular cluster, thereâs a huge loss of diversity in the model. Although this problem is avoided (as an intrinsic must-do) with good feature engineering, I had to re-state âcoverageâ as a separate skill for its importance in building a good model for personalised UX. So perhaps, it is time to check out DBSCAN or a variant?
Skill 8âââModel tuning
Even if youâve chosen the perfect model, it might seemingly not perform as desired. This is where data scientists have to go that extra mile in boosting the performance of their models. Perhaps you could use some form of Ensemble learning, combine multiple non-performant models and produce a better one. The standard industry practice is to tune parameters of the learning model. There are various resources for parameter tuning. A good place to start is https://www.coursera.org/learn/deep-neural-network.
Skill 9âââValidation
Validation is not a one-time activity. Market forces change, world economies take turns, and global warming might just affect your business irreversibly. Whatever it may be, validating accuracy of data models should be done as frequently as possible, perhaps if possible, even in real-time. Now a lot of data scientists would tend to agree that a successful validation involves agreement of the modelâs prediction with a reasonable real-world outcome. This is done by measuring metrics like F-score against the output variables. Usually the split is 70%âââ30% for training and testing data.
But is that enough? Taking the recommendation engine as example, if for a user who has browsed Red sport shirts between the price range 100 to 500 and black sport shorts between the price range 100 to 500, then reasonable recommendations could be black sport water sippers worth 50, black strapped fitness bands worth 100, etc.
However are the items that possibly seem like a good recommendation pass other business tests? Consider if the black strapped fitness bands are sold by a seller with a bad rating and has lots of returns and disputes. If the user succumbs to our recommendation and receives a defective product as a result of a bad seller, it reduces his likelihood of ever trusting the recommendation again.
Such business tests should be a part of the model itself. However, there are some insights that are hard to see until the model is actually in production. Therefore, constant validation is a must. In E-commerce, thereâs a lot of metrics that can help with such validation; a couple of important ones are Click Through Rates (CTR) and Conversion %. If we juxtapose this with seller rating, dispute %, demographics, time of view, purchase history and other data points, we can have powerful validation data.
Chapter 3: Shipping code to production
Skill 10âââSoftware engineering
The most important skill is to know you shouldnât start writing code until youâve done all the steps before this. Writing code is a final way of telling the computer âHey, Iâve decided to do this. Here are your instructionsâ. You shouldnât really start telling computers what to do unless you really know it yourself. As with all software engineering projects, the code you write will perhaps be the first way of entry into your mind for a future maintainer. (You should have written detailed documentation, with UMLs complete with Data Flow Diagrams, but who are we kidding?)
Follow simplicity whenever in doubt. Donât use fancy monads in an esoteric functional language. Use Python if possible. Read Joelâs tests at https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/. Top tip: When using Python keep in mind that it is not statically typed. This can be a great source of data errors. Designing elegant looking APIs are also an art. Then comes the important bits, version control and testing. Version your APIs. Write unit tests.
Skill 11â Latency and caching
Once you have an API ready, youâd want its response time to be fast. Typically less than 10ms in an e-commerce setting. Even lesser if possible. For this, youâd need caching. Design your API such that all it has to do is lookup a JSON from Redis for a given key. RAM is getting increasingly cheaper. With AWS ElastiCache you can get away with 100 GB RAM instances for less than $1000 a month. You wouldnât even need that much! Have a cache replacement strategy like LRU which would reduce your RAM needs by even perhaps 10X. Wiki article on cache replacement, ElastiCache tips. Pro tip: Donât cache everything!
Compute the JSON responses that needs to be served for every user, maybe as a long running streaming job or a hive summarisation query, etc. Pick the bare essential response and cache it. If your response is 10KB per user (which is a lot of data per user!) and you have a million users, thatâs 10 GB of data.
So /recommendation?user_id=abd17rcb
should just be a lookup of the key abd17rcb
. The time it takes for the network call between where your API is hosted and your Redis is hosted is the biggest bottleneck. Ensure speedy network I/O in your infrastructure. How did our entire recommendation API fold to be a Redis lookup? Iâll leave it to you to put the pieces together!
Conclusion
Building things is fun, and often thereâs more to behind the scenes than just the code and the algorithm. These are the 11 essential skills for building a production ready data science product!