Posted on , 13 min read
This article is a getting started level skills needed to ship a data science product to production from scratch. We will take the case study of building a recommendation engine on an e-commerce platform for illustration. I will also break this essay into 3 chapters: (1) Building a data pipeline, (2) Modelling, and (3) Shipping code to production. Each chapters have their necessary skills.
The primary aim of a recommendation engine is a better user experience and increase in overall engagement on the platform. There are many ways in which better UX and engagement would lead to more revenue. For instance, it could increase time spent on the platform, increase ad impressions, or in our example of an E-commerce store — more sales. This essay will not detail the algorithmic or mathematical details, for there are much better technical essays and books, complete with matrices and linear algebra for it. The intention, however, is to get a sneak peek at the engineering of it, and behind the scenes decision making in the quick world of startups — especially the skills required in rolling out a production ready Data Science + Big Data product.
Like for all trade to flourish, a train track has to be laid, a shipping route has to be set and an airport built. Infrastructure is key to sustaining and scaling up for running algorithms on data. It would mean the difference in running perhaps a decision tree classifier or a regression — there are many factors that you need to consider before running models, and the most important factor for a low latency, high concurrency system is a sound understanding of running times of various algorithms on various infrastructure stacks with various data sizes. Before you make infrastructure decisions though, you have to understand what the data is, how big it is, where does it come from, etc.
It is difficult to proceed without understanding these three things - (1) How much data size per day?, (2) What are the fields available? and (3) What is the size per record? These three questions are what I’d like to call “Three fundamental questions on datasets”. There are other questions of course, but those come at an EDA and feature engineering phase, later in Chapter 2 of this essay.
The second part of this skill is having an idea of running times of various algorithms versus your data size. This is a crucial skill that will come in handy forever. You need to be an absolute expert in this. Even if you realise O(n log n) is faster than O(n²), you need to intuitively have a feeling of how much time is actually saved running a O(n log n) algorithm on, say, a 500 GB data set on a two-node HDFS cluster with 8 Xeon CPUs. We’ll see how to build such a skill briefly.
Like I re-learned from a rather good friend of mine recently — the world is fast and ruthless, and people increasingly don’t care if you’re the kind of person who replies “I need to look more closely at the data and get back to you”. You have to learn to make quick decisions and cost estimates with as little information as possible.
Learn to answer “If we have 100 GB/day of data, and 5000 USD/month budget, based on work force, we’re likely going to need such and such retention policy and our best bet would be to rely on SQL and use a managed service like Redshift, which allows fast aggregations.”
That was just one such example; there are many other business use cases. Often a railway line would solve the issue, but sometimes you might also need an airport and a ferry way. The trick is to know what you need and when you need it. The good thing is that this skill is acquired — which leads us to the next section — benchmarking.
As a Data scientist or an engineer, you need to be up to date with benchmarks. A good grasp of benchmarks of popular classification or clustering algorithms on different data sizes with different hardware is powerful knowledge. A few interesting benchmark resources to start with are (1) Stanford DAWN, and (2) tech.marksblogg.com.
Over time, familiarity will wire some nerves in your brain, and you’ll be able to take lightning fast decisions on what data architecture to use for building a recommendation engine from scratch — that is, assuming you’re first mathematically aware of the operations involved in the approach: perhaps it is ALS, or some other such strategy.
Before we look at the next 2 skills of building a data pipeline, let’s quickly pause and review.
In our E-commerce example, again, we decide to host our data on Amazon Redshift. We’ll rely on fast SQL summaries for our columnar store and pipe in data to Redshift from an external logging service. These decisions were taken based on knowledge that a logging service would give us the usual suspects of data — a cookie based ID from browser, or a user ID, an email if logged in, time of access, IP, location, page visits, referral, etc., and at some rate of 10 GB/day with each record being an average of 5 KB.
The actual piping in to redshift could be done using many ways. One solution would be to use Apache Kafka perhaps, or alternatively, if the logging writes to S3 dumps, even a simple
/copy would work. Keep in mind that the bandwidth for ingesting data into a warehouse should be faster than rate of generation of data. Do not rely on realtime analysis if this condition is not met.
A popular counter argument here would be — why not Spark with Spark Streaming and MLlib? That’s a fair question. Although Spark is fast, it needs powerful machines and maintenance, and I have a personal bias to Redshift’s pricing versus EC2 + EMR pricing. Pricing aside, Redshift is also remarkably powerful. It’s designed for massive aggregations.
Consider this problem statement — we need to find the similarity between two vectors a and b using cosine similarity. It turns out to be real simple to write a quick SQL query for it, and Redshift is fast.
SELECT a, b, SUM(a * b) / (SQRT(SUM(a * a)) * SQRT(SUM(b * b))) as similarity FROM massive_table GROUP BY a, b
The final frontier in building a data pipeline is deciding schemas. If it’s a document store, like perhaps Couchbase or Aerospike or a cool new storage engine, the amount of pain for deciding schemas should be limited. However, if it is a columnar store, you have to be more careful. Excess normalisation is harmful for any massive retrievals. Joins can be costly, and choosing indices is an important exercise.
Spend time mocking schemas and retrieval scenarios. The tricky part is planning for the future. Often, a new column would be necessary, or some specific views have to be created. A general awareness of the scope of the solution is highly recommended. On managed services like Redshift, data replication and distribution across nodes are made simple — and it gets my personal bias. Having to avoid data sharding issues like maintaining multiple nodes, clusters, IP addresses, network engineering them, etc. is a massive digression from our end goal of building a data science product. Top tip: Use hosted data infrastructure unless you really know what you’re doing.
Your data sources are diverse and bizarre. They promise they’ll send you a CSV with 7 columns, and I’ll bet you there will be a row with 26 columns. They’ll tell you subscribe to their topic and you’ll receive JSONs formatted in UTF-8 only, but I’ll bet you there will be a JSON with Latin-1 encoding. They’ll say I have three tables with uniform schemas, but no, they’re lying. Here’s where you’ll need some non-linearity in your decision making. Think ahead, and convince yourself what algorithm is the best, and format your data accordingly. If you need to clean and compute and store everything in one table, do it. If you need to store it as two files in HDFS, do it.
If your algorithm fails due to bad data, or unexpected columns, or noise, remember that it is your fault, because you haven’t ETL-ed it right. Sometimes you might need a cron, or a daemon that constantly aggregates new data and appends to your master analytical table in Redshift or files in HDFS. If this is a big problem, and takes up a lot of time, and doesn’t allow for your model to recompute (remember Skill 1 and 2?), you could perhaps try Spark and Lambda architecture. But there’s a cost to those as well. Nothing is a perfect solution.
The first step in modelling is EDA or exploratory data analysis. Often this needs to be done before making infrastructure decisions, thus making this entire essay non-linear. But there’s no perfect path to take. Sometimes with experience, you just know how a dataset will look like. Exploratory data analysis is a key step in feature engineering, and good features lead to great models.
Feature engineering is an integral part of choosing the right model for your task. There are plenty of tutorials available on choosing the right features for training. We have to avoid too many features if the model is overfitting, and have a diverse set of training rows.
A limited set of features can be good for some use cases. For a recommendation system, we can actually get by with just user_id , page_viewed , and time_of_view for a basic model. A slightly more intelligent approach would be to build an intent tree of the user. An intent tree could be a derived column. Here’s where knowledge of SQL joins would help, and we may as well write an entire model in SQL too. Be agnostic of tooling. Due diligence is of course necessary on fitting and cross validations, which are important practices for improving accuracy of your model.
A model might seem like a really intelligent choice; for example: let’s say we have a side problem to cluster types of users for a particular recommendation, and we end up using k-means with
n_clusters=6, and it feels like it was a great choice, and performs brilliant on the training set.
However, even if the features were chosen after careful deliberation, if 90% of future users map to one particular cluster, there’s a huge loss of diversity in the model. Although this problem is avoided (as an intrinsic must-do) with good feature engineering, I had to re-state “coverage” as a separate skill for its importance in building a good model for personalised UX. So perhaps, it is time to check out DBSCAN or a variant?
Even if you’ve chosen the perfect model, it might seemingly not perform as desired. This is where data scientists have to go that extra mile in boosting the performance of their models. Perhaps you could use some form of Ensemble learning, combine multiple non-performant models and produce a better one. The standard industry practice is to tune parameters of the learning model. There are various resources for parameter tuning. A good place to start is https://www.coursera.org/learn/deep-neural-network.
Validation is not a one-time activity. Market forces change, world economies take turns, and global warming might just affect your business irreversibly. Whatever it may be, validating accuracy of data models should be done as frequently as possible, perhaps if possible, even in real-time. Now a lot of data scientists would tend to agree that a successful validation involves agreement of the model’s prediction with a reasonable real-world outcome. This is done by measuring metrics like F-score against the output variables. Usually the split is 70% — 30% for training and testing data.
But is that enough? Taking the recommendation engine as example, if for a user who has browsed Red sport shirts between the price range 100 to 500 and black sport shorts between the price range 100 to 500, then reasonable recommendations could be black sport water sippers worth 50, black strapped fitness bands worth 100, etc.
However are the items that possibly seem like a good recommendation pass other business tests? Consider if the black strapped fitness bands are sold by a seller with a bad rating and has lots of returns and disputes. If the user succumbs to our recommendation and receives a defective product as a result of a bad seller, it reduces his likelihood of ever trusting the recommendation again.
Such business tests should be a part of the model itself. However, there are some insights that are hard to see until the model is actually in production. Therefore, constant validation is a must. In E-commerce, there’s a lot of metrics that can help with such validation; a couple of important ones are Click Through Rates (CTR) and Conversion %. If we juxtapose this with seller rating, dispute %, demographics, time of view, purchase history and other data points, we can have powerful validation data.
The most important skill is to know you shouldn’t start writing code until you’ve done all the steps before this. Writing code is a final way of telling the computer “Hey, I’ve decided to do this. Here are your instructions”. You shouldn’t really start telling computers what to do unless you really know it yourself. As with all software engineering projects, the code you write will perhaps be the first way of entry into your mind for a future maintainer. (You should have written detailed documentation, with UMLs complete with Data Flow Diagrams, but who are we kidding?)
Follow simplicity whenever in doubt. Don’t use fancy monads in an esoteric functional language. Use Python if possible. Read Joel’s tests at https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/. Top tip: When using Python keep in mind that it is not statically typed. This can be a great source of data errors. Designing elegant looking APIs are also an art. Then comes the important bits, version control and testing. Version your APIs. Write unit tests.
Once you have an API ready, you’d want its response time to be fast. Typically less than 10ms in an e-commerce setting. Even lesser if possible. For this, you’d need caching. Design your API such that all it has to do is lookup a JSON from Redis for a given key. RAM is getting increasingly cheaper. With AWS ElastiCache you can get away with 100 GB RAM instances for less than $1000 a month. You wouldn’t even need that much! Have a cache replacement strategy like LRU which would reduce your RAM needs by even perhaps 10X. Wiki article on cache replacement, ElastiCache tips. Pro tip: Don’t cache everything!
Compute the JSON responses that needs to be served for every user, maybe as a long running streaming job or a hive summarisation query, etc. Pick the bare essential response and cache it. If your response is 10KB per user (which is a lot of data per user!) and you have a million users, that’s 10 GB of data.
/recommendation?user_id=abd17rcb should just be a lookup of the key
abd17rcb. The time it takes for the network call between where your API is hosted and your Redis is hosted is the biggest bottleneck. Ensure speedy network I/O in your infrastructure. How did our entire recommendation API fold to be a Redis lookup? I’ll leave it to you to put the pieces together!
Building things is fun, and often there’s more to behind the scenes than just the code and the algorithm. These are the 11 essential skills for building a production ready data science product!