10++ Commandments of Data Science Modeling

Our analytics team at ATD is comprised of a wide variety of data scientists, data engineers, and software developers. Although we come from different backgrounds, we all touch the data science modeling component at some point or another in our daily work. To make sure we avoid situations where model building turns into a multi-month-long complicated extravaganza that ends in an unscalable model, we try our best to stick to a simple set of what we call our “10++ Commandments” (C++, JavaScript pun intended). These were learned from hard experience, so you don’t have to!

ONE — Thou shalt always check if teammates have built existing tools for what you want to do.

“I remember someone mentioning a similar product clustering project; they may have some code in the monorepo that can already do what I’m about to do.”

  • Does a component of the potential solution (i.e. clustering) lend itself to something that has already been done?
Image for post
Image for post

TWO — Thou shalt always know what is good enough for the business customer — what’s “80%” in the 80/20 rule?

“Our current third-party forecast is at 60% MAPE (mean absolute percentage error) at the warehouse-item level; the business wants us to be at least 1% more accurate at the SKU-level, but not spend a lot of time trying to squeeze out more.”

  1. Decrease expenses
  2. Decrease the time needed for a specific process
Image for post
Image for post

THREE — Thou shalt always start with simple models that can serve as a base comparison for other models.

“If I use lag1 in a linear regression, I get 55% MAPE, which is almost what I need.”

FOUR — Thou shalt always start with a simple feature set and add incrementally.

“I’ll just start with tables that are easily available so I can get a feel for what a small feature set can do. Maybe I don’t need a whole new ETL process for a feature that isn’t necessary to hit the goal of the business.”

  1. Computation time increases with a larger feature set. Utilizing every table to achieve the 80% point might not be necessary, as this could be wasting valuable time.
  2. We’ll bet that most of the data you want isn’t available as simple SQL queries from tables you already have access to. Including a plethora of features at the beginning may cause your data engineers to run large ETL processes that ultimately aren’t needed if you find the features in your already-existing sales table are good enough for the business.
  3. Explaining and generalizing from complex models can be difficult.
Image for post
Image for post

FIVE — Thou shalt always start by utilizing open-source packages before “rolling your own.”

“I would like to design my own Bayesian forecasting package, but Facebook already has a package that does 90% of what I need to do.”

SIX — Thou shalt always time your modelling exercises and extrapolate out to a full dataset: Big-O is not just a Wikipedia article.

“It takes 10 minutes to train my model on 6,000 samples, but I have 2 million in the full dataset. How is this going to scale when it’s time?”

SEVEN — Thou shalt always consult with colleagues when stuck.

“I may be embarrassed about being stuck on how I’m going to scale my forecast to all DC-SKUs, but our solutions architect is one heck of a guy who can spot a problem before it happens — maybe I should ask him.”

EIGHT — Thou shalt never agree to deadlines without consulting the technical leads.

“Jacob is asking me to get this done for tomorrow to present to a Business Unit “BU”. I think I can do it. But I should probably run by the tech lead of my team before agreeing to it.”

NINE — Thou shalt optimize to a reasonable degree before resorting to big compute.

“It takes a 1024GB RAM VM for my script to run right now. Maybe I should think about using NumPy arrays, down casting my variables, using Dask and removing unused columns instead of increasing the VM size.”

Image for post
Image for post

TEN — Thou shalt know why you’re doing what you’re doing.

“There’s a really cool method I could apply to this data; not really sure what the purpose of it is, but it’s a lot of fun to play around with it.”

Conclusion

By following these commandments, even to a cursory degree, they can help you avoid model building processes that seem to never end. It allows your data scientists to be flexible in what they do, allowing that itch of curiosity to be scratched, but by the same token keeps the larger business context in view. We learned through experience that there needs to be some standardized set of ways a data science project should flow and we hope that you’re able to learn from our experience to keep your analytics team running in tip-top condition.

Who we are as people is who we are as a company. We share the same values in the way we work with each other, our partners and our customers. We are ATD.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store