What Does DataOps Mean to Me? — Data Scientist Edition
By Jin Huang
This is the first of a series of articles on DataOps practices and frameworks, where we will discuss what DataOps could mean to different roles in an organization.
In this article, we look at the role of a Data Scientist and discuss something that every enterprise data product or team needs.
A year and a half ago, I was conducting ‘typical’ data science work, such as building regression models to generate predictive statistics as a data scientist. The outputs of my work were usually delivered as technical reports and/or local applications that were not built to scale for production workflows.
After I joined Ascend as a data scientist, DataOps jumped into my work as a totally new concept and framework. DataOps emphasizes on the entire data cycle and involves many aspects. But the one I felt most strongly about and had the biggest direct impact was the data scaling for live production workflows.
After I built and validated a data pipeline using data samples, we needed to backfill the pipeline to bring in all the historical data. Only then did I realize that some logics/models that worked fine on the sample dataset (DEV data) would fail on the PROD data set. For example, when I implemented a left join, a right table with duplicates or unnecessary dimensions could increase the processing time dramatically and even fail on very large dataset. Another example is that predictive models’ behavior may be regressed when they are deployed to production.
These are typical DataOps issues regarding scaling and speed efficiency. To help with this, a modeling framework that considers some best practices can be designed up front in order to accommodate an efficient and agile process (e.g., dedupe right table before doing left joins; develop automated regression tests ensure model behavior in production, etc).
Be on the lookout for future articles, where we will cover some other aspects of DataOps from a Data Scientist perspective, including ingestion of a variety of big data formats, data quality monitoring, team collaboration, and more.