The research and development (R&D) phase of building an AI model to address a business problem is characterized by rapid exploration and iteration. Everything is on the table and experimentation is encouraged, from understanding how to frame the problem, to determining how to most effectively use the data on hand, to discovering the model architecture with the best performance.

In stark contrast to this, the operationalization phase of AI model development requires that the model be completely characterized, produce reproducible results, and be stable for integration in automation processes. Model versioning best practices and version control tools are essential to successfully navigating and overcoming this gap between R&D and production engineering.

Version Control Fundamentals

The practice of version control is nothing new. What developer isn’t interested in tracking and managing modifications to their software applications? Due to the large number of benefits associated with adopting these practices, ecosystems of software tools and applications exist in order to facilitate frictionless version control. The industry standard for this is Git, with a number of hosted software products based on Git being available including GitHub, GitLab, and Bitbucket.

While these solutions are necessary for managing complex software projects, they aren’t sufficient by themselves. In the data science ecosystem, there are emerging challenges that go beyond versioning software code and require innovation to address. For example:

Challenge: AI model development often produces very large model artifacts that exceed the capacity of traditional version control software.

Solution: Technologies such as Git Large File Storage (Git LFS) allow Git to be extended in order to accommodate very large non-source code files.

Challenge: The fuel for an AI model is data, however, datasets are massive, and storing multiple processed versions of the same dataset can be prohibitive.

Solution: Technologies such Data Version Control provide Git-like primitives for efficiently storing and versioning raw and processed datasets.

Key Takeaway: Developing a team that is well versed in version control principles will position you for success in adopting emerging data science version control tools.

Tracking Experimentation with Code and Data Version Control 

During the R&D phase, fast-paced development has many advantages to allow for rapid exploration of ideas without much overhead. If this process becomes unprincipled, however, it can lead to loss of knowledge, duplicated effort, and expensive technical debt.

Fortunately, rigorous model code and data versioning (even during the experimentation phase) can serve as a fossil record for the iterative process needed to produce high-impact, AI models. In this way, developers can:

  1. Preserve and disseminate institutional knowledge about an AI model, especially when there are multiple developers working on a model simultaneously
  2. Revisit and reproduce the performance of different versions of a model after being unattended after some period of time
  3. Ensure the model is in a state that can easily be transitioned into a production environment

Key Takeaway: Taking a proactive approach to experiment tracking and model versioning creates a streamlined process for achieving sustained value from AI models. In the domain of experiment tracking, tools such as MLFlow  and SageMaker are good tools to evaluate.

AI Deployment and Model Versioning

To build off of version control best practices, the Modzy approach embraces versioning practices at all levels. Not only do we use source code and data versioning during the process of model development, but we also provide a universal template for AI models that incorporates unique identifiers (UIDs) and versions for each model on the Modzy platform.

This provides a single source code repository to develop and release multiple versions of an AI model. These model versions each have their own full description and code state recorded. This is a huge win for development velocity. It also makes for a straightforward process to crank out updates and variations on models in a reproducible manner. 

Conducting model versioning in this way is not just useful for accelerating internal model development, but it is of critical concern when it comes to AI governance and auditability. Since all models deployed via the Modzy platform are packaged and set up for versioning in this way, every model prediction can be associated with a model and its history of development, data source, training procedure, and source code.

Key Takeaway: Modern version control practices are a foundational component for building the right ModelOps pipeline for your organization.

Version control practices consistently applied all the way from R&D to production deployment of an AI model are core to building a strong ModelOps pipeline. Dedicated technologies that are widely available make adopting these best-practices easier than ever. With a dedication to these principles, bringing model versioning to the domain of data science ensures quality, stability, and auditability of models. 

Be sure to check out the other blogs in this series:

Maximizing Data Use and Reducing Pitfalls: Putting the Right Data Preparation Process in Place for AI

Model Training: Our Favorite Tools in the Shed