Full stack open source MLOps

garrett · March 31, 2021, 10:07pm

Hello and welcome! Thanks for driving that thread on r/MachineLearning. Really impressive rollup of helpful data across a huge array of products.

Re Kedro, Guild is quite different. Guild runs stuff, captures results in a particular way, and lets you compare those results - even using them to drive further experiments. Kedro is one of the many pipeline/orchestration Python frameworks available. Others that come to mind are Dask, Tune, and Airflow. You’re free to implement an operation anyway you like. Guild will happily run it. In some cases, you might want to just let Guild roll up a series of operations for you - but for complex DAG style problems, you might want to use a DAG style tool (Dask is a great example).

In other words, Guild lets you implement your operations however you want. When you’re interested in measuring results so that you can compare them and make better decisions, Guild steps in.

Re MLOps, what a great question! It’s important and tricky to get right.

Guild is sort of a stealth platform. There are a LOT of features in this toolset that put it squarely in the “MLOps platform” category. But they’re subtle.

Guild is quite carefully design to let you start very very small and evolve to complex systems. So guild run train.py can evolve in steps to a full blown collaborative production system. I’ll break those steps into categories:

Pipeline development
Distributed computation
Collaborative workflow

As your tasks evolve (Guild calls these operations) you can create pipelines through operation dependencies. So test requires train requires data prep, etc. This is implemented through dependencies in the Guild file.

When you want to move from local runs to remote, Guild provides solid support for distributing compute via remotes. E.g. you can spin up a cloud based services and run what you need remotely. You can then move those runs around using commands like push and pull. This flexibility is very much inspired by git and the various workflows its commands enable.

Finally, Guild enables collaborative workflow through various tagging, labeling, and commenting primitives, which apply to experiments. When you generate runs, you can annotate them for ongoing comparison, discussion, and ultimately deploy/release decisions. As with the other features, these features are based on federation: users run their experiments, annotate them, and consolidate them into a common pool for more analysis or deployment. This approach leverages the excellent patterns we all know and enjoy with git. No central databases or complex central controls.

What Guild does not address: anything language specific, complex orchestration, runtime environments.

Guild, e.g. says nothing about Python project layout, linting, preferred libraries, etc. You do what you want in code. Guild runs that. Guild does not have an opinion about what libraries you use or how you use them. I don’t think it should.

Guild does not attempt to be a complex scheduler. There are many, many great schedulers. Guild instead wants to integrate with schedulers. The recently released Dask scheduler is a good example. Over time we’ll implement more support—the next target is likely Kubeflow.

Finally, Guild makes no claims to runtime environments. Run wherever you want. Run in a container or not. Run on prem or in the cloud. Guild doesn’t care. Guild wants to run something, capture the results of that operation, and then let you use those results to make smart decisions about what to do next. That’s the scoping.

This is a tough question you ask - but a very important one. As ML teams work to streamline their workflows they need to parse all of this and your r/MachineLearning thread will help a lot I think. Underlying all of this is the basic problem: operations are complex. That’s just the way it is. There’s no simple, single answer for anyone. It takes work—the sort that you’re doing—to make sense of the options and consider them for your application(s) and use cases. Guild takes a strong position on platform and language independence (it’s good and important to maintain) and of separating concerns (narrowly focused tools that can be assembled to solve complex problems are better than monolithic frameworks that purport to be a one-stop shop for all your needs). That’s philosophy. Still it’s important to note.

Please feel free to ask more specific questions—I’m sure I’ve left out many important details here but I wanted to share my initial thoughts.

Topic		Replies	Views
Release 0.7.4 Releases	0	470	December 9, 2021
Reddit thread on MLOps Random	0	530	March 24, 2022
Python package documentation? General	1	717	June 12, 2020
Guild Dash Board General	4	709	September 21, 2020
Is there any example with fast.ai and/or Kaggle? General	1	666	July 2, 2021

Full stack open source MLOps

Related topics