Full stack open source MLOps


I have written a post in reddit asking for MLOps tools and solutions:

I mentioned there Guild, which seems like a really ligthweight and interesting solution that might fits our team size and expertise. I wanted to ask here two things:

  • How is Guild compatible with Kedro?
  • If we used Guild, how many other tools should we seek? In other words, which parts of the “MLOps stack” will be covered?

Thanks in advance and congratulations for making such and interesting tool available and free.

1 Like

Hello and welcome! Thanks for driving that thread on r/MachineLearning. Really impressive rollup of helpful data across a huge array of products.

Re Kedro, Guild is quite different. Guild runs stuff, captures results in a particular way, and lets you compare those results - even using them to drive further experiments. Kedro is one of the many pipeline/orchestration Python frameworks available. Others that come to mind are Dask, Tune, and Airflow. You’re free to implement an operation anyway you like. Guild will happily run it. In some cases, you might want to just let Guild roll up a series of operations for you - but for complex DAG style problems, you might want to use a DAG style tool (Dask is a great example).

In other words, Guild lets you implement your operations however you want. When you’re interested in measuring results so that you can compare them and make better decisions, Guild steps in.

Re MLOps, what a great question! It’s important and tricky to get right.

Guild is sort of a stealth platform. There are a LOT of features in this toolset that put it squarely in the “MLOps platform” category. But they’re subtle.

Guild is quite carefully design to let you start very very small and evolve to complex systems. So guild run train.py can evolve in steps to a full blown collaborative production system. I’ll break those steps into categories:

  • Pipeline development
  • Distributed computation
  • Collaborative workflow

As your tasks evolve (Guild calls these operations) you can create pipelines through operation dependencies. So test requires train requires data prep, etc. This is implemented through dependencies in the Guild file.

When you want to move from local runs to remote, Guild provides solid support for distributing compute via remotes. E.g. you can spin up a cloud based services and run what you need remotely. You can then move those runs around using commands like push and pull. This flexibility is very much inspired by git and the various workflows its commands enable.

Finally, Guild enables collaborative workflow through various tagging, labeling, and commenting primitives, which apply to experiments. When you generate runs, you can annotate them for ongoing comparison, discussion, and ultimately deploy/release decisions. As with the other features, these features are based on federation: users run their experiments, annotate them, and consolidate them into a common pool for more analysis or deployment. This approach leverages the excellent patterns we all know and enjoy with git. No central databases or complex central controls.

What Guild does not address: anything language specific, complex orchestration, runtime environments.

Guild, e.g. says nothing about Python project layout, linting, preferred libraries, etc. You do what you want in code. Guild runs that. Guild does not have an opinion about what libraries you use or how you use them. I don’t think it should.

Guild does not attempt to be a complex scheduler. There are many, many great schedulers. Guild instead wants to integrate with schedulers. The recently released Dask scheduler is a good example. Over time we’ll implement more support—the next target is likely Kubeflow.

Finally, Guild makes no claims to runtime environments. Run wherever you want. Run in a container or not. Run on prem or in the cloud. Guild doesn’t care. Guild wants to run something, capture the results of that operation, and then let you use those results to make smart decisions about what to do next. That’s the scoping.

This is a tough question you ask - but a very important one. As ML teams work to streamline their workflows they need to parse all of this and your r/MachineLearning thread will help a lot I think. Underlying all of this is the basic problem: operations are complex. That’s just the way it is. There’s no simple, single answer for anyone. It takes work—the sort that you’re doing—to make sense of the options and consider them for your application(s) and use cases. Guild takes a strong position on platform and language independence (it’s good and important to maintain) and of separating concerns (narrowly focused tools that can be assembled to solve complex problems are better than monolithic frameworks that purport to be a one-stop shop for all your needs). That’s philosophy. Still it’s important to note.

Please feel free to ask more specific questions—I’m sure I’ve left out many important details here but I wanted to share my initial thoughts.

Wow, thank you so much, Garrett. You gave me a lot of useful information. I need some time to study all the tools and see what is more suited to our time. I really like Guild approach!

1 Like