Open source SkyPilot aims to optimize cloud costs for ML and Data Science

A team of researchers at RISELab at UC Berkeley recently released Skypilot, an open source framework for running machine learning workloads on major cloud providers through a unified interface. The project focuses on cost optimization to automatically find the cheapest regions, regions and service providers for the desired resource.

Based on job requirements, the framework automatically determines which AWS, Azure, and Google Cloud locations have the resources (CPU/GPU/TPU) required to perform the job and the most affordable cost. Sky pilot It then performs three main tasks: automatically migrates the cluster in case of capacity or quota failure, synchronizes user code and files with the destination, and manages job queues and execution.

Zonghen Yangpostdoctoral fellow at UC Berkeley, and Ion StoicaUC Berkeley professor and co-founder of Anyscale explains:

Cloud computing for ML and Data Science is already challenging, but when you start using cost-cutting techniques, your overhead will add up. Want to stop leaving cars idle? You’ll have to roll them back over and over and redo the environment and data settings. Want to use spot-instance pricing? It will take a week to coordinate the prevention. What about using price differences between regions or between clouds?

SkyPilot is not the first open source project RISELab aimed at optimizing cloud costs. As before Reported on InfoQissued by the research center SkyPlane optimize the transfer of large data sets between cloud service providers, reducing transfer time and costs.


Recently, training machine learning models in the cloud can be costly and inefficient for some companies transfer data and models back to their own data centers to reduce costs and improve performance. Yang and Stoika write:

SkyPilot has been in active development at UC Berkeley’s Sky Computing Lab for over a year. More than 10 organizations are using it for a variety of use cases including GPU/TPU (3x cost savings), simulation training on 100+ CPU spot instances (3x savings), distributed hyperparameter tuning, and bioinformatics batch jobs (6.5x savings). .

Among other benefits of SkyPilot, the authors suggest building multi-cloud applications, using the best hardware, and increasing the availability of rare resources such as high-end NVIDIA V100 or A100 GPUs.


Included in the scope Control pointthe option to use cheaper spot instances, automatic recovery from preventive measures, and A walk, a function to automatically clean up inactive clusters. released the team collection Jupiter Notebook help developers understand how the project works.

SkyPilot currently supports AWS, Google Cloud and Azure CLI and a Python API. according to a Reddit threadThe project is planned to be implemented support other small cloud providers in the future.

Available on SkyPilot GitHub Under the Apache-2.0 license.

Source link

Leave a Comment