Seattle, WA
December 10–13, 2018
Click Here for More Information & Registration
View Venue Map
Tuesday, December 11 • 11:40am - 12:15pm
Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud - Kai Zhang & Yang Che, Alibaba

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to relieve the pain of managing diverse dependencies.
In Alibaba cloud, we use Kubernetes to build elastic DL platform for continuous model training and optimization. It manages heterogeneous cluster including CPU/GPU/FPGA. Jobs are automatically scheduled to the best-fit resources. Kubeflow, which is a great machine learning scaffold on Kubernetes, is used to setup training pipeline. Project Arena is created to manage and instrument jobs with friendly user experience.
In this talk, we will discuss how the platform is designed, and how it facilitates users to focus on DL tasks instead of managing underlying complexity. A demo shows how to run distributed neural network training in a minute.

avatar for Yang Che

Yang Che

Senior Engineer, Alibaba
Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →
avatar for Kai Zhang

Kai Zhang

Staff Engineer, Alibaba
Kai Zhang, is now a staff engineer of Alibaba Cloud. He's worked on container service product and enterprise solution development for 3 years. Before that, he worked in deep learning platform, cloud computing, distributed system and SOA area over 10 years. Recently, he is exploring... Read More →

Tuesday December 11, 2018 11:40am - 12:15pm