Seattle, WA
December 10–13, 2018
Click Here for More Information & Registration
View Venue Map

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Machine Learning & Data [clear filter]
Tuesday, December 11


Real-time Vision Processing on Kubernetes: Working with Data Locality - Yisui Hu, Google
Kubernetes as a general purpose cloud native application platform simplifies the operation of applications. Data processing is one of the common of applications that benefit from the power of Kubernetes. However attempting real-time computer vision processing on Kubernetes isn't that straightforward, as the maximum processing time for a single frame is expected to be less than 30ms to reach 30fps or no more than 60ms (15fps, start being laggy). This presentation demonstrates the actual issues, feature gaps encountered and workarounds applied running a real-time vision pipeline on Kubernetes from the presenter's hobby robotics project. The pipeline involves a few image pre-processing tasks, a ML model (SSD mobilenet) for object detection and depth estimation with a single camera. The biggest challenge is how to distribute the tasks across nodes with data locality in consideration.

avatar for Yisui Hu

Yisui Hu

Software Engineer, Google
Yisui Hu is a Software Engineer currently working on Kubernetes Engine at Google. He has over 14 years industry experience focused on software engineering across multiple areas, including operating systems, networking, embedded systems, cloud infrastructure, etc.

Tuesday December 11, 2018 10:50am - 11:25am


Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud - Kai Zhang & Yang Che, Alibaba
Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to relieve the pain of managing diverse dependencies.
In Alibaba cloud, we use Kubernetes to build elastic DL platform for continuous model training and optimization. It manages heterogeneous cluster including CPU/GPU/FPGA. Jobs are automatically scheduled to the best-fit resources. Kubeflow, which is a great machine learning scaffold on Kubernetes, is used to setup training pipeline. Project Arena is created to manage and instrument jobs with friendly user experience.
In this talk, we will discuss how the platform is designed, and how it facilitates users to focus on DL tasks instead of managing underlying complexity. A demo shows how to run distributed neural network training in a minute.

avatar for Yang Che

Yang Che

Senior Engineer, Alibaba
Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →
avatar for Kai Zhang

Kai Zhang

Staff Engineer, Alibaba
Kai Zhang, is now a staff engineer of Alibaba Cloud. He's worked on container service product and enterprise solution development for 3 years. Before that, he worked in deep learning platform, cloud computing, distributed system and SOA area over 10 years. Recently, he is exploring... Read More →

Tuesday December 11, 2018 11:40am - 12:15pm


Scaling AI Inference Workloads with GPUs and Kubernetes - Renaud Gaubert & Ryan Olson, NVIDIA
Deep Learning (DL) is a computational intense form of machine learning that has revolutionize many fields including computer vision, automated speech recognition, natural language processing and artificial intelligence (AI).

DL impacts every vertical market from automotive to healthcare to cloud, as a result, the training and deployment of Deep Neural Networks (DNNs) has shifted datacenter workloads from traditional CPUs to AI-specific accelerators like NVIDIA GPUs.

Leveraging several popular CNCF projects such as Prometheus, Envoy, and gRPC, we will demonstrate an implementation of NVIDIA’s reference scale-out inference architecture, capable of delivering petaops per second of performance.

This is a new and challenging problem in the datacenter and we will discuss these challenges and ways to optimize for service delivery metrics (latency/throughput), cost, and redundancy.

avatar for Renaud Gaubert

Renaud Gaubert

Software Engineer, NVIDIA
Renaud Gaubert has been working since 2017 at NVIDIA on making GPU applications easier to deploy and manage in data centers. He focuses on supporting GPU-accelerated machine learning frameworks in container orchestration systems such as Kubernetes and Docker swarm. He is an active... Read More →
avatar for Ryan Olson

Ryan Olson

Solutions Architect, NVIDIA
Ryan Olson is a Solutions Architect in the Worldwide Field Organization at NVIDIA. His primary responsibilities involve supporting deep learning and high performance computing applications. Ryan is particularly interested in scalable software design that leverages the unique capabilities... Read More →

Tuesday December 11, 2018 1:45pm - 2:20pm


Machine Learning Model Serving and Pipeline Using KNative - Animesh Singh & Tommy Li, IBM
Lifecycle support— including continuous development, training, testing, and deployment of machine learning models—and continuous integration (CI) for AI applications is still in its infancy. We need a solution that enables end-to-end automation of data preparation and model deployment pipelines.

In this talk we are going to show how to leverage KNative components to create an event driven AI pipeline. We will leverage OpenWhisk and Kubernetes to provide an event driven platform, and Istio for traffic management and observability to construct a pipeline which will provide interfaces to various open source tools: model training, validation. serving platforms on Kubernetes

We will show how we can leverage this AI pipeline to train using advanced batch scheduling in Kubernetes, automate A/B tests and canary testing of models, monitoring concept drifts and accuracy losses etc.

avatar for Tommy Li

Tommy Li

Software Developer, IBM
Tommy Li is a software developer in IBM focusing on Cloud, Kubernetes, and Machine Learning. He is one of the Fabric for Deep Learning’s main contributors and worked on various developer code patterns on Kubernetes, Microservice, and deep learning application to provide use cases... Read More →
avatar for Animesh Singh

Animesh Singh

Chief Architect and Program Director, IBM
Animesh Singh is a Program Director and Chief Architect for the IBM Watson and Cloud Open Source Platform, where he leads machine learning and deep learning initiatives on IBM Cloud and works with communities and customers to design and implement deep learning, machine learning, and... Read More →

Tuesday December 11, 2018 2:35pm - 3:10pm


Machine Learning as Code: and Kubernetes with Kubeflow - Jason " Jay" Smith, Google & David Aronchick
Machine Learning is become an increasingly popular topic in the world of data. At the same time, the concept of microservices through containerization has increased in popularity as it allows for developers to create and package applications for easy export and distribution through various clouds.

Kubeflow is an open source project lead by Google to merge both concepts, allowing users to leverage the power of Kubernetes to run the training and serving of their ML models.

This convergence of technologies does result in a new way to think of Machine Learning. We now think of can think of machine learning as code bundles. My session will show how with Kubeflow and GitOps tools, you can go beyond simply deploying and training TensorFlow models but bundling the entire infrastructure into a code package and treat the entire machine learning process as a pipeline.

avatar for David Aronchick

David Aronchick

Head of OSS Machine Learning, Microsoft
David leads Open Source Machine Learning Strategy at Azure. This means he spends most of his time helping humans to convince machines to be smarter. He is only moderately successful at this. Previously, he led product management for Kubernetes, launched Google Kubernetes Engine and... Read More →
avatar for Jay Smith

Jay Smith

Cloud Customer Engineer 云客户工程师, Google
Jason ‘Jay’ Smith is a Cloud Customer Engineer at Google. He is passionate containerization and machine learning so Kubeflow was a natural fit as it allows users to run machine learning models on Kubernetes. He has spent much time learning about Kubernetes and has become interesting... Read More →

Tuesday December 11, 2018 3:40pm - 4:15pm


Why Data Scientists Love Kubernetes - Sophie Watson & William Benton, Red Hat
This talk will introduce the workflows and concerns of data scientists and machine learning engineers and demonstrate how to make Kubernetes a powerhouse for intelligent applications.

We’ll show how community projects like Kubeflow and radanalytics.io support the entire intelligent application development lifecycle. We’ll cover several key benefits of Kubernetes for a data scientist’s workflow, from experiment design to publishing results. You’ll see how well scale-out data processing frameworks like Apache Spark work in Kubernetes.

System operators will learn how Kubernetes can support data science and machine learning workflows. Application developers will learn how Kubernetes can enable intelligent applications and cross-functional collaboration. Data scientists will leave with concrete suggestions for how to use Kubernetes and open-source tools to make their work more productive.

avatar for William Benton

William Benton

Manager, Software Engineering and Sr. Principal Engineer, Red Hat, Inc
William Benton leads a team of data scientists and engineers at Red Hat, where he has applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy... Read More →
avatar for Sophie Watson

Sophie Watson

Software Engineer, Red Hat, Inc
Sophie is a software engineer at Red Hat, where she works in an emerging technology group. She has a background in Mathematics and has recently completed a PhD in Bayesian statistics, in which she developed algorithms to estimate intractable quantities quickly and accurately. Since... Read More →

Tuesday December 11, 2018 4:30pm - 5:05pm
Wednesday, December 12


Natural Language Code Search for GitHub Using Kubeflow - Jeremy Lewi, Google & Hamel Husain, GitHub
Turning ML into magical products often requires complex, distributed systems that bring with them a unique, ML specific set of infrastructure problems. A year ago, we started building Kubeflow to leverage Kubernetes to solve these problems. In this talk, we will use the example of a search engine for code using natural language (http://bit.ly/gh-kf-search) to illustrate how Kubeflow and Kubernetes can be used to deploy and manage ML products.

avatar for Hamel Husain

Hamel Husain

Senior Data Scientist, GitHub
Hamel Husain is a Senior Data Scientist at Github who is focused on creating the next generation of developer tools powered by machine learning. His work involves extensive use of natural language and deep learning techniques to extract features from code and text. Prior to Github... Read More →
avatar for Jeremy Lewi

Jeremy Lewi

Senior Software Developer, Google
Jeremy Lewi is a co-founder and lead engineer at Google for the Kubeflow project, an effort to help developers and enterprises deploy and use ML cloud-natively everywhere. He's been building on Kubernetes since its inception starting with Dataflow and then moving onto Cloud ML Engine... Read More →

Wednesday December 12, 2018 10:50am - 11:25am
Ballroom 6A


Nezha: A Kubernetes Native Big Data Accelerator For Machine Learning - Huamin Chen, Red Hat & Yuan Zhou, Intel
Large training datasets used by machine learning frameworks, such as Kubeflow, are usually stored in low cost and high capacity S3 or Google Cloud Storage. However, S3’s rating limiting and slow data downloading significantly challenges training performance and limits compute scalability.

We introduce NeZha and explain how it improves Kubeflow’s training. Nezha is an open source, community driven, and highly collaborative project, contributed by storage and big data engineers. Nezha is based on Kubernetes Initializer: it rewrites Pod spec, adds a sidecar S3 cache, and redirects Pod to use local cache to accelerate.

Nezha is self contained and easy to use. It does not require modification to existing applications or user visible Pod changes.

Nezha improves big data application performance. Our initial Kubeflow benchmark results using MNIST dataset shows NeZha achieves ~2x speedup.


Huamin Chen

Principal Software Engineer, Red Hat
Dr. Huamin Chen is a passionate developer at Red Hat' CTO office. He is one of the founding member of Kubernetes SIG Storage, member of Ceph, Knative, and Rook. He previously spoke at KubeCon, OpenStack Summits, and other technical conferences.
avatar for Yuan Zhou

Yuan Zhou

Senior Software Development Engineer, Intel
Yuan Zhou is a Senior Software Development Engineer in the Software and Service Group for Intel Corporation, working in the OpenSource Technology Center primarily focused on BigData Analytics Storage. He has been working on Databases, Virtualization and Cloud computing for most of... Read More →

Wednesday December 12, 2018 11:40am - 12:15pm
Ballroom 6A


Predictive Application Scaling with Prometheus and ML - Chris Dutra, Schireson
Schireson, a New York City based data science and engineering firm, overcame a challenge in how to properly scale kubernetes services for their application stack.

Consuming data from several tools in the CNCF portfolio (such as Envoy and Prometheus), Schireson developed home-grown machine learning to actively predict the resource requirements for its services at any given time. The models act in a semi-supervised state to ensure the overall stability of their data science platform.

This talk will illustrate the steps taken to construct the models, and offer suggestions to the larger DevOps community on how to implement Predictive Application Scaling in their organization.

avatar for Chris Dutra

Chris Dutra

Director, Site Reliability Engineering, Schireson
Chris is the Director of Site Reliability Engineering at Schireson, a data science consulting and engineering firm. Prior to this role, Chris has held several positions in artificial intelligence and cloud organizations. Outside the office, Chris loves traveling the world with his... Read More →

Wednesday December 12, 2018 1:45pm - 2:20pm
Ballroom 6A


Enterprise Machine Learning on K8s: Lessons Learned and the Road Ahead - Timothy Chen & Tristan Zajonc, Cloudera
Kubernetes as a platform is being asked to support an ever increasing range of workloads, including machine learning and big data processing. These new workloads introduce challenges both for both end users and cluster administrators. Data scientists want the flexibility to run any workload and library they require, data engineers want to ensure the scalability and reliability of production workloads, and cluster administrators want to maintain governance and control over cluster resources. At Cloudera, we’ve built a machine learning platform on Kubernetes that seeks to balance these competing objectives. In this talk, we will share some of the key design choices we made, lessons learned supporting large enterprise customers, and our vision of the road ahead for machine learning and AI on Kubernetes.


Timothy Chen

Software Engineer, Cloudera
Timothy Chen is a Software Engineer at Cloudera, working on their Enterprise Machine learning platform. He was also CEO / cofounder of Hyperpilot and Apache PMC/committer on Apache Drill and Apache Mesos.

Tristan Zajonc

CTO of Machine Learning, Cloudera
Tristan Zajonc is CTO of Machine learning at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development... Read More →

Wednesday December 12, 2018 2:35pm - 3:10pm
Ballroom 6A


Demystifying Data-Intensive Systems On Kubernetes - Alena Hall, Microsoft
Distributed databases, stateful stream processing workloads, caches, and machine learning frameworks often require persistence for storing data, operation progress, and more. Managing state while running systems like Cassandra, Kafka, Spark, Redis, or Tensorflow on Kubernetes is different than with VMs or physical servers.

Let’s examine why we might want to run these systems on Kubernetes, and look at foundational Kubernetes concepts (e.g. Stateful Sets) that help us get those systems up and running. But up and running isn’t always equal to operating correctly. We will go over best practices for managing data-intensive systems on Kubernetes, existing challenges, as well as solutions (e.g. CRDs, custom controllers, operators) and a possible future.

You will learn about operational things to take into account even if you haven't worked with data systems systems on Kubernetes before.

avatar for Lena Hall

Lena Hall

Senior Cloud Developer Advocate, Microsoft
Lena Hall is a senior software engineer and a developer advocate at Microsoft working on Azure, where she focuses on large-scale systems for distributed data processing and storage. Previously, she was a senior software engineer at Microsoft Research. Lena has more than 10 years of... Read More →

Wednesday December 12, 2018 3:40pm - 4:15pm
Ballroom 6A


Kafka on Kubernetes - From Evaluation to Production at Intuit - Shrinand Javadekar, Intuit
Kubernetes is fast becoming the platform of choice for running distributed, containerized applications in the cloud. However, running stateful applications such as Kafka on Kubernetes is not a common practice today. At Intuit, we took an experimentation and data driven approach for evaluating Kafka on Kubernetes in AWS. In this talk, we will provide details of our requirements, the configuration, and the details of the evaluation. The evaluation process included functional tests, network isolation tests, cross-region tests as well as performance and stress tests. We will focus on the problems we ran into and how we addressed them.

This talk will demonstrate a K8S cluster running Kafka and it's configuration using Kubernetes native components (statefulsets, configmaps, etc.). We will also show Splunk and Wavefront integrations for logging and monitoring respectively.

avatar for Shrinand Javadekar

Shrinand Javadekar

Principal Engineer, Intuit
Shrinand Javadekar is software engineer in the Modern SaaS team with focus on Kubernetes. They’re on a mission to make Kubernetes the defacto standard for developing, deploying and running apps at Intuit. In the past he has been part of large scale file system and virtualization... Read More →

Wednesday December 12, 2018 4:30pm - 5:05pm
Ballroom 6A