Seattle, WA
December 10–13, 2018
Click Here for More Information & Registration
View Venue Map

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Observability [clear filter]
Tuesday, December 11


On the OSS Path to Full Observability with Grafana - David Kaltschmidt, Grafana Labs
Grafana is coming “off the wall”. To make it more useful for interactive debugging, David and his team have already integrated two pillars of observability - metrics and logs. They are currently adding tracing to complete the incident response experience. All to minimise the cost of context switching during those crucial minutes after getting paged.

This talk will demonstrate the various methods we’ve used to link the data together. Prometheus is providing the metrics. Via its histograms, request latencies can be extracted to inform each tracing span from Jaeger. Grafana also ensures that lines from your log aggregation system are annotated with span and trace IDs, as well as the other way around: associating logged values with spans.

David will show how these OSS parts should be deployed to achieve full observability in an engaging user experience that saves valuable minutes.

avatar for David Kaltschmidt

David Kaltschmidt

Director of UX, Grafana Labs
David is the Director of UX at Grafana Labs and focuses mainly on workflows around monitoring with Prometheus. Previously he worked at companies such as Kausal, Weaveworks, Oracle, and Siemens Research. David has been building UIs for the networking industry for over a decade. He’s... Read More →

Tuesday December 11, 2018 10:50am - 11:25am
4C 3/4


Linkerd 2.0, Now with Extra Prometheus - Andrew Seigner, Buoyant & Frederic Branczyk, CoreOS
In order to provide zero-config observability for service owners running their application on Kubernetes, Linkerd 2.0 needs to scale to any size Kubernetes cluster. In this talk, Andrew Seigner from the Linkerd team and Frederic Branczyk from the Prometheus team will demonstrate Linkerd 2.0 features made possible by Prometheus, including a user-friendly dashboard, a simple yet powerful CLI, and Grafana integration. They’ll outline how Linkerd evolved its Prometheus integration to leverage Prometheus’s ability to handle massive sets of metrics, and demonstrate the scale this integration enables.

avatar for Frederic Branczyk

Frederic Branczyk

Principal Software Engineer, Red Hat
Frederic is an engineer at Red Hat (previously CoreOS) contributing to Prometheus and Kubernetes to build state of the art modern infrastructure and monitoring tools. He discovered his interest in monitoring tools and distributed systems in his previous jobs, where he used machine... Read More →
avatar for Andrew Seigner

Andrew Seigner

Software Engineer, Buoyant
Andrew Seigner is a Software Engineer at Buoyant, the makers of Linkerd. Andrew's recent focus has been rebuilding Linkerd's CI pipeline. Previously he worked on Linkerd's observability and telemetry system. Andrew previously gave talks on observability with Thanos and Linkerd at... Read More →

Tuesday December 11, 2018 11:40am - 12:15pm
4C 3/4


Water, Water, Everywhere: Making Kubernetes Audit Logs Potable - Kate Kuchin, Heptio
Audit logging in Kubernetes is a powerful tool that grants Kubernetes operators more insight into their clusters. Audit logs can tell us what happened in our clusters, when it happened, who did it, what resources were affected, and more. The problem is, even a fairly stagnant Kubernetes cluster generates millions of audit logs per week. And it's up to us to distill value out of what is largely just noise.

So, how can we isolate important audit events to better understand what's going on in our clusters?

In this session, we'll first go over what Kubernetes audit logs are and what information they provide. We'll then do a live a demo of getting audit set up on a cluster, and inspect the raw logs that are generated. And finally, we'll talk about strategies for pulling useful information out of the deluge, so we can make sense of these millions of audit logs glean actionable insights.

avatar for Kate Kuchin

Kate Kuchin

Senior Systems Software Engineer, Heptio
Kate is a Senior Software Engineer at Heptio, where she works on the UX team to build tools to make Kubernetes easier to adopt and use. She started her career at Google, where she worked on the Consumer Operations team of (the now sadly defunct) orkut. She has since worked at startups... Read More →

Tuesday December 11, 2018 1:45pm - 2:20pm
4C 3/4


Observability is the Key Tenet of Running a Multi-Tenant K8s Environment - Thom McCann & Amreth Chandrasehar, T-Mobile
How do you achieve 200 Million requests per day and no down time for 2 years? T-Mobile has been driving containerized workloads for many internal application teams. Running an internal multi-tenant environment can be challenging but has significant benefits

In this session we'll dive deep into observability components such as a large scale telemetry system built on Prometheus serving 4000+ requests per second and millions of metrics over 6 clusters. This open source system is built across 3 AZs in each region (US West and US East) and is federated across multiple Prometheus clusters enabling distributed queries and limitless scale.

We'll dive deep into how our operational teams can view integrated monitoring for infrastructure, hosts, VMs, containers and application level and integrate alerts with slack, pager duty and other real-time systems.

avatar for Amreth Chandrasehar

Amreth Chandrasehar

Principal Architect, Cloud, T-Mobile
Amreth Chandrasehar is a Principal Architect at T-Mobile responsible for on-boarding company-wide workloads on to the cloud. He is a co-creator of T-Mobile's internal system known as Conducktor which accelerates application team development for container based applications. Amreth... Read More →
avatar for Thom McCann

Thom McCann

Sr. Manager, Sr Software Engineer, T-Mobile
For decades, Thom McCann has been an innovator in the technology industry across several different software projects. As the creator of Outlook Web Access and key member of the teams that delivered the first versions of Windows NT and Exchange Server, he has built products reaching... Read More →

Tuesday December 11, 2018 3:40pm - 4:15pm
4C 3/4


Three Pillars, Zero Answers: We Need to Rethink Observability - Ben Sigelman, LightStep
Observability has never been more important: the complexity of microservices makes it harder and harder to answer basic questions about system behavior. The conventional wisdom claims that Metrics, Logging and Tracing are “the three pillars” of observability… yet software organizations check these three boxes and are still grasping at straws during emergencies.

In this session we’ll illustrate the problem with the three pillars: metrics, logs, and traces are just data – they are the fuel, not the car. We then reframe observability around two fundamental activities: (1) measurement and (2) explanation.

“Measurement” must be both robust and precise; “explanation” must present the larger context of distributed transactions and the role that contention plays under peak load. We present both via demos and examples to ground the session in real-world storytelling.

avatar for Ben Sigelman

Ben Sigelman

Co-founder and CEO, LightStep
Ben Sigelman is a co-founder and the CEO at LightStep, a co-creator of Dapper (Google’s distributed tracing system), and co-creator of the OpenTracing and OpenTelemetry projects (both part of the CNCF). Ben's work and interests gravitate towards observability, especially where microservices... Read More →

Tuesday December 11, 2018 4:30pm - 5:05pm
4C 3/4
Wednesday, December 12


Cortex - Infinitely Scalable Prometheus - Bryan Boreham, Weaveworks
Prometheus is an excellent monitoring solution, however it is designed to store all metrics on one host in one set of files, which limits scalability and availability.

This talk provides an overview of Cortex, an Open Source project proposed for donation to the CNCF, which takes metrics from multiple Prometheus instances and stores them in a NoSQL database.

- Microservices architecture of Cortex
- NoSQL back-end (DynamoDB, Bigtable or Cassandra)
- Fault-tolerance via DHT-based ingestion tier and stateless query tier
- Deployment via Kubernetes
- Architectural comparison to projects in the same area, such as Thanos and M3
- Lessons learned from operating “Prometheus as a service” for two years

avatar for Bryan Boreham

Bryan Boreham

Distinguished Engineer, Weaveworks
Bryan is Director of Engineering at Weaveworks, delivering deployment, observability and monitoring for containers and microservices. After first getting into programming as a kid, creating a video game called "Splat", Bryan's career has ranged from charting pie sales at a bakery... Read More →

Wednesday December 12, 2018 10:50am - 11:25am


Audit in Kubernetes, the Future is Here - Stefan Schimanski & Maciej Szulik, Red Hat
Advanced Auditing is now available, and you should learn how to use it. Last year, we proposed a list of improvements to Kubernetes auditing, and this year we are happy to report that we have delivered them! Join us for this session to learn what Advanced Auditing does, how to adapt it to your own requirements, and how to integrate it into your infrastructure.

Topics to be covered include:
1. From basic audit to advanced audit
2. Defining a policy (examples)
* Deep object logging
* Excluding secrets
* Logging different objects at different levels
3. Integrating with your infrastructure
4. Performance impact vs. consistency
5. Writing an audit webhook server
6. Which future improvements are planned

We are hoping to demystify the complexity of the policy configuration by showing
a handful of example rules that will allow attendees to better understand the mechanism.

avatar for Stefan Schimanski

Stefan Schimanski

Senior Software Engineer, Red Hat
Stefan is a Principal Software Developer at Red Hat working on Kubernetes and OpenShift, with a focus on API machinery, extension points and developer tools as part of Sig API Machinery. He is one of the top-3 reviewers in the project (number of reviews). Stefan is a 2nd time GoogleSummer... Read More →
avatar for Maciej Szulik

Maciej Szulik

Software Engineer, Red Hat
Maciej is a passionate developer with over 10 years of experience in many languages. Currently he's working on OpenShift and Kubernetes for Red Hat. Whereas at night he is hacking on side projects and CPython's IMAP library. In his spare time he organizes

Wednesday December 12, 2018 11:40am - 12:15pm


Do it Live: Measuring your Applications in Production - Jason Keene, Pivotal
Ever had a bug that only shows its head in production? Ever wanted to ask arbitrary questions about your software while it is running with minimal impact on performance? In this talk, we go beyond pre-defined instrumentation and look at various techniques of gathering information about your application to help you solve your production woes. Topics include:

- How to ask good questions, and get good answers.
- Securely running a container in your cluster for probing your application.
- How to use eBPF/BCC tools to gather data without impacting performance.
- Techniques for using ptrace/debuggers when all else fails.
- Knowing when to *not* use these tools and how to better instrument your code in advance of a production issue.

With these tools and techniques you will be better equipped to understand your software and solve the problems you'll face in production.

avatar for Jason Keene

Jason Keene

Senior Software Engineer, Pivotal
Jason Keene is an engineer at Pivotal working on observability for Kubernetes and Cloud Foundry. He is currently the tech lead for a team building tools for making egressing observability data from clusters simpler and self-service. He co-organizes the Boulder Golang meetup and frequently... Read More →

Wednesday December 12, 2018 1:45pm - 2:20pm


Monitor the World: Meaningful Metrics for Containerized Apps and Clusters - Nicholas Turner, Amazon & Nic Cope, Planet Labs
In this talk, Nick and Nic dive into key metrics and tooling that you should be using to monitor your Kubernetes control plane components and your containerized applications.

Nick will disect the metrics that have shown to be critical for Kubernetes cluster operations, experience gained while helping to build and operate Amazon's EKS platform. Topics covered will include monitoring etcd and the Kubernetes control plane components.

Nic will discuss Planet Lab's Kubernetes deployment and the application metrics that have proven to be most critical for operating large scale clusters, including how they utilize Linkerd and Prometheus for visibility.

Two themes that will persist throughout the talk are how use the minimum number of necessary metrics to still have a complete picture while avoiding noise, and how to do it with open source technology.

avatar for Nic Cope

Nic Cope

Staff Software Engineer, Planet
Nic is the technical lead of the Kubernetes infrastructure that powers Planet Labs' analytics platform (i.e. all the boring terrestrial bits). Before that he was an SRE at Spotify and Google. Nic likes to apply Go and sensible processes to infrastructure, reliability, and scalability... Read More →
avatar for Nicholas Turner

Nicholas Turner

Software Development Engineer, Amazon
Nick works at Amazon Web Services as a software development engineer for EKS where he works on building and operating a Kubernetes platform for customers who run their infrastructure on AWS. He is also a maintainer of the aws-iam-authenticator, the defacto open source authentication... Read More →

Wednesday December 12, 2018 2:35pm - 3:10pm


Adopting Prometheus the Hard Way - Tim Simmons, DigitalOcean
Deploying Prometheus is simple. Successfully adopting Prometheus as part of your company’s Observability story is not.

In this talk you will learn how to scale the adoption of Prometheus, how to train your organization to know and love Prometheus, and the importance of fostering an Observability culture. If you are interested in building a great Observability culture at your company, this talk is for you!

An extended outline of this talk is:
-How DigitalOcean uses Prometheus successfully
-The subtle art of scaling Prometheus and tradeoffs to consider
-How to teach people to use Prometheus correctly
-How to integrate Prometheus with Kubernetes and other software you use
-How to utilize Prometheus to measure everything you do
-What to do when you find Prometheus’ limits and how to push past them
-Prometheus’ friends that you’ll also love
-The value of your new Observability culture

avatar for Tim Simmons

Tim Simmons

Senior Engineer, DigitalOcean
Tim Simmons is a Senior Engineer on the Observability Platforms team at DigitalOcean. He primarily cares for DigitalOcean's internal Prometheus infrastructure. On a normal day, he helps his colleagues with PromQL queries, writes custom Prometheus exporters, and builds tools around... Read More →

Wednesday December 12, 2018 3:40pm - 4:15pm


Securing Application Telemetry & Tracing with SPIFFE and Envoy - Sabree Blackmon, Docker
Application telemetry, such as Prometheus metrics, distributed logs, and tracing, offer a surplus of information on how an app works, how it's performing, what it’s communicating with and how. However, even when these streams do not contain PII, this information can be invaluable to malicious actors in the days of highly distributed systems. In this talk, we will demonstrate how to use Envoy proxy and SPIRE to protect your telemetry endpoints, in both push and pull use cases, utilizing fluentd, Prometheus, and OpenTracing & Jaeger.

avatar for Sabree Blackmon

Sabree Blackmon

Senior Security Engineer, Docker
Sabree is a Security Engineer at Docker. While he cut his technical teeth in military & hardened embedded systems, Sabree now enjoys solving complex security problems in distributed systems and the cloud. He is passionate about open-source and helping engineers develop & deploy secure... Read More →

Wednesday December 12, 2018 4:30pm - 5:05pm
Thursday, December 13


Understanding Microservices with Distributed Tracing - Lita Cho, Lyft
With microservices, one request can go through hundreds of nodes. Not one engineer can know all the possible paths of the request, How can engineers infer how the system behaves? Metrics? Logging? These tools have their place, but neither of these inherently constructs a journey of the entire request. What if we want to optimize the overall request latency? Figure out how many additional hops the system will make by adding a new API call? I am here to talk about how distributed tracing tells a story about your system. I will go over how you can see the entire picture of what your system looks like, and with this data, make investigate and triage systematic issues, and make impactful, data-driven, performance optimizations to your system. I will go over what tracing does well and what does not. I will also go over how we leverage Envoy and OpenTracing to the instrument.

avatar for Lita Cho

Lita Cho

Senior Software Engineer, Lyft
Lita is a senior software engineer on the Networking team, building out the service mesh to handle both Kubernetes and legacy systems at Lyft. She also maintains the tracing infrastructure at Lyft. Before that, she worked on building out the API infrastructure using Protocol Buffers... Read More →

Thursday December 13, 2018 10:50am - 11:25am
Ballroom 6B


Visualise Cluster Connectivity with Goldpinger; Smash it with PowerfulSeal - Mikolaj Pawlikowski, Bloomberg
Let's talk about two open source tools to test and troubleshoot your Kubernetes setup: Goldpinger and PowerfulSeal.

Goldpinger is a Kubernetes networking troubleshooting tool. Deploy it on your clusters and get a pretty graph of connectivity between all your nodes, Prometheus metrics to generate alerts from, and more. We will be introducing the tool and sharing it with the community during the conference!

PowerfulSeal brings Chaos Engineering to your Kubernetes clusters. It will destroy things following your policies, so that you can discover and fix problems before real outages happen. We published PowerfulSeal at KubeCon + CloudNativeCon North America 2017 in Austin, and we'll be giving you an update on the project's momentum. Learn more about it at https://github.com/bloomberg/powerfulseal.

Come to our talk and gain two new Kubernetes superpowers!


Mikolaj Pawlikowski

Software Engineer, Bloomberg
Mikolaj Pawlikowski, a software engineer with Bloomberg, is building a microservices platform based on Kubernetes, and evangelising Cloud Native and Chaos Engineering. He previously built two startups, worked as a freelance consultant and collaborated on open source projects like... Read More →

Thursday December 13, 2018 11:40am - 12:15pm
Ballroom 6B


Distributed Tracing in Serverless Systems - Nitzan Shapira, Epsagon
Serverless and FaaS naturally fit microservices architectures. Observability of such systems is very complicated, since each microservice is separated and working asynchronously from the others. Distributed tracing is a key approach to understand such systems. In serverless, there are new challenges and opportunities which make distributed tracing a very interesting and useful technique for high observability.

I will go quickly over the history of tracing and the popular tools. I will focus on the key differences between observability using distributed tracing in generic microservices environments compared to serverless. Examples from the popular cloud vendors will be shown, including full visualization of asynchronous transactions in a highly distributed serverless system, and detecting business flows across multiple asynchronous communication resources (e.g. SNS, Kinesis, and more).

avatar for Nitzan Shapira

Nitzan Shapira

CEO, Epsagon
Co-founder and the CEO of Epsagon. I am a software engineer with over 12 years of experience in programming, machine learning, cyber-security, and reverse engineering. Epsagon is focused on bringing observability to serverless cloud applications, using distributed tracing and AI technologies... Read More →

Thursday December 13, 2018 1:45pm - 2:20pm
Ballroom 6B


Unified Observability in Kubernetes and Beyond with OpenCensus - Morgan McLean & Jaana Burcu Dogan, Google
Platforms like Kubernetes have allowed developers around the world to develop highly performant and easily configurable distributed systems. However, the techniques that developers have used in the past are no longer as effective in these environments.

OpenCensus is designed with this situation in mind: it provides libraries that automatically capture application metrics and distributed traces, attaches the correct service and workload metadata, and sends them to your backend of choice for analysis.

In this talk we will:
- Instrument a set of microservices, including custom metrics and tie-ins to OpenMetrics
- Analyze the telemetry captured from the app with common open-source tools like Prometheus and Jaeger
- Demonstrate advanced features like exporting data to multiple backends and changing configuration in realtime without requiring redeployment

See opencensus.io for more details.

avatar for Jaana Dogan

Jaana Dogan

Engineer, Google
Jaana works on Google Compute Engine and is a familiar figure in the the software development community via her previous work on Go and OpenCensus, and from her blog and Twitter presence (@rakyll).
avatar for Morgan McLean

Morgan McLean

Product Manager, Google
Morgan is a co-founder of OpenCensus and OpenTelemetry, and has spent much of his career as an engineer and product manager working on distributed systems and developer tools. Morgan is responsible for Google's distributed tracing, profiling, and debugging tools, including Stackdriver... Read More →

Thursday December 13, 2018 2:35pm - 3:10pm
Ballroom 6B


Sharded and Federated Prometheus Servers to Monitor Distributed Databases - Jun Li & Viswa Vutharkar, eBay
At eBay we have developed a geo-distributed transactional document store called NuData. It is deployed on Kubernetes. The current deployment has thousands of pods across three datacenters, and is monitored by Prometheus. For scalability, our Prometheus cluster has sharded servers to monitor individual infrastructure components and federation servers to retrieve aggregated metrics from sharded servers. For high availability, each sharded/federated server is configured with an active/standby pair over its load-balancer. A routing map ( a time series) is automatically constructed by each shard server and assembled by the federation server, to direct Prometheus queries to the right servers. Today we have over 90 Prometheus servers on two datacenters to collect over 11 M metrics/60 seconds on 400 metrics (and 900 rules), to support health monitoring and performance debugging of NuData.

avatar for Jun Li

Jun Li

Principal Architect, eBay
Jun Li is currently a Principal Architect at eBay. Since he joined eBay in January 2017, he has been leading the effort on monitoring and self-management of NuData, a geo-distributed transactional document store developed in eBay. The monitoring framework is to provide comprehensive... Read More →
avatar for Viswa Vutharkar

Viswa Vutharkar

Sr. MTS, eBay
Technologist, Mentor and Architect in product design & development, product research and analysis, sourcing from and contributing to open source eco systems. Currently at eBay working on a distributed document store on the cloud deployed on internal eBay Kubernetes infrastructure... Read More →

Thursday December 13, 2018 3:40pm - 4:15pm
Ballroom 6B


Collecting Operational Metrics for a Cluster with 5,000 Namespaces - Rob Szumski & Chance Zibolski, Red Hat
Kubernetes is popular to provide a multi-tenant, shared infrastructure layer for many eng teams within an org. This is great for the teams, as they have a stable, scalable cluster to build upon. However, the Ops team needs insight into resource usage, availability & more, especially once an app is running in multiple regions/clouds.

This talk will introduce the open source Operator Metering project, which collects metrics from Prometheus about Pods/Namespaces & combines it against cloud resource usage and other metrics to compute cost per Pod and other reports on a periodic interval. This capability unlocks billing for internal biz units, capacity planning, and operational knowledge of Operators running within a cluster.

At the end, we’ll show the project running against a real cluster with over 5,000 namespaces, proving the scale and insight the reports can provide to an Ops team.

avatar for Rob Szumski

Rob Szumski

Principal Product Manager for OpenShift, Red Hat
Rob Szumski is Principal Product Manager for OpenShift at Red Hat. Focused on product design from open source to commercial offerings, Rob has deep experience in UX, web interfaces, and CLI. Rob likes to experiment with the technology he works on to understand it from both the developer's... Read More →
avatar for Chance Zibolski

Chance Zibolski

Software Engineer, Red Hat
Chance is a software engineer at Red Hat. Chance is the lead engineer for the Operator Metering project. Chance joined Red Hat through the acquisition of CoreOS. Prior to CoreOS, he was active in the open-source lab at Oregon State, where he graduated with a degree in Computer Sc... Read More →

Thursday December 13, 2018 4:30pm - 5:05pm
Ballroom 6B