Monitoring Done Wrong

We've all heard numerous "awesome monitoring @ X" talks; Boring! Join me in exploring monitoring design principles through various fails - because we can learn sooo much more by analyzing cases where monitoring was done wrong :-)

The Pets the Cattle and the Germs

10 years ago, we promoted the move from pet systems to faceless hordes of electronic cattle grazing on commodity infrastructure. But as the evolution of the cloud progresses we find that the cattle methodology is no longer sufficient and that cloud native systems resemble some other biological entity…

Can I tell you a secret? I see dead systems

We live in a world of shiny new tech introduced all the time. Heck, we even made cars that drive themselves. Yet all around us, unseen and hidden, lurk ancient, forgotten systems. They're in our kernels, our terminals and our CPUs... They are everywhere.

To err is human: Introduction to modern safety thinking

In the last 40 years, the philosophy of safety and reliability has changed dramatically in the world of high risk industries. This has prompted many organizations in various risk-prone fields to adopt new methods and processes and sometimes even undergo a radical cultural and managerial change. However, the software industry remained largely oblivious of these advancements despite the similarities in failures and systems. After all, most systems today are software managed whether they run a nuclear reactor or a website builder. This talk introduces the major concepts of new-era safety thinking, e.g.: Safety II, Work as done vs work as imagined, Normal accidents theory.

Data: You keep using that word...

Structured data, dynamic data, big data, data driven..... we hear about data all the time. But what is "data" exactly? The term is frequently used, yet is rarely defined or thought of - and it turns out the answer to "what is data" is not simple at all.

Linux System Metrics

While you can learn a lot by emitting metrics from your application, some insights can only be gained by looking at OS metrics. This hands-on workshop, covers the basics in Linux metric collection for monitoring, performance tuning and capacity planning. (Co-Author: @nocoot)

Actionable Exceptions

So that exception I see in the logs 3000 times is a "normal exception"? sounds legit. Repeat after me: A Normal Exception is Not. Exception raising/handling is a popular and ingrained mechanism for dealing with faults. Unfortunately, it's also one of the most abused...