I’m not surprised to see Prometheus and Grafana with top votes in CNCF’s recently published End User Technology Radar on Observability.
However, I would have liked to see Litmus or any other chaos engineering tool on that radar because these are listed in the Observability and Analysis category of the CNCF landscape.
Curiosity led me to watch a recording of one of the meetings hosted by the CNCF SIG for Observability
But these questions still linger…
If I learn how to use these cool observability tools really well, will I have a firm handle on the state of any cloud-native system?
Are chaos engineering tools, such as Litmus, ChaosKube, and PowerfulSeal, sufficient to help me determine the resiliency of such a system?
Is observability the sum of monitoring, logging, and tracing?
Is this a case of the tail wagging the dog?
Maybe, but who’s to say that’s not a feasible approach. What if the sum of the parts leads me to the whole? How would I recognize the whole?
As an observability noob, many such questions continue to baffle me.
What should be the building blocks of my learning journey?
How should I design my learning pathway?
Why should I go down a certain path?
Last year, Charity Majors of honeycomb.io wrote a thought-provoking piece that debunked the myth about the three pillars of observability.
She spoke about instrumenting code and capturing details in a way that enable us to answer any question.
Here’s her tweet thread that would bewilder even an observability veteran, let alone a newbie.
Do I really want to store data in three different ways? Probably not, but what exactly does this alternative she’s suggesting mean?
My first stumbling block
Arbitrarily-wide structured data blobs at the origin (WTH?!)
If monitoring helps us figure out only known knowns, how can it help us answer unknown or unanticipated questions?
If I am unaware of what I am unaware of, how can I even begin to know what and where do I need to debug?
With so many moving parts in a distributed system, how can I even create a mental model of the system?
She goes on to explain it in another tweet here.
And Liz Fong-Jones chimes in with an interesting tweet.
Okay, I promise that’s the last tweet for this post!
Based on what I’ve been reading and watching so far, tracing is something I can kinda grasp. At least, through the lens of understanding the flow of events or the journey of a service request.
Monitoring and logging seem inapplicable to me when I don’t yet know what I don’t know. Also, how do I account for the silos created by decoupled metrics and lost context?
What about process variations? Think common and special causes?
Getting comfortable with VUCA
I’m ending this blog with many unanswered questions, and that’s something I’ve always been comfortable with.
The primary need in research is to be comfortable with volatility, uncertainty, complexity, and ambiguity (VUCA).
Upcoming Blog Alert: Stay tuned for a long-winded blog series on VUCA.
I haven’t formed an opinion about monitoring and logging, and I might not do so anytime soon.
So, I’m sharing the two blogs that piqued my interest and got me to write this first blog on AWE.
Banner Image by Free-Photos on Pixabay