MLTP¶
Metrics¶
- What is it?
- Read World Similarity
- Sample
metric_name{label="value", ..} measurement ... histogram_name{label="value", ..} bucket
Logs¶
- Who doesn't know it?
- Collect it to centralised storage, so we can link to our Central Monitoring-UI like
Grafana
- Sample Data Model
log_id: "unique identifier for the log" timestamp: "timestamp when the log was generated" message: "content of the log" severity: "level of severity of the log" tags: "key-value pairs for additional information"
Traces¶
- Tracing is a method used to monitor and understand the flow of a request through a distributed system.
- Read World Example of a Trace
- Microservice Example of a Trace -
@graph.wealthy
- Sample Data Model
trace_id: "unique identifier for the trace" span_id: "unique identifier for the span" parent_span_id: "id of the parent span" start_time: "timestamp when the span started" end_time: "timestamp when the span ended" operation_name: "name of the operation" tags: "key-value pairs for additional information" logs: "events that occurred during the span"
Continuous Profiling¶
- Continuous Profiling is like a doctor checking your health regularly, not just when you're sick.
- It helps us understand how our system is performing over time, not just when there's a problem.
- Sample Data Model
profile_id: "unique identifier for the profile" start_time: "timestamp when the profiling started" end_time: "timestamp when the profiling ended" duration: "duration of the profiling" cpu_time: "total CPU time used during the profiling" memory_usage: "total memory used during the profiling" disk_io: "total disk I/O during the profiling" network_io: "total network I/O during the profiling"
Flamegraph 🔥¶
- Flamegraph is a visualization tool that presents a graphical representation of the execution of a program.
- It helps us understand the flow of execution and identify performance bottlenecks.
- Google Chrome as example & Demo flame
Golden Signals 🚦¶
From SRE handbook¶
- Golden Signals are like the health indicators of a system.
- They help us understand if the system is working well or not.
- There are four main golden signals:
- Latency: 🕒 How long it takes for the system to respond.
- Traffic: 🚦 How much data the system is handling.
- Errors: ❌ How many mistakes the system is making.
- Saturation: 🔄 How full the system is.