Skip to content

MLTP

Metrics

  • What is it?
  • Read World Similarity
  • Sample
    metric_name{label="value", ..} measurement
    ...
    histogram_name{label="value", ..} bucket
    

Logs

  • Who doesn't know it?
  • Collect it to centralised storage, so we can link to our Central Monitoring-UI like Grafana
  • Sample Data Model
    log_id: "unique identifier for the log"
    timestamp: "timestamp when the log was generated"
    message: "content of the log"
    severity: "level of severity of the log"
    tags: "key-value pairs for additional information"
    

Traces

  • Tracing is a method used to monitor and understand the flow of a request through a distributed system.
  • Read World Example of a Trace
  • Microservice Example of a Trace - @graph.wealthy
  • Sample Data Model
    trace_id: "unique identifier for the trace"
    span_id: "unique identifier for the span"
    parent_span_id: "id of the parent span"
    start_time: "timestamp when the span started"
    end_time: "timestamp when the span ended"
    operation_name: "name of the operation"
    tags: "key-value pairs for additional information"
    logs: "events that occurred during the span"
    

Continuous Profiling

  • Continuous Profiling is like a doctor checking your health regularly, not just when you're sick.
  • It helps us understand how our system is performing over time, not just when there's a problem.
  • Sample Data Model
    profile_id: "unique identifier for the profile"
    start_time: "timestamp when the profiling started"
    end_time: "timestamp when the profiling ended"
    duration: "duration of the profiling"
    cpu_time: "total CPU time used during the profiling"
    memory_usage: "total memory used during the profiling"
    disk_io: "total disk I/O during the profiling"
    network_io: "total network I/O during the profiling"
    

Flamegraph 🔥

  • Flamegraph is a visualization tool that presents a graphical representation of the execution of a program.
  • It helps us understand the flow of execution and identify performance bottlenecks.
  • Google Chrome as example & Demo flame

Golden Signals 🚦

From SRE handbook
  • Golden Signals are like the health indicators of a system.
  • They help us understand if the system is working well or not.
  • There are four main golden signals:
    • Latency: 🕒 How long it takes for the system to respond.
    • Traffic: 🚦 How much data the system is handling.
    • Errors: ❌ How many mistakes the system is making.
    • Saturation: 🔄 How full the system is.