Monitoring

If you can't see what your deployment is doing, you're flying blind. And flying blind with a production system is... not great. This chapter covers the observability features built into Adama, how to configure logging, health check endpoints, debugging techniques, and how to think about performance monitoring.

Health Check Endpoints

Adama exposes HTTP endpoints for health monitoring. These are meant for load balancers, orchestrators, and whatever monitoring system you're running.

Basic Health Check

The basic health check confirms the server is up and accepting connections:

GET /~health_check_lb

Response when healthy:

200 OK

It's lightweight -- suitable for frequent polling by load balancers.

Deep Health Check

The deep health check does internal verification:

GET /~deep_health_check_status_page

This endpoint verifies:

Core services are running
Document runtime is operational
Internal components are healthy

Use this for more thorough monitoring, but at lower frequency. It does actual work internally, so don't hammer it.

Custom Health Check Paths

You can configure custom paths in your web configuration:

{
  "http-health-check-path": "/health",
  "http-deep-health-check-path": "/health/deep"
}

Handy when integrating with existing monitoring infrastructure that expects specific paths.

Load Balancer Configuration

When using a load balancer, configure it to poll the health endpoint:

# Example HAProxy configuration
backend adama_servers
    option httpchk GET /~health_check_lb
    http-check expect status 200
    server adama1 10.0.0.1:8080 check inter 5s
    server adama2 10.0.0.2:8080 check inter 5s

For Kubernetes:

livenessProbe:
  httpGet:
    path: /~health_check_lb
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

readinessProbe:
  httpGet:
    path: /~deep_health_check_status_page
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10

Built-in Metrics

Adama's runtime collects metrics about document operations, connections, and resource usage. These feed into the platform's monitoring infrastructure.

Core Metrics

The runtime tracks:

Metric Category	What It Measures
Document Operations	Creates, loads, saves, deletes
Connection Metrics	Active connections, connection rate, disconnections
Message Processing	Messages received, processed, errors
State Machine Events	State transitions, blocked states
Memory Usage	Documents loaded, memory per document

Metering

Adama tracks resource consumption for billing and capacity planning:

CPU time: Processing time per document
Storage: Data size per document
Bandwidth: Network transfer per connection
Operations: API calls per document

Accessing Metrics

Metrics are available through the platform API. For self-hosted deployments, integrate with your monitoring stack using the metrics factory pattern:

// Custom metrics integration
MetricsFactory factory = new PrometheusMetricsFactory();
CoreMetrics coreMetrics = new CoreMetrics(factory);

Logging

Good logging is the difference between "I know what went wrong" and "I have no idea what went wrong." Adama supports configurable logging at multiple levels.

Log Levels

Level	Use Case
ERROR	Unrecoverable errors, system failures
WARN	Recoverable issues, degraded performance
INFO	Normal operations, significant events
DEBUG	Detailed operational information
TRACE	Very detailed debugging information

Logging Configuration

Configure logging through your Java logging framework. For Logback:

<configuration>
  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <encoder>
      <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
    </encoder>
  </appender>

  <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>logs/adama.log</file>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <fileNamePattern>logs/adama.%d{yyyy-MM-dd}.log</fileNamePattern>
      <maxHistory>30</maxHistory>
    </rollingPolicy>
    <encoder>
      <pattern>%d{ISO8601} [%thread] %-5level %logger - %msg%n</pattern>
    </encoder>
  </appender>

  <root level="INFO">
    <appender-ref ref="STDOUT" />
    <appender-ref ref="FILE" />
  </root>

  <!-- Verbose logging for Adama internals during debugging -->
  <logger name="ape" level="DEBUG" />
</configuration>

Structured Logging

For production, use structured logging so your log aggregation system can actually parse things:

<encoder class="net.logstash.logback.encoder.LogstashEncoder">
  <customFields>{"service":"adama","environment":"production"}</customFields>
</encoder>

This outputs JSON-formatted logs suitable for Elasticsearch, Splunk, CloudWatch, or whatever you're using.

Log Aggregation

Centralize your logs. Pick your approach:

File-based: Use Filebeat or Fluentd to ship logs
Network-based: Configure a network appender to send directly
Container-based: Log to stdout and let the orchestrator handle aggregation

What to Log

Focus on actionable information. Logging everything is just as bad as logging nothing -- you end up drowning in noise.

Event Type	Log Level	Information to Include
Document created	INFO	Space, key, creator
Connection established	INFO	Client identifier, document
Connection failed	WARN	Reason, client identifier
Message processed	DEBUG	Channel, message type
State transition	DEBUG	From state, to state, trigger
Error in document	ERROR	Space, key, error details, stack trace

Debugging Techniques

When things go sideways, here's how to figure out why.

Document State Inspection

Examine the current state of a document by connecting via WebSocket and inspecting the state. Oftentimes the bug is obvious once you can actually see the state.

Connection Debugging

Monitor active connections and their state:

Enable DEBUG logging for connection handlers
Watch for connection lifecycle events
Check for connection leaks (disconnects without cleanup)

Message Tracing

Trace message flow through the system:

channel myChannel(MyMessage msg) {
  // Log incoming messages for debugging
  // Remove or gate behind a flag in production
  @debug("Received message: " + msg.type);
  // ... handle message
}

The @debug directive outputs information during development but can be disabled in production.

State Machine Debugging

Track state machine transitions:

#waiting {
  @debug("Entered waiting state");
  // state logic
}

#processing {
  @debug("Entered processing state");
  // state logic
}

Testing in Development

Use the built-in testing framework to validate behavior before you ship:

public int count;

message MyMessage {
  int value;
}

channel myChannel(MyMessage msg) {
  count++;
}

test scenario {
  @send myChannel(@no_one, { value: 42 });
  assert count == 1;
}

Run tests before deployment. This sounds obvious, but I'm saying it anyway.

Performance Monitoring

Monitor performance to find bottlenecks and plan capacity.

Key Performance Indicators

Track these to understand system health:

Metric	Healthy Range	Warning Signs
Message latency	< 10ms p99	Increasing latency
Connection rate	Stable	Sudden spikes
Document load time	< 100ms	Increasing over time
Memory per document	Stable	Unbounded growth
Error rate	< 0.1%	Any increase

Latency Analysis

Measure end-to-end latency at multiple points:

Client to server: Network latency
Message processing: Document execution time
Response delivery: Delta computation and transmission

High latency at each point means something different:

Client to server: Network or load balancer issues
Message processing: Complex document logic or resource contention
Response delivery: Large delta payloads or network congestion

Memory Analysis

Monitor memory usage patterns:

# JVM memory statistics
jstat -gc <pid> 1000

Watch for:

Steadily increasing heap usage (memory leak -- bad)
Frequent full garbage collections (insufficient heap)
Large survivor spaces (objects living too long)

Profiling

For detailed performance analysis:

# CPU profiling
java -agentpath:/path/to/async-profiler/libasyncProfiler.so=start,file=profile.html -jar solo.jar ...

# Heap analysis
jmap -dump:format=b,file=heap.hprof <pid>

Analyze profiles to find:

Hot code paths consuming CPU
Memory allocation patterns
Lock contention

Capacity Planning

Use historical metrics to plan ahead:

Track document count growth over time
Monitor peak concurrent connections
Measure message throughput during peak hours
Project future needs based on user growth

Alerting

Set up alerts for conditions that need immediate attention. The goal is to wake up for real problems, not false alarms.

Critical Alerts

Condition	Action
Health check failing	Investigate immediately, possible outage
Error rate > 1%	Review logs for error patterns
Memory > 90%	Scale up or investigate leak
Latency > 1s p99	Investigate bottleneck

Warning Alerts

Condition	Action
Memory > 70%	Plan scaling
Connection rate spike	Investigate source
Disk usage > 80%	Clean up or expand
Certificate expiring	Renew before expiration

Alert Configuration Example

Using Prometheus Alertmanager:

groups:
- name: adama
  rules:
  - alert: AdamaHealthCheckFailing
    expr: probe_success{job="adama_health"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Adama health check failing"

  - alert: AdamaHighErrorRate
    expr: rate(adama_errors_total[5m]) > 0.01
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Elevated error rate in Adama"

Operational Runbooks

Have procedures ready for common scenarios. When things break at 2am, you don't want to be figuring this out from scratch.

High Memory Usage

Check active document count
Identify documents with large state
Review for memory leaks in document code
Consider scaling horizontally
If critical, restart with larger heap

Elevated Error Rate

Check recent deployments (it's almost always a recent deployment)
Review error logs for patterns
Identify affected documents or channels
Rollback if deployment-related
Fix and redeploy if code issue

Connection Storms

Identify source of connections
Check for client retry loops (this is the usual suspect)
Enable rate limiting if available
Scale capacity if legitimate traffic
Block malicious sources

Invest in monitoring before you need it. When something breaks, good observability is the difference between a five-minute fix and a five-hour outage.

Monitoring#

Health Check Endpoints#

Basic Health Check#

Deep Health Check#

Custom Health Check Paths#

Load Balancer Configuration#

Built-in Metrics#

Core Metrics#

Metering#

Accessing Metrics#

Logging#

Log Levels#

Logging Configuration#

Structured Logging#

Log Aggregation#

What to Log#

Debugging Techniques#

Document State Inspection#

Connection Debugging#

Message Tracing#

State Machine Debugging#

Testing in Development#

Performance Monitoring#

Key Performance Indicators#

Latency Analysis#

Memory Analysis#

Profiling#

Capacity Planning#

Alerting#

Critical Alerts#

Warning Alerts#

Alert Configuration Example#

Operational Runbooks#

High Memory Usage#

Elevated Error Rate#

Connection Storms#