Monitoring
If you can't see what your deployment is doing, you're flying blind. And flying blind with a production system is... not great. This chapter covers the observability features built into Adama, how to configure logging, health check endpoints, debugging techniques, and how to think about performance monitoring.
Health Check Endpoints
Adama exposes HTTP endpoints for health monitoring. These are meant for load balancers, orchestrators, and whatever monitoring system you're running.
Basic Health Check
The basic health check confirms the server is up and accepting connections:
GET /~health_check_lb
Response when healthy:
200 OK
It's lightweight -- suitable for frequent polling by load balancers.
Deep Health Check
The deep health check does internal verification:
GET /~deep_health_check_status_page
This endpoint verifies:
- Core services are running
- Document runtime is operational
- Internal components are healthy
Use this for more thorough monitoring, but at lower frequency. It does actual work internally, so don't hammer it.
Custom Health Check Paths
You can configure custom paths in your web configuration:
{
"http-health-check-path": "/health",
"http-deep-health-check-path": "/health/deep"
}
Handy when integrating with existing monitoring infrastructure that expects specific paths.
Load Balancer Configuration
When using a load balancer, configure it to poll the health endpoint:
# Example HAProxy configuration
backend adama_servers
option httpchk GET /~health_check_lb
http-check expect status 200
server adama1 10.0.0.1:8080 check inter 5s
server adama2 10.0.0.2:8080 check inter 5s
For Kubernetes:
livenessProbe:
httpGet:
path: /~health_check_lb
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /~deep_health_check_status_page
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
Built-in Metrics
Adama's runtime collects metrics about document operations, connections, and resource usage. These feed into the platform's monitoring infrastructure.
Core Metrics
The runtime tracks:
| Metric Category | What It Measures |
|---|---|
| Document Operations | Creates, loads, saves, deletes |
| Connection Metrics | Active connections, connection rate, disconnections |
| Message Processing | Messages received, processed, errors |
| State Machine Events | State transitions, blocked states |
| Memory Usage | Documents loaded, memory per document |
Metering
Adama tracks resource consumption for billing and capacity planning:
- CPU time: Processing time per document
- Storage: Data size per document
- Bandwidth: Network transfer per connection
- Operations: API calls per document
Accessing Metrics
Metrics are available through the platform API. For self-hosted deployments, integrate with your monitoring stack using the metrics factory pattern:
// Custom metrics integration
MetricsFactory factory = new PrometheusMetricsFactory();
CoreMetrics coreMetrics = new CoreMetrics(factory);
Logging
Good logging is the difference between "I know what went wrong" and "I have no idea what went wrong." Adama supports configurable logging at multiple levels.
Log Levels
| Level | Use Case |
|---|---|
| ERROR | Unrecoverable errors, system failures |
| WARN | Recoverable issues, degraded performance |
| INFO | Normal operations, significant events |
| DEBUG | Detailed operational information |
| TRACE | Very detailed debugging information |
Logging Configuration
Configure logging through your Java logging framework. For Logback:
<configuration>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>logs/adama.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>logs/adama.%d{yyyy-MM-dd}.log</fileNamePattern>
<maxHistory>30</maxHistory>
</rollingPolicy>
<encoder>
<pattern>%d{ISO8601} [%thread] %-5level %logger - %msg%n</pattern>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="STDOUT" />
<appender-ref ref="FILE" />
</root>
<!-- Verbose logging for Adama internals during debugging -->
<logger name="ape" level="DEBUG" />
</configuration>
Structured Logging
For production, use structured logging so your log aggregation system can actually parse things:
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<customFields>{"service":"adama","environment":"production"}</customFields>
</encoder>
This outputs JSON-formatted logs suitable for Elasticsearch, Splunk, CloudWatch, or whatever you're using.
Log Aggregation
Centralize your logs. Pick your approach:
- File-based: Use Filebeat or Fluentd to ship logs
- Network-based: Configure a network appender to send directly
- Container-based: Log to stdout and let the orchestrator handle aggregation
What to Log
Focus on actionable information. Logging everything is just as bad as logging nothing -- you end up drowning in noise.
| Event Type | Log Level | Information to Include |
|---|---|---|
| Document created | INFO | Space, key, creator |
| Connection established | INFO | Client identifier, document |
| Connection failed | WARN | Reason, client identifier |
| Message processed | DEBUG | Channel, message type |
| State transition | DEBUG | From state, to state, trigger |
| Error in document | ERROR | Space, key, error details, stack trace |
Debugging Techniques
When things go sideways, here's how to figure out why.
Document State Inspection
Examine the current state of a document:
java -jar adama.jar document read --space myapp --key doc-123
This shows the document's current data. Oftentimes the bug is obvious once you can actually see the state.
Connection Debugging
Monitor active connections and their state:
- Enable DEBUG logging for connection handlers
- Watch for connection lifecycle events
- Check for connection leaks (disconnects without cleanup)
Message Tracing
Trace message flow through the system:
channel myChannel(MyMessage msg) {
// Log incoming messages for debugging
// Remove or gate behind a flag in production
@debug("Received message: " + msg.type);
// ... handle message
}
The @debug directive outputs information during development but can be disabled in production.
State Machine Debugging
Track state machine transitions:
#waiting {
@debug("Entered waiting state");
// state logic
}
#processing {
@debug("Entered processing state");
// state logic
}
Testing in Development
Use the built-in testing framework to validate behavior before you ship:
public int count;
message MyMessage {
int value;
}
channel myChannel(MyMessage msg) {
count++;
}
test scenario {
@send myChannel(@no_one, { value: 42 });
assert count == 1;
}
Run tests before deployment. This sounds obvious, but I'm saying it anyway.
Performance Monitoring
Monitor performance to find bottlenecks and plan capacity.
Key Performance Indicators
Track these to understand system health:
| Metric | Healthy Range | Warning Signs |
|---|---|---|
| Message latency | < 10ms p99 | Increasing latency |
| Connection rate | Stable | Sudden spikes |
| Document load time | < 100ms | Increasing over time |
| Memory per document | Stable | Unbounded growth |
| Error rate | < 0.1% | Any increase |
Latency Analysis
Measure end-to-end latency at multiple points:
- Client to server: Network latency
- Message processing: Document execution time
- Response delivery: Delta computation and transmission
High latency at each point means something different:
- Client to server: Network or load balancer issues
- Message processing: Complex document logic or resource contention
- Response delivery: Large delta payloads or network congestion
Memory Analysis
Monitor memory usage patterns:
# JVM memory statistics
jstat -gc <pid> 1000
Watch for:
- Steadily increasing heap usage (memory leak -- bad)
- Frequent full garbage collections (insufficient heap)
- Large survivor spaces (objects living too long)
Profiling
For detailed performance analysis:
# CPU profiling
java -agentpath:/path/to/async-profiler/libasyncProfiler.so=start,file=profile.html -jar solo.jar ...
# Heap analysis
jmap -dump:format=b,file=heap.hprof <pid>
Analyze profiles to find:
- Hot code paths consuming CPU
- Memory allocation patterns
- Lock contention
Capacity Planning
Use historical metrics to plan ahead:
- Track document count growth over time
- Monitor peak concurrent connections
- Measure message throughput during peak hours
- Project future needs based on user growth
Alerting
Set up alerts for conditions that need immediate attention. The goal is to wake up for real problems, not false alarms.
Critical Alerts
| Condition | Action |
|---|---|
| Health check failing | Investigate immediately, possible outage |
| Error rate > 1% | Review logs for error patterns |
| Memory > 90% | Scale up or investigate leak |
| Latency > 1s p99 | Investigate bottleneck |
Warning Alerts
| Condition | Action |
|---|---|
| Memory > 70% | Plan scaling |
| Connection rate spike | Investigate source |
| Disk usage > 80% | Clean up or expand |
| Certificate expiring | Renew before expiration |
Alert Configuration Example
Using Prometheus Alertmanager:
groups:
- name: adama
rules:
- alert: AdamaHealthCheckFailing
expr: probe_success{job="adama_health"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Adama health check failing"
- alert: AdamaHighErrorRate
expr: rate(adama_errors_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Elevated error rate in Adama"
Operational Runbooks
Have procedures ready for common scenarios. When things break at 2am, you don't want to be figuring this out from scratch.
High Memory Usage
- Check active document count
- Identify documents with large state
- Review for memory leaks in document code
- Consider scaling horizontally
- If critical, restart with larger heap
Elevated Error Rate
- Check recent deployments (it's almost always a recent deployment)
- Review error logs for patterns
- Identify affected documents or channels
- Rollback if deployment-related
- Fix and redeploy if code issue
Connection Storms
- Identify source of connections
- Check for client retry loops (this is the usual suspect)
- Enable rate limiting if available
- Scale capacity if legitimate traffic
- Block malicious sources
Invest in monitoring before you need it. When something breaks, good observability is the difference between a five-minute fix and a five-hour outage.