Skip to content

dev0root6/baadhal

Repository files navigation

🌩️ Baadal — Server Log Collector and Alerter

Build and Release Go Report Card License

A lightweight, self-contained observability agent for monitoring Ubuntu servers with disk usage tracking, kernel monitoring, Docker container health, and webhook-based alerting.

✨ Features

  • 📊 Disk Usage Monitoring — Track largest directories with configurable thresholds
  • 🐳 Docker Log Monitoring — Filter container logs for errors, panics, OOM events
  • 🔍 Kernel Monitoring — Monitor dmesg for critical kernel events
  • 💓 Heartbeat — Periodic alive signals for health tracking
  • 🔄 Event Deduplication — Reduce noise with smart event hashing
  • 📈 Performance Stats — Track collector performance metrics
  • 🔔 Webhook Alerts — Discord, Slack, and custom API notifications
  • 🔒 Dead Man's Switch — Detect silent failures with receiver-side monitoring
  • 📝 Log Rotation — Automatic log rotation with lumberjack
  • 🔄 Hot Reload — Update config without restart (SIGHUP)
  • 🏗️ Dual Mode — Run as collector (agent) or receiver (central server)

� Table of Contents

🚀 Quick Start

Download Pre-built Binary

Download the latest release for your platform from GitHub Releases:

# Linux amd64
wget https://github.com/YOUR_USERNAME/baadhal/releases/latest/download/baadal-linux-amd64.tar.gz
tar -xzf baadal-linux-amd64.tar.gz
chmod +x baadal-linux-amd64

# Verify checksum
sha256sum -c baadal-linux-amd64.tar.gz.sha256

Build from Source

git clone https://github.com/YOUR_USERNAME/baadhal.git
cd baadhal
go mod download
go build -o baadal main.go

📋 Detailed Configuration

Baadal uses a YAML configuration file (config.yml) with comprehensive options for each monitoring component.

1. App Configuration

app:
  name: "baadal"              # Application name (used in logs)
  enabled: true                # Master switch - set to false to disable
  log_level: "info"           # Logging level: debug | info | warn | error
  
  heartbeat:
    enabled: true              # Enable periodic heartbeat events
    interval: "*/5 * * * *"   # Cron expression for heartbeat frequency
                               # Examples:
                               #   "*/5 * * * *"  - Every 5 minutes
                               #   "*/10 * * * *" - Every 10 minutes
                               #   "0 * * * *"    - Every hour
  
  deduplication:
    enabled: true              # Enable event deduplication
    window_seconds: 60         # Time window to check for duplicates
                               # Events with identical type+host+data 
                               # within this window are discarded

Heartbeat Details:

  • Sends periodic "alive" signals to receiver
  • Includes uptime in seconds
  • Helps detect collector crashes or network issues
  • Recommended: 5-10 minutes for production

Deduplication Details:

  • Uses MD5 hash of (type + host + data) for comparison
  • Prevents alert spam from repeated errors
  • Memory-efficient with automatic cleanup every 5 minutes
  • Recommended: 60-300 seconds depending on alert frequency

2. Transport Configuration

transport:
  mode: "remote"              # "remote" or "local"
  
  remote:                     # Used when mode = "remote"
    endpoint: "http://receiver-ip:5170/ingest"  # Receiver URL
                               # Use Tailscale/VPN IP for security
    
    auth:
      enabled: false           # Enable bearer token authentication
      token: "your-secret-token"  # Must match receiver token
    
    batch_size: 10            # Send after N events accumulate
    flush_interval: "5s"      # Or send after this time (whichever first)
    retry_attempts: 3         # Number of retries on failure
    retry_delay: "2s"         # Delay between retry attempts
  
  local:                      # Used when mode = "local"
    log_output: "/var/log/baadal/events.log"  # Local file path

Mode Selection:

  • remote: Recommended for production - sends to central receiver
  • local: For testing or standalone logging

Remote Transport Tips:

  • Use batch_size: 1 and flush_interval: "1s" for real-time alerting
  • Use batch_size: 50 and flush_interval: "30s" to reduce network traffic
  • Enable auth.enabled: true for production deployments

Endpoint Examples:

endpoint: "http://192.168.1.100:5170/ingest"      # Local network
endpoint: "http://100.x.x.x:5170/ingest"          # Tailscale
endpoint: "https://monitor.example.com/ingest"    # Public (use HTTPS + auth!)

3. Disk Monitoring

disk:
  enabled: true               # Enable disk usage monitoring
  paths:                      # Directories to scan
    - /                       # Root filesystem
    - /var                    # Common log location
    - /var/lib/docker         # Docker data
    - /home                   # User directories
    - /tmp                    # Temporary files
    - /opt                    # Optional software
  
  top_n: 5                    # Report top N largest directories
  max_depth: 3                # Maximum directory depth to scan
                               # Higher = more detail but slower
  
  schedule: "*/5 * * * *"     # Cron schedule for scans
                               # Examples:
                               #   "*/15 * * * *"  - Every 15 minutes
                               #   "0 */2 * * *"   - Every 2 hours
                               #   "0 3 * * *"     - Daily at 3 AM
  
  alert:
    enabled: true              # Enable alerting for disk usage
    condition: "size_gb > 20"  # Alert when directory exceeds 20 GB
                               # Can adjust threshold as needed
    webhooks:                  # Webhook names to fire (from webhooks section)
      - discord
      - custom-api

Path Selection Tips:

  • Start with root paths (/, /var, /home)
  • Add Docker path if running containers: /var/lib/docker
  • Add application-specific paths: /opt/myapp, /data
  • Avoid network mounts (slow scans)

Performance Tuning:

# Fast scan (less detail, every 30 min)
max_depth: 2
schedule: "*/30 * * * *"

# Detailed scan (more detail, hourly)
max_depth: 4
schedule: "0 * * * *"

# Daily deep scan
max_depth: 5
schedule: "0 2 * * *"  # 2 AM daily

Alert Threshold Examples:

condition: "size_gb > 10"    # Alert at 10 GB
condition: "size_gb > 50"    # Alert at 50 GB
condition: "size_gb > 100"   # Alert at 100 GB

4. Dmesg/Kernel Monitoring

dmesg:
  enabled: true               # Enable kernel message monitoring
  schedule: "*/1 * * * *"     # Check every minute (recommended)
  
  filter_levels:              # Kernel log levels to monitor
    - err                     # Error conditions
    - crit                    # Critical conditions
    - alert                   # Action must be taken immediately
    - emerg                   # System is unusable
  
  # Available but not recommended for production:
  #   - warn     # Warning conditions (very noisy)
  #   - notice   # Normal but significant
  #   - info     # Informational
  #   - debug    # Debug-level messages
  
  alert:
    enabled: true              # Enable alerting
    condition: "level == crit" # Alert on critical messages
                               # Options: emerg, alert, crit, err
    webhooks:
      - discord

Filter Level Guide:

Level Severity Typical Issues Recommended
emerg Emergency Kernel panic, system crash ✅ Yes
alert Alert Hardware failure, critical bug ✅ Yes
crit Critical Hard disk errors, memory issues ✅ Yes
err Error Driver errors, non-critical failures ✅ Yes
warn Warning Deprecation notices, soft errors ⚠️ Noisy
notice Notice Normal but significant ❌ Too noisy
info Info General information ❌ Too noisy
debug Debug Debug messages ❌ Development only

Alert Condition Examples:

condition: "level == crit"   # Only critical errors
condition: "level == alert"  # Alert-level and above
condition: "level == emerg"  # Only emergency (kernel panic)

Best Practices:

  • Run every 1-2 minutes for timely kernel error detection
  • Filter to err, crit, alert, emerg only
  • Don't include warn - generates too many false positives

5. Docker Monitoring

docker:
  enabled: true               # Enable Docker log monitoring
  containers:                 # Which containers to monitor
    - "all"                   # Monitor all running containers
    # OR specify by name:
    # - "nginx"
    # - "postgres"
    # - "redis"
  
  tail_lines: 100             # Number of recent log lines to check
                               # Higher = more coverage but slower
  
  schedule: "*/2 * * * *"     # Check every 2 minutes
  
  filter_keywords:            # Keywords to search for in logs
    - "error"                 # Generic errors
    - "fatal"                 # Fatal errors
    - "panic"                 # Go panic, Python panic
    - "OOM"                   # Out of memory
    - "killed"                # Process killed
    - "segfault"              # Segmentation fault
    - "exception"             # Exceptions (add if using Python/Java)
  
  alert:
    enabled: true
    condition: "keyword == fatal"  # Alert on "fatal" keyword matches
    webhooks:
      - discord
      - custom-api

Container Selection:

# Monitor all containers
containers: ["all"]

# Monitor specific containers
containers:
  - "nginx"
  - "postgres-primary"
  - "redis"

# Monitor by pattern (use "all" and filter in alerts)
containers: ["all"]
filter_keywords: ["error", "fatal"]

Keyword Selection by Stack:

Node.js/JavaScript:

filter_keywords:
  - "error"
  - "fatal"
  - "uncaughtException"
  - "unhandledRejection"
  - "ECONNREFUSED"
  - "ETIMEDOUT"

Python:

filter_keywords:
  - "error"
  - "fatal"
  - "exception"
  - "traceback"
  - "critical"

Java/Spring:

filter_keywords:
  - "error"
  - "exception"
  - "OutOfMemoryError"
  - "StackOverflowError"
  - "SQLException"

Go:

filter_keywords:
  - "error"
  - "fatal"
  - "panic"
  - "deadlock"

Database (Postgres/MySQL):

filter_keywords:
  - "error"
  - "fatal"
  - "panic"
  - "deadlock"
  - "connection refused"

Alert Condition Examples:

condition: "keyword == fatal"     # Only fatal errors
condition: "keyword == panic"     # Only panics
condition: "keyword == OOM"       # Only out of memory

Performance Tips:

# Frequent checks (every minute)
tail_lines: 50
schedule: "*/1 * * * *"

# Balanced (every 2 minutes)
tail_lines: 100
schedule: "*/2 * * * *"

# Less frequent but thorough (every 5 minutes)
tail_lines: 500
schedule: "*/5 * * * *"

6. Webhooks

Baadal supports multiple webhook destinations for alerts. Each webhook can have custom templates.

Discord Webhook

webhooks:
  - name: "discord"
    enabled: true
    url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN"
    method: POST
    headers:
      Content-Type: "application/json"
    payload_template: |
      {
        "content": "🚨 {{.Title}}",
        "embeds": [{
          "description": "{{.Message}}",
          "fields": [
            {"name": "Host",     "value": "{{.Hostname}}",  "inline": true},
            {"name": "Type",     "value": "{{.Type}}",      "inline": true},
            {"name": "Severity", "value": "{{.Severity}}",  "inline": true},
            {"name": "Time",     "value": "{{.Timestamp}}", "inline": true}
          ],
          "color": 16711680
        }]
      }

How to Get Discord Webhook URL:

  1. Open Discord server → Server Settings → Integrations
  2. Click "Webhooks" → "New Webhook"
  3. Choose channel, copy webhook URL
  4. Paste in config.yml

Discord Color Codes:

"color": 16711680   # Red (critical)
"color": 16776960   # Yellow (warning)
"color": 65280      # Green (info)
"color": 3447003    # Blue (info)

Slack Webhook

  - name: "slack"
    enabled: true
    url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
    method: POST
    headers:
      Content-Type: "application/json"
    payload_template: |
      {
        "text": "🚨 *{{.Title}}*",
        "blocks": [
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "*{{.Title}}*\n{{.Message}}"
            }
          },
          {
            "type": "section",
            "fields": [
              {"type": "mrkdwn", "text": "*Host:*\n{{.Hostname}}"},
              {"type": "mrkdwn", "text": "*Type:*\n{{.Type}}"},
              {"type": "mrkdwn", "text": "*Severity:*\n{{.Severity}}"},
              {"type": "mrkdwn", "text": "*Time:*\n{{.Timestamp}}"}
            ]
          }
        ]
      }

How to Get Slack Webhook URL:

  1. Visit https://api.slack.com/apps
  2. Create New App → "From scratch"
  3. Enable "Incoming Webhooks"
  4. Add New Webhook to Workspace
  5. Copy webhook URL

Custom API Webhook

  - name: "custom-api"
    enabled: true
    url: "https://your-api.com/alerts"
    method: POST
    headers:
      Authorization: "Bearer YOUR_API_TOKEN"
      Content-Type: "application/json"
      X-Custom-Header: "baadal-alerts"
    payload_template: |
      {
        "alert": "{{.Title}}",
        "host": "{{.Hostname}}",
        "timestamp": "{{.Timestamp}}",
        "type": "{{.Type}}",
        "severity": "{{.Severity}}",
        "message": "{{.Message}}",
        "data": {{.Data}}
      }

PagerDuty Webhook

  - name: "pagerduty"
    enabled: true
    url: "https://events.pagerduty.com/v2/enqueue"
    method: POST
    headers:
      Content-Type: "application/json"
    payload_template: |
      {
        "routing_key": "YOUR_INTEGRATION_KEY",
        "event_action": "trigger",
        "payload": {
          "summary": "{{.Title}}",
          "severity": "critical",
          "source": "{{.Hostname}}",
          "custom_details": {
            "message": "{{.Message}}",
            "type": "{{.Type}}",
            "timestamp": "{{.Timestamp}}"
          }
        }
      }

Email via SendGrid/Mailgun

  - name: "email"
    enabled: true
    url: "https://api.sendgrid.com/v3/mail/send"
    method: POST
    headers:
      Authorization: "Bearer YOUR_SENDGRID_API_KEY"
      Content-Type: "application/json"
    payload_template: |
      {
        "personalizations": [{
          "to": [{"email": "alerts@example.com"}]
        }],
        "from": {"email": "baadal@example.com"},
        "subject": "{{.Title}}",
        "content": [{
          "type": "text/plain",
          "value": "{{.Message}}\n\nHost: {{.Hostname}}\nTime: {{.Timestamp}}"
        }]
      }

Template Variables:

  • {{.Title}} - Alert title
  • {{.Message}} - Alert message
  • {{.Hostname}} - Source hostname
  • {{.Host}} - Same as Hostname
  • {{.Timestamp}} - ISO8601 timestamp (IST timezone)
  • {{.Type}} - Event type (disk_usage, dmesg, docker_log, etc.)
  • {{.Severity}} - Alert severity (critical, warning, info)
  • {{.Data}} - Raw JSON event data

7. Receiver Configuration

The receiver runs on your central monitoring server and accepts events from all collectors.

receiver:
  enabled: true               # Enable receiver mode
  port: 5170                  # Port to listen on
  
  auth:
    enabled: false            # Enable authentication
    token: "your-secret-token"  # Must match collector tokens
                               # IMPORTANT: Enable in production!
  
  log_output: "/var/log/baadal/events.log"  # Where to write events
  
  log_rotation:               # Automatic log rotation (via lumberjack)
    max_size_mb: 50           # Rotate after 50 MB
    max_backups: 3            # Keep 3 old log files
    max_age_days: 7           # Delete logs older than 7 days
    compress: true            # Gzip old log files
  
  dead_mans_switch:           # Detect missing collectors
    enabled: true
    timeout_minutes: 10       # Alert if no events for 10 minutes
    check_interval: "*/2 * * * *"  # Check every 2 minutes
    webhooks:
      - discord
  
  promtail:                   # Optional: Forward to Loki
    enabled: false
    endpoint: "http://localhost:9080/loki/api/v1/push"

Log Rotation Examples:

High-frequency monitoring (lots of events):

log_rotation:
  max_size_mb: 100      # Larger files
  max_backups: 7        # Keep more history
  max_age_days: 14      # 2 weeks retention
  compress: true

Low-frequency monitoring:

log_rotation:
  max_size_mb: 20       # Smaller files
  max_backups: 3        # Less history
  max_age_days: 7       # 1 week retention
  compress: true

Dead Man's Switch:

  • Monitors when each collector last sent events
  • Fires webhook alert if collector goes silent
  • Helps detect crashed collectors or network issues

Dead Man's Switch Examples:

# Tight monitoring (5 min timeout)
timeout_minutes: 5
check_interval: "*/1 * * * *"

# Relaxed monitoring (30 min timeout)
timeout_minutes: 30
check_interval: "*/10 * * * *"

# Daily check (for non-critical servers)
timeout_minutes: 1440  # 24 hours
check_interval: "0 * * * *"  # Hourly check

Authentication:

# Development (no auth)
auth:
  enabled: false

# Production (required!)
auth:
  enabled: true
  token: "use-a-long-random-string-here"
  # Generate token: openssl rand -base64 32

8. Node Identity

Configure how this server identifies itself in events:

node:
  hostname: ""                # Empty = auto-detect from OS
                               # Or set manually: "web-server-01"
  
  environment: "production"   # Environment label
                               # Options: production, staging, dev, test
  
  tags:                       # Custom tags for filtering/grouping
    - "ubuntu"
    - "backend"
    - "api-server"
    - "us-east-1"

Hostname Examples:

hostname: ""                  # Auto-detect (recommended)
hostname: "web-server-01"     # Manual override
hostname: "db-primary"        # For databases
hostname: "worker-03"         # For worker nodes

Environment Best Practices:

environment: "production"     # Live production servers
environment: "staging"        # Staging/QA environment
environment: "development"    # Dev servers
environment: "test"           # CI/CD test runners

Tag Examples:

By Role:

tags: ["web-server", "nginx", "frontend"]
tags: ["database", "postgres", "primary"]
tags: ["worker", "celery", "background-jobs"]

By Location:

tags: ["aws", "us-east-1", "production"]
tags: ["on-premise", "datacenter-1"]
tags: ["cloud", "digitalocean", "sgp1"]

By Stack:

tags: ["nodejs", "express", "api"]
tags: ["python", "django", "web"]
tags: ["go", "microservice"]

🔧 Complete Configuration Example

Here's a production-ready config.yml with all features enabled:

# ─────────────────────────────────────────────
#  Baadal — Production Configuration
# ─────────────────────────────────────────────

app:
  name: "baadal"
  enabled: true
  log_level: "info"
  
  heartbeat:
    enabled: true
    interval: "*/5 * * * *"
  
  deduplication:
    enabled: true
    window_seconds: 60

transport:
  mode: "remote"
  
  remote:
    endpoint: "http://100.64.1.100:5170/ingest"  # Replace with your receiver IP
    auth:
      enabled: true
      token: "your-generated-secret-token-here"  # Generate: openssl rand -base64 32
    batch_size: 10
    flush_interval: "5s"
    retry_attempts: 3
    retry_delay: "2s"

disk:
  enabled: true
  paths:
    - /
    - /var
    - /var/lib/docker
    - /home
    - /opt
  top_n: 5
  max_depth: 3
  schedule: "*/10 * * * *"
  alert:
    enabled: true
    condition: "size_gb > 50"
    webhooks:
      - discord

dmesg:
  enabled: true
  schedule: "*/1 * * * *"
  filter_levels:
    - err
    - crit
    - alert
    - emerg
  alert:
    enabled: true
    condition: "level == crit"
    webhooks:
      - discord

docker:
  enabled: true
  containers:
    - "all"
  tail_lines: 100
  schedule: "*/2 * * * *"
  filter_keywords:
    - "error"
    - "fatal"
    - "panic"
    - "OOM"
    - "killed"
    - "segfault"
  alert:
    enabled: true
    condition: "keyword == fatal"
    webhooks:
      - discord

webhooks:
  - name: "discord"
    enabled: true
    url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_TOKEN"
    method: POST
    headers:
      Content-Type: "application/json"
    payload_template: |
      {
        "content": "🚨 {{.Title}}",
        "embeds": [{
          "description": "{{.Message}}",
          "fields": [
            {"name": "Host",     "value": "{{.Hostname}}",  "inline": true},
            {"name": "Type",     "value": "{{.Type}}",      "inline": true},
            {"name": "Severity", "value": "{{.Severity}}",  "inline": true},
            {"name": "Time",     "value": "{{.Timestamp}}", "inline": true}
          ],
          "color": 16711680
        }]
      }

receiver:
  enabled: false  # Set to true only on receiver server
  port: 5170
  auth:
    enabled: true
    token: "your-generated-secret-token-here"  # Must match collector token
  log_output: "/var/log/baadal/events.log"
  log_rotation:
    max_size_mb: 50
    max_backups: 3
    max_age_days: 7
    compress: true
  dead_mans_switch:
    enabled: true
    timeout_minutes: 10
    check_interval: "*/2 * * * *"
    webhooks:
      - discord

node:
  hostname: ""  # Auto-detect
  environment: "production"
  tags:
    - "ubuntu"
    - "web-server"
    - "backend"

🏃 Usage

Collector Mode (on monitored servers)

# Run directly
./baadal --mode=collector --config=config.yml

# Run in background
nohup ./baadal --mode=collector --config=config.yml > /dev/null 2>&1 &

# Check it's running
ps aux | grep baadal

Receiver Mode (on central monitoring server)

# Run directly
./baadal --mode=receiver --config=config.yml

# Run in background
nohup ./baadal --mode=receiver --config=config.yml > /dev/null 2>&1 &

Systemd Installation (Recommended)

# On collector servers
sudo bash install.sh collector

# On receiver server
sudo bash install.sh receiver

# Service management
sudo systemctl start baadal
sudo systemctl enable baadal    # Start on boot
sudo systemctl status baadal
sudo journalctl -u baadal -f    # Follow logs

# Reload configuration without restart
sudo systemctl kill -s HUP baadal

# Restart service
sudo systemctl restart baadal

# Stop service
sudo systemctl stop baadal

Configuration Hot Reload

Baadal supports reloading configuration without restart:

# If running via systemd
sudo systemctl kill -s HUP baadal

# If running manually (get PID first)
ps aux | grep baadal
kill -HUP <PID>

What gets reloaded:

  • ✅ Schedule intervals
  • ✅ Alert thresholds
  • ✅ Webhook configurations
  • ✅ Filter keywords
  • ✅ Paths to monitor
  • ❌ Mode (collector/receiver) - requires restart

🐳 Docker

Using Pre-built Image

# Pull from GitHub Container Registry
docker pull ghcr.io/YOUR_USERNAME/baadhal:latest

# Run collector
docker run -d \
  --name baadal-collector \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v $(pwd)/config.yml:/app/config.yml:ro \
  -e MODE=collector \
  --restart unless-stopped \
  ghcr.io/YOUR_USERNAME/baadhal:latest

# Run receiver
docker run -d \
  --name baadal-receiver \
  -p 5170:5170 \
  -v $(pwd)/config.yml:/app/config.yml:ro \
  -v baadal-logs:/var/log/baadal \
  -e MODE=receiver \
  --restart unless-stopped \
  ghcr.io/YOUR_USERNAME/baadhal:latest

# View logs
docker logs -f baadal-collector
docker logs -f baadal-receiver

# Reload config
docker kill -s HUP baadal-collector

Docker Compose

version: '3.8'

services:
  # Receiver (central monitoring server)
  baadal-receiver:
    image: ghcr.io/YOUR_USERNAME/baadhal:latest
    container_name: baadal-receiver
    ports:
      - "5170:5170"
    volumes:
      - ./config.yml:/app/config.yml:ro
      - baadal-logs:/var/log/baadal
    environment:
      - MODE=receiver
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "-q", "--spider", "http://localhost:5170/health"]
      interval: 30s
      timeout: 3s
      retries: 3

  # Collector (on same server or different server)
  baadal-collector:
    image: ghcr.io/YOUR_USERNAME/baadhal:latest
    container_name: baadal-collector
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./config.yml:/app/config.yml:ro
    environment:
      - MODE=collector
    restart: unless-stopped

volumes:
  baadal-logs:

📊 Event Types

Type Description Data Fields Triggers
disk_usage Directory size monitoring top_dirs, scan_root, total_scanned_gb When scanned
dmesg Kernel message monitoring level, message, kernel_ts On new kernel messages
docker_log Container log monitoring container, line, matched_keyword On keyword match
heartbeat Periodic health signal status, uptime_seconds On schedule
lifecycle Start/stop events event, mode, version, uptime_seconds On start/stop
collector_stats Performance metrics disk_scan_ms, events_sent, events_deduped, cycle_total_ms Every 5 min
dead_mans_switch Missing host alert N/A (webhook only) Receiver detects silence

Example Event JSON

Disk Usage Event:

{
  "timestamp": "2026-03-06T18:45:00+05:30",
  "type": "disk_usage",
  "host": "web-server-01",
  "environment": "production",
  "tags": ["ubuntu", "backend"],
  "data": {
    "top_dirs": [
      {
        "path": "/var/lib/docker",
        "size_gb": 45.3,
        "size_mb": 46387,
        "rank": 1
      }
    ],
    "scan_root": "/var",
    "total_scanned_gb": 67.8
  },
  "alert_triggered": true,
  "alert_condition": "size_gb > 20"
}

Docker Log Event:

{
  "timestamp": "2026-03-06T18:50:12+05:30",
  "type": "docker_log",
  "host": "api-server-02",
  "environment": "production",
  "tags": ["nodejs", "api"],
  "data": {
    "container": "api-backend",
    "line": "Fatal error: Cannot connect to database",
    "matched_keyword": "fatal"
  },
  "alert_triggered": true,
  "alert_condition": "keyword == fatal"
}

Heartbeat Event:

{
  "timestamp": "2026-03-06T18:55:00+05:30",
  "type": "heartbeat",
  "host": "worker-01",
  "environment": "production",
  "tags": ["worker", "celery"],
  "data": {
    "status": "alive",
    "uptime_seconds": 86400
  },
  "alert_triggered": false
}

🔧 Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Monitored Servers                          │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐         │
│  │ Collector #1 │  │ Collector #2 │  │ Collector #N │         │
│  │              │  │              │  │              │         │
│  │ - Disk scan  │  │ - Disk scan  │  │ - Disk scan  │         │
│  │ - Dmesg      │  │ - Dmesg      │  │ - Dmesg      │         │
│  │ - Docker     │  │ - Docker     │  │ - Docker     │         │
│  │ - Heartbeat  │  │ - Heartbeat  │  │ - Heartbeat  │         │
│  │ - Dedup      │  │ - Dedup      │  │ - Dedup      │         │
│  └───────┬──────┘  └───────┬──────┘  └───────┬──────┘         │
│          │                 │                  │                │
└──────────┼─────────────────┼──────────────────┼────────────────┘
           │                 │                  │
           │   Batched       │   Batched        │   Batched
           │   Events        │   Events         │   Events
           │   (HTTP/JSON)   │   (HTTP/JSON)    │   (HTTP/JSON)
           │                 │                  │
           └─────────────────┼──────────────────┘
                             │
                             ▼
              ┌──────────────────────────────┐
              │    Central Receiver Server   │
              │                              │
              │  - HTTP endpoint (port 5170) │
              │  - Authentication            │
              │  - Log rotation (lumberjack) │
              │  - Dead man's switch         │
              │  - Webhook dispatcher        │
              └───────────┬──────────────────┘
                          │
                          ├─────────────┬──────────────┐
                          ▼             ▼              ▼
                   ┌──────────┐  ┌──────────┐  ┌──────────┐
                   │ Discord  │  │  Slack   │  │ Custom   │
                   │ Webhook  │  │ Webhook  │  │   API    │
                   └──────────┘  └──────────┘  └──────────┘

Data Flow

  1. Collection: Collectors run scheduled jobs (cron)
  2. Deduplication: Events checked against recent history
  3. Batching: Events accumulated until batch_size or flush_interval
  4. Transport: HTTP POST to receiver with optional auth
  5. Logging: Receiver writes to rotated log file
  6. Alerting: Matching conditions trigger webhooks
  7. Dead Man's Switch: Receiver monitors for missing collectors

Performance Characteristics

Collector (per server):

  • CPU: < 1% average
  • Memory: ~15-20 MB
  • Disk I/O: Minimal (reads only during scans)
  • Network: < 1 KB/min average

Receiver:

  • CPU: < 1% average
  • Memory: ~20-30 MB + log buffer
  • Disk I/O: Sequential writes only
  • Network: Depends on number of collectors

🛠️ Development

Requirements

  • Go 1.24+
  • Docker (for container log monitoring)

Run Tests

go test ./...
go vet ./...
gofmt -s -l .

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing)
  3. Commit your changes (git commit -am 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing)
  5. Open a Pull Request

📝 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built with:

📞 Support


Made with ☁️ by the Baadal Team

About

A lightweight, self-contained observability agent for monitoring Ubuntu servers with disk usage tracking, kernel monitoring, Docker container health, and webhook-based alerting.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors