- sogoctl: supervisor avec health checks et restart auto - sogoway: gateway HTTP, auth JWT, routing par hostname - sogoms-db: microservice MariaDB avec pool par application - Protocol IPC Unix socket JSON length-prefixed - Config YAML multi-application (prokov) - Deploy script pour container Alpine gw3 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1214 lines
41 KiB
Markdown
Executable File
1214 lines
41 KiB
Markdown
Executable File
# SOGOMS Vigil - Plateforme de Monitoring Multi-Hosts
|
|
|
|
> Surveillance temps réel de serveurs et containers avec agents Go et dashboard Flutter
|
|
|
|
**Website** : https://vigil.sogoms.com
|
|
|
|
---
|
|
|
|
## 1. Vue d'ensemble
|
|
|
|
### 1.1 Concept
|
|
|
|
**SOGOMS Vigil** est une extension de la plateforme SOGOMS dédiée au monitoring. Elle permet de surveiller en temps réel des serveurs, containers et services via des agents Go légers, avec une interface Flutter moderne.
|
|
|
|
### 1.2 Composants
|
|
|
|
| Composant | Description |
|
|
|-----------|-------------|
|
|
| **Sovigilant** | Agent Go léger installé sur chaque host à monitorer |
|
|
| **Vigil Server** | Serveur central basé sur l'architecture SOGOMS |
|
|
| **Vigil Dashboard** | Application Flutter (Web/Mobile) pour visualisation |
|
|
|
|
### 1.3 Fonctionnalités
|
|
|
|
- Monitoring temps réel (CPU, RAM, disk, network)
|
|
- Surveillance containers (Incus, Docker)
|
|
- Surveillance services (nginx, mariadb, redis...)
|
|
- Alerting configurable (email, Slack, webhook)
|
|
- Actions distantes (restart, exec, deploy)
|
|
- Multi-tenant (chaque client voit ses hosts)
|
|
- Historique et graphiques
|
|
- Auto-discovery des containers
|
|
|
|
---
|
|
|
|
## 2. Architecture
|
|
|
|
### 2.1 Vue globale
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ HOSTS MONITORÉS │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Serveur A │ │ Serveur B │ │ Serveur C │ ... │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │
|
|
│ │ │Sovigilant │ │ │ │Sovigilant │ │ │ │Sovigilant │ │ │
|
|
│ │ │ Agent │ │ │ │ Agent │ │ │ │ Agent │ │ │
|
|
│ │ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │ │
|
|
│ │ │ │ │ │ │ │ │ │ │
|
|
│ └────────┼────────┘ └────────┼────────┘ └────────┼────────┘ │
|
|
│ │ │ │ │
|
|
└────────────┼────────────────────┼────────────────────┼──────────────────────┘
|
|
│ │ │
|
|
│ gRPC Stream (TLS, bidirectionnel)│
|
|
│ │ │
|
|
└──────────────────┬─┴────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ VIGIL SERVER (gw3) │
|
|
│ 13.23.33.5 - Alpine 3.21 │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Sogoctl │ │
|
|
│ │ (Superviseur) │ │
|
|
│ └─────────────────────────────┬───────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌─────────────────────────────┴───────────────────────────────────────┐ │
|
|
│ │ │ │
|
|
│ │ Sogoway │ │
|
|
│ │ ├── REST API :8080 (Flutter, intégrations) │ │
|
|
│ │ ├── WebSocket :8080/ws (temps réel Flutter) │ │
|
|
│ │ └── gRPC :9090 (agents Sovigilant) │ │
|
|
│ │ │ │
|
|
│ │ Sogorch │ │
|
|
│ │ └── Orchestration actions (restart, deploy, alerting) │ │
|
|
│ │ │ │
|
|
│ │ ┌─────────────┬─────────────┬─────────────┬─────────────┐ │ │
|
|
│ │ │sogoms- │sogoms- │sogoms-db │sogoms- │ │ │
|
|
│ │ │ collect │ alert │ │ action │ │ │
|
|
│ │ │ │ │ │ │ │ │
|
|
│ │ │Agrégation │Évaluation │Stockage │Exécution │ │ │
|
|
│ │ │métriques │règles │métriques │commandes │ │ │
|
|
│ │ │Buffer │Notifications│Historique │sur agents │ │ │
|
|
│ │ └─────────────┴─────────────┴─────────────┴─────────────┘ │ │
|
|
│ │ │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ maria3 │ │ redis3 │ │
|
|
│ │ 13.23.33.4 │ │ 13.23.33.6 │ │
|
|
│ │ MariaDB │ │ Redis │ │
|
|
│ │ (métriques, │ │ (cache, │ │
|
|
│ │ config) │ │ pub/sub) │ │
|
|
│ └─────────────────┘ └─────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
│ WebSocket (temps réel)
|
|
│ REST API
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ VIGIL DASHBOARD │
|
|
│ Flutter Web / Mobile │
|
|
│ │
|
|
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
|
|
│ │ Dashboard │ │ Hosts │ │Containers │ │ Alertes │ │ Actions │ │
|
|
│ │ Vue global│ │ Détails │ │ Services │ │ Historique│ │ Scripts │ │
|
|
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 2.2 Flux de données
|
|
|
|
```
|
|
1. Sovigilant collecte métriques (toutes les 10s)
|
|
2. Sovigilant → gRPC stream → Sogoway
|
|
3. Sogoway → sogoms-collect (agrégation)
|
|
4. sogoms-collect → sogoms-db (stockage)
|
|
5. sogoms-collect → Redis pub/sub (temps réel)
|
|
6. sogoms-alert évalue les règles
|
|
7. Si alerte → sogoms-alert notifie (email, slack...)
|
|
8. Redis → WebSocket → Flutter (temps réel)
|
|
9. Flutter ← REST API ← Sogoway (historique, config)
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Sovigilant (Agent)
|
|
|
|
### 3.1 Caractéristiques
|
|
|
|
| Aspect | Valeur |
|
|
|--------|--------|
|
|
| **Taille** | ~5-8 MB (binaire statique) |
|
|
| **RAM** | ~10-20 MB en fonctionnement |
|
|
| **CPU** | < 1% en moyenne |
|
|
| **Intervalle** | Configurable (défaut 10s) |
|
|
| **Protocole** | gRPC avec TLS |
|
|
| **Auto-update** | Oui |
|
|
|
|
### 3.2 Installation
|
|
|
|
```bash
|
|
# Télécharger et installer
|
|
curl -fsSL https://vigil.sogoms.com/install.sh | sh
|
|
|
|
# Ou manuellement
|
|
wget https://vigil.sogoms.com/releases/sovigilant-linux-amd64
|
|
chmod +x sovigilant-linux-amd64
|
|
mv sovigilant-linux-amd64 /usr/local/bin/sovigilant
|
|
|
|
# Configurer
|
|
sovigilant init --server vigil.sogoms.com --token <TOKEN>
|
|
|
|
# Démarrer comme service
|
|
sovigilant install-service
|
|
systemctl enable sovigilant
|
|
systemctl start sovigilant
|
|
```
|
|
|
|
### 3.3 Configuration
|
|
|
|
```yaml
|
|
# /etc/sovigilant/config.yaml
|
|
|
|
agent:
|
|
id: auto # Auto-généré ou défini
|
|
name: "{{hostname}}" # Nom affiché
|
|
|
|
server:
|
|
address: vigil.sogoms.com:9090
|
|
token: "eyJhbGciOiJIUzI1NiIs..." # Token d'auth (contient tenant_id)
|
|
tls:
|
|
enabled: true
|
|
ca_cert: /etc/sovigilant/ca.pem # Optionnel si CA publique
|
|
|
|
collection:
|
|
interval: 10s # Intervalle de collecte
|
|
|
|
host:
|
|
enabled: true
|
|
metrics:
|
|
- cpu
|
|
- memory
|
|
- disk
|
|
- network
|
|
- load
|
|
- uptime
|
|
|
|
containers:
|
|
enabled: true
|
|
runtime: auto # auto | incus | docker | podman
|
|
metrics:
|
|
- cpu
|
|
- memory
|
|
- disk
|
|
- network
|
|
- status
|
|
|
|
services:
|
|
enabled: true
|
|
discover: true # Auto-découverte
|
|
watch: # Services explicites
|
|
- nginx
|
|
- mariadb
|
|
- redis
|
|
- php-fpm
|
|
custom: # Services custom
|
|
- name: myapp
|
|
check: systemctl is-active myapp
|
|
port: 8080
|
|
|
|
custom_metrics: # Métriques personnalisées
|
|
- name: nginx_connections
|
|
command: "nginx -s status | grep 'Active connections' | awk '{print $3}'"
|
|
type: gauge
|
|
interval: 30s
|
|
|
|
logging:
|
|
level: info
|
|
output: /var/log/sovigilant.log
|
|
max_size: 10MB
|
|
max_files: 5
|
|
|
|
actions:
|
|
enabled: true # Autoriser les actions distantes
|
|
allowed: # Liste blanche de commandes
|
|
- "systemctl restart *"
|
|
- "systemctl status *"
|
|
- "docker restart *"
|
|
- "incus restart *"
|
|
denied: # Liste noire
|
|
- "rm -rf *"
|
|
- "shutdown*"
|
|
- "reboot*"
|
|
```
|
|
|
|
### 3.4 Métriques collectées
|
|
|
|
```yaml
|
|
# Structure des métriques envoyées
|
|
|
|
host:
|
|
timestamp: "2025-01-15T10:30:00Z"
|
|
uptime: 864000
|
|
|
|
cpu:
|
|
percent: 23.5
|
|
cores: 8
|
|
model: "Intel Xeon E5-2680"
|
|
per_core: [20.1, 25.3, 22.0, ...]
|
|
|
|
memory:
|
|
total: 8589934592
|
|
used: 4293918720
|
|
available: 4296015872
|
|
percent: 50.0
|
|
swap_total: 2147483648
|
|
swap_used: 0
|
|
|
|
disk:
|
|
mounts:
|
|
- path: /
|
|
total: 107374182400
|
|
used: 53687091200
|
|
percent: 50.0
|
|
fs_type: ext4
|
|
- path: /data
|
|
total: 536870912000
|
|
used: 107374182400
|
|
percent: 20.0
|
|
fs_type: ext4
|
|
|
|
network:
|
|
interfaces:
|
|
- name: eth0
|
|
rx_bytes: 123456789012
|
|
tx_bytes: 987654321098
|
|
rx_packets: 12345678
|
|
tx_packets: 9876543
|
|
rx_errors: 0
|
|
tx_errors: 0
|
|
|
|
load:
|
|
load_1m: 0.52
|
|
load_5m: 0.48
|
|
load_15m: 0.45
|
|
|
|
containers:
|
|
- name: maria3
|
|
runtime: incus
|
|
status: running
|
|
created: "2025-01-01T00:00:00Z"
|
|
image: alpine/3.21
|
|
|
|
cpu:
|
|
percent: 5.2
|
|
|
|
memory:
|
|
used: 536870912
|
|
limit: 2147483648
|
|
percent: 25.0
|
|
|
|
disk:
|
|
used: 10737418240
|
|
|
|
network:
|
|
rx_bytes: 12345678
|
|
tx_bytes: 87654321
|
|
|
|
processes: 45
|
|
|
|
- name: gw3
|
|
runtime: incus
|
|
status: running
|
|
# ...
|
|
|
|
services:
|
|
- name: nginx
|
|
status: active
|
|
pid: 1234
|
|
memory: 52428800
|
|
cpu_percent: 0.5
|
|
ports: [80, 443]
|
|
uptime: 864000
|
|
|
|
- name: mariadb
|
|
status: active
|
|
pid: 5678
|
|
memory: 268435456
|
|
cpu_percent: 2.1
|
|
ports: [3306]
|
|
connections: 15
|
|
|
|
- name: php-fpm
|
|
status: active
|
|
pid: 9012
|
|
memory: 134217728
|
|
workers_active: 5
|
|
workers_idle: 15
|
|
```
|
|
|
|
### 3.5 CLI Sovigilant
|
|
|
|
```bash
|
|
# Statut de l'agent
|
|
$ sovigilant status
|
|
Agent ID: host-abc123
|
|
Status: connected
|
|
Server: vigil.sogoms.com:9090
|
|
Uptime: 5d 12h 30m
|
|
Last push: 2s ago
|
|
Metrics sent: 45,230
|
|
|
|
# Tester la collecte
|
|
$ sovigilant collect --once
|
|
Collecting host metrics... OK
|
|
Collecting container metrics... OK (5 containers)
|
|
Collecting service metrics... OK (4 services)
|
|
|
|
# Voir les métriques actuelles
|
|
$ sovigilant metrics
|
|
HOST
|
|
CPU: 23.5%
|
|
Memory: 4.0GB / 8.0GB (50%)
|
|
Disk /: 50GB / 100GB (50%)
|
|
Load: 0.52 0.48 0.45
|
|
|
|
CONTAINERS
|
|
maria3 running CPU: 5.2% MEM: 512MB/2GB
|
|
gw3 running CPU: 12.1% MEM: 256MB/1GB
|
|
redis3 running CPU: 0.5% MEM: 64MB/512MB
|
|
|
|
SERVICES
|
|
nginx active PID: 1234 MEM: 50MB
|
|
mariadb active PID: 5678 MEM: 256MB
|
|
php-fpm active PID: 9012 MEM: 128MB
|
|
|
|
# Logs
|
|
$ sovigilant logs --follow
|
|
|
|
# Mise à jour
|
|
$ sovigilant update
|
|
Current version: 1.2.0
|
|
Latest version: 1.3.0
|
|
Downloading... OK
|
|
Installing... OK
|
|
Restarting... OK
|
|
|
|
# Réinitialiser
|
|
$ sovigilant reset
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Vigil Server
|
|
|
|
### 4.1 Binaires
|
|
|
|
| Binaire | Rôle | Port/Socket |
|
|
|---------|------|-------------|
|
|
| **sogoctl** | Superviseur | TCP :9000 |
|
|
| **sogoway** | API REST + WebSocket + gRPC | TCP :8080, :9090 |
|
|
| **sogorch** | Orchestration actions | Unix socket |
|
|
| **sogoms-collect** | Agrégation métriques | Unix socket |
|
|
| **sogoms-alert** | Évaluation alertes | Unix socket |
|
|
| **sogoms-db** | Stockage MariaDB | Unix socket |
|
|
| **sogoms-action** | Exécution commandes | Unix socket |
|
|
|
|
### 4.2 Sogoway - Endpoints
|
|
|
|
#### REST API
|
|
|
|
```yaml
|
|
# Authentification
|
|
POST /api/auth/login # Login, retourne JWT
|
|
POST /api/auth/refresh # Refresh token
|
|
POST /api/auth/logout # Logout
|
|
|
|
# Hosts
|
|
GET /api/hosts # Liste des hosts du tenant
|
|
GET /api/hosts/{id} # Détail d'un host
|
|
GET /api/hosts/{id}/metrics # Métriques d'un host
|
|
GET /api/hosts/{id}/containers # Containers d'un host
|
|
GET /api/hosts/{id}/services # Services d'un host
|
|
POST /api/hosts/{id}/action # Exécuter une action
|
|
|
|
# Containers
|
|
GET /api/containers # Tous les containers
|
|
GET /api/containers/{id} # Détail d'un container
|
|
GET /api/containers/{id}/metrics # Métriques d'un container
|
|
POST /api/containers/{id}/action # Action (start/stop/restart)
|
|
|
|
# Alertes
|
|
GET /api/alerts # Alertes actives
|
|
GET /api/alerts/history # Historique des alertes
|
|
GET /api/alerts/rules # Règles d'alerte
|
|
POST /api/alerts/rules # Créer une règle
|
|
PUT /api/alerts/rules/{id} # Modifier une règle
|
|
DELETE /api/alerts/rules/{id} # Supprimer une règle
|
|
POST /api/alerts/{id}/acknowledge # Acquitter une alerte
|
|
|
|
# Métriques agrégées
|
|
GET /api/metrics/overview # Vue globale
|
|
GET /api/metrics/history # Historique (query params)
|
|
|
|
# Configuration
|
|
GET /api/config/tenant # Config du tenant
|
|
PUT /api/config/tenant # Modifier config
|
|
GET /api/config/notifications # Canaux de notification
|
|
PUT /api/config/notifications # Modifier canaux
|
|
```
|
|
|
|
#### WebSocket
|
|
|
|
```yaml
|
|
# Connexion
|
|
ws://vigil.sogoms.com:8080/ws?token=<JWT>
|
|
|
|
# Messages serveur → client
|
|
{
|
|
"type": "metrics",
|
|
"host_id": "host-abc123",
|
|
"timestamp": "2025-01-15T10:30:00Z",
|
|
"data": {
|
|
"cpu": 23.5,
|
|
"memory": 50.0,
|
|
"containers": [
|
|
{"name": "gw3", "cpu": 12.1, "memory": 25.0, "status": "running"}
|
|
]
|
|
}
|
|
}
|
|
|
|
{
|
|
"type": "alert",
|
|
"id": "alert-xyz789",
|
|
"severity": "warning",
|
|
"host_id": "host-abc123",
|
|
"message": "CPU > 80% depuis 5 minutes",
|
|
"timestamp": "2025-01-15T10:30:00Z"
|
|
}
|
|
|
|
{
|
|
"type": "host_status",
|
|
"host_id": "host-abc123",
|
|
"status": "offline", # online | offline | degraded
|
|
"timestamp": "2025-01-15T10:30:00Z"
|
|
}
|
|
|
|
# Messages client → serveur
|
|
{
|
|
"type": "subscribe",
|
|
"hosts": ["host-abc123", "host-def456"] # Optionnel, défaut = tous
|
|
}
|
|
|
|
{
|
|
"type": "unsubscribe",
|
|
"hosts": ["host-abc123"]
|
|
}
|
|
```
|
|
|
|
#### gRPC (Agents)
|
|
|
|
```protobuf
|
|
// vigil.proto
|
|
|
|
syntax = "proto3";
|
|
package vigil;
|
|
|
|
service VigilService {
|
|
// Stream bidirectionnel : métriques ↑ commandes ↓
|
|
rpc Connect(stream AgentMessage) returns (stream ServerMessage);
|
|
|
|
// Enregistrement initial
|
|
rpc Register(RegisterRequest) returns (RegisterResponse);
|
|
}
|
|
|
|
message AgentMessage {
|
|
string agent_id = 1;
|
|
oneof payload {
|
|
Metrics metrics = 2;
|
|
ActionResult action_result = 3;
|
|
Heartbeat heartbeat = 4;
|
|
}
|
|
}
|
|
|
|
message ServerMessage {
|
|
oneof payload {
|
|
Action action = 1;
|
|
ConfigUpdate config = 2;
|
|
Ack ack = 3;
|
|
}
|
|
}
|
|
|
|
message Metrics {
|
|
string timestamp = 1;
|
|
HostMetrics host = 2;
|
|
repeated ContainerMetrics containers = 3;
|
|
repeated ServiceMetrics services = 4;
|
|
}
|
|
|
|
message Action {
|
|
string id = 1;
|
|
string type = 2; // restart_service, restart_container, exec
|
|
map<string, string> params = 3;
|
|
}
|
|
|
|
message ActionResult {
|
|
string action_id = 1;
|
|
bool success = 2;
|
|
string output = 3;
|
|
string error = 4;
|
|
}
|
|
```
|
|
|
|
### 4.3 Sogoms-collect
|
|
|
|
Agrège et bufferise les métriques avant stockage.
|
|
|
|
```yaml
|
|
# Actions
|
|
actions:
|
|
- ingest # Recevoir métriques d'un agent
|
|
- query_latest # Dernières métriques d'un host
|
|
- query_range # Métriques sur une période
|
|
- downsample # Agrégation pour historique long terme
|
|
```
|
|
|
|
```yaml
|
|
# Config
|
|
collect:
|
|
buffer:
|
|
size: 1000 # Métriques en mémoire avant flush
|
|
flush_interval: 5s
|
|
|
|
retention:
|
|
raw: 7d # Métriques brutes (10s interval)
|
|
hourly: 90d # Agrégées par heure
|
|
daily: 2y # Agrégées par jour
|
|
|
|
downsample:
|
|
enabled: true
|
|
schedule: "0 * * * *" # Toutes les heures
|
|
```
|
|
|
|
### 4.4 Sogoms-alert
|
|
|
|
Évalue les règles d'alerte et envoie les notifications.
|
|
|
|
```yaml
|
|
# Actions
|
|
actions:
|
|
- evaluate # Évaluer une métrique contre les règles
|
|
- notify # Envoyer une notification
|
|
- acknowledge # Acquitter une alerte
|
|
- resolve # Résoudre une alerte
|
|
```
|
|
|
|
```yaml
|
|
# Règle d'alerte
|
|
rules:
|
|
- id: cpu_high
|
|
name: "CPU élevé"
|
|
condition: "host.cpu.percent > 80"
|
|
duration: 5m # Doit être vrai pendant 5min
|
|
severity: warning
|
|
channels: [email, slack]
|
|
message: "CPU à {{value}}% sur {{host.name}}"
|
|
|
|
- id: disk_critical
|
|
name: "Disque critique"
|
|
condition: "host.disk.percent > 95"
|
|
duration: 1m
|
|
severity: critical
|
|
channels: [email, slack, pagerduty]
|
|
message: "Disque {{mount}} à {{value}}% sur {{host.name}}"
|
|
|
|
- id: container_down
|
|
name: "Container arrêté"
|
|
condition: "container.status != 'running'"
|
|
duration: 30s
|
|
severity: critical
|
|
channels: [email, slack]
|
|
message: "Container {{container.name}} arrêté sur {{host.name}}"
|
|
|
|
- id: service_down
|
|
name: "Service arrêté"
|
|
condition: "service.status != 'active'"
|
|
duration: 1m
|
|
severity: critical
|
|
channels: [email]
|
|
message: "Service {{service.name}} arrêté sur {{host.name}}"
|
|
```
|
|
|
|
```yaml
|
|
# Canaux de notification
|
|
channels:
|
|
email:
|
|
type: smtp
|
|
config:
|
|
host: smtp.example.com
|
|
port: 587
|
|
user: alerts@example.com
|
|
from: "SOGOMS Vigil <alerts@sogoms.com>"
|
|
|
|
slack:
|
|
type: webhook
|
|
config:
|
|
url: https://hooks.slack.com/services/xxx/yyy/zzz
|
|
channel: "#alerts"
|
|
|
|
pagerduty:
|
|
type: pagerduty
|
|
config:
|
|
service_key: "xxxxx"
|
|
|
|
webhook:
|
|
type: http
|
|
config:
|
|
url: https://my-service.com/webhook
|
|
method: POST
|
|
headers:
|
|
Authorization: "Bearer xxx"
|
|
```
|
|
|
|
### 4.5 Sogoms-action
|
|
|
|
Exécute des commandes sur les agents distants.
|
|
|
|
```yaml
|
|
# Actions
|
|
actions:
|
|
- exec # Exécuter une commande
|
|
- restart_service # Redémarrer un service
|
|
- restart_container # Redémarrer un container
|
|
- deploy # Déployer (via scénario)
|
|
```
|
|
|
|
```yaml
|
|
# Scénario d'action automatique
|
|
name: auto_restart_on_high_memory
|
|
trigger:
|
|
alert: memory_high
|
|
|
|
steps:
|
|
- id: restart_php
|
|
service: action
|
|
action: restart_service
|
|
params:
|
|
host_id: "{{alert.host_id}}"
|
|
service: php-fpm
|
|
|
|
- id: notify
|
|
service: alert
|
|
action: notify
|
|
params:
|
|
channel: slack
|
|
message: "PHP-FPM redémarré automatiquement sur {{alert.host.name}}"
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Schéma Base de Données
|
|
|
|
### 5.1 Tables principales
|
|
|
|
```sql
|
|
-- Hosts enregistrés
|
|
CREATE TABLE hosts (
|
|
id VARCHAR(50) PRIMARY KEY,
|
|
tenant_id VARCHAR(50) NOT NULL,
|
|
name VARCHAR(100) NOT NULL,
|
|
hostname VARCHAR(255),
|
|
ip_address VARCHAR(45),
|
|
os VARCHAR(100),
|
|
arch VARCHAR(20),
|
|
agent_version VARCHAR(20),
|
|
status ENUM('online', 'offline', 'degraded') DEFAULT 'offline',
|
|
last_seen DATETIME,
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
config JSON,
|
|
|
|
INDEX idx_tenant (tenant_id),
|
|
INDEX idx_status (tenant_id, status)
|
|
);
|
|
|
|
-- Métriques hosts (partitionnée)
|
|
CREATE TABLE metrics_host (
|
|
id BIGINT AUTO_INCREMENT,
|
|
host_id VARCHAR(50) NOT NULL,
|
|
timestamp DATETIME NOT NULL,
|
|
cpu_percent DECIMAL(5,2),
|
|
memory_used BIGINT,
|
|
memory_total BIGINT,
|
|
disk_used BIGINT,
|
|
disk_total BIGINT,
|
|
load_1m DECIMAL(5,2),
|
|
load_5m DECIMAL(5,2),
|
|
load_15m DECIMAL(5,2),
|
|
network_rx BIGINT,
|
|
network_tx BIGINT,
|
|
|
|
PRIMARY KEY (id, timestamp),
|
|
INDEX idx_host_time (host_id, timestamp DESC)
|
|
) PARTITION BY RANGE (TO_DAYS(timestamp)) (
|
|
PARTITION p_current VALUES LESS THAN (TO_DAYS(DATE_ADD(CURDATE(), INTERVAL 1 DAY))),
|
|
PARTITION p_future VALUES LESS THAN MAXVALUE
|
|
);
|
|
|
|
-- Métriques containers
|
|
CREATE TABLE metrics_container (
|
|
id BIGINT AUTO_INCREMENT,
|
|
host_id VARCHAR(50) NOT NULL,
|
|
container_name VARCHAR(100) NOT NULL,
|
|
timestamp DATETIME NOT NULL,
|
|
status VARCHAR(20),
|
|
cpu_percent DECIMAL(5,2),
|
|
memory_used BIGINT,
|
|
memory_limit BIGINT,
|
|
disk_used BIGINT,
|
|
network_rx BIGINT,
|
|
network_tx BIGINT,
|
|
|
|
PRIMARY KEY (id, timestamp),
|
|
INDEX idx_container_time (host_id, container_name, timestamp DESC)
|
|
) PARTITION BY RANGE (TO_DAYS(timestamp)) (...);
|
|
|
|
-- Métriques services
|
|
CREATE TABLE metrics_service (
|
|
id BIGINT AUTO_INCREMENT,
|
|
host_id VARCHAR(50) NOT NULL,
|
|
service_name VARCHAR(100) NOT NULL,
|
|
timestamp DATETIME NOT NULL,
|
|
status VARCHAR(20),
|
|
pid INT,
|
|
memory_used BIGINT,
|
|
cpu_percent DECIMAL(5,2),
|
|
|
|
PRIMARY KEY (id, timestamp),
|
|
INDEX idx_service_time (host_id, service_name, timestamp DESC)
|
|
) PARTITION BY RANGE (TO_DAYS(timestamp)) (...);
|
|
|
|
-- Métriques agrégées (hourly)
|
|
CREATE TABLE metrics_host_hourly (
|
|
host_id VARCHAR(50) NOT NULL,
|
|
hour DATETIME NOT NULL,
|
|
cpu_avg DECIMAL(5,2),
|
|
cpu_max DECIMAL(5,2),
|
|
memory_avg BIGINT,
|
|
memory_max BIGINT,
|
|
disk_avg BIGINT,
|
|
network_rx_total BIGINT,
|
|
network_tx_total BIGINT,
|
|
|
|
PRIMARY KEY (host_id, hour),
|
|
INDEX idx_hour (hour)
|
|
);
|
|
|
|
-- Alertes
|
|
CREATE TABLE alerts (
|
|
id VARCHAR(50) PRIMARY KEY,
|
|
tenant_id VARCHAR(50) NOT NULL,
|
|
host_id VARCHAR(50),
|
|
container_name VARCHAR(100),
|
|
service_name VARCHAR(100),
|
|
rule_id VARCHAR(50) NOT NULL,
|
|
severity ENUM('info', 'warning', 'critical') NOT NULL,
|
|
status ENUM('firing', 'acknowledged', 'resolved') DEFAULT 'firing',
|
|
message TEXT,
|
|
value VARCHAR(50),
|
|
fired_at DATETIME NOT NULL,
|
|
acknowledged_at DATETIME,
|
|
acknowledged_by VARCHAR(100),
|
|
resolved_at DATETIME,
|
|
|
|
INDEX idx_tenant_status (tenant_id, status),
|
|
INDEX idx_host (host_id),
|
|
INDEX idx_fired (fired_at DESC)
|
|
);
|
|
|
|
-- Historique des alertes
|
|
CREATE TABLE alerts_history (
|
|
id BIGINT AUTO_INCREMENT PRIMARY KEY,
|
|
alert_id VARCHAR(50) NOT NULL,
|
|
tenant_id VARCHAR(50) NOT NULL,
|
|
host_id VARCHAR(50),
|
|
rule_id VARCHAR(50) NOT NULL,
|
|
severity ENUM('info', 'warning', 'critical') NOT NULL,
|
|
message TEXT,
|
|
fired_at DATETIME NOT NULL,
|
|
resolved_at DATETIME,
|
|
duration_seconds INT,
|
|
|
|
INDEX idx_tenant_time (tenant_id, fired_at DESC)
|
|
);
|
|
|
|
-- Règles d'alerte
|
|
CREATE TABLE alert_rules (
|
|
id VARCHAR(50) PRIMARY KEY,
|
|
tenant_id VARCHAR(50) NOT NULL,
|
|
name VARCHAR(100) NOT NULL,
|
|
description TEXT,
|
|
condition TEXT NOT NULL,
|
|
duration_seconds INT DEFAULT 60,
|
|
severity ENUM('info', 'warning', 'critical') NOT NULL,
|
|
channels JSON NOT NULL,
|
|
message_template TEXT,
|
|
enabled BOOLEAN DEFAULT TRUE,
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
updated_at DATETIME ON UPDATE CURRENT_TIMESTAMP,
|
|
|
|
INDEX idx_tenant (tenant_id)
|
|
);
|
|
|
|
-- Canaux de notification
|
|
CREATE TABLE notification_channels (
|
|
id VARCHAR(50) PRIMARY KEY,
|
|
tenant_id VARCHAR(50) NOT NULL,
|
|
name VARCHAR(100) NOT NULL,
|
|
type ENUM('email', 'slack', 'webhook', 'pagerduty') NOT NULL,
|
|
config JSON NOT NULL,
|
|
enabled BOOLEAN DEFAULT TRUE,
|
|
|
|
INDEX idx_tenant (tenant_id)
|
|
);
|
|
|
|
-- Actions exécutées
|
|
CREATE TABLE actions_log (
|
|
id VARCHAR(50) PRIMARY KEY,
|
|
tenant_id VARCHAR(50) NOT NULL,
|
|
host_id VARCHAR(50) NOT NULL,
|
|
user_id VARCHAR(50),
|
|
action_type VARCHAR(50) NOT NULL,
|
|
params JSON,
|
|
status ENUM('pending', 'running', 'success', 'failed') DEFAULT 'pending',
|
|
output TEXT,
|
|
error TEXT,
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
completed_at DATETIME,
|
|
|
|
INDEX idx_tenant (tenant_id),
|
|
INDEX idx_host (host_id)
|
|
);
|
|
|
|
-- Tokens agents
|
|
CREATE TABLE agent_tokens (
|
|
token_hash VARCHAR(64) PRIMARY KEY,
|
|
tenant_id VARCHAR(50) NOT NULL,
|
|
host_id VARCHAR(50),
|
|
name VARCHAR(100),
|
|
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
|
|
expires_at DATETIME,
|
|
last_used DATETIME,
|
|
|
|
INDEX idx_tenant (tenant_id)
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Vigil Dashboard (Flutter)
|
|
|
|
### 6.1 Écrans principaux
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ DASHBOARD [user] ▼ │
|
|
├─────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ 5 Hosts │ │ 12 Contain. │ │ 2 Alertes │ │ 99.9% │ │
|
|
│ │ Online │ │ Running │ │ Warning │ │ Uptime │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
│ │
|
|
│ HOSTS [+ Add Host] │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ ● IN3 (prod-1) CPU: ████░░ 45% MEM: ██████░ 70% │ │
|
|
│ │ 5 containers, 4 services Disk: ████░░░ 50% │ │
|
|
│ ├─────────────────────────────────────────────────────────────────┤ │
|
|
│ │ ● IN4 (prod-2) CPU: ██░░░░ 23% MEM: ████░░░ 50% │ │
|
|
│ │ 3 containers, 3 services Disk: ██░░░░░ 30% │ │
|
|
│ ├─────────────────────────────────────────────────────────────────┤ │
|
|
│ │ ○ DEV-1 (offline) Last seen: 5 minutes ago │ │
|
|
│ └─────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ALERTES ACTIVES │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ ⚠ WARNING CPU > 80% on IN3 depuis 10 min [Ack] │ │
|
|
│ │ ⚠ WARNING Disk > 85% on IN4/data depuis 2h [Ack] │ │
|
|
│ └─────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ [Dashboard] [Hosts] [Containers] [Alertes] [Actions] [Settings] │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 6.2 Détail Host
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ ← HOSTS / IN3 (prod-1) [Actions ▼] │
|
|
├─────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ STATUS: ● Online UPTIME: 45 days AGENT: v1.3.0 │
|
|
│ OS: Debian 13 IP: 192.168.1.10 ARCH: amd64 │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────────────────┐ │
|
|
│ │ CPU │ MEMORY │ │
|
|
│ │ ████████████░░░░░░░░ 45% │ ██████████████░░░░░░ 70% │ │
|
|
│ │ [Graph 24h ~~~~~~~~~~~~~~~] │ [Graph 24h ~~~~~~~~~~~~~~~] │ │
|
|
│ ├──────────────────────────────────────────────────────────────────┤ │
|
|
│ │ DISK / │ DISK /data │ │
|
|
│ │ ██████████░░░░░░░░░░ 50% │ ██████░░░░░░░░░░░░░░ 30% │ │
|
|
│ │ 50GB / 100GB │ 150GB / 500GB │ │
|
|
│ └──────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ CONTAINERS │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ ● gw3 CPU: 12% MEM: 256MB/1GB ▸ restart │ logs │ ●●● │ │
|
|
│ │ ● maria3 CPU: 5% MEM: 512MB/2GB ▸ restart │ logs │ ●●● │ │
|
|
│ │ ● redis3 CPU: 1% MEM: 64MB/512MB ▸ restart │ logs │ ●●● │ │
|
|
│ └─────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ SERVICES │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ ● nginx active PID: 1234 MEM: 50MB ▸ restart │ status│ │
|
|
│ │ ● sshd active PID: 567 MEM: 5MB ▸ restart │ status│ │
|
|
│ └─────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 6.3 Architecture Flutter
|
|
|
|
```
|
|
lib/
|
|
├── main.dart
|
|
├── app.dart
|
|
│
|
|
├── core/
|
|
│ ├── api/
|
|
│ │ ├── api_client.dart
|
|
│ │ ├── websocket_client.dart
|
|
│ │ └── endpoints.dart
|
|
│ ├── auth/
|
|
│ │ ├── auth_provider.dart
|
|
│ │ └── token_storage.dart
|
|
│ └── theme/
|
|
│ └── app_theme.dart
|
|
│
|
|
├── features/
|
|
│ ├── dashboard/
|
|
│ │ ├── dashboard_screen.dart
|
|
│ │ ├── dashboard_provider.dart
|
|
│ │ └── widgets/
|
|
│ │ ├── stats_cards.dart
|
|
│ │ ├── hosts_list.dart
|
|
│ │ └── alerts_panel.dart
|
|
│ │
|
|
│ ├── hosts/
|
|
│ │ ├── hosts_screen.dart
|
|
│ │ ├── host_detail_screen.dart
|
|
│ │ ├── hosts_provider.dart
|
|
│ │ └── widgets/
|
|
│ │ ├── host_card.dart
|
|
│ │ ├── metrics_chart.dart
|
|
│ │ ├── containers_list.dart
|
|
│ │ └── services_list.dart
|
|
│ │
|
|
│ ├── alerts/
|
|
│ │ ├── alerts_screen.dart
|
|
│ │ ├── alert_rules_screen.dart
|
|
│ │ ├── alerts_provider.dart
|
|
│ │ └── widgets/
|
|
│ │ ├── alert_card.dart
|
|
│ │ └── rule_editor.dart
|
|
│ │
|
|
│ ├── actions/
|
|
│ │ ├── actions_screen.dart
|
|
│ │ └── action_dialog.dart
|
|
│ │
|
|
│ └── settings/
|
|
│ ├── settings_screen.dart
|
|
│ └── notifications_screen.dart
|
|
│
|
|
├── models/
|
|
│ ├── host.dart
|
|
│ ├── container.dart
|
|
│ ├── service.dart
|
|
│ ├── metrics.dart
|
|
│ ├── alert.dart
|
|
│ └── alert_rule.dart
|
|
│
|
|
└── widgets/
|
|
├── metric_gauge.dart
|
|
├── status_indicator.dart
|
|
├── time_series_chart.dart
|
|
└── loading_overlay.dart
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Déploiement
|
|
|
|
### 7.1 Serveur Vigil
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# deploy/incus/setup-vigil.sh
|
|
|
|
# Créer le container
|
|
incus launch images:alpine/3.21 vigil
|
|
|
|
# IP fixe
|
|
incus config device override vigil eth0 ipv4.address=13.23.33.5
|
|
|
|
# Dossiers
|
|
incus exec vigil -- mkdir -p /opt/sogoms/bin
|
|
incus exec vigil -- mkdir -p /config
|
|
incus exec vigil -- mkdir -p /var/log/sogoms
|
|
|
|
# Copier les binaires
|
|
incus file push dist/sogoctl vigil/opt/sogoms/bin/
|
|
incus file push dist/sogoway vigil/opt/sogoms/bin/
|
|
incus file push dist/sogorch vigil/opt/sogoms/bin/
|
|
incus file push dist/sogoms-* vigil/opt/sogoms/bin/
|
|
|
|
# Copier les configs
|
|
incus file push -r config/* vigil/config/
|
|
|
|
# Service
|
|
incus exec vigil -- rc-update add sogoms default
|
|
incus exec vigil -- rc-service sogoms start
|
|
```
|
|
|
|
### 7.2 Agent Sovigilant
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Script d'installation agent
|
|
|
|
VIGIL_SERVER="${1:-vigil.sogoms.com}"
|
|
VIGIL_TOKEN="${2}"
|
|
|
|
# Télécharger
|
|
curl -fsSL -o /usr/local/bin/sovigilant \
|
|
"https://${VIGIL_SERVER}/downloads/sovigilant-$(uname -s)-$(uname -m)"
|
|
chmod +x /usr/local/bin/sovigilant
|
|
|
|
# Configurer
|
|
mkdir -p /etc/sovigilant
|
|
cat > /etc/sovigilant/config.yaml << EOF
|
|
agent:
|
|
name: "$(hostname)"
|
|
server:
|
|
address: ${VIGIL_SERVER}:9090
|
|
token: "${VIGIL_TOKEN}"
|
|
collection:
|
|
interval: 10s
|
|
host:
|
|
enabled: true
|
|
containers:
|
|
enabled: true
|
|
runtime: auto
|
|
services:
|
|
enabled: true
|
|
discover: true
|
|
EOF
|
|
|
|
# Service systemd
|
|
cat > /etc/systemd/system/sovigilant.service << EOF
|
|
[Unit]
|
|
Description=SOGOMS Vigil Agent
|
|
After=network.target
|
|
|
|
[Service]
|
|
ExecStart=/usr/local/bin/sovigilant run
|
|
Restart=always
|
|
RestartSec=5
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
EOF
|
|
|
|
systemctl daemon-reload
|
|
systemctl enable sovigilant
|
|
systemctl start sovigilant
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Structure du Projet
|
|
|
|
```
|
|
sogoms/
|
|
├── cmd/
|
|
│ ├── sogoctl/
|
|
│ ├── sogoway/
|
|
│ ├── sogorch/
|
|
│ └── sogoms/
|
|
│ ├── db/
|
|
│ ├── pdf/
|
|
│ ├── email/
|
|
│ ├── storage/
|
|
│ ├── collect/ # Vigil: agrégation métriques
|
|
│ ├── alert/ # Vigil: alerting
|
|
│ └── action/ # Vigil: actions distantes
|
|
│
|
|
├── cmd/
|
|
│ └── sovigilant/ # Agent monitoring
|
|
│ └── main.go
|
|
│
|
|
├── vigil-dashboard/ # Flutter app
|
|
│ ├── lib/
|
|
│ ├── pubspec.yaml
|
|
│ └── ...
|
|
│
|
|
└── ...
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Roadmap
|
|
|
|
### Phase 1 : MVP
|
|
|
|
- [ ] Sovigilant : collecte host + containers
|
|
- [ ] Sogoway : API REST + gRPC agents
|
|
- [ ] Sogoms-collect : agrégation basique
|
|
- [ ] Sogoms-db : stockage métriques
|
|
- [ ] Flutter : dashboard + liste hosts
|
|
|
|
### Phase 2 : Alerting
|
|
|
|
- [ ] Sogoms-alert : évaluation règles
|
|
- [ ] Notifications email + Slack
|
|
- [ ] Flutter : gestion alertes
|
|
- [ ] Historique alertes
|
|
|
|
### Phase 3 : Actions
|
|
|
|
- [ ] Sogoms-action : commandes distantes
|
|
- [ ] Sogorch : scénarios d'action
|
|
- [ ] Flutter : interface actions
|
|
- [ ] Actions automatiques sur alerte
|
|
|
|
### Phase 4 : Avancé
|
|
|
|
- [ ] Auto-discovery services
|
|
- [ ] Métriques custom
|
|
- [ ] Graphiques avancés
|
|
- [ ] Export Prometheus
|
|
- [ ] Multi-tenant complet
|
|
|
|
---
|
|
|
|
## 10. Références
|
|
|
|
- [gRPC Go](https://grpc.io/docs/languages/go/) - Communication agents
|
|
- [gopsutil](https://github.com/shirou/gopsutil) - Collecte métriques système
|
|
- [Incus API](https://linuxcontainers.org/incus/docs/main/api/) - Métriques containers
|
|
- [Flutter Charts](https://pub.dev/packages/fl_chart) - Graphiques
|
|
|
|
---
|
|
|
|
**Website** : https://vigil.sogoms.com
|
|
**Documentation** : https://docs.sogoms.com/vigil
|
|
**Agent Downloads** : https://vigil.sogoms.com/downloads
|
|
|
|
*Document créé le 2025-01-15 - Version 1.0.0*
|