Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Os : General nodata | high | n/a | k.os.knock.nodata | This trigger fires when we have no data for a specific server (meaning this server seams to be down). Please note this triggers acts as a master trigger for all no data triggers. | Check this was expected, and check the server, check you network, check connectivity, check the knock daemon on this server. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Hdd.Health no data | high | Per disk | k.hard.hd.health, no data for 7200 sec | This trigger fires when a disk has been removed or dispeared, and we have no data for more than 7200 sec. | Check this was expected and/or replace the disk. |
Hdd.Health status | high | Per disk | k.hard.hd.health != KOK | This trigger fires when SMART status for a disk is invalid. | Check this was expected and/or replace the disk. |
Hdd re-allocated sectors | high | Per disk | k.hard.hd.reallocated_sector_ct > 0 | This trigger fires when a disk start re-allocating sectors. This usually indicates a predictive disk failure. | Check the disk, replace it if necessary. |
Hdd serial number modified | high | Per disk | k.hard.hd.serial_number modified | This trigger fires when a disk serial number has changed. | Check this was expected. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Network full duplex | high | Per interface | k.net.if.status != full | This trigger fires when a network interface is not in full duplex mode. | Check configuration and restore full duplex mode. |
Network speed | average | Per interface | k.net.if.status speed < 1000 | This trigger fires when a network interface speed is lower than 1 GB/sec. | Check configuration, connectivity and restore at least 1 GB interface speed. |
Network status | average | Per interface | k.net.if.status != ok | This trigger fires when a network interface oper status is not ok | Check configuration, connectivity and restore interface status. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Ping delay | info | Per ping ip | k.ping.delay > 0.5 | This trigger fires when ping delay toward specified IP is greater than 0.5 seconds. | Check your network, local and remove devices. |
Ping delay | high | Per ping ip | k.ping.delay > 1 | This trigger fires when ping delay toward specified IP is greater than 1 seconds. | Check your network, local and remove devices. |
Ping delay no data | disaster | Per disk | k.ping.delay, no data for 7200 sec | This trigger fires when we have no ping data for at least 300 seconds toward specified IP. | Check your network, local and remove devices. |
Ping lost | info | Per ping ip | k.ping.lost > 0 | This trigger fires when ping packet loss are detected (> 0). | Check your network, local and remove devices. |
Ping lost | high | Per ping ip | k.ping.lost > 25 | This trigger fires when ping packet loss are detected (> 25). | Check your network, local and remove devices. |
Ping lost no data | disaster | Per ping ip | k.ping.lost, no data for 7200 sec | This trigger fires when we have no ping loss for at least 300 seconds toward specified IP. | Check your network, local and remove devices. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Free disk space | high | Per volume | k.vfs.fs.size, % free < 15 | This trigger fires when free disk space on specified volume is lower than 15%. | Check your disk usage, remove some files, archive some files, upgrade volume capacity. |
Free disk space | warn | Per volume | k.vfs.fs.size, % free < 30 | This trigger fires when free disk space on specified volume is lower than 30%. | Check your disk usage, remove some files, archive some files, upgrade volume capacity. |
Free inode space | high | Per volume | k.vfs.fs.inode, % free < 15 | This trigger fires when free inode space on specified volume is lower than 15%. | Check your disk usage, remove some files, archive some files, upgrade inode capacity (may be difficult depending on underlying FS). |
Free inode space | warn | Per volume | k.vfs.fs.inode, % free < 30 | This trigger fires when free inode space on specified volume is lower than 30%. | Check your disk usage, remove some files, archive some files, upgrade inode capacity (may be difficult depending on underlying FS). |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Os.Cpu : CPU is overloaded | warn | n/a | k.os.cpu.util, idle < 10% | This trigger fires when idle cpu usage < 10% | Check this was expected, and check server/processes/users activities. |
Os.Cpu : Load is too high | warn | n/a | k.os.cpu.load, per cpu > 5 | This trigger fires when server load, per cpu, is greater than 5 | Check this was expected, and check server/processes/users activities. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Os.Host : /etc/passwd has been changed | warn | n/a | k.vfs.file.cksum has changed | This trigger fires when /etc/passwd is modified | Check this was expected |
Os.Host : Configured max number of opened files is too low | info | n/a | k.os.maxfiles < 1025 | This trigger fires when configured max opened files is too low | Check your server configuration & tuning. |
Os.Host : Configured max number of processes is too low | info | n/a | k.os.maxproc < 1025 | This trigger fires when configured max processes is too low | Check your server configuration & tuning. |
Os.Host : Time Diff | average | n/a | k.os.timediff > 1 | This trigger fires when server time difference is greater than 1 seconds | Check your server time synchronization. |
Os.Host : Time Diff | disaster | n/a | k.os.timediff > 2 | This trigger fires when server time difference is greater than 2 seconds | Check your server time synchronization. |
Os.Host : Server has just been restarted | info | n/a | k.os.uptime < 600 | This trigger fires when server restarted for less than 600 seconds | Check this was expected. |
Os.Memory : Low available memory | average | n/a | k.os.memory.size available < 16MB and k.os.memory.size cached < 32MB | This trigger fires when server is in low memory conditions (no more available and cached memory) | Check you server & processes memory usage. Increase memory. |
Os.Swap : Lack of free swap space | high | n/a | k.os.swap.size, % free < 25 | This trigger fires when swap free space if less than 25%. | Check you server & processes memory usage. Increase memory. Please note that a swap is hell, fix it. |
Os.Swap : Lack of free swap space | average | n/a | k.os.swap.size, % free < 50 | This trigger fires when swap free space if less than 50%. | Check you server & processes memory usage. Increase memory. Please note that a swap is hell, fix it. |
Os.Swap : Lack of free swap space | info | n/a | k.os.swap.size, % free < 75 | This trigger fires when swap free space if less than 75%. | Check you server & processes memory usage. Increase memory. Please note that a swap is hell, fix it. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Dns status | high | Per hostname / dns server | k.dns.resolv==KO | This trigger fires when dns resolving for specified hostname, toward specified dns server is not ok (invalid reply, resolving failed, timeout). | Possible remote dns server issue, platform connectivity issue, dns entry issue, dns resolving configuration issue. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
CheckProcess : pidfile for process | average | Per process monitored | k.proc.pidfile!=ok | This trigger fires when process pidfile is not ok. | Possible process down and/or crashed, pidfile deleted, process not stopped |
CheckProcess : running for process | average | Per process monitored | k.proc.running!=ok | This trigger fires when process is not running. | Possible process down and/or crashed |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Nginx started | high | Per instance | k.nginx.started!=1 | This trigger fires when nginx status is down (no valid reply, no status reply, http timeout). | Possible nginx status not correctly deployed (check daemon logs), possible nginx workers overload, possible nginx instance failure or stopped. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Apache idle workers | average | Per instance | k.apache.stat.idle_workers==0 | This trigger fires when no more apache idle workers are available (meaning that all apache workers are in use). | Possible platform slowdown or issue. May requires apache tuning, server code optimization, platform optimization, benchmarking. |
Apache started | high | Per instance | k.apache.started!=1 | This trigger fires when apache status is down (no valid reply, no status reply, http timeout). | Possible apache status not correctly deployed (check daemon logs), possible apache workers overload, possible apache instance failure or stopped. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
PhpFpm idle processes | average | Per instance | k.phpfpm.idle_processes==0 | This trigger fires when no more PhpFpm idle processes are available (meaning that all processes are in use). | Possible platform slowdown or issue. May requires PhpFpm tuning, server code optimization, platform optimization, benchmarking. |
PhpFpm started | high | Per instance | k.phpfpm.started!=1 | This trigger fires when PhpFpm status is down (no valid reply, no status reply, http timeout). | Possible PhpFpm status not correctly deployed (check daemon logs), possible PhpFpm pool overload, possible PhpFpm instance failure or stopped, possible upper Web Server issue (Nginx, Apache...). |
PhpFpm restarted | info | Per instance | k.phpfpm.start_since<600 | This trigger fires when PhpFpm has been restarted recently (<600 seconds). | Check instance restart was expected. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Uwsgi overload | average | Per instance | k.uwsgi.cores.cur.idle==0 | This trigger fires when no more idle cores are available (meaning that all cores are in use). | Possible platform slowdown or issue. May requires uwsgi tuning, server code optimization, platform optimization, benchmarking. |
Uwsgi started | high | Per instance | k.uwsgi.started!=1 | This trigger fires when uwsgi stat is down (no valid reply, no stats socket reply, stat timeout). | Possible uwsgi stat not correctly deployed (check daemon logs), possible server overload, possible uwsgi instance failure or stopped. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Varnish main.uptime | info | Per instance | k.varnish.main.uptime<600 | This trigger fires when main varnish process has been restarted recently (<600 seconds). | Check instance restart was expected. |
Varnish mgt.uptime | info | Per instance | k.varnish.mgt.uptime<600 | This trigger fires when mgt varnish process has been restarted recently (<600 seconds). | Check instance restart was expected. |
Varnish started | high | Per instance | k.varnish.started!=1 | This trigger fires when varnishstat is down (no valid reply, no invoke reply, invoke timeout). | Possible varnishstat not correctly installed (check daemon logs), possible server overload, possible varnish instance failure or stopped. |
Varnish backend_busy | average | Per instance | k.varnish.backend_busy>0 | This trigger fires when some busy backends are detected. | You may have to check your underlying backends. |
Varnish backend_fail | average | Per instance | k.varnish.backend_fail>0 | This trigger fires when some failed backends are detected. | You may have to check your underlying backends. |
Varnish cur.thread_queue_len | average | Per instance | k.varnish.cur.thread_queue_len>0 | This trigger fires when some sessions are waiting for available threads. | Possible varnish thread pool tuning required, possible server and/or backend slowdown. |
Varnish sess_drop | average | Per instance | k.varnish.sess_drop>0 | This trigger fires when sessions are silently dropped due to lack of worker thread. | Possible varnish thread pool tuning required, possible server and/or backend slowdown. |
Varnish sess_dropped | average | Per instance | k.varnish.sess_dropped>0 | This trigger fires when sessions are dropped because the queue were too long already. | Possible varnish thread pool tuning required, possible server and/or backend slowdown. |
Varnish sess_fail | average | Per instance | k.varnish.sess_fail>0 | This trigger occurs when some TCP accept failed. Can be caused by client, or the server ran out of some resource like file descriptors. | Possible OS and/or varnish process tuning required, possible server and/or backend slowdown. |
Varnish sess_queued | average | Per instance | k.varnish.sess_queued>0 | This trigger fires when session are queued waiting for a thread. | Possible varnish process tuning required, possible server and/or backend slowdown. |
Varnish threads_failed | average | Per instance | k.varnish.threads_failed>0 | This trigger fires when creating a thread failed. | Indeed an issue, investigate varnish instance (open files limits?). |
Varnish threads_limited | average | Per instance | k.varnish.threads_limited>0 | This trigger fires when thread pool is maxed. | Possible varnish thread pool tuning required, possible server and/or backend slowdown. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Mysql started | high | Per instance | k.mysql.started!=1 | This trigger fires when daemon is not able to connect and/or execute basic SQL statements toward Mysql instance (SQL timeout, SQL failure, instance down). | Possible debian-sys-maint account issue, Mysql instance down or crashed. |
Mysql Replication Threads | high | Per instance | k.mysql.repli.cur.lag_sec<0 | This trigger fires a replication setup is detected, but replication threads are not running. | Possible broken replication, master server connectivity issue, replication threads crash, instance issue. |
Mysql Replication Lag | info | Per instance | k.mysql.repli.cur.lag_sec>600 | This trigger fires a replication setup is detected, replication is up, but slave server lag (600+ seconds) compared to master. | Possible slave server slowdown, possible huge sql requests processing on slave server, replicating table without PK (involving massive full scan on slave server)... |
Mysql Replication Lag | warn | Per instance | k.mysql.repli.cur.lag_sec>3600 | This trigger fires a replication setup is detected, replication is up, but slave server lag (3600+ seconds) compared to master. | Possible slave server slowdown, possible huge sql requests processing on slave server, replicating table without PK (involving massive full scan on slave server)... |
Mysql Replication Lag | high | Per instance | k.mysql.repli.cur.lag_sec>7200 | This trigger fires a replication setup is detected, replication is up, but slave server lag (7200+ seconds) compared to master. | Possible slave server slowdown, possible huge sql requests processing on slave server, replicating table without PK (involving massive full scan on slave server)... |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
Redis started | high | Per instance | k.redis.started!=1 | This trigger fires when redis status is down (no valid reply, no INFO reply, socket timeout). | Possible instance overload, server slowdown, possible redis instance failure or stopped. |
Redis uptime | info | Per instance | k.redis.uptime_in_seconds<600 | This trigger fires when Redis has been restarted recently (<600 seconds). | Check instance restart was expected. |
Redis replication down | high | Per instance | k.redis.master_link_down_since_seconds>60 | This trigger fires a replication setup is detected, replication is down (since 60+ seconds) toward master. | Possible master connectivity issue, possible replication issue. |
Redis rdb save | average | Per instance | k.redis.rdb_last_bgsave_status!=ok | This trigger fires if last rdb save was not ok. | Possible instance issue, possible disk full. |
Redis aof save (write) | average | Per instance | k.redis.aof_last_write_status!=ok | This trigger fires if last aof write was not ok. | Possible instance issue, possible disk full. |
Redis aof save (rewrite) | average | Per instance | k.redis.aof_last_bgrewrite_status!=ok | This trigger fires if last aof rewrite was not ok. | Possible instance issue, possible disk full. |
Trigger | Level | Discovery | Key | Description | Fix |
---|---|---|---|---|---|
MemCached started | high | Per instance | k.memcached.started!=1 | This trigger fires when memcached status is down (no valid reply, no stats reply, socket timeout). | Possible instance overload, server slowdown, possible memcached instance failure or stopped. |
MemCached uptime | info | Per instance | k.memcached.uptime<600 | This trigger fires when MemCached has been restarted recently (<600 seconds). | Check instance restart was expected. |
MemCached accepting stop | info | Per instance | k.memcached.listen_disabled_num>0 | This trigger fires when MemCached has stopped acception connections recently. | Possible instance overload, possible tuning at maxconns end required, possible tuning at tcp stack end required. |
MemCached accepting disabled | high | Per instance | k.memcached.accepting_conns!=1 | This trigger fires when MemCached do not accept connections any more. | Possible instance overload, crashed, mis-configured, buggy. |
MemCached auth errors | info | Per instance | k.memcached.auth_errors>0 | This trigger fires when MemCached authentication errors occured. | Possible configuration issue (at client end|at server end), possible breakout attempts. |