This is archived documentation for InfluxData product versions that are no longer maintained. For newer documentation, see the latest InfluxData documentation.
Kapacitor is a data processing engine. It can process both stream and batch data. This guide will walk through both workflows and teach the basics of using and running a Kapacitor daemon.
What will be needed
Do not worry about installing anything at this point. Instructions are found below.
The following applications will be required:
- InfluxDB - While Kapacitor does not require InfluxDB, it is the easiest integration to setup and so it will be used in this guide. InfluxDB >= 1.3.x will be needed.
- Telegraf - Telegraf >= 1.3.x will be required.
- Kapacitor - The latest Kapacitor binary and installation packages for most OSes can be found at the downloads page.
- Terminal - The Kapacitor client application works using the CLI and so a basic terminal will be needed to issue commands.
The Use Case
This guide will follow the classic use case of triggering an alert for high cpu usage on a server. CPU data is among the default system metrics generated by Telegraf out of the box.
The Process
- Install InfluxDB and Telegraf.
- Start InfluxDB and send it data from Telegraf.
- Install Kapacitor.
- Start Kapacitor.
- Define and run a stream task to trigger CPU alerts.
- Define and run a batch task to trigger CPU alerts.
Installation
The TICKStack services can be installed to run on the host machine as a part of Systemd, or they can be run from Docker containers. This guide will focus on installing and running them all on the same host as Systemd services.
If you would like to explore using Docker deployments of these components, check out these instructions.
The applications InfluxDB, Telegraf and Kapacitor will need to be installed in that order and on the same host.
All examples will assume that Kapacitor is running on http://localhost:9092
and InfluxDB on http://localhost:8086
.
InfluxDB + Telegraf
Install InfluxDB using the Linux system packages (.deb
,.rpm
) if available.
Start InfluxDB using systemctl:
$ sudo systemctl start influxdb
Verify InfluxDB startup:
$ sudo journalctl -f -n 128 -u influxdb
zář 01 14:47:43 algonquin systemd[1]: Started InfluxDB is an open-source, distributed, time series database.
zář 01 14:47:43 algonquin influxd[14778]: 8888888 .d888 888 8888888b. 888888b.
zář 01 14:47:43 algonquin influxd[14778]: 888 d88P" 888 888 "Y88b 888 "88b
zář 01 14:47:43 algonquin influxd[14778]: 888 888 888 888 888 888 .88P
zář 01 14:47:43 algonquin influxd[14778]: 888 88888b. 888888 888 888 888 888 888 888 888 8888888K.
zář 01 14:47:43 algonquin influxd[14778]: 888 888 "88b 888 888 888 888 Y8bd8P\' 888 888 888 "Y88b
zář 01 14:47:43 algonquin influxd[14778]: 888 888 888 888 888 888 888 X88K 888 888 888 888
zář 01 14:47:43 algonquin influxd[14778]: 888 888 888 888 888 Y88b 888 .d8""8b. 888 .d88P 888 d88P
zář 01 14:47:43 algonquin influxd[14778]: 8888888 888 888 888 888 "Y88888 888 888 8888888P" 8888888P"
zář 01 14:47:43 algonquin influxd[14778]: [I] 2017-09-01T12:47:43Z InfluxDB starting, version 1.3.5, branch HEAD, commit 9d9001036d3585cf21925c13a57881bc6c8dcc7e
zář 01 14:47:43 algonquin influxd[14778]: [I] 2017-09-01T12:47:43Z Go version go1.8.3, GOMAXPROCS set to 8
zář 01 14:47:43 algonquin influxd[14778]: [I] 2017-09-01T12:47:43Z Using configuration at: /etc/influxdb/influxdb.conf
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Using data dir: /var/lib/influxdb/data service=store
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z opened service service=subscriber
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Starting monitor system service=monitor
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z 'build' registered for diagnostics monitoring service=monitor
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z 'runtime' registered for diagnostics monitoring service=monitor
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z 'network' registered for diagnostics monitoring service=monitor
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z 'system' registered for diagnostics monitoring service=monitor
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Starting precreation service with check interval of 10m0s, advance period of 30m0s service=shard-precreation
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Starting snapshot service service=snapshot
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Starting continuous query service service=continuous_querier
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Starting HTTP service service=httpd
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Authentication enabled:false service=httpd
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Listening on HTTP:[::]:8086 service=httpd
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Starting retention policy enforcement service with check interval of 30m0s service=retention
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Listening for signals
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Sending usage statistics to usage.influxdata.com
zář 01 14:47:44 algonquin influxd[14778]: [I] 2017-09-01T12:47:44Z Storing statistics in database '_internal' retention policy 'monitor', at interval 10s service=monitor
...
Next install Telegraf using the Linux system packages (.deb
,.rpm
) if available.
Once Telegraf is installed and started, it will, as configured by default, send system metrics to InfluxDB, which automatically creates the ‘telegraf’ database.
The Telegraf configuration file can be found at its default location: /etc/telegraf/telegraf.conf
. For this introduction it is worth noting some values that will be relevant to the Kapacitor tasks that will be shown below. Namely:
[agent]\interval
- declares the frequency at which system metrics will be sent to InfluxDB[[outputs.influxd]]
- declares how to connect to InfluxDB and the destination database, which is the default ‘telegraf’ database.[[inputs.cpu]]
- declares how to collect the system cpu metrics to be sent to InfluxDB.
Example - relevant sections of /etc/telegraf/telegraf.conf
[agent]
## Default data collection interval for all inputs
interval = "10s"
...
[[outputs.influxdb]]
## The HTTP or UDP URL for your InfluxDB instance. Each item should be
## of the form:
## scheme "://" host [ ":" port]
##
## Multiple urls can be specified as part of the same cluster,
## this means that only ONE of the urls will be written to each interval.
# urls = ["udp://localhost:8089"] # UDP endpoint example
urls = ["http://localhost:8086"] # required
## The target database for metrics (telegraf will create it if not exists).
database = "telegraf" # required
...
[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## If true, collect raw CPU time metrics.
collect_cpu_time = false
It is likely that Telegraf has started upon installation.
Check the current status of the Telegraf service:
$ sudo systemctl status telegraf
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
Loaded: loaded (/lib/systemd/system/telegraf.service; enabled; vendor preset: enabled)
Active: active (running) since Pá 2017-09-01 14:52:10 CEST; 20min ago
Docs: https://github.com/influxdata/telegraf
Main PID: 15068 (telegraf)
Tasks: 18
Memory: 14.4M
CPU: 6.789s
CGroup: /system.slice/telegraf.service
└─15068 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
zář 01 14:52:10 algonquin systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
zář 01 14:52:11 algonquin telegraf[15068]: 2017-09-01T12:52:11Z I! Starting Telegraf (version 1.3.3)
zář 01 14:52:11 algonquin telegraf[15068]: 2017-09-01T12:52:11Z I! Loaded outputs: influxdb
zář 01 14:52:11 algonquin telegraf[15068]: 2017-09-01T12:52:11Z I! Loaded inputs: inputs.cpu inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes in
zář 01 14:52:11 algonquin telegraf[15068]: 2017-09-01T12:52:11Z I! Tags enabled: host=algonquin
zář 01 14:52:11 algonquin telegraf[15068]: 2017-09-01T12:52:11Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"algonquin", Flush Interval:10s
If Telegraf is ‘inactive’ start it as follows:
$ sudo systemctl start telegraf
Check its status as above, and check the system journal to ensure that there are no connection errors to InfluxDB.
$ sudo journalctl -f -n 128 -u telegraf
-- Logs begin at Pá 2017-09-01 09:59:06 CEST. --
zář 01 15:15:42 algonquin systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
zář 01 15:15:43 algonquin telegraf[16968]: 2017-09-01T13:15:43Z I! Starting Telegraf (version 1.3.3)
zář 01 15:15:43 algonquin telegraf[16968]: 2017-09-01T13:15:43Z I! Loaded outputs: influxdb
zář 01 15:15:43 algonquin telegraf[16968]: 2017-09-01T13:15:43Z I! Loaded inputs: inputs.disk inputs.diskio inputs.kernel inputs.mem inputs.processes inputs.swap inputs.system inputs.cpu
zář 01 15:15:43 algonquin telegraf[16968]: 2017-09-01T13:15:43Z I! Tags enabled: host=algonquin
zář 01 15:15:43 algonquin telegraf[16968]: 2017-09-01T13:15:43Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"algonquin", Flush Interval:10s
InfluxDB and Telegraf are now running and listening on localhost. Wait about a minute for Telegraf to supply a small amount of system metric data to InfluxDB. Then, confirm that InfluxDB has the data that Kapacitor will use.
This can be achieved with the following query:
$ curl -G 'http://localhost:8086/query?db=telegraf' --data-urlencode 'q=SELECT mean(usage_idle) FROM cpu'
This should return results similar to the following example.
Example - results from InfluxDB REST query
{"results":[{"statement_id":0,"series":[{"name":"cpu","columns":["time","mean"],"values":[["1970-01-01T00:00:00Z",91.82304336748372]]}]}]}
Installing and Starting Kapacitor
Install Kapacitor using the Linux system packages (.deb
,.rpm
) if available.
The default Kapacitor configuration file is unpacked to /etc/kapacitor/kapacitor.conf
. A copy of the current configuration can be extracted from the Kapacitor daemon as follows:
kapacitord config > kapacitor.conf
The configuration is a toml file and is very similar to the InfluxDB configuration. That is because any input that can be configured for InfluxDB also works for Kapacitor.
Start the Kapacitor service:
$ sudo systemctl start kapacitor
Verify the status of the Kapacitor service:
$ sudo systemctl status kapacitor
● kapacitor.service - Time series data processing engine.
Loaded: loaded (/lib/systemd/system/kapacitor.service; enabled; vendor preset: enabled)
Active: active (running) since Pá 2017-09-01 15:34:16 CEST; 3s ago
Docs: https://github.com/influxdb/kapacitor
Main PID: 18526 (kapacitord)
Tasks: 13
Memory: 9.3M
CPU: 122ms
CGroup: /system.slice/kapacitor.service
└─18526 /usr/bin/kapacitord -config /etc/kapacitor/kapacitor.conf
zář 01 15:34:16 algonquin systemd[1]: Started Time series data processing engine..
zář 01 15:34:16 algonquin kapacitord[18526]: '##:::'##::::'###::::'########:::::'###:::::'######::'####:'########::'#######::'########::
zář 01 15:34:16 algonquin kapacitord[18526]: ##::'##::::'## ##::: ##.... ##:::'## ##:::'##... ##:. ##::... ##..::'##.... ##: ##.... ##:
zář 01 15:34:16 algonquin kapacitord[18526]: ##:'##::::'##:. ##:: ##:::: ##::'##:. ##:: ##:::..::: ##::::: ##:::: ##:::: ##: ##:::: ##:
zář 01 15:34:16 algonquin kapacitord[18526]: #####::::'##:::. ##: ########::'##:::. ##: ##:::::::: ##::::: ##:::: ##:::: ##: ########::
zář 01 15:34:16 algonquin kapacitord[18526]: ##. ##::: #########: ##.....::: #########: ##:::::::: ##::::: ##:::: ##:::: ##: ##.. ##:::
zář 01 15:34:16 algonquin kapacitord[18526]: ##:. ##:: ##.... ##: ##:::::::: ##.... ##: ##::: ##:: ##::::: ##:::: ##:::: ##: ##::. ##::
zář 01 15:34:16 algonquin kapacitord[18526]: ##::. ##: ##:::: ##: ##:::::::: ##:::: ##:. ######::'####:::: ##::::. #######:: ##:::. ##:
zář 01 15:34:16 algonquin kapacitord[18526]: ..::::..::..:::::..::..:::::::::..:::::..:::......:::....:::::..::::::.......:::..:::::..::
zář 01 15:34:16 algonquin kapacitord[18526]: 2017/09/01 15:34:16 Using configuration at: /etc/kapacitor/kapacitor.conf
Since InfluxDB is running on http://localhost:8086
Kapacitor finds it during start up and creates several subscriptions on InfluxDB.
These subscriptions tell InfluxDB to send all the data it receives to Kapacitor.
For more log data check the log file in the traditional /var/log/kapacitor
directory.
$ sudo tail -f -n 128 /var/log/kapacitor/kapacitor.log
[run] 2017/09/01 15:34:16 I! Kapacitor starting, version 1.3.1, branch master, commit 3b5512f7276483326577907803167e4bb213c613
[run] 2017/09/01 15:34:16 I! Go version go1.7.5
[srv] 2017/09/01 15:34:16 I! Kapacitor hostname: localhost
[srv] 2017/09/01 15:34:16 I! ClusterID: e181c0c9-f173-42b5-92c7-10878c15887b ServerID: b0a73d8a-dae8-473c-a053-c06fcaacae7d
[task_master:main] 2017/09/01 15:34:16 I! opened
[scrapers] 2017/09/01 15:34:17 I! [Starting target manager...]
[httpd] 2017/09/01 15:34:17 I! Starting HTTP service
[httpd] 2017/09/01 15:34:17 I! Authentication enabled: false
[httpd] 2017/09/01 15:34:17 I! Listening on HTTP: [::]:9092
[run] 2017/09/01 15:34:17 I! Listening for signals
[httpd] 127.0.0.1 - - [01/Sep/2017:15:34:20 +0200] "POST /write?consistency=&db=_internal&precision=ns&rp=monitor HTTP/1.1" 204 0 "-" "InfluxDBClient" 422971ab-8f1a-11e7-8001-000000000000 1373
[httpd] 127.0.0.1 - - [01/Sep/2017:15:34:20 +0200] "POST /write?consistency=&db=telegraf&precision=ns&rp=autogen HTTP/1.1" 204 0 "-" "InfluxDBClient" 42572567-8f1a-11e7-8002-000000000000 336
...
Here can be seen some basic start up messages: listening on an HTTP port and posting data to InfluxDB. At this point InfluxDB is streaming the data it is receiving from Telegraf to Kapacitor.
Trigger Alert from Stream data
The TICKStack is now setup (excluding Chronograf, which is not covered here). This guide will now introduce the fundamentals of actually working with Kapacitor.
A task
in Kapacitor represents an amount of work to do on a set of data. There are two types of tasks: stream
and batch
. A simple stream
task will be used first to present core Kapacitor features. Then there will be presented some more sophisticated use cases. Finally the first simple use case will be covered as a batch
task.
Kapacitor uses a DSL called TICKscript to define tasks. Each TICKscript defines a pipeline that tells Kapacitor which data to process and how.
So what should Kapacitor be instructed to do?
The most common Kapacitor use case is triggering alerts. The example that follows will set up an alert on high cpu usage. How to define high cpu usage? Telegraf writes to InfluxDB a cpu metric on the percentage of time a cpu spent in an idle state. For demonstration purposes assume that when idle usage drops below 70% a critical alert should be triggered.
A TICKscript can now be written to cover these criteria. Copy the script below into a file called cpu_alert.tick
:
stream
// Select just the cpu measurement from our example database.
|from()
.measurement('cpu')
|alert()
.crit(lambda: int("usage_idle") < 70)
// Whenever we get an alert write it to a file.
.log('/tmp/alerts.log')
Kapacitor has an HTTP API with which all communication happens.
The kapacitor
client application exposes the API over the command line.
Now use this CLI tool to define the task
and the database—including retention policy—that it can access:
kapacitor define cpu_alert \
-type stream \
-tick cpu_alert.tick \
-dbrp telegraf.autogen
Verify that the alert has been created using the list
command.
$ kapacitor list tasks
ID Type Status Executing Databases and Retention Policies
cpu_alert stream disabled false ["telegraf"."autogen"]
View details about the task using the show
command.
$ kapacitor show cpu_alert
ID: cpu_alert
Error:
Template:
Type: stream
Status: disabled
Executing: false
...
This command will be covered in more detail below.
Kapacitor now knows how to trigger the alert.
However, nothing is going to happen until the task has been enabled. Before being enabled it should first be tested to ensure that it does not do spam the log files or communication channels with alerts. Record the current data stream for a bit so it can be used to test the new task:
kapacitor record stream -task cpu_alert -duration 60s
Since the task was defined with a database and retention policy pair, the recording knows to only record data from that database and retention policy.
- NOTE – troubleshooting connection refused – If, when running the record command, an error is returned of the type
getsockopt: connection refused
(Linux) orconnectex: No connection could be made...
(Windows), please ensure that the Kapacitor service is running. See the section above Installing and Starting Kapacitor. If Kapacitor is started and this error is still encountered, check the firewall settings of the host machine and ensure that port9092
is accessible. Check as well the messages in/var/log/kapacitor/kapacitor.log
. There may be an issue with thehttp
or other configuration in/etc/kapacitor/kapacitor.conf
and this will appear in the log. If the Kapacitor service is running on another host machine, set theKAPACITOR_URL
environment variable in the local shell to the Kapacitor endpoint on the remote machine.
Now grab the ID that was returned and put it in a bash variable for easy use later on (the actual UUID returned will be different):
rid=cd158f21-02e6-405c-8527-261ae6f26153
Confirm that the recording captured some data. Run
kapacitor list recordings $rid
The output should appear like:
ID Type Status Size Date
cd158f21-02e6-405c-8527-261ae6f26153 stream finished 2.2 kB 04 May 16 11:44 MDT
As long as the size is more than a few bytes it is certain that some data was captured.
If Kapacitor is not receiving data yet, check each layer: Telegraf → InfluxDB → Kapacitor.
Telegraf will log errors if it cannot communicate to InfluxDB.
InfluxDB will log an error about connection refused
if it cannot send data to Kapacitor.
Run the query SHOW SUBSCRIPTIONS
to find the endpoint that InfluxDB is using to send data to Kapacitor.
$ curl -G 'http://localhost:8086/query?db=telegraf' --data-urlencode 'q=SHOW SUBSCRIPTIONS'
{"results":[{"statement_id":0,"series":[{"name":"_internal","columns":["retention_policy","name","mode","destinations"],"values":[["monitor","kapacitor-ef3b3f9d-0997-4c0b-b1b6-5d0fb37fe509","ANY",["http://localhost:9092"]]]},{"name":"telegraf","columns":["retention_policy","name","mode","destinations"],"values":[["autogen","kapacitor-ef3b3f9d-0997-4c0b-b1b6-5d0fb37fe509","ANY",["http://localhost:9092"]]]}]}]}
With a snapshot of data recorded from the stream, that data can then be replayed to the new task.
The replay
action replays data only to a specific task.
This way the task can be tested in complete isolation:
kapacitor replay -recording $rid -task cpu_alert
Since the data has already been recorded, it can be replayed as fast as possible instead of waiting for real time to pass.
When the flag -real-clock
is set, the data will be replayed by waiting for the deltas between the timestamps to pass, though the result is identical whether real time passes or not. This is because time is measured on each node by the data points it receives.
Check the log using the command below.
sudo cat /tmp/alerts.log
Were any alerts received? The file should contain lines of JSON, where each line represents one alert. The JSON line contains the alert level and the data that triggered the alert.
Depending on how busy the host machine was, maybe not.
The task can be modified to be really sensitive to ensure the alerts are working.
In the TICKscript change the lamda function .crit(lambda: "usage_idle" < 70)
to .crit(lambda: "usage_idle" < 100)
, and define the task once more.
Any time you want to update a task change the TICKscript and then run the define
command again with just the TASK_NAME
and -tick
arguments:
Now every data point that was received during the recording will trigger an alert.
kapacitor define cpu_alert -tick cpu_alert.tick
Replay it again and verify the results.
kapacitor replay -recording $rid -task cpu_alert
Once the alerts.log
results verify that it is working, change the usage_idle
threshold back to a more reasonable level and redefine the task once more using the define
command as shown above.
Enable the task, so it can start processing the live data stream, with:
kapacitor enable cpu_alert
Now alerts will be written to the log in real time.
To see that the task is receiving data and behaving as expected run the show
command once again to get more information about it:
$ kapacitor show cpu_alert
|from()
ID: cpu_alert
Error:
Type: stream
Status: Enabled
Executing: true
Created: 04 May 16 21:01 MDT
Modified: 04 May 16 21:04 MDT
LastEnabled: 04 May 16 21:03 MDT
Databases Retention Policies: [""."autogen"]
TICKscript:
stream
// Select just the cpu me
.measurement('cpu')
|alert()
.crit(lambda: "usage_idle" < 70)
// Whenever we get an alert write it to a file.
.log('/tmp/alerts.log')
DOT:
digraph asdf {
graph [throughput="0.00 points/s"];
stream0 [avg_exec_time_ns="0" ];
stream0 -> from1 [processed="12"];
from1 [avg_exec_time_ns="0" ];
from1 -> alert2 [processed="12"];
alert2 [alerts_triggered="0" avg_exec_time_ns="0" ];
}
The first part has information about the state of the task and any error it may have encountered.
The TICKscript
section displays the version of the TICKscript that Kapacitor has stored in its local database.
The last section, DOT
, is a graphviz dot formatted tree that contains information about the data processing pipeline defined by the TICKscript. Its members are key-value associative array entries containing statistics about each node and links along an edge to the next node also including associative array statistical information. The processed key in the link/edge members indicates the number of data points that have passed along the specified edge of the graph.
For example in the above the stream0
node (aka the stream
var from the TICKscript) has sent 12 points to the from1
node.
The from1
node has also sent 12 points on to the alert2
node. Since Telegraf is configured to send cpu
data, all 12 points match the from/measurement criteria of the from1
node and are passed on.
NOTE: When installing graphviz on Debian or RedHat (if not already installed) use the package provided by the OS provider. The packages offered in the download section of the graphviz site are not up-to-date.
Now that the task is running with live data, here is a quick hack to use 100% of one core to generate some artificial cpu activity:
while true; do i=0; done
There are plenty of ways to get a threshold alert. So, why all this pipeline TICKscript stuff? In short because TICKscripts can quickly be extended to become much more powerful.
Gotcha - single versus double quotes
Single quotes and double quotes in TICKscripts do very different things:
Note the following example:
var data = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('cpu')
// NOTE: Double quotes on server1
.where(lambda: "host" == "server1")
The result of this search will always be empty, because double quotes were used around “server1”. This means that Kapacitor will search for a series where the field “host” is equal to the value held in the field “server1”. This is probably not what was intended. More likely the intention was to search for a series where tag “host” has the value ‘server1’, so single quotes should be used. Double quotes denote data fields, single quotes string values. To match the value, the tick script above should look like this:
var data = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('cpu')
// NOTE: Single quotes on server1
.where(lambda: "host" == 'server1')
Extending TICKscripts
The TICKscript below will compute the running mean and compare current values to it.
It will then trigger an alert if the values are more than 3 standard deviations away from the mean.
Replace the cpu_alert.tick
script with the TICKscript below:
stream
|from()
.measurement('cpu')
|alert()
// Compare values to running mean and standard deviation
.crit(lambda: sigma("usage_idle") > 3)
.log('/tmp/alerts.log')
Just like that, a dynamic threshold can be created, and, if cpu usage drops in the day or spikes at night, an alert will be issued.
Try it out.
Use define
to update the task TICKscript.
kapacitor define cpu_alert -tick cpu_alert.tick
NOTE: If a task is already enabled, redefining the task with the
define
command will automaticallyreload
it. To define a task without reloading it use-no-reload
Now tail the alert log:
sudo tail -f /tmp/alerts.log
There should not be any alerts triggering just yet. Next, start a while loop to add some load:
while true; do i=0; done
An alert trigger should be written to the log shortly, once enough artificial load has been created. Leave the loop running for a few minutes. After canceling the loop, another alert should be issued indicating that cpu usage has again changed. Using this technique, alerts can be generated for the raising and falling edges of cpu usage, as well as any outliers.
A Real-World Example
Now that the basics have been covered, here is a more real world example. Once the metrics from several hosts are streaming to Kapacitor, it is possible to do something like: Aggregate and group the cpu usage for each service running in each datacenter, and then trigger an alert based off the 95th percentile. In addition to just writing the alert to a log, Kapacitor can integrate with third party utilities: currently Slack, PagerDuty, HipChat, VictorOps and more are supported. The alert can also be sent by email, be posted to a custom endpoint or can trigger the execution of a custom script. Custom message formats can also be defined so that alerts have the right context and meaning. The TICKscript for this would look like the following example.
Example - TICKscript for stream on multiple service cpus and alert on 95th percentile
stream
|from()
.measurement('cpu')
// create a new field called 'used' which inverts the idle cpu.
|eval(lambda: 100.0 - "usage_idle")
.as('used')
|groupBy('service', 'datacenter')
|window()
.period(1m)
.every(1m)
// calculate the 95th percentile of the used cpu.
|percentile('used', 95.0)
|eval(lambda: sigma("percentile"))
.as('sigma')
.keep('percentile', 'sigma')
|alert()
.id('{{ .Name }}/{{ index .Tags "service" }}/{{ index .Tags "datacenter"}}')
.message('{{ .ID }} is {{ .Level }} cpu-95th:{{ index .Fields "percentile" }}')
// Compare values to running mean and standard deviation
.warn(lambda: "sigma" > 2.5)
.crit(lambda: "sigma" > 3.0)
.log('/tmp/alerts.log')
// Post data to custom endpoint
.post('https://alerthandler.example.com')
// Execute custom alert handler script
.exec('/bin/custom_alert_handler.sh')
// Send alerts to slack
.slack()
.channel('#alerts')
// Sends alerts to PagerDuty
.pagerDuty()
// Send alerts to VictorOps
.victorOps()
.routingKey('team_rocket')
Something so simple as defining an alert can quickly be extended to apply to a much larger scope. With the above script, an alert will be triggered if any service in any datacenter deviates more than 3 standard deviations away from normal behavior as defined by the historical 95th percentile of cpu usage, and will do so within 1 minute!
For more information on how alerting works, see the AlertNode docs.
Trigger Alert from Batch data
Instead of just processing the data in streams, Kapacitor can also periodically query
InfluxDB and then process that data in batches.
While triggering an alert based off cpu usage is more suited for the streaming case, the basic idea
of how batch
tasks work is demonstrated here by following the same use case.
This TICKscript does roughly the same thing as the earlier stream task, but as a batch task:
batch
|query('''
SELECT mean(usage_idle)
FROM "telegraf"."autogen"."cpu"
''')
.period(5m)
.every(5m)
.groupBy(time(1m), 'cpu')
|alert()
.crit(lambda: "mean" < 70)
.log('/tmp/batch_alerts.log')
Copy the script above into the file batch_cpu_alert.tick
.
Define this task:
kapacitor define batch_cpu_alert -type batch -tick batch_cpu_alert.tick -dbrp telegraf.autogen
Verify its creation:
$ kapacitor list tasks
ID Type Status Executing Databases and Retention Policies
batch_cpu_alert batch disabled false ["telegraf"."autogen"]
cpu_alert stream enabled true ["telegraf"."autogen"]
The result of the query in the task can be recorded like so (again, the actual UUID will differ):
kapacitor record batch -task batch_cpu_alert -past 20m
# Save the id again
rid=b82d4034-7d5c-4d59-a252-16604f902832
This will record the last 20 minutes of batches using the query in the batch_cpu_alert
task.
In this case, since the period
is 5 minutes, the last 4 batches will be saved in the recording.
The batch recording can be replayed in the same way:
kapacitor replay -recording $rid -task batch_cpu_alert
Check the alert log to make sure alerts were generated as expected.
The sigma
based alert above can also be adapted for working with batch data.
Play around and get comfortable with updating, testing, and running tasks in Kapacitor.
What’s next?
Take a look at the example guides for how to use Kapacitor. The use cases demonstrated there explore some of the richer features of Kapacitor.