A Better Way to Monitor
God is an easy to configure, easy to extend monitoring framework written in Ruby.
Keeping your server processes and tasks running should be a simple part of your deployment process. God aims to be the simplest, most powerful monitoring application available.
Tom Preston-Werner
tom@mojombo.com
Google Group: https://groups.google.com/g/god-rb
Features
- Config file is written in Ruby
- Easily write your own custom conditions in Ruby
- Supports both poll and event based conditions
- Different poll conditions can have different intervals
- Integrated notification system (write your own too!)
- Easily control non-daemonizing scripts
Installation
The best way to get god is via rubygems:
$ [sudo] gem install resurrected_god
Requirements
God currently only works on Linux (kernel 2.6.15+), BSD, and Darwin
systems. No support for Windows is planned. Event based conditions on Linux
systems require the cn
(connector) kernel module loaded or compiled into
the kernel and god must be run as root.
The following systems have been tested. Help us test it on others!
-
Darwin 10.4.10
-
RedHat Fedora 6-15
-
Ubuntu Dapper (no events)
-
Ubuntu Feisty
-
CentOS 4.5 (no events), 5, 6
Quick Start
Note: this quick start guide requires god 0.12.0 or above. You can check your version by running:
$ god --version
The easiest way to understand how god will make your life better is by trying out a simple example. To get you up and running quickly, I’ll show you how to keep a trivial server running.
Open up a new directory and write a simple server. Let’s call it
simple.rb
:
loop do
puts 'Hello'
sleep 1
end
Now we’ll write a god config file that tells god about our process. Place it
in the same directory and call it simple.god
:
God.watch do |w|
w.name = "simple"
w.start = "ruby /full/path/to/simple.rb"
w.keepalive
end
This is the simplest possible god configuration. We start by declaring a
God.watch
block. A watch in god represents a process that we want to watch
and control. Each watch must have, at minimum, a unique name and a command that
tells god how to start the process. The keepalive
declaration tells god to
keep this process alive. If the process is not running when god starts, it will
be started. If the process dies, it will be restarted.
In this example the simple
process runs foreground, so god will take care of
daemonizing it and keeping track of the PID for us. When possible, it’s best to
let god daemonize processes for us, that way we don’t have to worry about
specifying and keeping track of PID files. Later on we’ll see how to manage
processes that can’t run foreground or that require PID files to be specified.
To run god, we give it the configuration file we wrote with -c
. To see what’s
going on, we can ask it to run foreground with -D
:
$ god -c path/to/simple.god -D
There are two ways that god can monitor your process. The first and better way is with process events. Not every system supports it, but those that do will automatically use it. With events, god will know immediately when a process exits. For those systems without process event support, god will use a polling mechanism. The output you see throughout this section will show both ways.
After starting god, you should see some output like the following:
# Events
I [2011-12-10 15:24:34] INFO: Loading simple.god
I [2011-12-10 15:24:34] INFO: Syslog enabled.
I [2011-12-10 15:24:34] INFO: Using pid file directory: /Users/tom/.god/pids
I [2011-12-10 15:24:34] INFO: Started on drbunix:///tmp/god.17165.sock
I [2011-12-10 15:24:34] INFO: simple move 'unmonitored' to 'init'
I [2011-12-10 15:24:34] INFO: simple moved 'unmonitored' to 'init'
I [2011-12-10 15:24:34] INFO: simple [trigger] process is not running (ProcessRunning)
I [2011-12-10 15:24:34] INFO: simple move 'init' to 'start'
I [2011-12-10 15:24:34] INFO: simple start: ruby /Users/tom/dev/mojombo/god/simple.rb
I [2011-12-10 15:24:34] INFO: simple moved 'init' to 'start'
I [2011-12-10 15:24:34] INFO: simple [trigger] process is running (ProcessRunning)
I [2011-12-10 15:24:34] INFO: simple move 'start' to 'up'
I [2011-12-10 15:24:34] INFO: simple registered 'proc_exit' event for pid 23298
I [2011-12-10 15:24:34] INFO: simple moved 'start' to 'up'
# Polls
I [2011-12-07 09:40:18] INFO: Loading simple.god
I [2011-12-07 09:40:18] INFO: Syslog enabled.
I [2011-12-07 09:40:18] INFO: Using pid file directory: /Users/tom/.god/pids
I [2011-12-07 09:40:18] INFO: Started on drbunix:///tmp/god.17165.sock
I [2011-12-07 09:40:18] INFO: simple move 'unmonitored' to 'up'
I [2011-12-07 09:40:18] INFO: simple moved 'unmonitored' to 'up'
I [2011-12-07 09:40:18] INFO: simple [trigger] process is not running (ProcessRunning)
I [2011-12-07 09:40:18] INFO: simple move 'up' to 'start'
I [2011-12-07 09:40:18] INFO: simple start: ruby /Users/tom/dev/mojombo/god/simple.rb
I [2011-12-07 09:40:19] INFO: simple moved 'up' to 'up'
I [2011-12-07 09:40:19] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 09:40:24] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 09:40:29] INFO: simple [ok] process is running (ProcessRunning)
Here you can see god starting up, noticing that the simple
process isn’t
running, starting it, and then checking every five seconds to make sure it’s
up. If you’d like to see god work its magic, go ahead and kill the simple
process. You should then see something like this:
# Events
I [2011-12-10 15:33:38] INFO: simple [trigger] process 23416 exited (ProcessExits)
I [2011-12-10 15:33:38] INFO: simple move 'up' to 'start'
I [2011-12-10 15:33:38] INFO: simple deregistered 'proc_exit' event for pid 23416
I [2011-12-10 15:33:38] INFO: simple start: ruby /Users/tom/dev/mojombo/god/simple.rb
I [2011-12-10 15:33:38] INFO: simple moved 'up' to 'start'
I [2011-12-10 15:33:38] INFO: simple [trigger] process is running (ProcessRunning)
I [2011-12-10 15:33:38] INFO: simple move 'start' to 'up'
I [2011-12-10 15:33:38] INFO: simple registered 'proc_exit' event for pid 23601
I [2011-12-10 15:33:38] INFO: simple moved 'start' to 'up'
# Polls
I [2011-12-07 09:54:59] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 09:55:04] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 09:55:09] INFO: simple [trigger] process is not running (ProcessRunning)
I [2011-12-07 09:55:09] INFO: simple move 'up' to 'start'
I [2011-12-07 09:55:09] INFO: simple start: ruby /Users/tom/dev/mojombo/god/simple.rb
I [2011-12-07 09:55:09] INFO: simple moved 'up' to 'up'
I [2011-12-07 09:55:09] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 09:55:14] INFO: simple [ok] process is running (ProcessRunning)
While keeping a process up is useful, it would be even better if we could make
sure our process was behaving well and restart it when resource utilization
exceeds our specifications. With a few additions, we can easily have our
process restarted when memory usage or CPU goes above certain limits. Edit
your sample.god
config file to look like this:
God.watch do |w|
w.name = "simple"
w.start = "ruby /full/path/to/simple.rb"
w.keepalive(memory_max: 150.megabytes,
cpu_max: 50.percent)
end
Here I’ve specified a :memory_max
option to the keepalive
command. Now if
the process memory usage goes above 150 megabytes, god will restart it.
Similarly, by setting the :cpu_max
, god will restart my process if its CPU
usage goes over 50%. By default these properties will be checked every 30
seconds and will be acted upon if there is an overage for three out of any
five checks. This prevents the process from getting restarted for temporary
resource spikes.
To test this out, modify your simple.rb
server script to introduce a memory
leak:
data = +''
loop do
puts 'Hello'
100000.times { data << 'x' }
end
Ctrl-C out of the foregrounded god instance. Notice that your current simple
server will continue to run. Start god again with the same command as before.
Now instead of starting the simple
process, it will notice that one is
already running and simply switch to the up
state.
# Events
I [2011-12-10 15:36:00] INFO: Loading simple.god
I [2011-12-10 15:36:00] INFO: Syslog enabled.
I [2011-12-10 15:36:00] INFO: Using pid file directory: /Users/tom/.god/pids
I [2011-12-10 15:36:00] INFO: Started on drbunix:///tmp/god.17165.sock
I [2011-12-10 15:36:00] INFO: simple move 'unmonitored' to 'init'
I [2011-12-10 15:36:00] INFO: simple moved 'unmonitored' to 'init'
I [2011-12-10 15:36:00] INFO: simple [trigger] process is running (ProcessRunning)
I [2011-12-10 15:36:00] INFO: simple move 'init' to 'up'
I [2011-12-10 15:36:00] INFO: simple registered 'proc_exit' event for pid 23601
I [2011-12-10 15:36:00] INFO: simple moved 'init' to 'up'
# Polls
I [2011-12-07 14:50:46] INFO: Loading simple.god
I [2011-12-07 14:50:46] INFO: Syslog enabled.
I [2011-12-07 14:50:46] INFO: Using pid file directory: /Users/tom/.god/pids
I [2011-12-07 14:50:47] INFO: Started on drbunix:///tmp/god.17165.sock
I [2011-12-07 14:50:47] INFO: simple move 'unmonitored' to 'up'
I [2011-12-07 14:50:47] INFO: simple moved 'unmonitored' to 'up'
I [2011-12-07 14:50:47] INFO: simple [ok] process is running (ProcessRunning)
In order to get our new simple
server running, we can issue a command to god
to have our process restarted:
$ god restart simple
From the logs you can see god killing and restarting the process:
# Events
I [2011-12-10 15:38:13] INFO: simple move 'up' to 'restart'
I [2011-12-10 15:38:13] INFO: simple deregistered 'proc_exit' event for pid 23601
I [2011-12-10 15:38:13] INFO: simple stop: default lambda killer
I [2011-12-10 15:38:13] INFO: simple sent SIGTERM
I [2011-12-10 15:38:14] INFO: simple process stopped
I [2011-12-10 15:38:14] INFO: simple start: ruby /Users/tom/dev/mojombo/god/simple.rb
I [2011-12-10 15:38:14] INFO: simple moved 'up' to 'restart'
I [2011-12-10 15:38:14] INFO: simple [trigger] process is running (ProcessRunning)
I [2011-12-10 15:38:14] INFO: simple move 'restart' to 'up'
I [2011-12-10 15:38:14] INFO: simple registered 'proc_exit' event for pid 23707
I [2011-12-10 15:38:14] INFO: simple moved 'restart' to 'up'
# Polls
I [2011-12-07 14:51:13] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 14:51:13] INFO: simple move 'up' to 'restart'
I [2011-12-07 14:51:13] INFO: simple stop: default lambda killer
I [2011-12-07 14:51:13] INFO: simple sent SIGTERM
I [2011-12-07 14:51:14] INFO: simple process stopped
I [2011-12-07 14:51:14] INFO: simple start: ruby /Users/tom/dev/mojombo/god/simple.rb
I [2011-12-07 14:51:14] INFO: simple moved 'up' to 'up'
I [2011-12-07 14:51:14] INFO: simple [ok] process is running (ProcessRunning)
God will now start reporting on memory and CPU utilization of your process:
# Events and Polls
I [2011-12-07 14:54:37] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 14:54:37] INFO: simple [ok] memory within bounds [2032kb] (MemoryUsage)
I [2011-12-07 14:54:37] INFO: simple [ok] cpu within bounds [0.0%%] (CpuUsage)
I [2011-12-07 14:54:42] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 14:54:42] INFO: simple [ok] memory within bounds [2032kb, 13492kb] (MemoryUsage)
I [2011-12-07 14:54:42] INFO: simple [ok] cpu within bounds [0.0%%, *99.7%%] (CpuUsage)
I [2011-12-07 14:54:47] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 14:54:47] INFO: simple [ok] memory within bounds [2032kb, 13492kb, 25568kb] (MemoryUsage)
I [2011-12-07 14:54:47] INFO: simple [ok] cpu within bounds [0.0%%, *99.7%%, *100.0%%] (CpuUsage)
I [2011-12-07 14:54:52] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 14:54:52] INFO: simple [ok] memory within bounds [2032kb, 13492kb, 25568kb, 37556kb] (MemoryUsage)
I [2011-12-07 14:54:52] INFO: simple [trigger] cpu out of bounds [0.0%%, *99.7%%, *100.0%%, *98.4%%] (CpuUsage)
I [2011-12-07 14:54:52] INFO: simple move 'up' to 'restart'
On the last line of the above log you can see that CPU usage has gone above
50% for three cycles and god will issue a restart operation. God will continue
to monitor the simple
process for as long as god is running and the process
is set to be monitored.
Now, before you kill the god process, let’s kill the simple
server by asking
god to stop it for us. In a new terminal, issue the command:
$ god stop simple
You should see the following output:
Sending 'stop' command
The following watches were affected:
simple
And in the foregrounded god terminal window, you’ll see the log of what happened:
# Events
I [2011-12-10 15:41:04] INFO: simple stop: default lambda killer
I [2011-12-10 15:41:04] INFO: simple sent SIGTERM
I [2011-12-10 15:41:05] INFO: simple process stopped
I [2011-12-10 15:41:05] INFO: simple move 'up' to 'unmonitored'
I [2011-12-10 15:41:05] INFO: simple deregistered 'proc_exit' event for pid 23707
I [2011-12-10 15:41:05] INFO: simple moved 'up' to 'unmonitored'
# Polls
I [2011-12-07 09:59:59] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 10:00:04] INFO: simple [ok] process is running (ProcessRunning)
I [2011-12-07 10:00:07] INFO: simple stop: default lambda killer
I [2011-12-07 10:00:07] INFO: simple sent SIGTERM
I [2011-12-07 10:00:08] INFO: simple process stopped
I [2011-12-07 10:00:08] INFO: simple move 'up' to 'unmonitored'
I [2011-12-07 10:00:08] INFO: simple moved 'up' to 'unmonitored'
Now feel free to Ctrl-C out of god. Congratulations! You’ve just taken god for a test ride and seen how easy it is to keep your processes running.
This is just the beginning of what god can do, and in reality, the keepalive
command is a convenience method written using more advanced transitional and
condition constructs that may be used directly. You can configure many
different kinds of conditions to have your process restarted when memory or
CPU are too high, when disk usage is above a threshold, when a process returns
an HTTP error code on a specific URL, and many more. In addition you can write
your own custom conditions and use them in your configuration files. Many
different lifecycle controls are available alongside a sophisticated and
extensible notifications system. Keep reading to find out what makes god
different from other monitoring systems and how it can help you solve many of
your process monitoring and control problems.
Config Files are Ruby Code!
Now that you’ve seen how to get started quickly, let’s see how to use the more powerful aspects of god. Once again, the best way to learn will be through an example. The following configuration file is what I once used at gravatar.com to keep the mongrels running:
RAILS_ROOT = "/Users/tom/dev/gravatar2"
%w{8200 8201 8202}.each do |port|
God.watch do |w|
w.name = "gravatar2-mongrel-#{port}"
w.start = "mongrel_rails start -c #{RAILS_ROOT} -p #{port} \
-P #{RAILS_ROOT}/log/mongrel.#{port}.pid -d"
w.stop = "mongrel_rails stop -P #{RAILS_ROOT}/log/mongrel.#{port}.pid"
w.restart = "mongrel_rails restart -P #{RAILS_ROOT}/log/mongrel.#{port}.pid"
w.pid_file = File.join(RAILS_ROOT, "log/mongrel.#{port}.pid")
w.behavior(:clean_pid_file)
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
w.restart_if do |restart|
restart.condition(:memory_usage) do |c|
c.above = 150.megabytes
c.times = [3, 5] # 3 out of 5 intervals
end
restart.condition(:cpu_usage) do |c|
c.above = 50.percent
c.times = 5
end
end
# lifecycle
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition = :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end
end
That’s a lot to take in at once, so I’ll break it down by section and explain what’s going on in each.
RAILS_ROOT = "/var/www/gravatar2/current"
Here I’ve set a constant that is used throughout the file. Keeping the
RAILS_ROOT
value in a constant makes it easy to adapt this script to other
applications. Because the config file is Ruby code, I can set whatever
variables or constants I want that make the configuration more concise and
easier to work with.
%w{8200 8201 8202}.each do |port|
...
end
Because the config file is written in actual Ruby code, we can construct loops and do other intelligent things that are impossible in your every day, run of the mill config file. I need to watch three mongrels, so I simply loop over their port numbers, eliminating duplication and making my life a whole lot easier.
God.watch do |w|
w.name = "gravatar2-mongrel-#{port}"
w.start = "mongrel_rails start -c #{RAILS_ROOT} -p #{port} \
-P #{RAILS_ROOT}/log/mongrel.#{port}.pid -d"
w.stop = "mongrel_rails stop -P #{RAILS_ROOT}/log/mongrel.#{port}.pid"
w.restart = "mongrel_rails restart -P #{RAILS_ROOT}/log/mongrel.#{port}.pid"
w.pid_file = File.join(RAILS_ROOT, "log/mongrel.#{port}.pid")
...
end
A watch
represents a single process that has concrete start, stop, and/or
restart operations. You can define as many watches as you like. In the example
above, I’ve got some Rails instances running in Mongrels that I need to keep
alive. Every watch must have a unique name
so that it can be identified
later on. The start
and stop
attributes specify the commands to start
and stop the process. If no restart
attribute is set, restart will be
represented by a call to stop followed by a call to start. The
optional grace
attribute sets the amount of time following a
start/stop/restart command to wait before resuming normal monitoring
operations. If the process you’re watching runs as a daemon (as
mine does), you’ll need to set the pid_file
attribute.
w.behavior(:clean_pid_file)
Behaviors allow you to execute additional commands around start/stop/restart
commands. In our case, if the process dies it will leave a PID file behind.
The next time a start command is issued, it will fail, complaining about the
leftover PID file. We’d like the PID file cleaned up before a start command is
issued. The built-in behavior clean_pid_file
will do just that.
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
Watches contain conditions grouped by the action to execute should they return
true
. I start with a start_if
block that contains a single condition.
Conditions are specified by calling condition
with an identifier, in this
case :process_running
. Each condition can specify a poll interval that will
override the default watch interval. In this case, I want to check that the
process is still running every 5 seconds instead of the 30 second interval
that other conditions will inherit. The ability to set condition specific poll
intervals makes it possible to run critical tests (such as :process_running)
more often than less critical tests (such as :memory_usage and :cpu_usage).
w.restart_if do |restart|
restart.condition(:memory_usage) do |c|
c.above = 150.megabytes
c.times = [3, 5] # 3 out of 5 intervals
end
...
end
Similar to start_if
there is a restart_if
command that groups conditions
that should trigger a restart. The memory_usage
condition will fail if the
specified process is using too much memory. The maximum allowable amount of
memory is specified with the above
attribute (you can use the kilobytes
,
megabytes
, or gigabytes
helpers). The number of times the test needs to
fail in order to trigger a restart is set with times
. This can be either an
integer or an array. An integer means it must fail that many times in a row
while an array [x, y]
means it must fail x
times out of the last y
tests.
w.restart_if do |restart|
...
restart.condition(:cpu_usage) do |c|
c.above = 50.percent
c.times = 5
end
end
To keep an eye on CPU usage, I’ve employed the cpu_usage
condition. When CPU
usage for a Mongrel process is over 50% for 5 consecutive intervals, it will
be restarted.
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition = :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
Conditions inside a lifecycle
section are active as long as the process is being monitored (they live across state changes).
The :flapping
condition guards against the edge case wherein god rapidly
starts or restarts your application. Things like server configuration changes
or the unavailability of external services could make it impossible for my
process to start. In that case, god will try to start my process over and over
to no avail. The :flapping
condition provides two levels of giving up on
flapping processes. If I were to translate the options of the code above, it
would be something like: If this watch is started or restarted five times
within 5 minutes, then unmonitor it…then after ten minutes, monitor it
again to see if it was just a temporary problem; if the process is seen to be
flapping five times within two hours, then give up completely.
That’s it!
Starting and Controlling God
To start the god monitoring process as a daemon simply run the god
executable passing in the path to the config file (you need to sudo if you’re
using events on Linux or want to use the setuid/setgid functionality):
$ sudo god -c /path/to/config.god
While you’re writing your config file, it can be helpful to run god in the foreground so you can see the log messages. You can do that with:
$ sudo god -c /path/to/config.god -D
You can start/restart/stop/monitor/unmonitor your Watches with the same utility like so:
$ sudo god stop gravatar2-mongrel-8200
Watching Non-Daemon Processes
Need to watch a script that doesn’t have built in daemonization? No problem!
God will daemonize and keep track of your process for you. If you don’t
specify a pid_file
attribute for a watch, it will be auto-daemonized and a
PID file will be stored for it in /var/run/god
.
God.pid_file_directory = '/home/tom/pids'
# Watcher that auto-daemonizes and creates the pid file
God.watch do |w|
w.name = 'mongrel'
w.pid_file = w.pid_file = File.join(RAILS_ROOT, "log/mongrel.pid")
w.start = "mongrel_rails start -P #{RAILS_ROOT}/log/mongrel.pid -d"
# ...
end
# Watcher that does not auto-daemonize
God.watch do |w|
w.name = 'worker'
# w.pid_file = is not set
w.start = "rake resque:worker"
# ...
end
If you’d rather have the PID file stored in a different location, you can set it at the top of your config:
God.pid_file_directory = '/home/tom/pids'
The directory you specify must be writable by god.
Grouping Watches
Watches can be assigned to groups. These groups can then be controlled together from the command line.
God.watch do |w|
...
w.group = 'mongrels'
...
end
The above configuration now allows you to control the watch (and any others that are in the group) with a single command:
$ sudo god stop mongrels
Invoke Commands for all watches
If you need to invoke a command (e.g. Stop / Start / Restart) on all watches you can simply omit the second parameter. For example, to start all watches:
$ sudo god start
Redirecting STDOUT and STDERR of your Process
By default, the STDOUT stream for your process is redirected to /dev/null
.
To get access to this output, you can redirect the stream either to a file or
to a command.
To redirect STDOUT to a file, set the log
attribute to a file path. The file
will be written in append mode and created if it does not exist.
God.watch do |w|
...
w.log = '/var/log/myprocess.log'
...
end
To redirect STDOUT to a command that will be run for you, set the log_cmd
attribute to a command.
God.watch do |w|
...
w.log_cmd = '/usr/bin/logger'
...
end
By default, STDERR is redirected to STDOUT. You can redirect it to a file or a
command just like STDOUT by setting the err_log
or err_log_cmd
attributes
respectively.
Changing UID/GID for processes
It is possible to have god run your start/stop/restart commands as a specific
user/group. This can be done by setting the uid
and/or gid
attributes of a
watch.
God.watch do |w|
...
w.uid = 'tom'
w.gid = 'devs'
...
end
This only works for commands specified as a string. Lambda commands are unaffected.
Setting the Working Directory
By default, God sets the working directory to /
before running your process.
You can change this by setting the dir
attribute on the watch.
God.watch do |w|
...
w.dir = '/var/www/myapp'
...
end
Setting environment variables
You can set any number of environment variables you wish via the env
attribute of a watch.
God.watch do |w|
...
w.env = { 'RAILS_ROOT' => "/var/www/myapp",
'RAILS_ENV' => "production" }
...
end
Using chroot to Change the File System Root
If you want your process to run chrooted, simply use the chroot
attribute on
the watch. The specified directory must exist and have a /dev/null
.
God.watch do |w|
...
w.chroot = '/var/myroot'
...
end
Lambda commands
In addition to specifying start/stop/restart commands as strings (to be executed via the shell), you can specify a lambda that will be called.
God.watch do |w|
...
w.start = lambda { ENV['APACHE'] ? `apachectl -k graceful` : `lighttpd restart` }
...
end
Customizing the Default Stop Lambda
If you do not provide a stop command, God will attempt to stop your process by
first sending a SIGTERM. It will then wait for ten seconds for the process to
exit. If after this time it still has not exited, it will be sent a SIGKILL.
You can customize the stop signal and/or the time to wait for the process to
exit by setting the stop_signal
and stop_timeout
attributes on the watch.
God.watch do |w|
...
w.stop_signal = 'QUIT'
w.stop_timeout = 20.seconds
...
end
Loading Other Config Files
You should feel free to separate your god configs into separate files for
easier organization. You can load in other configs using Ruby’s normal load
method, or use the convenience method God.load
which allows for glob-style
paths:
# load in all god configs
God.load "/usr/local/conf/*.god"
God won’t start its monitoring operations until all configurations have been loaded.
Dynamically Loading Config Files Into an Already Running God
God allows you to load or reload configurations into an already running instance. There are a few things to consider when doing this:
-
Existing Watches with the same
name
as the incoming Watches will be overridden by the new config. -
All paths must be either absolute or relative to the path from which god was started.
To load a config into a running god, issue the following command:
$ sudo god load path/to/config.god
Config files that are loaded dynamically can contain anything that a normal
config file contains, however, global options such as God.pid_file_directory
blocks will be ignored (and produce a warning in the logs).
Getting Logs for a Single Watch
Sifting through the god logs for statements specific to a single Watch can be frustrating when you have many of them. You can get the realtime logs for a single Watch via the command line:
$ sudo god log local-3000
This will display log output for the 'local-3000' Watch and update every second with new log messages.
You can also supply a shorthand to the log command that will match one of your watches. If it happens to match several, the shortest match will be used:
$ sudo god log l3
Notifications
God has an extensible notification framework built in that makes it easy to have notifications sent when conditions are triggered. Each notification type has a set of configuration parameters that must be set. These parameters may be set globally via Contact Defaults or individually via Contact Instances.
Contact Defaults - Some parameters are unlikely to change on a per-contact basis. You should set those parameters via the defaults mechanism.
God::Contacts::Email.defaults do |d|
d.from_email = 'god@example.com'
d.from_name = 'God'
d.delivery_method = :sendmail
end
Contact Instances - Each contact must have a unique name
set. You may
optionally assign each contact to a group
.
God.contact(:email) do |c|
c.name = 'tom'
c.group = 'developers'
c.to_email = 'tom@example.com'
end
God.contact(:email) do |c|
c.name = 'vanpelt'
c.group = 'developers'
c.to_email = 'vanpelt@example.com'
end
God.contact(:email) do |c|
c.name = 'kevin'
c.group = 'developers'
c.to_email = 'kevin@example.com'
end
Condition Attachment - To have a specific contact notified when a condition
is triggered, simply set the condition’s notify
attribute to the name of the
individual contact.
w.transition(:up, :start) do |on|
on.condition(:process_exits) do |c|
c.notify = 'tom'
end
end
There are two ways to specify that a notification should be sent. The first,
easier way is shown above. Every condition can take an optional notify
attribute that specifies which contacts should be notified when the condition
is triggered. The value can be a contact name or contact group or an array
of contact names and/or contact groups.
w.transition(:up, :start) do |on|
on.condition(:process_exits) do |c|
c.notify = {:contacts => ['tom', 'developers'], :priority => 1, :category => 'product'}
end
end
The second way allows you to specify the priority
and category
in addition
to the contacts. The extra attributes can be arbitrary integers or strings and
will be passed as-is to the notification subsystem.
The above notification will arrive as an email similar to the following.
From: God <god@example.com>
To: tom <tom@example.com>
Subject: [god] mongrel-8600 [trigger] process exited (ProcessExits)
Message: mongrel-8600 [trigger] process exited (ProcessExits)
Host: candymountain.example.com
Priority: 1
Category: product
Available Notification Types
Send a notice to an email address.
God::Contacts::Email.defaults do |d|
...
end
God.contact(:email) do |c|
...
end
to_email - The String email address to which the email will be sent.
to_name - The String name corresponding to the recipient.
from_email - The String email address from which the email will be sent.
from_name - The String name corresponding to the sender.
delivery_method - The Symbol delivery method. [ :smtp | :sendmail ]
(default: :smtp).
=== SMTP Options (when delivery_method = :smtp) ===
server_host - The String hostname of the SMTP server (default: localhost).
server_port - The Integer port of the SMTP server (default: 25).
server_auth - A Boolean or Symbol, false if no authentication else a symbol
for the type of authentication [false | :plain | :login | :cram_md5]
(default: false).
=== SMTP Auth Options (when server_auth = true) ===
server_domain - The String domain.
server_user - The String username.
server_password - The String password.
=== Sendmail Options (when delivery_method = :sendmail) ===
sendmail_path - The String path to the sendmail executable
(default: "/usr/sbin/sendmail").
sendmail_args - The String args to send to sendmail (default "-i -t").
Webhook
Send a notice to a webhook (https://www.webhooks.org).
God::Contacts::Webhook.defaults do |d|
...
end
God.contact(:webhook) do |c|
...
end
url - The String webhook URL. format - The Symbol format [ :form | :json ] (default: :form).
Airbrake
Send a notice to airbrake (https://airbrake.io).
God::Contacts::Airbrake.defaults do |d|
...
end
God.contact(:airbrake) do |c|
...
end
apikey - The String API key.
Slack
Send a message to a channel in Slack (https://slack.com).
First, set up an Incoming Webhook in your Slack account.
Then, in your God configuration, set the defaults:
God::Contacts::Slack.defaults do |d|
d.account = "foo"
d.token = "abc123abc123abc123"
c.notify_channel = true
c.format = '%{host} alert: %{message}'
end
account
is the name of your Slack account; if you view slack at
"foo.slack.com", then your account is "foo". token
is from your
newly-created webhook, and will be a string of unintelligible
characters.
The notify_channel
and format
settings are optional. The first
controls whether the message includes @channel
(sending notifications
to everyone in the channel); the second controls how the message is
formatted. Acceptable values within the format are priority
, host
,
message
, category
, and time
.
Once you’ve set the defaults, create contacts for the channels that you want to notify. You can create as many as you like, and they’ll look something like this:
God.contact(:slack) do |c|
c.name = '#ops'
c.channel = '#ops'
end
Advanced Configuration with Transitions and Events
So far you’ve been introduced to a simple poll-based config file and seen how
to run it. Poll-based monitoring works great for simple things, but falls
short for highly critical tasks. God has native support for kqueue/netlink
events on BSD/Darwin/Linux systems. For instance, instead of using the
process_running
condition to poll for the status of your process, you can
use the process_exits
condition that will be notified immediately upon the
exit of your process. This means less load on your system and shorter downtime
after a crash.
While the configuration syntax you saw in the previous example is very simple,
it lacks the power that we need to deal with event based monitoring. In fact,
the start_if
and restart_if
methods are really just calling out to a
lower-level API. If we use the low-level API directly, we can harness the full
power of god’s event based lifecycle system. Let’s look at another example
config file.
RAILS_ROOT = "/Users/tom/dev/gravatar2"
God.watch do |w|
w.name = "local-3000"
w.start = "mongrel_rails start -c #{RAILS_ROOT} -P #{RAILS_ROOT}/log/mongrel.pid -p 3000 -d"
w.stop = "mongrel_rails stop -P #{RAILS_ROOT}/log/mongrel.pid"
w.restart = "mongrel_rails restart -P #{RAILS_ROOT}/log/mongrel.pid"
w.pid_file = File.join(RAILS_ROOT, "log/mongrel.pid")
# clean pid files before start if necessary
w.behavior(:clean_pid_file)
# determine the state on startup
w.transition(:init, { true => :up, false => :start }) do |on|
on.condition(:process_running) do |c|
c.running = true
end
end
# determine when process has finished starting
w.transition([:start, :restart], :up) do |on|
on.condition(:process_running) do |c|
c.running = true
end
# failsafe
on.condition(:tries) do |c|
c.times = 5
c.transition = :start
end
end
# start if process is not running
w.transition(:up, :start) do |on|
on.condition(:process_exits)
end
# restart if memory or cpu is too high
w.transition(:up, :restart) do |on|
on.condition(:memory_usage) do |c|
c.interval = 20
c.above = 50.megabytes
c.times = [3, 5]
end
on.condition(:cpu_usage) do |c|
c.interval = 10
c.above = 10.percent
c.times = [3, 5]
end
end
# lifecycle
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition = :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end
A bit longer, I know, but very straightforward once you understand how the
transition
calls work. The name
, interval
, start
, stop
, and
pid_file
attributes should be familiar. We also specify the clean_pid_file
behavior.
Before jumping into the code, it’s important to understand the different
states that a Watch can have, and how that state changes over time. At any
given time, a Watch will be in one of the init
, up
, start
, or restart
states. As different conditions are satisfied, the Watch will progress from
state to state, enabling and disabling conditions along the way.
When god first starts, each Watch is placed in the init
state.
You’ll use the transition
method to tell god how to transition between
states. It takes two arguments. The first argument may be either a symbol or
an array of symbols representing the state or states during which the
specified conditions should be enabled. The second argument may be either a
symbol or a hash. If it is a symbol, then that is the state that will be
transitioned to if any of the conditions return true
. If it is a hash, then
that hash must have both true
and false
keys, each of which point to a
symbol that represents the state to transition to given the corresponding
return from the single condition that must be specified.
# determine the state on startup
w.transition(:init, { true => :up, false => :start }) do |on|
on.condition(:process_running) do |c|
c.running = true
end
end
The first transition block tells god what to do when the Watch is in the
init
state (first argument). This is where I tell god how to determine if my
task is already running. Since I’m monitoring a process, I can use the
process_running
condition to determine whether the process is running. If
the process is running, it will return true, otherwise it will return false.
Since I sent a hash as the second argument to transition
, the return from
process_running
will determine which of the two states will be transitioned
to. If the process is running, the return is true and god will put the Watch
into the up
state. If the process is not running, the return is false and
god will put the Watch into the start
state.
# determine when process has finished starting
w.transition([:start, :restart], :up) do |on|
on.condition(:process_running) do |c|
c.running = true
end
...
end
If god has determined that my process isn’t running, the Watch will be put
into the start
state. Upon entering this state, the start
command that I
specified on the Watch will be called. In addition, the above transition
specifies a condition that should be enabled when in either the start
or
restart
states. The condition is another process_running
, however this
time I’m only interested in moving to another state once it returns true
. A
true
return from this condition means that the process is running and it’s
ok to transition to the up
state (second argument to transition
).
# determine when process has finished starting
w.transition([:start, :restart], :up) do |on|
...
# failsafe
on.condition(:tries) do |c|
c.times = 5
c.transition = :start
end
end
The other half of this transition uses the tries
condition to ensure that
god doesn’t get stuck in this state. It’s possible that the process could go
down while the transition is being made, in which case god would end up
polling forever to see if the process is up. Here I’ve specified that if this
condition is called five times, god should override the normal transition
destination and move to the start
state instead. If you specify a
transition
attribute on any condition, that state will be transferred to
instead of the normal transfer destination.
# start if process is not running
w.transition(:up, :start) do |on|
on.condition(:process_exits)
end
This is where the event based system comes into play. Once in the up
state,
I want to be notified when my process exits. The process_exits
condition
registers a callback that will trigger a transition change when it is fired
off. Event conditions (like this one) cannot be used in transitions that have
a hash for the second argument (as they do not return true or false).
# restart if memory or cpu is too high
w.transition(:up, :restart) do |on|
on.condition(:memory_usage) do |c|
c.interval = 20
c.above = 50.megabytes
c.times = [3, 5]
end
on.condition(:cpu_usage) do |c|
c.interval = 10
c.above = 10.percent
c.times = [3, 5]
end
end
Notice that I can have multiple transitions with the same start state. In this
case, I want to have the memory_usage
and cpu_usage
poll conditions going
at the same time that I listen for the process exit event. In the case of
runaway CPU or memory usage, however, I want to transition to the restart
state. When a Watch enters the restart
state it will either call the
restart
command that you specified, or if none has been set, call the stop
and then start
commands.
Extend God with your own Conditions
God was designed from the start to allow you to easily write your own custom conditions, making it simple to add tests that are application specific.
Contribute
If you’d like to hack on god itself or contribute fixes or new functionality, read this section.
The codebase can be found at https://github.com/mishina2228/resurrected_god. To get started, fork god on GitHub into your own account and then pull that down to your local machine. This way you can easily submit changes via Pull Requests later on.
$ git clone git@github.com:yourusername/god
We recommend using rbenv and ruby-build to manage multiple versions of Ruby and their separate gemsets. Any changes to god must work on Ruby 2.6+.
God uses bundler to deal with development dependencies. Once you have the code locally, you can pull in all the dependencies like so:
$ cd resurrected_god
$ bundle install
In order for process events to function during development you’ll need to compile the C extensions:
$ cd ext/god
$ ruby extconf.rb
$ make
$ cd ../..
Now you’re ready to run the tests and make sure everything is configured properly. On Linux you’ll need to run the tests as root in order for the events system to load. On MacOS there is no need to run the tests as root.
$ [sudo] bundle exec rake
To run your development god to make sure config files and such still work properly, just run:
$ [sudo] bundle exec god -c myconfig.god -D
There are a bunch of example config files for various scenarios in
test/configs
that you can try out. For big new features, it’s great to add a
new test config showing off the usage of the feature.
If you intend to contribute your changes back to god core, make sure you create a new branch and do your work there. Then, when your changes are ready to be shared with the world, push them to your fork and issue a Pull Request against mishina2228/resurrected_god. Make sure to describe your changes in detail and add relevant tests.
Any feature additions or changes should be accompanied by corresponding updates
to the documentation. It can be found in the doc
directory. The
documentation is done in AsciiDoc format
and then converted into the public site at https://mishina2228.github.io/resurrected_god. To see the
generated site locally you’ll first need to commit your changes to git and then
issue the following:
$ bundle exec rake site
This will open the site in your browser so you can check for correctness.