We shouldn’t ever rely on the luxury of hoping for the best when it comes to production application uptime. There are a series of steps we can take to prevent and minimize fatal exceptions, while maximizing uptime and improving monitoring.
Log exceptions with their full stack traces
This one is obvious enough. You have to watch out for exceptions getting lost in the sea of asynchronous code. This basically means figuring out whether a piece of code throw
s exceptions or uses the callback convention.
// the usual, sync way
try{
syncOperation();
}catch(e){
// rethrow, or handle it
}
// the async way
asyncOperation(function(err, data){
if(err){
// bubble the exception up, or handle it
return;
}
});
Keep in mind another important aspect of exception logging, is doing so in a persistant way. That is, use a database storage for your logging purposes. The console
is just fine for development environments, but you probably want something more robust in production.
If you are, however, hosting on a platform such as heroku, where console
output is persisted, then you can opt not to log to a database yourself.
Popular logging options for Node include winston and bunyan. Both support various logging adapters.
Log uncaughtException
, but then exit
When an exception is not handled anywhere else, it will be emitted on the process
's 'uncaughtException'
event. We can listen for this event to do some logging, but we should allow the process to shut down gracefully.
process.on('uncaughtException', function(err){
// log the error
process.exit(1);
});
Using Domains
Node has recently released the Domain API, which provides a context in which we can deal with uncaught exceptions.
It’s hard to put it any better than what the Node Docs have to offer:
Domains provide a way to handle multiple different IO operations as a single group. If any of the event emitters or callbacks registered to a domain emit an
error
event, or throw an error, then the domain object will be notified, rather than losing the context of the error in theprocess.on('uncaughtException')
handler, or causing the program to exit immediately with an error code.
Failover Clusters
The Cluster API was published alongside domain
. Clusters allow us to use several processes, taking advantage of multi-core systems. The usefulness of clusters lies in the ability to listen to the same port using several processes. This is provided by the API itself.
Here is a very unpolished example HTTP server, using clusters.
var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;
if (cluster.isMaster) {
// Fork workers.
for (var i = 0; i < numCPUs; i++) {
cluster.fork();
}
cluster.on('exit', function(worker, code, signal) {
console.log('worker ' + worker.process.pid + ' died');
});
} else {
// Workers can share any TCP connection
// In this case its a HTTP server
http.createServer(function(req, res) {
res.writeHead(200);
res.end("hello world\n");
}).listen(8000);
}
Obviously we’d like to separate this into two files, the master cluster and the forks. A nice improvement over this design might be forking the cluster whenever a worker crashes. We would simply need to add cluster.fork()
to the exit
listener.
Node Docs explain:
Because workers are all separate processes, they can be killed or re-spawned depending on your program’s needs, without affecting other workers. As long as there are some workers still alive, the server will continue to accept connections. Node does not automatically manage the number of workers for you, however. It is your responsibility to manage the worker pool for your application’s needs.
I highly recommend skimming (at the very least) through both domain and cluster documentation pages, as they are really short and extremely valuable for whoever’s interested in keeping their servers uptime at a respectable level.
The Last Stand, Uptime Monitoring
Ultimately, even the master cluster can fail. As a last resort, we might set up a process that monitors our application’s port. We can determine a finite number of states our application and port might be in.
- Server completely shut down. No port listener
- Server preparing to listen. No port listener
- Server listening
Armed with this knowledge, we could assert whether our application is down, starting, or up. We might set up a monitoring process, which would run in parallel with our server process(es).
I created an npm
package explicitly to deal with this kind of scenario. The process-finder package helps us find processes listening on a port, and even more handily, it lets us watch the port for changes!
Here’s a tentative monitor.js
application.
var finder = require('process-finder');
var port = 3000; // port to watch
var watcher = finder.watch(port);
var runner = require('./runner.js');
watcher.on('error', console.error);
watcher.on('listen', function(pid){
console.log('Cluster Up!', pid);
});
watcher.on('unlisten', function(pid){
console.log('Cluster Down!', pid);
runner.start();
});
runner.start();
Where runner.js
would simply spawn a new server. The spawned process would eventually listen in the application’s port, and it might even use a cluster, as we previously discussed, to improve its resilience.
We got this far in our efforts to reduce server downtime, we should go the extra mile here. Most definitely, we would benefit from having our logger send us an email whenever a worker dies, or at the very least, whenever the monitor
has to restart our cluster
because a previous cluster went deaf.
Performance Analytics
Monitoring your application is great, but you’d probably like charts with that, too. You can use a tool such as NodeFly, or Nodetime for this purpose.
These solutions allow you to track CPU usage, server load, database load, perform memory profiling, and more. Make sure to check them out. They also allow you to set up alerts when certain thresholds are surpassed.
Nodetime
Nodetime’s documentation explains:
It is important to be notified when an application is experiencing performance problems in order to prevent downtime and be able to quickly locate the problem’s root cause, while profiling exact problem symptoms, which might disappear later. Nodetime allows users to create threshold and anomaly alerts for many internal metrics of the application - for example, if HTTP response time is continuously high or there are too few requests. It is also possible to set alerts on API call metrics of different supported libraries, such as MongoDB, Redis, and MySQL.
Both solutions are trivially easy to set up.
For Nodetime, all you need to do is the following:
Sign Up
Sign up with them to get an API key.
Install
Install their npm
module.
$ npm install nodetime --save
Setup
Load and configure the module using the API key linked to your account.
require('nodetime').profile({
accountKey: 'your_account_key',
appName: 'your_application_name'
});
And, that’s it!
Comments