Search

Pony Foo

Ramblings of a degenerate coder

Monitoring Production Grade Node Applications

(4 comments)reading time: , published

Catching, or even tracing errors in Node applications can be tricky sometimes. I wanted to share a few practices that can help you keep your application up in most scenarios.

We shouldn't ever rely on the luxury of hoping for the best when it comes to production application uptime. There are a series of steps we can take to prevent and minimize fatal exceptions, while maximizing uptime and improving monitoring.

Log exceptions with their full stack traces

This one is obvious enough. You have to watch out for exceptions getting lost in the sea of asynchronous code. This basically means figuring out whether a piece of code throws exceptions or uses the callback convention.

// the usual, sync way
try{
   syncOperation(); 
}catch(e){
    // rethrow, or handle it
}

// the async way
asyncOperation(function(err, data){
    if(err){
        // bubble the exception up, or handle it
        return;
    }
});

Keep in mind another important aspect of exception logging, is doing so in a persistant way. That is, use a database storage for your logging purposes. The console is just fine for development environments, but you probably want something more robust in production.

logging.jpg

If you are, however, hosting on a platform such as heroku, where console output is persisted, then you can opt not to log to a database yourself.

Popular logging options for Node include winston and bunyan. Both support various logging adapters.

Log uncaughtException, but then exit

When an exception is not handled anywhere else, it will be emitted on the process's 'uncaughtException' event. We can listen for this event to do some logging, but we should allow the process to shut down gracefully.

process.on('uncaughtException', function(err){
    // log the error
    process.exit(1);    
});

Using Domains

Node has recently released the Domain API, which provides a context in which we can deal with uncaught exceptions.

It's hard to put it any better than what the Node Docs have to offer:

Domains provide a way to handle multiple different IO operations as a single group. If any of the event emitters or callbacks registered to a domain emit an error event, or throw an error, then the domain object will be notified, rather than losing the context of the error in the process.on('uncaughtException') handler, or causing the program to exit immediately with an error code.

Failover Clusters

cluster.jpg

The Cluster API was published alongside domain. Clusters allow us to use several processes, taking advantage of multi-core systems. The usefulness of clusters lies in the ability to listen to the same port using several processes. This is provided by the API itself.

Here is a very unpolished example HTTP server, using clusters.

var cluster = require('cluster');
var http = require('http');
var numCPUs = require('os').cpus().length;

if (cluster.isMaster) {
  // Fork workers.
  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on('exit', function(worker, code, signal) {
    console.log('worker ' + worker.process.pid + ' died');
  });
} else {
  // Workers can share any TCP connection
  // In this case its a HTTP server
  http.createServer(function(req, res) {
    res.writeHead(200);
    res.end("hello world\n");
  }).listen(8000);
}

Obviously we'd like to separate this into two files, the master cluster and the forks. A nice improvement over this design might be forking the cluster whenever a worker crashes. We would simply need to add cluster.fork() to the exit listener.

Node Docs explain:

Because workers are all separate processes, they can be killed or re-spawned depending on your program's needs, without affecting other workers. As long as there are some workers still alive, the server will continue to accept connections. Node does not automatically manage the number of workers for you, however. It is your responsibility to manage the worker pool for your application's needs.

I highly recommend skimming (at the very least) through both domain and cluster documentation pages, as they are really short and extremely valuable for whoever's interested in keeping their servers uptime at a respectable level.

The Last Stand, Uptime Monitoring

Ultimately, even the master cluster can fail. As a last resort, we might set up a process that monitors our application's port. We can determine a finite number of states our application and port might be in.

  • Server completely shut down. No port listener
  • Server preparing to listen. No port listener
  • Server listening

Armed with this knowledge, we could assert whether our application is down, starting, or up. We might set up a monitoring process, which would run in parallel with our server process(es).

I created an npm package explicitly to deal with this kind of scenario. The process-finder package helps us find processes listening on a port, and even more handily, it lets us watch the port for changes!

Here's a tentative monitor.js application.

var finder = require('process-finder');
var port = 3000; // port to watch
var watcher = finder.watch(port);
var runner = require('./runner.js');

watcher.on('error', console.error);
watcher.on('listen', function(pid){
    console.log('Cluster Up!', pid);
});
watcher.on('unlisten', function(pid){
    console.log('Cluster Down!', pid);
    runner.start();
});

runner.start();

Where runner.js would simply spawn a new server. The spawned process would eventually listen in the application's port, and it might even use a cluster, as we previously discussed, to improve its resilience.

We got this far in our efforts to reduce server downtime, we should go the extra mile here. Most definitely, we would benefit from having our logger send us an email whenever a worker dies, or at the very least, whenever the monitor has to restart our cluster because a previous cluster went deaf.

Performance Analytics

Monitoring your application is great, but you'd probably like charts with that, too. You can use a tool such as NodeFly, or Nodetime for this purpose.

These solutions allow you to track CPU usage, server load, database load, perform memory profiling, and more. Make sure to check them out. They also allow you to set up alerts when certain thresholds are surpassed.

Nodetime

Nodetime's documentation explains:

It is important to be notified when an application is experiencing performance problems in order to prevent downtime and be able to quickly locate the problem's root cause, while profiling exact problem symptoms, which might disappear later. Nodetime allows users to create threshold and anomaly alerts for many internal metrics of the application - for example, if HTTP response time is continuously high or there are too few requests. It is also possible to set alerts on API call metrics of different supported libraries, such as MongoDB, Redis, and MySQL.

Both solutions are trivially easy to set up.

For Nodetime, all you need to do is the following:

Sign Up

Sign up with them to get an API key.

Install

Install their npm module.

$ npm install nodetime --save

Setup

Load and configure the module using the API key linked to your account.

require('nodetime').profile({
    accountKey: 'your_account_key', 
    appName: 'your_application_name'
});

And, that's it!

Comments(4)

Yair Even Or

Very interesting stuff! I am new to Node.js and servers world, but old to javascript and the client-side world, which is my home. I am currently building quite a complex game in Node using Socket.io, and basically all I do is write normal javascript using the very simple socket.io API, but the thing is, my server keeps crushing on every JS error, and from your article I get the hint of suggestion that there is a solution, but I just can't find any real life example that could help me. I don't know much or any about node itself, but I really don't want my server to crush on every error some user has come to. Can you maybe post on Github or YouTube or somewhere a real-life test case to handle such things? would be super helpful..thanks!

Nicolas Bevacqua

It depends on the kind of errors you want to be handling. If you are using Express, then you can add a special error handling middleware:

app.use(function(err, req, res, next){
    // handle the error and end the response
    res.end(200, 'oops!');
});

You should add that middleware last, and in any given route, you can next(err), which will invoke this special error handler.

That would do nothing to help you catch errors, though. Just report them, basically.

If you want to catch errors like you describe, you should use JSHint or a similar linting tool, maybe even using a build process.

The ultimate bug catcher, though, is having a comprehensive unit test suite that tests your code as thorougly as possible. Throw in a few integration tests, and you'll have a pretty well covered codebase that, paired with what I talked about in this article, will be pretty resilient, even when an exception does throw your stack off balance.

Yair Even Or

I prefer not to waste time on writing test, and rather just write my code as best I can. this eventually will save me time, because writing these tests will take me weeks, hand less than half this time I can spend making my code bullet-proof for errors anyway. About that middleware, I didn't quite understood why I need it, if the server crushes, I already see the problem in the console when lets say node app.js is running..it prints it anyway and then just..die.

Nicolas Bevacqua

There is no such thing as bullet-proof code. Tests are the way to have some more certainty that your app won't come down crashing like a house of cards if a little wind blows its way.

Writing tests will eventually save you time, contrary to your belief.

It's all a matter of scope and objectives, though. You are probably fine not writing tests for your pet project, but if you expect your application to be resilient and not so error-prone, then you should at least consider writing some tests.

While it's true that writing tests might take you some more time than not writing any tests. Consider this: once you write a unit test, it's in your test suite forever, and you can run it as many times as you want. If you don't write a test, you risk running into the same issue several times, and you might not even notice, since you have no way to track it.

Pony
Foo
Pony
Foo
Pony
Foo