ponyfoo.com

Two-way Synchronization for a Web App and Git

Fix

For a long while I wanted to implement a feature where people reading an article on Pony Foo could click on an “Improve this article” button whenever they spotted an error, submit a PR, and help us promptly fix the article. These features are usually grounded on git, but the issue was that Pony Foo had no understanding of git whatsoever.

This article explains the challenges I went through and how I ended up implementing a two-way synchronization between the web application and a git repository for the articles.

There were several challenges that had to be sorted out before the implementation, which didn’t take that long once I settled for an idea on how I wanted to approach it.

  1. A script that converted any database article into its file system representation was necessary so that I could use that representation when talking to git.

  2. The files need to be straightforward.

    • A huge JSON file doesn’t make a lot of sense for Markdown articles, and a single Markdown file might prove confusing due to the amount of fields present on a Pony Foo article.
    • The ability to see the rendered article on GitHub is pretty important, but the problem is that the Markdown in Pony Foo has diverged quite a bit from GitHub’s Markdown. At the same time, articles rely pretty heavily on domain-relative links (such as /subscribe) and expecting the git user to enjoy an article by looking at half a dozen different files didn’t seem reasonable.
    • Not every field needs to be on git. The git representation can act as a mirror of the database, which is the single source of truth. Which fields should be on git?
  3. The repository needs to be provisioned with all pre-existing articles. I needed a script that went through every published article and converted it to files in a git repository.

  4. Drafts need to be excluded from the repository. They haven’t been published on the site yet, and as such providing them in the open in an unpolished form isn’t the best idea. Since drafts aren’t available on git, we don’t need to concern ourselves with publishing or handling new articles being created directly through git.

  5. Whenever an article is updated on the website, we need to update its files and push to git.

  6. Whenever an article is updated on git, we need to update the web version.

Let’s follow the logical progression. We’ll start with provisioning the repository, and then look at how we can keep it up to date whenever an article is updated on the site. Once the repository is kept up to date with changes to the website, we can look at reacting to updates made against git.

Shall we?

Provisioning the Repository

The first order of business was to figure out the file structure. I settled for a pattern where I’d have a standalone articles repository. It would include licensing and contributing information at the top level, along with some dotfiles and other commonplace open-source files. I didn’t want a deep folder structure for the articles so I decided on one folder per year, and a folder per article during that year.

Here is the folder structure for the first five articles this year. I used a ${ month }-${ day }--${ article.slug } pattern so that they would be properly sorted but also have a meaningful description.

2016/01-11--asynchronous-i-o-with-generators-and-promises/
2016/01-21--controversial-state-of-javascript-tooling/
2016/01-30--es2016-features-and-ecmascript-as-a-living-standard/
2016/02-02--understanding-javascript-async-await/
2016/02-09--ecmascript-string-padding/

The metadata.json file looks like the piece of JSON seen below. The id and author fields are used to uniquely identify the article, but they won’t be updated should they change in git. The article’s title, slug, tags, and hero image, can be updated on GitHub.

{
  "id": "57783d1df2a76b840314377d",
  "author": "543d222f4683586910034197",
  "title": "<div>How Pony Foo is ridiculously over-engineered</div><div><em>— and why that is awesome</em></div>",
  "slug": "most-over-engineered-blog-ever",
  "tags": [
    "side-projects",
    "ponyfoo"
  ],
  "heroImage": "https://i.imgur.com/IF2aFsB.jpg"
}

There’s a single file for each different piece of Markdown:

  • One file for the teaser below the title
  • One file for the introduction to the article, used when sending out an email about a new article
  • One file for the body of the article, containing everything else in the article
  • One file for the article summary and another for the notes from the editor, both of which are optional

Lastly, we also have a rendered readme.markdown preview, which is the product of compiling all Markdown fields into HTML with a header that includes the title, tags, summary, and a warning about the readme file being autogenerated (and thus read-only). We say the readme is in .markdown format even though it’s spewed as raw HTML, so that GitHub renders the previews when humans visit one of the folders on the repository.

These files are written to disk for an article, via a single updateSyncRoot(article, done) function. The same function is called for every published article on the site, and we have our provisioning set up.

So far we have a directory tree with a bunch of folders, Markdown and JSON files. For local development, I cloned the git repository and then called updateSyncRoot for every article, provisioning the repository.

An important note was to use a branch other than master, for SEO purposes, as GitHub indexes master by default, but I’d rather not have search engines crawl GitHub for entire copies of every article on the website. I use a noindex branch name to communicate that intent clearly, and made that the default branch on GitHub.

At this point, I manually commit and push to GitHub every single article, in file format. The rendered preview of an article on GitHub is shown below.

An article on GitHub
An article on GitHub

Next up, we need to push updates to the repository.

Keeping the Repository Up To Date

This step is a bit more challenging, as it involves several automated git commands. I used a mongoose post-save hook so that whenever an article is saved , for both inserts and updates, a function gets called. We use that hook to call a pushToGit({ article, oldSlug }, done) function, passing in the article and the slug it had before the article was saved.

In pushToGit we start by running git pull on the repository. I used the simple-git package to run git commands from node. After pulling, we use the updateSyncRoot(article, done) function to update the file representation of our article. Then, if the oldSlug is different than the new one, we remove the files at the old slug’s directory structure. Finally, we run git add, git commit, and git push on the changed files.

Whenever an article is deleted, in a similar fashion to our update hook, we remove its files from the repository. We do this by pulling first, then running git rm for the related files, committing, and pushing.

These two actions ensure that any updates to the articles are mirrored on the git repository. The last piece of the equation is to handle git push events.

Handling Updates Pushed To Git

We can register a Git hook on GitHub so that whenever new commits are pushed onto our repository, our web app gets a notification. I chose the /api/git-hooks/articles endpoint and a secret code. GitHub uses the secret to encode payloads sent to /api/git-hooks/articles, so that when we receive a request for that endpoint, we know it came from GitHub.

On the web app, I use github-webhook-handler to receive the event. The articleGit service is where I have the git-related functions I mentioned earlier, and the env module contains all secrets used by the application. I made a small configure helper function which takes care of creating the webhook handler middleware for an Express app, using a key as part of the endpoint, and any number of ...events the application accepts and knows how to process.

const winston = require('winston');
const createHandler = require('github-webhook-handler');
const env = require('./env');
const articleGitService = require('../services/articleGit');
const secret = env('X_HUB_SECRET');

function configure (app, key, ...events) {
  const path = `/api/git-hooks/${key}`;
  const handler = createHandler({ path, secret, events });
  app.use(handler);
  handler.on('error', err => {
    winston.warn('Error in GitHub hook handler', err.stack || err);
  });
  return handler;
}

When we receive a push event from the articles webhook on GitHub, we’ll invoke articleGitService.pullFromGit.

function webhooks (app) {
  configure(app, 'articles', 'push').on('push', event => {
    articleGitService.pullFromGit(event);
  });
}

module.exports = webhooks;

The service receives an event.payload such as the following. We’ll only leverage the highlighted fields.

{
  "ref": "refs/heads/changes",
  "before": "9049f1265b7d61be4a8904a9a27120d2064dab3b",
  "after": "0d1a26e67d8f5eaf1f6ba5c57fc3c7d91ac0fd1c",
  "created": false,
  "deleted": false,
  "forced": false,
  "base_ref": null,
  "compare": "https://github.com/baxterthehacker/public-repo/compare/9049f1265b7d...0d1a26e67d8f",
  "commits": [
    {
      "id": "0d1a26e67d8f5eaf1f6ba5c57fc3c7d91ac0fd1c",
      "tree_id": "f9d2a07e9488b91af2641b26b9407fe22a451433",
      "distinct": true,
      "message": "Update README.md",
      "timestamp": "2015-05-05T19:40:15-04:00",
      "url": "https://github.com/baxterthehacker/public-repo/commit/0d1a26e67d8f5eaf1f6ba5c57fc3c7d91ac0fd1c",
      "author": {
        "name": "baxterthehacker",
        "email": "baxterthehacker@users.noreply.github.com",
        "username": "baxterthehacker"
      },
      "committer": {
        "name": "baxterthehacker",
        "email": "baxterthehacker@users.noreply.github.com",
        "username": "baxterthehacker"
      },
      "added": [

      ],
      "removed": [

      ],
      "modified": [
        "README.md"
      ]
    }
  ],
  "head_commit": {
    // ...
  },
  "repository": {
    // ...
  },
  "pusher": {
    // ...
  }
  "sender": {
    // ...
  }
}

Example extracted from GitHub Help pages

There are quite a few things that could have happened due to git push. Let’s look at those.

  • Changes could be completely unrelated to any articles, such as when we update the license for the repository
  • Changes could have updated a piece or pieces of information for an article
  • Changes could have removed an article entirely

I decided to go for a naïve but realistic implementation, where we’ll look at the commits collection. We .reduce the commits once for removed files and again for modified files. We’ll interpret deleted metadata.json files as the article for that metadata.json having been deleted. We’ll interpret any modified files as the article having changed.

Next up we’ll remove any articles that were deleted from the database. This would trigger the “remove from git” hook, but the article was already removed from git and , since deletion is idempotent, all is well.

After that, we git pull the changes in the repository into our local clone, read all the files related to a modified article, and update its database representation being careful not to erase any important information. When the article gets saved, the updateSyncRoot hook gets triggered, which would result in an attempt to push to git. However, given that the file system representation is equivalent to the article (after saving it), there is nothing to commit.

Certainly this is all a tad brittle, but it has worked quite well thus far without many surprises. A nice aspect of having this setup, is that we get versioning for free. We’re now able to look at an article on GitHub and see any changes applied to them, when they were applied, and how the article looked before and after the change was made.

How would you improve a two-way synchronization mechanism between a database and a git repository?

Liked the article? Subscribe below to get an email when new articles come out! Also, follow @ponyfoo on Twitter and @ponyfoo on Facebook.
One-click unsubscribe, anytime. Learn more.

Comments (2)

Mathieu Dutour wrote

Interesting article!

One thing to be careful at is that the event doesn’t list all the commits of the push, only the first 30 if I remember correctly. I implemented something similar and found that I needed to get the list of files changed between the 2 refs.

Also I’m using the github API instead of using git directly

Nicolás Bevacqua wrote

Absolutely! 😅

Like I said in the article, the implementation is quite brittle, but in my use case I expect two commits at most, changing some of the files related to a single article, so it’s good enough for now!