There were several challenges that had to be sorted out before the implementation, which didn’t take that long once I settled for an idea on how I wanted to approach it.
-
A script that converted any database article into its file system representation was necessary so that I could use that representation when talking to git.
-
The files need to be straightforward.
- A huge JSON file doesn’t make a lot of sense for Markdown articles, and a single Markdown file might prove confusing due to the amount of fields present on a Pony Foo article.
- The ability to see the rendered article on GitHub is pretty important, but the problem is that the Markdown in Pony Foo has diverged quite a bit from GitHub’s Markdown. At the same time, articles rely pretty heavily on domain-relative links (such as
/subscribe
) and expecting the git user to enjoy an article by looking at half a dozen different files didn’t seem reasonable. - Not every field needs to be on git. The git representation can act as a mirror of the database, which is the single source of truth. Which fields should be on git?
-
The repository needs to be provisioned with all pre-existing articles. I needed a script that went through every published article and converted it to files in a git repository.
-
Drafts need to be excluded from the repository. They haven’t been published on the site yet, and as such providing them in the open in an unpolished form isn’t the best idea. Since drafts aren’t available on git, we don’t need to concern ourselves with publishing or handling new articles being created directly through git.
-
Whenever an article is updated on the website, we need to update its files and push to git.
-
Whenever an article is updated on git, we need to update the web version.
Let’s follow the logical progression. We’ll start with provisioning the repository, and then look at how we can keep it up to date whenever an article is updated on the site. Once the repository is kept up to date with changes to the website, we can look at reacting to updates made against git.
Shall we?
Provisioning the Repository
The first order of business was to figure out the file structure. I settled for a pattern where I’d have a standalone articles
repository. It would include licensing and contributing information at the top level, along with some dotfiles and other commonplace open-source files. I didn’t want a deep folder structure for the articles so I decided on one folder per year, and a folder per article during that year.
Here is the folder structure for the first five articles this year. I used a ${ month }-${ day }--${ article.slug }
pattern so that they would be properly sorted but also have a meaningful description.
2016/01-11--asynchronous-i-o-with-generators-and-promises/
2016/01-21--controversial-state-of-javascript-tooling/
2016/01-30--es2016-features-and-ecmascript-as-a-living-standard/
2016/02-02--understanding-javascript-async-await/
2016/02-09--ecmascript-string-padding/
The metadata.json
file looks like the piece of JSON seen below. The id
and author
fields are used to uniquely identify the article, but they won’t be updated should they change in git. The article’s title, slug, tags, and hero image, can be updated on GitHub.
{
"id": "57783d1df2a76b840314377d",
"author": "543d222f4683586910034197",
"title": "<div>How Pony Foo is ridiculously over-engineered</div><div><em>— and why that is awesome</em></div>",
"slug": "most-over-engineered-blog-ever",
"tags": [
"side-projects",
"ponyfoo"
],
"heroImage": "https://i.imgur.com/IF2aFsB.jpg"
}
There’s a single file for each different piece of Markdown:
- One file for the teaser below the title
- One file for the introduction to the article, used when sending out an email about a new article
- One file for the body of the article, containing everything else in the article
- One file for the article summary and another for the notes from the editor, both of which are optional
Lastly, we also have a rendered readme.markdown
preview, which is the product of compiling all Markdown fields into HTML with a header that includes the title, tags, summary, and a warning about the readme file being autogenerated (and thus read-only). We say the readme
is in .markdown
format even though it’s spewed as raw HTML, so that GitHub renders the previews when humans visit one of the folders on the repository.
These files are written to disk for an article, via a single updateSyncRoot(article, done)
function. The same function is called for every published article on the site, and we have our provisioning set up.
So far we have a directory tree with a bunch of folders, Markdown and JSON files. For local development, I cloned the git repository and then called updateSyncRoot
for every article, provisioning the repository.
An important note was to use a branch other than master
, for SEO purposes, as GitHub indexes master
by default, but I’d rather not have search engines crawl GitHub for entire copies of every article on the website. I use a noindex
branch name to communicate that intent clearly, and made that the default branch on GitHub.
At this point, I manually commit and push to GitHub every single article, in file format. The rendered preview of an article on GitHub is shown below.
Next up, we need to push updates to the repository.
Keeping the Repository Up To Date
This step is a bit more challenging, as it involves several automated git
commands. I used a mongoose
post-save hook so that whenever an article is saved , for both inserts and updates, a function gets called. We use that hook to call a pushToGit({ article, oldSlug }, done)
function, passing in the article and the slug it had before the article was saved.
In pushToGit
we start by running git pull
on the repository. I used the simple-git
package to run git
commands from node. After pulling, we use the updateSyncRoot(article, done)
function to update the file representation of our article. Then, if the oldSlug
is different than the new one, we remove the files at the old slug’s directory structure. Finally, we run git add
, git commit
, and git push
on the changed files.
Whenever an article is deleted, in a similar fashion to our update hook, we remove its files from the repository. We do this by pulling first, then running git rm
for the related files, committing, and pushing.
These two actions ensure that any updates to the articles are mirrored on the git repository. The last piece of the equation is to handle git push
events.
Handling Updates Pushed To Git
We can register a Git hook on GitHub so that whenever new commits are pushed onto our repository, our web app gets a notification. I chose the /api/git-hooks/articles
endpoint and a secret code. GitHub uses the secret to encode payloads sent to /api/git-hooks/articles
, so that when we receive a request for that endpoint, we know it came from GitHub.
On the web app, I use github-webhook-handler
to receive the event. The articleGit
service is where I have the git-related functions I mentioned earlier, and the env
module contains all secrets used by the application. I made a small configure
helper function which takes care of creating the webhook handler middleware for an Express app
, using a key
as part of the endpoint, and any number of ...events
the application accepts and knows how to process.
const winston = require('winston');
const createHandler = require('github-webhook-handler');
const env = require('./env');
const articleGitService = require('../services/articleGit');
const secret = env('X_HUB_SECRET');
function configure (app, key, ...events) {
const path = `/api/git-hooks/${key}`;
const handler = createHandler({ path, secret, events });
app.use(handler);
handler.on('error', err => {
winston.warn('Error in GitHub hook handler', err.stack || err);
});
return handler;
}
When we receive a push event from the articles
webhook on GitHub, we’ll invoke articleGitService.pullFromGit
.
function webhooks (app) {
configure(app, 'articles', 'push').on('push', event => {
articleGitService.pullFromGit(event);
});
}
module.exports = webhooks;
The service receives an event.payload
such as the following. We’ll only leverage the highlighted fields.
{
"ref": "refs/heads/changes",
"before": "9049f1265b7d61be4a8904a9a27120d2064dab3b",
"after": "0d1a26e67d8f5eaf1f6ba5c57fc3c7d91ac0fd1c",
"created": false,
"deleted": false,
"forced": false,
"base_ref": null,
"compare": "https://github.com/baxterthehacker/public-repo/compare/9049f1265b7d...0d1a26e67d8f",
"commits": [
{
"id": "0d1a26e67d8f5eaf1f6ba5c57fc3c7d91ac0fd1c",
"tree_id": "f9d2a07e9488b91af2641b26b9407fe22a451433",
"distinct": true,
"message": "Update README.md",
"timestamp": "2015-05-05T19:40:15-04:00",
"url": "https://github.com/baxterthehacker/public-repo/commit/0d1a26e67d8f5eaf1f6ba5c57fc3c7d91ac0fd1c",
"author": {
"name": "baxterthehacker",
"email": "baxterthehacker@users.noreply.github.com",
"username": "baxterthehacker"
},
"committer": {
"name": "baxterthehacker",
"email": "baxterthehacker@users.noreply.github.com",
"username": "baxterthehacker"
},
"added": [
],
"removed": [
],
"modified": [
"README.md"
]
}
],
"head_commit": {
// ...
},
"repository": {
// ...
},
"pusher": {
// ...
}
"sender": {
// ...
}
}
Example extracted from GitHub Help pages
There are quite a few things that could have happened due to git push
. Let’s look at those.
- Changes could be completely unrelated to any articles, such as when we update the
license
for the repository - Changes could have updated a piece or pieces of information for an article
- Changes could have removed an article entirely
I decided to go for a naïve but realistic implementation, where we’ll look at the commits
collection. We .reduce
the commits once for removed
files and again for modified
files. We’ll interpret deleted metadata.json
files as the article for that metadata.json
having been deleted. We’ll interpret any modified files as the article having changed.
Next up we’ll remove any articles that were deleted from the database. This would trigger the “remove from git” hook, but the article was already removed from git and , since deletion is idempotent, all is well.
After that, we git pull
the changes in the repository into our local clone, read all the files related to a modified article, and update its database representation being careful not to erase any important information. When the article gets saved, the updateSyncRoot
hook gets triggered, which would result in an attempt to push to git. However, given that the file system representation is equivalent to the article (after saving it), there is nothing to commit.
Certainly this is all a tad brittle, but it has worked quite well thus far without many surprises. A nice aspect of having this setup, is that we get versioning for free. We’re now able to look at an article on GitHub and see any changes applied to them, when they were applied, and how the article looked before and after the change was made.
How would you improve a two-way synchronization mechanism between a database and a git repository?
Comments