ponyfoo.com

Introduction to SEO and Content Indexing

Fix
A relevant ad will be displayed here soon. These ads help pay for my hosting.
Please consider disabling your ad blocker on Pony Foo. These ads help pay for my hosting.
You can support Pony Foo directly through Patreon or via PayPal.

Just a few days ago, Google started indexing this blog, and it’s just starting to show up in their search results. I wanted to go over the steps I took to make a good impression on them.

I must admit that I’m no expert on SEO, I just read a lot, and I think I’ve come to amass a good sense in what makes a site relevant in the eyes of web crawlers such as Google.

Semantic HTML and Content Relevance

The absolute first step was semantic markup. That is, using HTML 5 tags, such as <article>, <section>, <header>, <footer> and so on. For a full list of HTML 5 tags, visit MDN.

This helps crawlers assign weight (importance) to each piece of HTML in your page. It makes your pages future-proof, meaning that when and if crawlers start giving more importance to semantic markup, you will be ready for it. Additionaly, it makes your CSS more semantic, too, which can’t hurt.

Always make sure your content is relevant to the keywords you aspire to be found for, don’t just spit a bunch of keywords onto your site and expect good things to happen. Your visitors won’t take your site seriously if you do that, which is your end goal anyways.

What is the point in being well-ranked if your visitors don’t consume or appreciate the contents of your site?

Analytics

The second step I took was adding Google Analytics to my solution. I later went on and added a couple of other services: Clicky and [New Relic](http://newrelic.com/ "New Relic Application Monitoring), to improve my analytics and have at least some sort of uptime monitoring. All these services are really easy to include effortlessly in your application, and they provide a lot of value.

Analytics can tell you what pages users land on, what pages are the most linked to, where your users come from, as well as how your users behave and what they are looking for. In summary, it’s really important to know what’s going on with your site in the grand scheme of things, and figure out how to proceed, and analytics tools are a great way to accomplish just that.

You also definitely want to sign up for Webmaster Tools, which will be immensely helpful in putting it all together, and will also help you track your index status closely.

AJAX Crawling

This is the most complex step I took. Since this site is entirely AJAX, meaning users navigating only load the site once, the page is for all intents and purposes static, meaning that no matter what page you go to, you get the same HTML, and then JavaScript takes care of displaying the appropriate view. This is clearly a dealbreaking issue for web scrapers, because they cannot index your site if all your pages look the same.

Without a proper AJAX crawling strategy, any other efforts to improve SEO are utterly useless. You need a good crawling strategy that allows search engines to get the content a regular surfer would find in each page.

The desired behavior would, then, be some sort of reflection of this mockup:

crawler.png
crawler.png

Implementation

In my research, I figured out a headless browser was the way to go. A headless browser is basically a way to make a request to a page and execute any JavaScript in it, without resorting to a full-fledged web browser. This would allow me to transparently serve web crawlers with static versions of the dynamic pages on my blog, without having to resort to obscure techniques or manually editing the static versions of the site.

Once I had this down, it was just a matter of adding a little helper to every single GET route to handle crawler requests differently. If a request comes in and it matches one of the known web crawler user agents, a second request is triggered on behalf of the crawler, against the same resource, and through the headless browser.

After waiting for all the JavaScript to get executed, and after a little cleanup (since the page is static, it makes sense to me to remove all script tags before serving it), we are ready to serve this view to the crawler agent.

One last step if you care about performance is dumping this into a file cache that is relatively short-lived (meaning you’ll invalidate that cached page if a determined amount of time elapses), in order to save yourself a web request in subsequent calls made by a crawler agent against that resource.

If you are curious about how to implement this, here is my take for this blog, it is implemented in Node.

Note that this might not be the latest version. It’s the one contained in the v0.2 tag. Although I don’t expect to change it much.

Once that’s settled, and working, you can do awesome stuff such as updating your <meta> tags through JavaScript, and the web crawlers will pick up on it!

Metadata

Metadata is crucial to being well-positioned in search results. There are lots of meta content you can enrich your site with. I’ll talk about some of the metadata you can include, particularly about what I have chosen to include.

Use <meta> tags

The single most important <meta> tag is <meta name='description' content='...'>. This tag should uniquely describe each page in your site. Meaning different description tags should never have the same content attribute value. You should keep the description tag brief, though not too short, since it’ll be the description users will get when your page gets a search results impression.

The <meta name='keywords' content='...'> tag is much discussed, and seems to have been mostly irrelevant for a while now, but it won’t do any harm should you decide to include it.

Provide Open Graph metadata

Open Graph is a set of meta properties pushed by Facebook. These are mostly useful when sharing links to your site on social networks. Try to include a relevant thumbnail, the actual title of each page, and an appropriate description of that page. You definitely should include these in your website if you care about SEO.

Implement Schema.org microdata

Schema.org microdata allows you to mark up your site with attributes indicating what the content might be. You can read more about microdata on Google. This will help Google display your site in search results, and figure the types of content published in your site.

Search engine Support

There are at least a couple of ways to help search engines get on the right tracks when indexing your website. You should take full advantage of these.

Provide a robots.txt file

There isn’t a lot to say about robots.txt, don’t depend on crawlers honoring your rules, but rather make your site follow the REST guidelines to the letter, so that your unsuspecting data doesn’t get ravaged by a curious spider.

Publish a sitemap.xml

By submitting a sitemap, you can give hints to a web crawler about how you value your site’s content, and help it index your website. You can read more about sitemaps on Google or here.

Alternative traffic sources

It is always wise to provide users with alternative means of accessing your website’s content.

Implement OpenSearch Protocol

A while back I talked about implementing OpenSearch. This allows users to search your site directly from their browser’s address bar.

Publish Feeds

I don’t think feeds such as RSS need an introduction. I recommend to publish at least a single feed to your site’s content. Keeping it up to date is just as important.

Liked the article? Subscribe below to get an email when new articles come out! Also, follow @ponyfoo on Twitter and @ponyfoo on Facebook.
One-click unsubscribe, anytime. Learn more.

Comments (1)

Eric Douglas wrote

Lots of cool resources! Thanks for this great article! =)