Ryan Harrison My blog, portfolio and technology related ramblings

ElasticSearch for your Jekyll Blog

Search functionality is very helpful to have in pretty much any website, but something that’s not particularly easy to do in a static Jekyll site. Fully fledged blog solutions such as Wordpress give you a partial solution (no full text search) for free, however you have to also deal with all the associated bloat and need for a database running in the background. On statically generated sites, you have to role your own. Most of the solutions on the internet seem to lean towards doing full text search completely on the client side using a library such as LunrJs. This will work well, but you end up having to ship your whole site to the client as JSON blob before you perform the search. For smaller sites this might be fine, but otherwise that file can get quite large when you have to include all content across your entire site - no thanks.

My, perhaps heavy handed, solution (which won’t work for GitHub Pages) is to use a small ElasticSearch instance on the server side to provide great full text search across your site. It takes a little more work to set up, but once you have it all automated you can just leave it and still take advantage of all the capabilities of ElasticSearch.

I put together elastic-jekyll which is a small Python library that you can use to automatically index and search across your entire Jekyll blog. I’ll cover below how it all fits together and how to use it.

Parsing your Posts

The first step in the process is to find all of your posts within your site and create an in-memory representation of them with all the attributes we require. In this case the library will try to go through ~/blog/_posts unless you pass in another path to main.py. Once all of the markdown files are found, each one is parsed using BeautifulSoup to extract the title and text content (find_posts.py):

def parse_post(path):
    with open(path, encoding="utf8") as f:
        contents = f.read()

        soup = BeautifulSoup(contents, 'html.parser')
        title = soup.find('h1', { "class" : "post-title" }).text.strip()
        
        post_elem = soup.find("div", {"class": "post"})
        post_elem.find(attrs={"class": "post-title"}).decompose()
        post_elem.find(attrs={"class": "post-date"}).decompose()

        paras = post_elem.find_all(text=True)

        body = " ".join(p.strip() for p in paras).replace("  ", " ").strip()
        return (title, body)

    raise "Could not read file: " + path

The output is passed into create_posts which creates a generator of Post instances. Each contains:

  • Id - A unique identifier to let ElasticSearch keep track of this document (modified version of the post filename)
  • Url - The relative url of this post so we can create links in the search results (again uses the filename and site base directory)
  • Title - The title of the post extracted from the frontmatter of the markdown file
  • Text - The text content of the post. Note that this is still in markdown format so contains all of the associated special characters. A future extension might be to do some sort of sanitization on this text

Indexing your Posts

Once we have all of the current posts properly parsed, we’re ready to dump them into ElasticSearch so it can perform its indexing magic on them and let us search through it. In Python this is very straightforward to do using the Python ElasticSearch client library.

First we establish a connection to the ElasticSearch server you should already have running on your system. It defaults to port 9200 although you can override it if you want.

from elasticsearch import Elasticsearch

def connect_elastic(host="localhost", port=9200):
    return Elasticsearch([{'host': host, 'port': port}])

For simplicity, the library will currently blow away any existing blog index that may already exist on the Elastic instance and recreate a new one from scratch. You could of course figure out delta’s from the version control history etc, but for a small set of data it’s way easier just to re-index everything each time:

# remove existing blog index and create a new blank one
def refresh_index(es):
    if es.indices.exists(index=index_name):
        es.indices.delete(index=index_name)
    es.indices.create(index=index_name)

Then we just loop through each of the posts we got from the previous step and push them into the index:

for post in posts:
    doc = {
        "title": post.title,
        "url": post.url,
        "body": post.body
    }

    es.index(index=index_name, doc_type=doc_type, id=post.id, body=doc)

At this point we now have an index sitting in ElasticSearch that is ready to receive search queries from your users and turn them into a set of search results for relevant posts.

Searching for your Posts

To actually provide users the ability to search through your index you will need to have some kind of web service open ready to receive such Ajax calls. In my case I have a lightweight Flask server running which has an endpoint for searching. It simply passes the query string into ElasticSearch and returns the response as a JSON object. It is of course up to you how you want to do this so I’ve just provided a generic way of querying your index within searcher.py:

from elasticsearch import Elasticsearch

es =  Elasticsearch([{'host': 'localhost', 'port': 9200}])

user_query = "python"

query = {
    "query": {
    "multi_match": {
        "query": user_query,
        "type": "best_fields",
        "fuzziness": "AUTO",
        "tie_breaker": 0.3,
        "fields": ["title^3", "body"]
    }
    },
    "highlight": {
        "fields" : {
            "body" : {}
        }
    },
    "_source": ["title", "url"]
}

res = es.search(index="blog", body=query)
print("Found %d Hits:" % res['hits']['total'])

for hit in res['hits']['hits']:
    print(hit["_source"])

This snippet will connect to your ElasticSearch instance running under localhost and query the blog index with a search term of python. The query object is an Elastic specific search DSL which you can read more about in their documentation. ElasticSearch is a complicated and powerful beast with a ton of options at your disposal. In this case we are doing a simple multi_match query on the title and body fields (providing more weight onto the title field). We also use fuzziness to resolve any potential spelling mistakes in the user input. ElasticSearch will return us a set of hits which consist of objects containing just the title and url fields as specified in the _source field. We have no use for the others so no point in bloating the response. One cool feature is the use of highlighting which will add <i> tags into the body field within the response. This can then be used to apply styling on the client side to show much sections of text the engine has matched on.

This search query seems to work well for my use cases and I’ve literally just copied the above into the corresponding Flask endpoint. On the client side in your Jekyll search page, I’ve just used a but of good old JQuery to perform the Ajax call and fill in a list with the search results. Keep it simple. You can find the JS I use in the search page source.

As far as automating the process, I have a script which will rebuild my Jekyll blog after a Git push has been performed into GitHub (via hooks). After the main site is rebuilt I just call python main.py and everything is kept up to date. As I said before, it takes a bit of work to set up things up, but once you have it will sync itself every time you make an update.

Full source code can be found in the GitHub repository

Read More

PNG Image Optimisation

Some tools that can be used to reduce PNG file sizes whilst maintaining good quality images. All those below can be installed and used within the WSL (Windows Subsystem for Linux).

Sample Image:

A graphic with transparency is probably better suited for a PNG, but who doesn’t love a bit of tilt shift?

Original Size: 711KB

Original Image

PNG Crush (lossless)

Probably the most popular, but has a lot of options and you may need to know some compression details to get the best results out of the tool.

> sudo apt-get install pngcrush

(also works on WSL)

> pngcrush input.png output.png

> pngcrush -brute input.png output.png

The -brute option will through 148 different reduction algorithms and chooses the best result.

> pngcrush -brute -reduce -rem allb input.png output.png

The -reduce option counts the number of distinct colours and reduces the pixel depth to the smallest size that can contain the palette.

Compressed size: 539KB (24% reduction)

Optipng (lossless)

Based on pngcrush but tries to figure out the best config options for you. In this case so suprise that we get the same results.

> sudo apt-get install optipng

> optipng -o7 -out outfile.png input.png

The -o7 option specifies maximum optimisation but will take the longest to process.

Compressed size: 539KB (24% reduction)

PNGQuant (lossy)

The conversion reduces file sizes significantly (often as much as 70%) and preserves full alpha transparency. It turns 24-bit RGB files into palettized 8-bit ones. You lose some color depth, but for small images it’s often imperceptible.

> sudo apt-get install pngquant

> pngquant input.png

Compressed size: 193KB (73% reduction)

Original Image

If you look closely you can see some minor visual differences between this and the original image. However, the file size reduction is huge and the image quality remains very good. Definitely a great tool for the vast majority of images you find on the web.

Read More

Firefox Quantum - It's Fast Again

Firefox has always been installed on my system and it used to be my browser of choice. For the last few years or so however, it has been lagging behind Chrome in speed and general responsiveness. I have always hated the terrible startup times of Firefox compared to the relative instantaneousness of Chrome. General browsing and usability has also been more snappy in Chrome - which for most people is the single most important factor when choosing a browser.

Firefox Quantum Beta

This story seems to have changed quite a bit in the latest pre-release of Firefox however. Version 57, which is dubbed Quantum, uses a completeley new CSS engine and various components have been recreated in Rust which allow much better use of multi-core processors. Mozilla says that these improvements give Quantum a 2x speed improvement over v52 along with using up to 30% RAM than Chrome.

This all sounds great, but does it actually have any notable difference? I have been using the beta release alongside Chrome for a couple weeks now (both with the same extensions installed), and I must say that the performance improvement is quite significant. I generally don’t care much about RAM usage and have no problem with Chrome eating loads of it as long as its well used to make things faster (if it’s there why not use it?) so I won’t comment on that, but you can definitely notice the difference. Firefox feels a lot more snappy now and page loads are just generally much faster. Can’t really argue with that. I wouldn’t say that it feels faster than Chrome now, but it’s probably just as good which is quite impressive. Always good to have some competition back in the marketplace. Startup times are also much better now!

Other notable differences in Firefox Quantum include the new Photon UI, which I must say I think looks pretty good. Things seem a lot simpler now and they’ve thankfully done away with the old huge hamburger menu which was terrible. Transitions seem smooth and everything is where it should be. One thing to note is that the newer version forces the use of the new web extension framework, so if/when you do update, it’s certainly possible that not all of your extensions will work. One big example is the LastPass extension which has yet to be updated. It’s still a beta though so this is still acceptable. Most of the popular extensions have already been updated to work in Quantum and hopefully more will follow after general release.

Firefox Quantum (v57) is due for release on November 14th. In the meantime, you can still try it out by installing the Beta (or Nightly releases).

Read More

Microsoft Rewards - Earn points by using Bing & Edge

Microsoft Rewards has been around for ages now in the USA, but it’s now made its way over to the UK. The general idea is that you get awarded points by using Microsoft services (predominantly Edge and Bing) which you can then redeem for a range of rewards. You can also get points by purchasing products such as Xbox Live etc.

Currently, I fire up Edge every so often and look at the front page news on Bing to rack up points for the day. You can get a maximum of 90 points per day for using Bing to search (although there have been offers to get more if you also use Edge to perform your searches). Unfortunately, you can’t get points by visiting the same page over and over again, but it still doesn’t take too long to fill the daily quota.

There are also daily challenges and quizes on the main Microsoft Rewards portal which give you one off boosts to your points. Again, they don’t take too long and you can pretty much get through them by button mashing.

Microsoft Rewards Portal page

Above shows the main rewards portal page where you can see your total and redeem your points for prizes. I’ve managed to accumulate over 15,000 points in a few months with pretty little effort. In some ways it’s similar to Google Opinion Rewards - but instead of giving Google all your personal information you just have to use Bing for a bit.

Some of the prizes you can get include:

  • Skype credit (£2 = 900 points)
  • Skype Unlimited (3 months for 8000 points)
  • Xbox Live Gold (3 months for 15000 points or 12 months for 29000 points)
  • Xbox gift card (£10 for 12000 points)

As you can see my 15000 points equates to around £13 already which is pretty good. Annoyingly the UK version doesn’t include Amazon gift cards like the US one seems to - which is frustrating as that is pretty much the only thing I would redeem for. They have also removed the Groove music passes which seems strange to me. Maybe they will bring that back at some point. Edit - Microsoft is now apparently killing off Groove Music which explains why it suddenly disappeared from the rewards.

Visit the Microsoft Rewards page to sign up (you will need an account obviously).

Read More

Brave Browser for Android

Excessive ads and tracker scripts are a pain when on desktop, let alone mobile browsers with limited network connectivity. Nothing worse than loading a couple megabytes of Javascript on a ‘modern’ website, along with a bunch of tracking scripts, ads and an autoplaying video just for good measure. Not only does it now take ages to load a simple site - even on a reasonable 3g/4g connection - but it starts to eat through your data plan pretty quickly.

On desktop browsers this is most commonly solved by installing blocking extensions such as uBlock Origin, Adblock Plus, Ghostery etc. Unfortunately on mobile devices however, the app are much more limited and most don’t have support for installing custom extensions or addons. N.B - ok, yes the Firefox mobile app does allow this, and yes I have tried it. Even with just one extension installed however, the app seems too sluggish and laggy - especially when compared to the stock Chrome experience on Android devices (I am running a Nexus 6P so CPU performance shouldn’t be a factor).

I have however found what seems to be a good middle ground that gives you the speed and responsiveness of Chrome whilst also blocking most ads and trackers - the Brave browser.

Brave browser screenshot

The browser is based on Chromium so you get the familiar interface and speed, but it will also block a ton of the major trackers and ads by default. The site has some stats on how the load times are improved, which are probably under ideal circumstances, but I can definitely tell that there is a good improvement when surfing the web on my phone. I also get the same feeling of horror when I use any other browser on my phone as compared with using a stock browser on desktop. It’s amazing how fast the web could be without all the excess crap we push through the wire (or air).

To set your default browser on Android source:

  1. Open Settings.
  2. Go to Apps.
  3. On the All tabs, look for your default browser and tap on it.
  4. Under Launch by Default, press the “Clear defaults” button, to reset default 1. browser.
  5. Then open a link, you are asked to select a browser, select Opera , select Always.

You can get the Brave browser through the Google Play Store/Apple App store. There is also a desktop version available although I haven’t tried it yet.

Read More