OxfordWords Blog, you may have seen the launch of the OxfordWords Text Analyser, coinciding with the 200th anniversary of Charles Dickens. Here follows a technical description of the feature, what it does and how it works.
The challenge was to show the logophile visitors to OxfordWords some of the computational linguistic techniques used in the preparation of the Oxford English Corpus, and thus give some insight into the preparation of a modern dictionary.
Finding ourselves unable to share the corpus itself with the public, we settled on the idea of delivering a much smaller text with similar analytical techniques applied to it to those used by our corpus analysis software. Since we already publish a huge range of classic texts in the Oxford World's Classics range it made the most sense to use those as our sources and provide collocate and frequency analysis as well as example sentences for each text. This presented a data problem: to incorporate all this in a single web page would mean adding several megabytes of data to the page, resulting in an unsustainable page load time.
AJAX having been decided upon, the next decision related to the server-side component. I wrote a piece about this last summer, about how experience of database driven back-ends for language analysis had led me to precomputing data as json files rather than querying a database. Disk space is fast and cheap, server processing power isn't. So my next task was to write a set of command-line PHP scripts that generated a tree of JSON files for each word in the source text, containing collocates, frequencies and example sentences.
While these tasks are essentially simple ones, they are quite computationally intensive. In a typical Dickens novel for instance, there will be as many as 20000 unique words, and every one of those words needs to be searched for across the whole text and the frequency of its colloscates in all the locations it appears in computed. The whole process takes about six hours, and produces a roughly 20Mb tree of thousands of tiny JSON files which are then uploaded to the web server.
So that's the surprisingly low-tech backend for the feature, how about the front end?
The core functionality of jQuery makes coding an application like this very simple. The three main parts of the feature live in hidden DIVs which are shuffled using the jQuery show() and hide() functions. TheJSON data is pulled in using jQuery's getJSON() function. Collocates are shown in a word cloud courtesy of the excellent jQCloud plugin, example sentences are simply loaded into an unordered list, and frequency graphs are created using Google's Image Chart API.
These plugins and code made a working feature. But to make a single-page jQuery application like this one feel like a proper application, there was one further component required. Users expect to be able to use the back button, and to be able to return to a particular part of the application by URL alone. Miss Havisham's dress needs to be readily conjoured up by linking directly to the word "bridal".
We thus used the jQuery-BBQ plugin to provide URL and history functionality through the use of in-page anchors. Because the page can't be reloaded, the plugin appends the word to the end of the URL after a # symbol as it triggers any changes to what is displayed.