Friday, 12 August 2011

Precomputing for fun and profit

    Last year I wrote a post about my keyword analysis tool. I wrote this piece of software in the years before I started working for OUP, and only now I have seen the tools at the disposal of the lexicographers do I realise that I created what was in effect a corpus analysis tool.
    I used the technologies I was familiar with using for web sites, PHP and MySQL. These are the bread-and-butter of web development, making it possible to create dynamic and flexible web content engines with relative ease. Since an SQL engine is designed for the querying and analysis of data, I thought, MySQL would be perfect for the task of language analysis.
    I was proved right, to a point. I was able to quickly hack together a fully functional tool that would analyse a corpus to provide me with collocates and their frequencies for any n-word phrase. But in sticking dogmatically to what I knew, beyond that point I had taken entirely the wrong approach. I was using a very powerful database engine to perform the same very complex task again and again on an extremely large data set, and as my corpus edged into the tens of millions of words I found that my analysis was redolent of the early days of mainframe computing when programmers would start a job and go away to have a cup of tea while it ran. The system became so slow as to be unusable. It works, but even with all the MySQL optimisations in the book it does not do so in the real time demanded by today's users.
    My mistake was to rely on an expensive commodity when I should have used a cheap one. When I first used computers twenty-plus years ago, storage was the expensive commodity. We used to have disk compression systems using a cheap commodity - the processor time on our PC-ATs - to expand an expensive commodity - the free space on out 10 and 20Mb hard drives. Thus we were trained as programmers to be as efficient as possible in our use of storage space and memory.
    Space efficiency is a good thing if you are programming for a tiny embedded system but in 2011 disk space on web servers is laughably cheap. Processing power, as represented by the waiting time of your users, is not. Clearly a rethink is called for.
    So as an experiment I tried precomputing collocates and frequencies a subset of phrases and saving them as a structure of separate tiny JSON files. A simple JQuery based browser interface was then cobbled together to browse them, resulting in an application that was near-instantaneous for analysis that would have taken minutes to compute all the separate component queries using the MySQL tool.
    The moral of this story is I guess to stand back from your choice of underlying software tools and ask yourself whether they have been driven by suitability or familiarity. I used PHP/MySQL because I am used to it, I never gave a thought to whether a simpler solution might not be much better.
    So I'll be revisiting my hacked-together precomputing engine and applying it to my entire corpus. I may end up with a few gigabytes of JSON files instead of a large MySQL database but to the user that doesn't matter.
    I never thought I'd be returning to flat file storage, in fact I've mocked people who use perl DB files for similar tasks in the past. Guess I'd better eat those words then, precomputed of course!


  1. Heh, I had a similar thought recently re online turn-based games. The first large-scale DB-based game that I designed, we ended up saving out flat files holding all the turn results, because the pgSQL server couldn't serve out query results fast enough when all the players wanted to see what had happened.

    As time went on, processors becoming faster and RDBMSes (and my coding ;-) more efficient, we moved to doing everything dynamically, with live database queries.

    If I was to go back to it now, though, I would do as you discuss and put all the popular resultsets into JSON because, why not, it's even quicker for the user that way. Funny the way these things come back round.

  2. I dread to think what would happen with concurrent queries on this one, it takes minutes with just me. But it does do some significant querying on very large data sets.

    I've wondered whether I could craft some incredibly clever take on the SQL that did the queries much more quickly, but you can't escape that it's a huge amount of data to sift.