I launched my startup this week. On a very modest scale, no debauched parties or penthouse offices on scads of VC money, just me and my laptop in my dad's living room, popping open a bottle of home made cider to mark the event.
Language Spy is the tangible fruit of a seven or eight year side project, creating a searchable corpus of political language. It's driven by a pair of Raspberry Pis doing the numbercrunching and uploading data to Google Cloud Storage buckets from whence the site is served by Google App Engine.
You can see events unfolding through the words used about them, for example the correlation between "Hillary Clinton" and "Email" in the last week of US politics. If like me you're a news junkie, it's compelling viewing.
Unfortunately though if you click the link as I write this, you'll see a Google App Engine quota exceeded message. The site won't work, because I have reached the point at which I can't afford the traffic it's serving and it has exceeded my daily budget.
This traffic spike would be no problem if it were generated by real site users as then I'd be able to monetise the traffic, but sadly it isn't. Instead it's generated by GoogleBot. That's right, being indexed by a search engine has taken my site down. The bot looks at the site, decides it's on some very fast infrastructure, and issues millions of requests per hour.
When I examine my problem, it becomes clear that it has several aspects:
- It's a language analysis site, so it has a *lot* of pages for the spider to crawl.
- Being a language analysis site there are no pieces of language I can exclude using robots.txt, so I can't reduce the load by conventional means. How do you decide which language is more important than other pieces? You can't, at least not when your aim is to have it all open for analysis.
- I can't tell Google to slow down a little, as where I'd expect to be able to do this in Webmaster Tools I get a "Your site has been assigned special crawl rate settings. You will not be able to change the crawl rate." message. I see this as the sticking point, if I could restrict Googlebot's rate I'd be able to keep the site running and take the hit of not being so well indexed.
Right now my only hope lies with a crawl issue report I filed with the Webmaster Tools team, if they can give me control over my indexing rate I'll be good to go. But I can't say when they'll come back to me if ever, so I may just have to come up with a Plan B.
Is there a moral to this story? Perhaps it's a cautionary tale for a small startup tempted to use cloud hosting. Google Cloud Storage has proved very cost-effective for a huge language database, but the sting in the tail has turned out to be how GoogleBot behaves when it sees a cloud server and how per-instance billing on App Engine handles unexpected traffic surges. The fact that it's Google who are causing me to use up my budget with Google is annoying but not sinister, however neither giving me the option to limit my GAE instance count nor slow down the crawl rate doesn't leave me as the happiest of customers.
So yes, I've launched a startup. It's live for an hour or two a day while it has budget, in the morning UK time. Perhaps that will be my epitaph.

 
Why don't you just put some code in to block all the Googlebot requests???
ReplyDeleteEach time it would be called would still spin up a GAE instance. So no benefit.
DeleteYou can set load limits in Google Webmaster tools, and in your robots.txt.
ReplyDeleteExactly the same issue has taken high profile forums offline following URL changes - when Google decides to do a full re-index very quickly (it happened to Neowin.net a few years back).
Sadly Google doesn't respect crawl delays in robots.txt, and my problem is that I don't have the option to limit it in Webmaster Tools.
DeleteYou can try this:
ReplyDeleteUser-agent: *
Crawl-delay: 3
Using this syntax, a crawler that honors “crawl-delay” will wait at least three seconds between visits. Obviously, the larger value you use, the more slowly your site will be crawled.
Sadly Google doesn't honour crawl-delay. But yes.
Deleteyou made hackernews
ReplyDeleteI had a VERY similar thing happen to my startup www.metrink.com as well. GAE kept spinning up instances to handle bursty load, then not shutting them down and running through my budget. I opened a support ticket and went back-and-forth with Google (including 5 phone calls) to try and resolve the issue to no end. My solution: switched to Digital Ocean. Doesn't have all the nice things about GEA, but is MUCH cheaper...
ReplyDeleteIt may have to come to that :(
DeleteWrote this on hacker news but I was a bit late to the party so it'll probably end up buried:
ReplyDeleteRather than it hitting quota error pages would it be feasible to give Googlebot a 503 header back after a certain amount of pages with a Retry-After set to the next day?
From https://plus.google.com/+PierreFar/posts/Gas8vjZ5fmB
Primarily the section
"2. Googlebot's crawling rate will drop when it sees a spike in 503 headers. This is unavoidable but as long as the blackout is only a transient event, it shouldn't cause any long-term problems and the crawl rate will recover fairly quickly to the pre-blackout rate. How fast depends on the site and it should be on the order of a few days."
I think having a handler return a 503 would still require a GAE instance. So while it would have the effect of lowering the rate it would still cost me a significant amount.
ReplyDeleteMore significant than not having your site running at all?
DeleteYes. Each 503 handler is an instance running per simultaneous request, each instance is 5 cents an hour. Get hundreds of instances running and that's a lot of cash in a short time.
DeleteI believe you can have a single F1 instance up 24/7 just on the free quota that Appengine gives you. Shouldn't cost you a thing.
ReplyDeleteAbsolutely, you get 28 instance hours free. But the GAE model isn't the same as a server model, every simultaneous request gets another instance. Lots of simultaneous requests means the free quota evaporates very quickly in the face of a DoS level of traffic.
DeleteIt's a good model for a site visited by a manageable number of humans because their requests are rarely simultaneous and you're mostly within the free quota. If you're getting the level of human traffic at which you do get huge numbers of instances running you can at least get whatever benefit your business model returns from those visits. As I found though it's a bad model if you get a huge amount of unthrottled bot traffic.
Could you put "Welcome/who I am" pages in the site's root directory, then put the meat in a subdirectory? Use robots.txt to exclude the subdirectory.
ReplyDeleteThat's more or less what I ended up doing
DeleteWell thanks everyone for all your comments. For reference, a Hacker News front page gets you 17000 page loads.
ReplyDeleteIn the end after a long chat with a former colleague I excluded my word trend data using robots.txt. Immediately the bot traffic stopped. Yes I take the hit of not having the data indexed, but I still have hundreds of pages of timeline data that's search visible.
Hi, you mention you're using a Raspberry Pi for number crunching. Mind if i ask why you went with Pis, and not say a hostd server? How are the pair of Pi's working out?
ReplyDeleteI went with the Pi originally because I had one to hand, and when it was just a side project I wanted a machine to take the task away from my laptop. You know how it is, you buy an SBC because it's a cool toy, then it sits looking reproachfully at you because you haven't done anything with it.
ReplyDeleteOriginally it was running on an early Chinese 256Mb Pi, now it runs on a model B+. In time it'll probably receive a Pi 2, but there's no special need for that.
Why did I keep using the Pi? It makes sense as an extremely cheap and reliable Linux box that doesn't use much power. Anything else would cost me more money for not much benefit.