a 'foreign spelling' test for GloWBE corpus
In blogging, I rely a lot on the Global Web-Based English corpus, GloWBE , which has millions of words of internet data categori{z/s}ed by the country of the website. It's divided into excerpts from 'blogs' (which includes comments on blogs) and 'general', which includes all sorts of things, even some blogs. It's an invaluable tool for judging whether a word or phrasing is used in a particular place. But national borders are very weak on the internet, and commenters comment on all kinds of things from all kinds of places. And there are even people like me who are blogging in the 'wrong' country for their dialect (and I have run into some of my own writing in the corpus!). So, how can we know how much of the data that's in the 'US' category is actually by Americans and so forth? This is a problem that has struck me as I've tried to use GloWBE data to research politeness markers. (I'm giving a paper on please in GloWBE next week .) So,...