Sunday, March 11, 2018

Reddit bots

Just for kicks, I've been creating a bot that will suggest spelling corrections when the wrong homophone is used, starting with "then" and "than". Mainly because I want the experience of creating a bot, and secondarily because these kinds of errors drive me nuts.

The high level description of the bot is like this:
  1. Train the Stanford Parser on news text.
  2. Periodically scrape some sample of comments from Reddit (I don't want or need to cover ALL of Reddit)
  3. Save only comments that contain "then" or "than"
  4. For "then" sentences, generate an identical "than" sentence, and vice versa.
  5. Parse both versions with the Stanford Parser using settings that provide a parser confidence score with each parsed sentence.
  6. If the parser is significantly more confident about the changed version, post a reply comment suggesting the change.
I'll need to establish some kind of thresholds for the difference in confidence scores, and probably a minimum confidence score overall (for cases where neither word is correct; e.g., when "then" should really be "them"). I only want post replies in cases where I'm highly confident that a change should be made; i.e., I'm concerned about precision, but not about recall.

If this works well, I'd also like to periodically retrain the parser model by adding previously seen "then" and "than" sentences to the training set, assuming they were parsed with a high level of confidence. This gives the parser a wider range of contexts for these words, and should improve its performance on this task. This is similar in concept to some domain adaptation work I've done in the past: Baucom, King & Kuebler (2013).

It's here on GitHub, but there's not a working copy there yet. Stay tuned!