Sunday, March 11, 2018

Reddit bots

Just for kicks, I've been creating a bot that will suggest spelling corrections when the wrong homophone is used, starting with "then" and "than". Mainly because I want the experience of creating a bot, and secondarily because these kinds of errors drive me nuts.

The high level description of the bot is like this:
  1. Train the Stanford Parser on news text.
  2. Periodically scrape some sample of comments from Reddit (I don't want or need to cover ALL of Reddit)
  3. Save only comments that contain "then" or "than"
  4. For "then" sentences, generate an identical "than" sentence, and vice versa.
  5. Parse both versions with the Stanford Parser using settings that provide a parser confidence score with each parsed sentence.
  6. If the parser is significantly more confident about the changed version, post a reply comment suggesting the change.
I'll need to establish some kind of thresholds for the difference in confidence scores, and probably a minimum confidence score overall (for cases where neither word is correct; e.g., when "then" should really be "them"). I only want post replies in cases where I'm highly confident that a change should be made; i.e., I'm concerned about precision, but not about recall.

If this works well, I'd also like to periodically retrain the parser model by adding previously seen "then" and "than" sentences to the training set, assuming they were parsed with a high level of confidence. This gives the parser a wider range of contexts for these words, and should improve its performance on this task. This is similar in concept to some domain adaptation work I've done in the past: Baucom, King & Kuebler (2013).

It's here on GitHub, but there's not a working copy there yet. Stay tuned!

Monday, February 26, 2018

Flood, farm, etc.

I'm back in Bloomington now, but I spent the last few days back home in Jefferson County, catching up on some work there. With planting season around the corner, my brothers and I spent some time planning and doing some maintenance. Large parts of the Midwest are flooded right now due to snow melting and lots of heavy rain. In the Ohio River Valley, where we live, the river is expected to crest today and begin receding. We were spared the worst, but I'll definitely be back soon to fill in some badly eroded spots around the fields and driveways after the flooding recedes enough to do so. Let's hope those farms, homes and businesses who really were affected can recover quickly.

In the meantime, we did some minor repairs on the buildings, fences and equipment. I'll be moving to NYC within the next few months, so I've been trying to do what I can before I go. My younger brother recently moved back to the area, which means he can help out and I don't need to be around as much. In fact, this is the first year in a long time that the three of us won't be actively involved in the farming operation in some way. We're out of the livestock business as well. Things are settled down enough now that we've been able to scale back, work out leasing arrangements, and focus on our own careers.  I'm grateful for the help and guidance we've had since being thrust back into this, and I've learned some valuable lessons, but I'm also excited to finally have the time I need to polish off my PhD and finally put it to work professionally.

It really comes down to the fact that the economics of farming these days only work on large scales; if you're going to own and maintain all of the expensive equipment and buy the chemicals, seeds, fuel, insurance, etc. needed to farm effectively, you really need to be running a large operation with several employees and several hundred acres at a minimum. That's why crop farmers today are either huge operations that own and lease thousands of acres, or very small operations where someone is farming his family's 20 acres with a tractor from the 1970s and working a full-time job somewhere else. Add to that the fact that today's land prices make it very difficult for small operations to expand or for younger people to start out farming. There's just not a lot of room for the little guys.


Saturday, February 24, 2018

SAILS Corpus hype

For the better part of the past year, I've been working on the dataset at the core of my dissertation. I'll be releasing it sometime in March. In my field, as in any kind of science, it's important to share your research data for a number of reasons. First, it makes your work more transparent -- anyone should be able to check your calculations for accuracy (and honesty), and ideally, anyone should be able to repeat and reach similar outcomes and conclusions. And second, it simply allows other researchers to use your data for new and interesting kinds of research and development.

I plan to call this the SAILS Corpus, for "Semantic Analysis of Image-based Learner Sentences". And a corpus is just a collection of text data used for some kind of research or for building statistical models. The corpus consists of responses to a picture description task (PDT), which is just what it sounds like -- a person is shown an image and asked to describe it or respond to some question about it. My corpus consists of 30 simple, cartoon-like images depicting some common activity. Half of the participants were asked, "What is happening?" and the other half were asked, "What is x doing?", where x is replaced with the subject of the image, e.g., "the girl" or "the bird". This was set up as an online survey and participants were instructed to provide a single sentence for each picture.

Example PDT item: "What is the boy doing?"
Example response: He's carrying a big bag of fruits.


I collected about 13,000 responses total. These come from different populations -- about 70 participants were English as a Second Language students at Indiana University and completed the task with my supervision. About 30 were native English speakers that I know personally. About another 220 native English speakers were crowdsourced from the survey site; the survey creator pays the survey site, and the participants receive rewards and gift cards for completing surveys. The reason I use these different groups is that the next step of this work involves using the native speakers' responses to automatically rate the non-native speakers' responses using various natural language processing techniques. I'll use the results of these techniques to see how well this kind of non-native speaker content assessment can be automated simply by crowdsourcing the task to native speakers and using their responses as a "gold standard" or a kind of answer key.

The real bulk of the work with this corpus has been annotating it. In this context, annotation simply means adding some kind of score to each response. If you want to develop an assessment tool that, for example, reads an essay and assigns it a grade of A through F, at minimum you need a sample of examples that are manually annotated A through F by a competent human. You then give the same example essays to your assessment tool (after removing the human annotation) and obtain the automatic grade. You then judge the quality of your assessment tool with regards to how well its annotations match the human ("gold standard") annotations.

I started out with something like a three point scale in mind:

  • 2 points = accurate and native-like response
  • 1 point = accurate but not native-like
  • 0 points = not accurate


When I tried to implement this, however, I found that it quickly broke down. There are simply too many characteristics to consider for each response, and a great many responses don't fit neatly into this scale. What does "accurate" mean? How "accurate" is accurate? What does "native-like" mean? So the scale had to be rethought, and through a long iterative process, I arrived at five binary features. This means each response is given five different scores, where each score is 1 ("yes") or 0 ("no"). These five features are:
  • Core event: Does the response capture the core event ("eating pizza", "buying a car", etc.) depicted in the image?
  • Answerhood: (Yeah, I guess I made that word up!) Does the response attempt to answer the question?
  • Grammaticality: Does the response use good grammar and spelling?
  • Interpretability: Does the response provide a clear mental image for the reader?
  • Verifiability: Does the response contain only information that is verifiable from the image?
As simple as this sounds, it took a long time and a lot of interesting conversations with my advisor and other linguists and language teachers to arrive at, and ultimately resulted in a 40 page manual of annotation guidelines with lots of rules and examples. The problem is that just when you have rules that cover all the weird sentences you think you will encounter, you find an even weirder one and have to make significant changes to feature definitions and the annotation guidelines.

Because I'm focused on automatically assessing content rather than form (grammar, spelling, etc.), my scoring tool will rely mostly on the Core Event and Answerhood features. Each of the features will provide some interesting insights into patterns of language variation between and within the native and non-native speakers.

The annotation itself is also quite a lot of work: 13,000 responses * 5 features = 65,000 annotation decisions. And it's pretty mind-numbing work after a while.

A second annotator annotated about 5% of the data when the annotation rules were being developed, so that we could compare our annotations and discuss and modify the rules as necessary. When the annotation guidelines were complete, I personally annotated all of the data. The second annotator completed another 5% sample, with only the guidelines -- no consultation with me. By comparing both annotators' annotations for this sample, I can report agreement scores, which is an indication of how reliable the annotations and the guidelines are. If humans can't agree, there's probably no way for an automated system to agree with a human. So far, the agreement scores look good.

I'm currently writing all of this up as a chapter for my dissertation. When that's ready, I'll post the chapter, the annotation guidelines, and the corpus to my Github page. I'll most likely edit the chapter down to a paper, and submit it to a conference as well.

Monday, October 09, 2017

Blade Runner 2049: My take

I'm back! A friend has been asking for my impressions of the movie, so I decided to write up my thoughts. I will avoid any major spoilers here, but I will discuss some of the context.

This weekend I went to see Blade Runner 2049.

I want to preface my comments by saying that I love the original (preferably the Final Cut). It's my favorite movie, hands down. It's more than just a great movie -- I believe that it is one humanity's crowning achievements. It's visually gorgeous and has a compelling narrative, but it's more than the sum of its parts. Blade Runner is an exploration on what it means to be human, particularly in a world of advanced machines, and whether that even matters. And it's the best at what it does, in any medium. Had Voyager left Earth in 1982 and not 1977, a reel of Blade Runner would have been an ideal addition to the Beethoven and Chuck Berry it carried out beyond our solar system.

I deliberately avoided most hype regarding Blade Runner 2049, because I wasn't really keen on the idea of a sequel. I wasn't sure if I'd even see it. But last week I saw that it was getting very high ratings on Rotten Tomatoes and Metacritic, so I got a little excited and decided to go.

It's a pretty good sequel to a movie that never should have had one.

2049 is visually stunning. The exteriors return us to the neon noir of Los Angeles, trading the rain for snow this time. 2049 is the future, but it's not our future, it's the future extrapolated from Ridley Scott's Blade Runner. Long-dead brands like Atari and Pan Am appear prominently, despite the "Blade Runner curse" of our world. The interiors are equally marvelous, often contrasting modern, minimalist luxury with the grimey chaos of the megacity outside. In general, the pallette is a perfect match with the original -- this is a cold world, devoid of natural brilliance and punctuated only by the gaudy colors of intrusive advertising. The costumes are great and the art direction is perfect. I could watch this movie again for the picture alone.

The music is a fair match for Vangelis' iconic synths of the original, albeit less memorable and probably overpowering at times.

So suffice it to say that production-wise, this is a stellar complement to the original Blade Runner.

I wasn't as thrilled with the story. It wasn't bad. In fact, it was probably pretty good, but it didn't hit enough of the original notes for me. It's a good sci-fi action movie, but it isn't as thoughtful or layered as the original. It also had quite a few holes.

The story takes place about 30 years after the original. A new blade runner, "Joe", played by Ryan Gosling, now works for the LAPD "retiring" fugitive replicants. He stumbles onto evidence of a "miracle" -- a live replicant birth. (I guess "life, uh, finds a way"?) The possibility of replicants reproducing has dangerous implications, and Joe is tasked with finding the replicant offspring. The ability to breed replicants has eluded the corporation that produces them, and Joe has competition from the bounty hunters sent by this corporation. Joe's search eventually leads him to Deckard, who has long been in hiding but may have valuable information. I can't say much more without spoiling things.

I appreciated a couple of nods to the Philip K. Dick story that were absent from the original. We see Deckard living alone in an abandoned building, much like he did in the novel. We first see his dog standing in the shadows, looking very much like a black sheep, which was Deckard's pet in Dick's story.

But there are holes. And I've already watched three YouTube videos discussing many of them, so I know I'm not alone here. Perhaps the biggest one that remains for me is this: Why was Deckard taken, but Joe was left behind (alive, and ready to cause problems)?

I'm glad I watched the movie, and I may well watch it again, but I can't take it as canon. Blade Runner is a singular work of art.

Sunday, March 22, 2015

Lists and arrays

I found this post on lists and arrays in python.

http://www.wired.com/2011/08/python-notes-lists-vs-arrays/

At this point, I've only had two programming courses in my life, one of which consisted of about seven weeks of installing software packages and one week of actual instruction in java. Of course I've had several computational linguistics / NLP courses, and these inevitably involve a good amount of programming, but as far as formal instruction in theory and application of the basics in programming or computer science, my experience is very limited. This has left me with some pretty glaring holes in my knowledge. Over the last five years, largely through projects, I've managed to patch a few of these, but I always feel like I'm playing catch up. I'm a linguist who has picked up some programming, but I often feel like I'd be better off if I were a programmer who picked up some linguistics. I keep telling myself that when I get the time, I'd like to go back to the beginning and try to systematically learn a lot of the fundamentals so I can be on par with anyone with a bachelor's degree in CS, but I haven't been proactive enough in making such time.

Recently, I've been preparing for some technical interviews for summer internships. I've come to realize that python hides a lot of the basics from the user. For example, in some programming languages, a programmer may need to choose between lists, arrays, stacks, queues or linked lists as a data type for a given task, but in python these all basically conflate to simple lists. At least that's my primitive understanding. On one hand, this simplifies things and makes python a nice language for the beginner, but on the other hand, it removes some functionality and obscures a lot of the CS details from the programmer.

If I'm not out of town this summer, I'm looking to find a class or two to audit, or maybe a MOOC to help fill in some of my knowledge gaps. I'll try to post more helpful articles like this one as I encounter them (if only as a reminder to myself). If anyone has found him- or herself in the same boat before, I'd love to hear suggestions for resources or strategies to overcome this hurdle.

Wednesday, November 12, 2014

Arrival in Uppsala

I'm in Uppsala, Sweden!

I'm exhausted, it's late, and I have a presentation in the morning, so I'll keep this short.

Quick observations:

This place is gorgeous! Took a long walk around the city tonight and it's gotta be a top 5 walk. Very walkable, livable city. So many bicycles! The people seem lovely. They're certainly fashionable. Public art everywhere, which is awesome.


On the other hand--so much tobacco! I guess that's just Europe, though. Actually, pretty much anyplace but North America. Maybe the bicycling cancels it out?

I'm sometimes nervous before traveling alone, but that has passed and this trip is going well. I'm looking forward to a great workshop tomorrow.

Saturday, November 01, 2014

Notes on an ongoing academic project ("SSAILL")

What follows is a summary and discussion of a project I've been working on for a year or two under the direction of my advisor (Markus). It was written as a sort of note-to-self/internal memo for us, so it might not make perfect sense to others. We first published a paper about this work last summer. The paper was called "Shallow Semantic Analysis of Interactive Learner Language", so I've decided to start using the acronym "SSAILL", in discussing this ongoing project. Not because I need to have a fancypants name for my work, but because like any grad student, I have three or four projects in some stage of work at any given time. Most of mine are easy to refer to by some name or phrase because they tackle one specific task or are for a specific conference task, e.g., language ID or SemEval. SSAILL is different, however, because we're approaching a more loosely defined task and throwing a few different tools at it. So, it just kinda needs an easy name.

By the way, I can happily report that our second paper from this project, "Leveraging Known Semantics for Spelling Correction", was accepted at the NLP4CALL workshop, so I'll be presenting that in Uppsala, Sweden in just about 12 days. So that's pretty exciting. 

So anywho, here's the writeup I did a few days ago.

Brief Recap

We have collected one-sentence descriptions to 10 items of a PDT task depicting transitive events. This includes 14 native speakers (NS) and 39 non-native speakers (NNS) of English. In our first experiment, we dependency parsed and lemmatized responses, then used our own custom, rule-based script to extract a semantic triple (verb(subject,object)) from each. We evaluated our system's ability to perform this extraction, and we evaluated the set of NS triples' suitability as a gold standard (for evaluating triples extracted from NNS responses). We found that our process of extracting triples (given the contraints on response form imposed by the PDT) achieved 92.3% and 92.9% accuracy for the NNS and NS responses. Reliance on the small set of NS triples as a gold standard was proven woefully inadequate, however, with roughly half of correct NNS responses not covered by the gold standardard.

Next, we extended this work to include preprocessing of the NNS responses with spelling correction methods to improve the extraction of semantic triples and the coverage afforded by our gold standard. As a preliminary experiment, we attempted using a spelling correction tool (Aspell) to "correct" any unrecognized words in the NNS responses. In cases where a spelling suggestion matched a word found in the NS speaker responses, the matched word was chosen as the "correct" word; if no match was found, the top suggestion was chosen. This resulted in a net loss in performance, because many misspelled words were recognized by Aspell (e.g., "shout" instead of "shoot") and not corrected, and in other cases, with no context other than the NS word list, Aspell chose the most likely correction, which often was not the intended word.

Most recently, we attempted a more sophisticated approach at spelling correction. This time, we used the python Enchant module, which uses Aspell as a backend but extends its functionality, allowing us to request spelling suggestions for all words, including those that appear to be correctly spelled. Additionally, we made use of a trigram language model, trained on 250 million words of newspaper text. Again, we attempted to match words (and their spelling suggestions) with words from the NS word list. Where no match was found, we iterated through the list of spelling suggestions to form all the possible combinations (sentences). These sentences were evaluated by the language model, and the most likely candidate sentence (along with the original) was retained and passed through the rest of the pipeline (parsing, lemmatization, triple extraction, gold standard evaluation). 

[I should mention that the numbers below are no longer valid. It's complicated, but we had been classifying false negatives as "gold errors"--these are not really errors, they are simply good answers that aren't covered by our limited gold standard. So, we reconsidered and reclassified false negatives, and we think this is both more accurate and more fair. As a result, the trend reflected below is the same, but the numbers reflect an even stronger effect, because the error counts are no longer exaggerated by the false positives. I would simply rewrite the paragraph, but I don't have time right now.] 

We found that we reduced errors (which included valid but non-covered responses in this experiment) by 7% overall. However, this was again dependent on our very limited gold standard. We showed that this reduction in error would be roughly 11.6% given a better gold standard, because through preprocessing, we shifted an additional 4.6% of errors from the "form error" category (primarily spelling and grammar errors) to the "gold error" category ("good" triples that simply are not covered by the current gold standard; i.e., these are not really errors at all). However, given that native speaker responses (in the form of word lists) are used to bias the preprocessing, a richer gold standard would likely have an even stronger benefit. As it stands, without preprocessing, roughly 32% of all responses are gold errors; with preprocessing, roughly 36% of all responses are errors.

Future Directions

Given the discussion above, I believe the most immediate means of increasing performance on this task is simply by improving the gold standard. Initially, this project was conceived of as a low-effort process for automatically evaluating learner responses: the gold standard---and in turn, the task of evaluating NNS responses---was effectively crowdsourced, minimizing the amount of labor and expertise needed. The "pipe dream" of this line of work has been the development of a language learning game or intelligent language tutoring system (ILT) that is modular, where new learning modules (stories) could be plugged in with little effort and the gold standard could be obtained by having NSs play the game. The notion of automatically deriving evaluation of NNS responses simply by having NSs and NNSs perform the exact same task has been a major (theoretical) advantage to this approach. At this juncture, however, this insistance on simplicity and minimizing researcher effort needs to be reconsidered. If this constraint is dropped, we would be free to simply brainstorm an unlimited number of reasonable responses to the PDT items. With regard to researcher effort, this would be a reasonable task; a single PDT item in our task could likely be described extensively in sentence form by a single researcher in less than 30 minutes. Alternatively, the task and instructions for NSs could be revised in a way that encourages variety. For example, we could ask NSs to describe the image with three unique sentences, or using five unique verbs, etc. We could also make the task "adaptive" by asking NSs to avoid using content words found among the responses of previous NSs. Another approach could involve attempting to automatically discover new gold triples by combining elements from known triples and asking NSs to approve or disapprove; e.g., from NS triples "do(woman,laundry)" and "wash(woman,shirt)", we could derive "do(woman,shirt)" and "wash(woman,laundry)", but NS knowledge would be useful in disapproving the former and approving the later. (As Markus suggests, however, simple methods exist to automatically perform this filtering.)

The topic of expanding the gold standard relates to another open question in this work: How do we produce feedback? Given that this project is conceived of as the backbone of a game or tutoring system, providing useful feedback to the language learner should be one of its functions. Another philosophical underpinning of this project comes into play here; our priority is on maximizing learner use of the L2 and minimizing grammar and spelling feedback, as SLA research has established these to be ineffective. The entire pipeline here is an attempt to overlook minor errors in spelling and grammar, instead abstracting away from the immediate form of the NNS response to the intended meaning (much the way a NS may do in communication with a NNS). To this end, we believe that most (or all) feedback should focus on resolving situations in which a response lacks the target meaning. If we assume a "Choose Your Own Adventure" style game built on this system, feedback could come from a sidekick, "guide", or other interlocutor. At this stage, I envision feedback to follow a very formulaic pattern. An NNS who provides a response for which no match is found could simply be asked to restate their response. If a triple is a partial match, a matched part of the triple could be used to elicit another response. Consider the following toy example: an item showing a man reading a newspaper; a gold standard of a single triple, "read(man,newspaper)"; and a NNS response and triple of "a man is reading" and "read(man,NONE)". Upon matching "man" and "read", we could nudge the NNS toward a better response with questions like "What did the *man* *read*?", "What did the *man* do?" or "Who *read* what?"

This relates to another open question here: How should we handle partial responses? We noted in previous work that as the basic unit of analysis here is the semantic triple (verb(subject,object)), a number of "partially correct" responses are counted as errors. Whether or not we want to award partial credit for partially matched responses relates heavily to the particular game or ILT in which we would like to implement our system. If we are focused only on moving the action of the game forward, the most likely use of partial matches would be in generating feedback intended to elicit a better response, as discussed above. In a testing scenario, simply assigning a non-binary score to partial matches may be useful in its own right.

As we continue to investigate this line of work, we should also consider using more linguistic processing. One simple way of doing this, as mentioned above, would be to automatically derive unseen triples by breaking triples into their subject, verb, and object, and recombining those elements. We will also consider using some kind of lexical ontology like WordNet. WordNet stores lexical entries in a heirarchy from most to least specific (entity > animal > mammal > dog > cocker spaniel), which could allow us to automatically discover a hypernym/hyponym relationship between, e.g., a known subject and an unknown subject. This knowledge could allow us to assign full or partial credit to the triple with the unseen subject, and possibly even modify the gold standard automatically. Additionally, the use of a semantic role labeler could improve performance, most likely in the triple extraction step of the process.

The language model has been shown to introduce errors due to its bias toward newspaper text. Improving the LM is another obvious modification to investigate. The challenge here is finding a suitable training text. The text must be sufficiently long to give robust coverage of the language, but it should also be more suitable for a domain primarily containing descriptions of transitive, physical actions. We may need to think creatively to truly gain some benefit here; perhaps a smaller, domain-specific model could be used in conjunction with a larger newspaper model, with the results from the two models weighted in some way.

Ultimately, we will likely need to collect more data. This time, we will want to have a better vision of the kind of game or tutoring or testing application motivating this work. This would almost certainly involve images that are linked by some narrative, unlike the unrelated images used in the pilot task. As a result, this will likely mean that some items would show a sequence of actions for users to describe. An experimental item like this was given on the previous task, and a closer examination of those results may influence our decision of how to organize this task for users. For example, we believe it may be better (or at least easier to process) if participants are presented with a single image from a sequence and asked to describe it before the next image is shown.