Wednesday, November 12, 2014

Arrival in Uppsala

I'm in Uppsala, Sweden!

I'm exhausted, it's late, and I have a presentation in the morning, so I'll keep this short.

Quick observations:

This place is gorgeous! Took a long walk around the city tonight and it's gotta be a top 5 walk. Very walkable, livable city. So many bicycles! The people seem lovely. They're certainly fashionable. Public art everywhere, which is awesome.


On the other hand--so much tobacco! I guess that's just Europe, though. Actually, pretty much anyplace but North America. Maybe the bicycling cancels it out?

I'm sometimes nervous before traveling alone, but that has passed and this trip is going well. I'm looking forward to a great workshop tomorrow.

Saturday, November 01, 2014

Notes on an ongoing academic project ("SSAILL")

What follows is a summary and discussion of a project I've been working on for a year or two under the direction of my advisor (Markus). It was written as a sort of note-to-self/internal memo for us, so it might not make perfect sense to others. We first published a paper about this work last summer. The paper was called "Shallow Semantic Analysis of Interactive Learner Language", so I've decided to start using the acronym "SSAILL", in discussing this ongoing project. Not because I need to have a fancypants name for my work, but because like any grad student, I have three or four projects in some stage of work at any given time. Most of mine are easy to refer to by some name or phrase because they tackle one specific task or are for a specific conference task, e.g., language ID or SemEval. SSAILL is different, however, because we're approaching a more loosely defined task and throwing a few different tools at it. So, it just kinda needs an easy name.

By the way, I can happily report that our second paper from this project, "Leveraging Known Semantics for Spelling Correction", was accepted at the NLP4CALL workshop, so I'll be presenting that in Uppsala, Sweden in just about 12 days. So that's pretty exciting. 

So anywho, here's the writeup I did a few days ago.

Brief Recap

We have collected one-sentence descriptions to 10 items of a PDT task depicting transitive events. This includes 14 native speakers (NS) and 39 non-native speakers (NNS) of English. In our first experiment, we dependency parsed and lemmatized responses, then used our own custom, rule-based script to extract a semantic triple (verb(subject,object)) from each. We evaluated our system's ability to perform this extraction, and we evaluated the set of NS triples' suitability as a gold standard (for evaluating triples extracted from NNS responses). We found that our process of extracting triples (given the contraints on response form imposed by the PDT) achieved 92.3% and 92.9% accuracy for the NNS and NS responses. Reliance on the small set of NS triples as a gold standard was proven woefully inadequate, however, with roughly half of correct NNS responses not covered by the gold standardard.

Next, we extended this work to include preprocessing of the NNS responses with spelling correction methods to improve the extraction of semantic triples and the coverage afforded by our gold standard. As a preliminary experiment, we attempted using a spelling correction tool (Aspell) to "correct" any unrecognized words in the NNS responses. In cases where a spelling suggestion matched a word found in the NS speaker responses, the matched word was chosen as the "correct" word; if no match was found, the top suggestion was chosen. This resulted in a net loss in performance, because many misspelled words were recognized by Aspell (e.g., "shout" instead of "shoot") and not corrected, and in other cases, with no context other than the NS word list, Aspell chose the most likely correction, which often was not the intended word.

Most recently, we attempted a more sophisticated approach at spelling correction. This time, we used the python Enchant module, which uses Aspell as a backend but extends its functionality, allowing us to request spelling suggestions for all words, including those that appear to be correctly spelled. Additionally, we made use of a trigram language model, trained on 250 million words of newspaper text. Again, we attempted to match words (and their spelling suggestions) with words from the NS word list. Where no match was found, we iterated through the list of spelling suggestions to form all the possible combinations (sentences). These sentences were evaluated by the language model, and the most likely candidate sentence (along with the original) was retained and passed through the rest of the pipeline (parsing, lemmatization, triple extraction, gold standard evaluation). 

[I should mention that the numbers below are no longer valid. It's complicated, but we had been classifying false negatives as "gold errors"--these are not really errors, they are simply good answers that aren't covered by our limited gold standard. So, we reconsidered and reclassified false negatives, and we think this is both more accurate and more fair. As a result, the trend reflected below is the same, but the numbers reflect an even stronger effect, because the error counts are no longer exaggerated by the false positives. I would simply rewrite the paragraph, but I don't have time right now.] 

We found that we reduced errors (which included valid but non-covered responses in this experiment) by 7% overall. However, this was again dependent on our very limited gold standard. We showed that this reduction in error would be roughly 11.6% given a better gold standard, because through preprocessing, we shifted an additional 4.6% of errors from the "form error" category (primarily spelling and grammar errors) to the "gold error" category ("good" triples that simply are not covered by the current gold standard; i.e., these are not really errors at all). However, given that native speaker responses (in the form of word lists) are used to bias the preprocessing, a richer gold standard would likely have an even stronger benefit. As it stands, without preprocessing, roughly 32% of all responses are gold errors; with preprocessing, roughly 36% of all responses are errors.

Future Directions

Given the discussion above, I believe the most immediate means of increasing performance on this task is simply by improving the gold standard. Initially, this project was conceived of as a low-effort process for automatically evaluating learner responses: the gold standard---and in turn, the task of evaluating NNS responses---was effectively crowdsourced, minimizing the amount of labor and expertise needed. The "pipe dream" of this line of work has been the development of a language learning game or intelligent language tutoring system (ILT) that is modular, where new learning modules (stories) could be plugged in with little effort and the gold standard could be obtained by having NSs play the game. The notion of automatically deriving evaluation of NNS responses simply by having NSs and NNSs perform the exact same task has been a major (theoretical) advantage to this approach. At this juncture, however, this insistance on simplicity and minimizing researcher effort needs to be reconsidered. If this constraint is dropped, we would be free to simply brainstorm an unlimited number of reasonable responses to the PDT items. With regard to researcher effort, this would be a reasonable task; a single PDT item in our task could likely be described extensively in sentence form by a single researcher in less than 30 minutes. Alternatively, the task and instructions for NSs could be revised in a way that encourages variety. For example, we could ask NSs to describe the image with three unique sentences, or using five unique verbs, etc. We could also make the task "adaptive" by asking NSs to avoid using content words found among the responses of previous NSs. Another approach could involve attempting to automatically discover new gold triples by combining elements from known triples and asking NSs to approve or disapprove; e.g., from NS triples "do(woman,laundry)" and "wash(woman,shirt)", we could derive "do(woman,shirt)" and "wash(woman,laundry)", but NS knowledge would be useful in disapproving the former and approving the later. (As Markus suggests, however, simple methods exist to automatically perform this filtering.)

The topic of expanding the gold standard relates to another open question in this work: How do we produce feedback? Given that this project is conceived of as the backbone of a game or tutoring system, providing useful feedback to the language learner should be one of its functions. Another philosophical underpinning of this project comes into play here; our priority is on maximizing learner use of the L2 and minimizing grammar and spelling feedback, as SLA research has established these to be ineffective. The entire pipeline here is an attempt to overlook minor errors in spelling and grammar, instead abstracting away from the immediate form of the NNS response to the intended meaning (much the way a NS may do in communication with a NNS). To this end, we believe that most (or all) feedback should focus on resolving situations in which a response lacks the target meaning. If we assume a "Choose Your Own Adventure" style game built on this system, feedback could come from a sidekick, "guide", or other interlocutor. At this stage, I envision feedback to follow a very formulaic pattern. An NNS who provides a response for which no match is found could simply be asked to restate their response. If a triple is a partial match, a matched part of the triple could be used to elicit another response. Consider the following toy example: an item showing a man reading a newspaper; a gold standard of a single triple, "read(man,newspaper)"; and a NNS response and triple of "a man is reading" and "read(man,NONE)". Upon matching "man" and "read", we could nudge the NNS toward a better response with questions like "What did the *man* *read*?", "What did the *man* do?" or "Who *read* what?"

This relates to another open question here: How should we handle partial responses? We noted in previous work that as the basic unit of analysis here is the semantic triple (verb(subject,object)), a number of "partially correct" responses are counted as errors. Whether or not we want to award partial credit for partially matched responses relates heavily to the particular game or ILT in which we would like to implement our system. If we are focused only on moving the action of the game forward, the most likely use of partial matches would be in generating feedback intended to elicit a better response, as discussed above. In a testing scenario, simply assigning a non-binary score to partial matches may be useful in its own right.

As we continue to investigate this line of work, we should also consider using more linguistic processing. One simple way of doing this, as mentioned above, would be to automatically derive unseen triples by breaking triples into their subject, verb, and object, and recombining those elements. We will also consider using some kind of lexical ontology like WordNet. WordNet stores lexical entries in a heirarchy from most to least specific (entity > animal > mammal > dog > cocker spaniel), which could allow us to automatically discover a hypernym/hyponym relationship between, e.g., a known subject and an unknown subject. This knowledge could allow us to assign full or partial credit to the triple with the unseen subject, and possibly even modify the gold standard automatically. Additionally, the use of a semantic role labeler could improve performance, most likely in the triple extraction step of the process.

The language model has been shown to introduce errors due to its bias toward newspaper text. Improving the LM is another obvious modification to investigate. The challenge here is finding a suitable training text. The text must be sufficiently long to give robust coverage of the language, but it should also be more suitable for a domain primarily containing descriptions of transitive, physical actions. We may need to think creatively to truly gain some benefit here; perhaps a smaller, domain-specific model could be used in conjunction with a larger newspaper model, with the results from the two models weighted in some way.

Ultimately, we will likely need to collect more data. This time, we will want to have a better vision of the kind of game or tutoring or testing application motivating this work. This would almost certainly involve images that are linked by some narrative, unlike the unrelated images used in the pilot task. As a result, this will likely mean that some items would show a sequence of actions for users to describe. An experimental item like this was given on the previous task, and a closer examination of those results may influence our decision of how to organize this task for users. For example, we believe it may be better (or at least easier to process) if participants are presented with a single image from a sequence and asked to describe it before the next image is shown.