Monday, February 26, 2018

Flood, farm, etc.

I'm back in Bloomington now, but I spent the last few days back home in Jefferson County, catching up on some work there. With planting season around the corner, my brothers and I spent some time planning and doing some maintenance. Large parts of the Midwest are flooded right now due to snow melting and lots of heavy rain. In the Ohio River Valley, where we live, the river is expected to crest today and begin receding. We were spared the worst, but I'll definitely be back soon to fill in some badly eroded spots around the fields and driveways after the flooding recedes enough to do so. Let's hope those farms, homes and businesses who really were affected can recover quickly.

In the meantime, we did some minor repairs on the buildings, fences and equipment. I'll be moving to NYC within the next few months, so I've been trying to do what I can before I go. My younger brother recently moved back to the area, which means he can help out and I don't need to be around as much. In fact, this is the first year in a long time that the three of us won't be actively involved in the farming operation in some way. We're out of the livestock business as well. Things are settled down enough now that we've been able to scale back, work out leasing arrangements, and focus on our own careers.  I'm grateful for the help and guidance we've had since being thrust back into this, and I've learned some valuable lessons, but I'm also excited to finally have the time I need to polish off my PhD and finally put it to work professionally.

It really comes down to the fact that the economics of farming these days only work on large scales; if you're going to own and maintain all of the expensive equipment and buy the chemicals, seeds, fuel, insurance, etc. needed to farm effectively, you really need to be running a large operation with several employees and several hundred acres at a minimum. That's why crop farmers today are either huge operations that own and lease thousands of acres, or very small operations where someone is farming his family's 20 acres with a tractor from the 1970s and working a full-time job somewhere else. Add to that the fact that today's land prices make it very difficult for small operations to expand or for younger people to start out farming. There's just not a lot of room for the little guys.


Saturday, February 24, 2018

SAILS Corpus hype

For the better part of the past year, I've been working on the dataset at the core of my dissertation. I'll be releasing it sometime in March. In my field, as in any kind of science, it's important to share your research data for a number of reasons. First, it makes your work more transparent -- anyone should be able to check your calculations for accuracy (and honesty), and ideally, anyone should be able to repeat and reach similar outcomes and conclusions. And second, it simply allows other researchers to use your data for new and interesting kinds of research and development.

I plan to call this the SAILS Corpus, for "Semantic Analysis of Image-based Learner Sentences". And a corpus is just a collection of text data used for some kind of research or for building statistical models. The corpus consists of responses to a picture description task (PDT), which is just what it sounds like -- a person is shown an image and asked to describe it or respond to some question about it. My corpus consists of 30 simple, cartoon-like images depicting some common activity. Half of the participants were asked, "What is happening?" and the other half were asked, "What is x doing?", where x is replaced with the subject of the image, e.g., "the girl" or "the bird". This was set up as an online survey and participants were instructed to provide a single sentence for each picture.

Example PDT item: "What is the boy doing?"
Example response: He's carrying a big bag of fruits.


I collected about 13,000 responses total. These come from different populations -- about 70 participants were English as a Second Language students at Indiana University and completed the task with my supervision. About 30 were native English speakers that I know personally. About another 220 native English speakers were crowdsourced from the survey site; the survey creator pays the survey site, and the participants receive rewards and gift cards for completing surveys. The reason I use these different groups is that the next step of this work involves using the native speakers' responses to automatically rate the non-native speakers' responses using various natural language processing techniques. I'll use the results of these techniques to see how well this kind of non-native speaker content assessment can be automated simply by crowdsourcing the task to native speakers and using their responses as a "gold standard" or a kind of answer key.

The real bulk of the work with this corpus has been annotating it. In this context, annotation simply means adding some kind of score to each response. If you want to develop an assessment tool that, for example, reads an essay and assigns it a grade of A through F, at minimum you need a sample of examples that are manually annotated A through F by a competent human. You then give the same example essays to your assessment tool (after removing the human annotation) and obtain the automatic grade. You then judge the quality of your assessment tool with regards to how well its annotations match the human ("gold standard") annotations.

I started out with something like a three point scale in mind:

  • 2 points = accurate and native-like response
  • 1 point = accurate but not native-like
  • 0 points = not accurate


When I tried to implement this, however, I found that it quickly broke down. There are simply too many characteristics to consider for each response, and a great many responses don't fit neatly into this scale. What does "accurate" mean? How "accurate" is accurate? What does "native-like" mean? So the scale had to be rethought, and through a long iterative process, I arrived at five binary features. This means each response is given five different scores, where each score is 1 ("yes") or 0 ("no"). These five features are:
  • Core event: Does the response capture the core event ("eating pizza", "buying a car", etc.) depicted in the image?
  • Answerhood: (Yeah, I guess I made that word up!) Does the response attempt to answer the question?
  • Grammaticality: Does the response use good grammar and spelling?
  • Interpretability: Does the response provide a clear mental image for the reader?
  • Verifiability: Does the response contain only information that is verifiable from the image?
As simple as this sounds, it took a long time and a lot of interesting conversations with my advisor and other linguists and language teachers to arrive at, and ultimately resulted in a 40 page manual of annotation guidelines with lots of rules and examples. The problem is that just when you have rules that cover all the weird sentences you think you will encounter, you find an even weirder one and have to make significant changes to feature definitions and the annotation guidelines.

Because I'm focused on automatically assessing content rather than form (grammar, spelling, etc.), my scoring tool will rely mostly on the Core Event and Answerhood features. Each of the features will provide some interesting insights into patterns of language variation between and within the native and non-native speakers.

The annotation itself is also quite a lot of work: 13,000 responses * 5 features = 65,000 annotation decisions. And it's pretty mind-numbing work after a while.

A second annotator annotated about 5% of the data when the annotation rules were being developed, so that we could compare our annotations and discuss and modify the rules as necessary. When the annotation guidelines were complete, I personally annotated all of the data. The second annotator completed another 5% sample, with only the guidelines -- no consultation with me. By comparing both annotators' annotations for this sample, I can report agreement scores, which is an indication of how reliable the annotations and the guidelines are. If humans can't agree, there's probably no way for an automated system to agree with a human. So far, the agreement scores look good.

I'm currently writing all of this up as a chapter for my dissertation. When that's ready, I'll post the chapter, the annotation guidelines, and the corpus to my Github page. I'll most likely edit the chapter down to a paper, and submit it to a conference as well.