Sunday, July 14, 2013

Scientifically Approaching Language


Let's take a look at a quote from a journalistic piece, from The American Scholar:
The science behind search may change how linguists view natural language. For one, linguists may move away from modeling language using formal grammars, Chomskyan or otherwise. “What you end up finding is that the things that people do with language very rarely fit into the formal grammar that you carved out from the outset,” says Mailhot. “The data are what they are and people do what they do and the best strategy is to make inferences based on what people do rather than carve something out ahead of time and shoehorn the data into it.” 
This hit home for Mailhot while briefly working on a project using data from Twitter, where “people just do the absolute most fascinating stuff with language.” Take the ironic construction X much—as in Jealous much? or Hypocritical much? “There are no grammars that will give you something out of this,” he says, “but you have to know that X much is a thing that people are doing."
In a specific instance, Mailhot is right: as a scientist working on language, you do have to know that "X much?" is a thing that people are doing.  However, and more generally, his comments are misinformed at best, and ignorant at worst.

Allow me to summarize how Mailhot's last comments come across to a linguist: "A theory that's been built up over half a century by some of the world's greatest minds that have ever considered how language works needs to be completely thrown out because I found some things it can't account for!"

This line of thinking is flawed in two ways: in the disconnect between model and system, and in how to scientifically deal with counterexamples.



First, the problem regarding models and systems.

Whether a model can (not) successfully predict actual behavior is not an indicator of how well the model reflects the actual system that it is trying to describe.

A good working model of how humans actually walk is not necessarily a good reflection of the system of walking and how it is represented in the brain. Similarly, a good working model of the system of walking and how it is represented in the brain is not necessarily a good reflection of how humans actually walk.

In the same way, a good working model of how humans actually use language is not necessarily a good reflection of the system of language and how it is represented in the brain. Similarly, a good working model of the system of language and how it is represented in the brain is not necessarily a good reflection of how humans actually use language.

Linguistic engineers approach language by looking at outwardly observable data as it actually occurs in the world, and move towards a stochastic generalized model of usage.  This bottom-up-only, usage-based approach is especially helpful for trying to improve performance on computational models, like those used by search engines.  This kind of approach, in which data *is* the theory, is thus very useful, but it does not necessarily inform us about the system.

Theoretical linguists approach language somewhat differently. It's not that we do not start with observable data.  Of course we do, otherwise how would we know where to start?  What's different is that we use both bottom-up and top-down approaches in our work.  We look at outwardly observable data, make (potentially huge) abstractions/inferences, and generate hypotheses about the basic principles that determine of how the system works.  Theorists focus on the latter part, and only use the data to support/modify the hypothesis.  Data plays a secondary role, informing the theoretical model.  Developing theoretical models is important for helping us understand why and how different languages are different/the same, and it helps us understand how the basic principles of language.

Theoretical approaches, in the baby stages of theoretical study (where linguistics is in its scientific lifetime), have more limited applicability in computational models with direct real-world applications.  For this reason, theoretical linguistics is often not paid much attention by search engine engineers (perhaps rightfully!).  They prefer more directly-useful computational models on probabilistic language patterns (but computational models that take grammatical approaches do exist, and are very successful in their own way).

In order to form a good model of a system, we need more than just data and a collection of generalizations.  We need to understand the basic principles of the system, which we come to through creating and refining hypotheses with actual language data.



Second, the problem of how to scientifically approach counterexamples.

Let us start with the specific (purported) counterexamples at hand.  Mailhot seems to confidently believe that a grammatical model, chomskyan or otherwise, canNOT predict "X much?" or other such linguistic innovations.  But how did he come to this conclusion?  And what do we do if it is truly something that is unpredictable in any grammatical model?

It is possibly true that no one has attempted to use a grammatical approach to predict this behavior in published work, but that is very different from all grammatical accounts predicting it to be impossible.  I would put large amounts of money on the idea that a grammatical account could account for this, and without any substantial changes to the framework.  (In fact, through the magic of facebook, I have seen theoretical linguists offer good first-hypotheses of how it is already predicted possible in a chomskyan grammatical approach.)

But let us suppose (and I believe this is not true) and it might be true that the model as it stands might now cannot account for "X much?" et alia.  In other words, let us suppose that Mailhot has found a *true* counterexample to all current "grammatical" approaches.  What is the conclusion we can make?

Does this mean that the grammatical approach is just wrong?  Does it mean that the models we are using are truly and basically flawed?  Should we just throw it all away, because we have some counterexamples?

That kind of extreme reaction is not how theoretical science deals with problems.  A well-founded theory that makes lots of attested predictions is not dismissed every time a problem is found.  Instead, we make incremental changes to the theory: we revise it and increase its complexity so that the discovered problem is no longer a problem.  When those changes are on the right track, further observations of the data show that changes have corrected other problems -- as an indicator that we've got a successful theory.  If further observations show that there are new (worse) problems, we change the theory in a new way.  Afterwards, the theory's empirical predictions better match up with the actual data, and we will have a better understanding of the system, through a model.

This is called the scientific method.  It is literally how science is defined.

Eventually, we may make so many changes that the model we started with only vaguely resembles the actual model. This is how science gets rid of models: making incremental changes until the changes have eventually created an entirely new model.  Starting over because of some empirical problems with a model that otherwise gets a lot right is unjustifiably rash.



I have often pondered why it is that people have such strongly negative feelings about how theoretical linguists must be wrong about how language works.

This is something I frequently encounter from people who don't actually study language, but have taken a linguistics course, or have taken a course in some other department in which modern, mainstream linguistic approaches are (briefly) discussed.

My guess is that people see a simplified version of the theory, and rightfully find problems with it.  Wrongfully, however, they assume that they simplified version of the theory is the theory, and there is no other way that the same ideas can be manifested.

So, I would suppose that Mailhot was exposed to an elementary/simplified grammatical approach (like the ones taught in introductory classes), and rightfully noticed that it has no actual way of dealing with things like "X much?".  The mistake was in then inferring that all grammatical approaches will similarly have no way of dealing with "X much?".

But consider the fact that the model of genetics that you learned in high school (Mendelian inheritance) can't explain even the most basic facts about how genetics actually works.  In fact, Mendel's theory cannot actually determine pea-color as easily as he predicted.  Would you say that that model is totally broken?  No.  He had to start with the basic observations -- he made a theory and model of how inheritance works -- and he probably knew that it couldn't account for every empirical fact about inheritance.

He put his theory out there, though, because it was (in its message) basically right, and so that he or others could make refinements to the model after making more observations.  Over time, the most correct model of inheritance has become more complicated -- so complicated that you can't teach it in an introductory class, and you've gotta start somewhere, so you start with the basic-but-flawed basic idea.

The same is true for linguistics.  You start with the theory whose general ideas (constituency, hierarchy, basic principles) are correct in the beginning, while recognizing that it is incomplete, and while working to make it more complete.  That's how science works.

No comments: