E Identification

Nick HK

Since I’m a causal inference guy, I want to be clear: what I mean by identification in this section is not causal identification (although that would be a subset). I don’t think we need to teach high school students matching estimators or RDD or whatever. Nor is it statistical identification (too much math!). By “identification” I mean the much broader concept of: “we did a calculation in order to answer a question. Does this calculation actually answer that question?”

As an example of what I mean, and how this concept applies in a non-causal way:2 let’s say that you’re trying to decide whether to open up a Taco Bell franchise location. You look at the data and notice that Taco Bell’s total revenue has gone up for the past few years.3 Seems like Taco Bell is doing well. Does this tell you anything about whether you should open a Taco Bell? No! You’re not going to own all of Taco Bell, you’re going to own a single store. You should instead look at revenue per location. Perhaps revenues are going up because they opened a bunch of new locations, and each individual store is actually doing worse!

The calculation that identifies how much a Taco Bell location is likely to sell is sales per-store, not total Taco Bell sales.

My verdict: Yes. I think that this kind of identification is both teachable and highly useful in the real world, even in cases where you don’t have your own hands on data and are just seeing someone talk about data, or just seeing someone make a claim based on their own observations and no data at all.

How is identification teachable? I think most errors of identification tend to be obvious after they’re pointed out, so giving students practice carefully considering whether a given measure actually relates to a question, a set of questions to ask themselves when evaluating a data-backed claim, and even a checklist of common ways that identification goes wrong would actually go a long way and be of everyday use.

That checklist include: construct validity (does the data we have actually represent the concept we’re interested in, for example in a study on “happiness levels”), selection bias (as economists use the term; basically, the presence of confounders that make a correlation not represent a causal effect of interest, for example whenever we hear about something rich people do being correlated with good health outcomes), appropriately scaled comparisons (figuring out whether a 50% increase is big or small depends on the base rate and what variable we’re talking about, and a claim of “outcome Y just fell, this must be the fault of policy X I don’t like” should probably check whether Y also fell in areas that don’t have policy X) and appropriate transformation (is the data calculated in the right format or observation level to answer the question we have, such as in the Taco Bell example above, or in choosing an absolute vs. relative increase in a rate). Some of these terms would probably be renamed into less-wonky terminology.

Nick HK (2024) What Should School Teach Teenagers About Statistics?