Who’s sorting who? Or the explosion of metrics and how we can take back control

For those of us still working with paper student evaluations, we receive our spring semester results during the summer. It is a time of mixed emotions for many of us. While I heard one lucky young professor describe opening up her student evaluations as tantamount to Christmas, I am probably not alone in likening it to Halloween, where the tricks far outnumber the treats! Too hard, too boring, does not provide enough guidelines or makeup opportunities, not what I expected—the list goes on and on for the types of complaints students can and will make anonymously. As professors and grad students, we have seen them all.

I am at a point where these student evaluations mean little to my career. However there are plenty of instructors and junior professors—particularly those in precarious circumstances—who have to worry about each little bump and dip in their average. Colleges and universities, particularly those more oriented to teaching, rely on student evaluations to make tenure decisions. And less than super-positive evaluations can be used to bludgeon contingent faculty.

Student evaluations may have some place in assessment. There is no other way for students to provide anonymous feedback about a course or an instructor. However, the problems with this sort of measurement are legion. For one thing, research has repeatedly demonstrated that many students evaluate female instructors more harshly than male instructors (see the accompanying figure). Even the words that students use are different. Racial bias is also present, as is bias against foreign-born instructors. Nervous professors have figured out that the best way to receive more positive evaluations is to grade more generously. High grades are clearly correlated with good evaluations, though long-term studies have shown that students in courses given lower ratings learned as much if not more than students in courses with higher ratings. I remember an instructor who always plied students with pizza and cookies during evaluation day. Turns out, he was helping boost his evaluations significantly.

Summertime is also when many of us who work as journal editors receive an evaluation of another sort: the yearly impact factor. For those of you who have spent the last couple of decades in monastic solitude, a journal’s impact factor measures the mean number of citations each article garners over the course of a year. So an impact factor of 1 indicates that the average article during the measurement years attracted an average of one citation per year. The calculations can be done in several ways, and while it is not possible to bring pizza and cookies to all potential citers, there are still means of manipulating the process. For instance, some very unscrupulous journal editors will insist that accepted publications include recent citations to their very own journal.

As with student evaluations, journal impact factors have come to rise up and conquer all of academe. Originally meant for life science and medical journals, the impact factor now is used in all fields. The quality of one’s scholarship is also conveniently “measured” by the impact factors of journals that people list on their CVs. The numbers are easily grasped and can be a proxy for a journal’s research and scholarly reputation. I myself watched in dismay as journals I had long held up as models, both in and out of geography, were dismissed by their relatively low impact factors while some niche journals rapidly scrambled their way up the scholarly ladder. Even more dismaying is how these impact factors are used to box in scholars—rapidly quantifying and sorting what should be careful decisions. Taken to its extreme was a recent ad for a postdoctoral position in process engineering (gratefully, not in Geography), where applicants were required to have published in a journal with an impact factor above 10 or they “will get a rejection.” It is not as if impact factors suddenly became calculable; it is just that they became terribly urgent.

Journal impact factors have multiple biases. They favor journals composed largely of review essays or “debates.” They clearly slight some fields. For instance, only two journals in all of History rise above a journal impact factor of 1 and then just barely. Impact factors also screen out many varieties of scholarship, notably books. They can be “gamed” by editorial policy. And as I have found myself, a journal’s impact factor may ride on just one or two well-cited articles. Many new and innovative journals may not have an impact factor at all. To me the oddest thing of all is just how significant journal impact factors have become in an age where scarcely anybody physically handles an entire printed journal. Most of us download relevant articles, no matter what the journal. The well-known deficiencies of journal impact factors have sparked a strong backlash, culminating in the San Francisco Declaration on Research Assessment, which recommends that universities, potential funders, and publishers “not use journal-based metrics, such as Journal Impact Factors, as a surrogate measure of the quality of individual research articles, to assess an individual scientist’s contributions, or in hiring, promotion, or funding decisions.”

What is it about these measures? We have rankings of departments, of universities, and even of individuals. The United Kingdom has long undergone yearly research assessment exercises, and the state of Texas embarked several years ago on a program to evaluate the value added for every faculty member. Is this a needed corrective to professorial deadwood? Or is it yet another cudgel with which to intimidate and mold the professoriate?

Previous AAG presidents have weighed in on the explosion of metrics, while other academics on social media lament “the mushrooming of metrics and their influence on student, academic, and other university professionals’ lives.” My own view is that most metrics can be useful—after all, I employ them myself in research and evaluation. But they are inherently obtuse and contain all manner of biases. Because they are perceived as objective, they end up hiding their role in slighting certain groups and particular practices. Because they themselves become the desired objective, they lead to a warped process. Much like a company that only wants to maximize its quarterly earnings, too much reliance on metrics can lead to timid teaching and overly opportunistic scholarship. Exclusive use of these metrics cannot possibly account for all people do to forward the enterprise—the unplanned service, the approachable demeanor, the helpful hearing out of a student’s or colleague’s research ideas, the emotional labor—in other words, the soul of many a department and the types of things that should be valued more but are so egregiously overlooked.

So what can we do? For starters, it could help to work within your departments and institutions to ensure that teaching evaluations are used only within the appropriate safeguards, if they are used at all, with due respect to their inherent and discriminatory distortions. Beyond this, you can urge institutions and publishers you work with to join signatories to the San Francisco Declaration of Research Assessment as a way to scale back the reliance on misleading measures of research quality.

Professors like to say that we are not producing widgets, and I would agree. Many bemoan the corporatization of the academy. I agree with this as well. Metrics can be a form of empowerment as they provide alternative means of assessment outside the purely subjective. But they reduce a great deal of complexity to one simple and misleading number. Keeping our academic autonomy and retaining the purpose and dignity of our profession means putting these metrics in their place.

— Dave Kaplan
AAG President

DOI: 10.14433/2017.0057

* Gender Disparities in Student Evaluation Scores. Figure from Lisa Martin.