The Challenge: Making Better Predictions to Improve Student Outreach and Enrollment
Technology providers in the higher education sector are now offering new analytic capabilities within CRM and ERP systems, or even as stand-alone services. This presents a potential boon to administrators and enrollment management professionals trying to rapidly adapt to a changing landscape, particularly at smaller institutions that lack the internal resources to build analytic teams. However, it is precisely because these institutions lack internal analytic expertise that vetting technology providers and solutions becomes problematic. This memo responds to several inquiries I have fielded regarding how to incorporate AI (Artificial Intelligence) into predictive modeling to inform institutional practices and goal setting.
Before evaluating analytic methods like AI, it is important to frame the problem within a broader context.
Analytic methods occupy an intermediary role between data and action. The selection of a “good” method needs to respond to what is known about data quality and scope. It likewise has to consider how analytic results will inform strategic action. As a result, instead of focusing narrowly on an analytic method, it is important to continually work through these three steps:
- The Data: Quality and scope
- The Methods: How to extract meaning from data
- The Action: How to use the data to improve outcomes
Improved data quality and scope must always be a priority, just as it is imperative to continually assess new modeling techniques, learn from prior year cycles, and “stress test” possible outcomes. Of course, these improvements only matter to the extent that specific interventions are planned in response to the data and methods.
How do we know which students to direct our scarce resources toward? Let’s look at these three steps in turn:
I. The Data: Improving Quality and Scope
Garbage in, garbage out: your model is only as good as your data. This is a particularly pressing problem in higher education where, frankly, data quality is abysmal. A small institution may run multiple CRMs for different functions, and each data source may have a rickety connection with a central ERP, if at all. I have seen cases in which critical data literally exist on note cards on someone’s desk. There is also the curious phenomenon of prestigious schools (with massive endowments) having horrible data infrastructure, as they have never needed to rely on good data. Once your “brand” is big enough students will come, regardless of how much you botch the enrollment experience.
The challenge, then, is to increase the scope of data collection (what we are measuring and how) and the quality (consistency and reliability). This is not a trivial undertaking. But until a college or university can establish a basic data infrastructure, any modeling exercise will be for naught. This may sound like a truism, but it is remains a very present problem for many institutions.
II. The Method
There is a dizzying array of analytic techniques to extract meaning from data, something further enhanced–or compounded–by recent progress in the fields of machine learning and artificial intelligence. Contributing to the confusion is the fact that many techniques have arisen from different disciplinary settings and often use different language to refer to similar things. This is not to say that a newer discipline is merely old wine in a new bottle, but rather that the disciplinary jargon needs to be unpacked to assess what different techniques actually contribute.
It is useful to distinguish between three “cultures” of data analytics: 1) machine learning, 2) statistical learning, and 3) social science modeling. Contemporary technology companies are heavily indebted toward #1 and #2, and as a result there are more resources devoted to them, such as books and online communities. Of course, there are also real and unique contributions from social science modeling, including its recognition that the statistical model is an approximation of an underlying social process (see below). Personally, I combine elements of all three cultures, with a heavy and deliberate weighting on statistical learning and social science modeling.
Exhibit A: Problem Conceptualization/Paradigm
- Machine Learning:
Input –> Output - Statistical Learning:
Defining a function: f(x) = ….
While this may seem identical to the input/output paradigm, because it is expressed in these terms there is a bias toward functions as functions (that is, taking the form of mathematical relationships that tend to be parametric). - Social Science Modeling:
Underlying social process can be approximated through statistical modeling.
This is an extreme oversimplification, and there is literally a century of academic debate on the topic across a dozen social science fields. However, given the contrast with machine and statistical learning, I believe this is a unique emphasis that is more common to social science modeling.
In selecting an appropriate method, it is important to recognize the real trade-offs of each model. (Also remember that there is no single “best” model, as certain models perform better in different contexts.) One of the most important trade-offs is between model flexibility and interpretability. A flexible model is one that is responsive to underlying complexity in relationships. If, for example, the relationship between family income and yield presents a seesaw pattern (as opposed to a curvilinear pattern), a flexible model will be required to capture the relationship. The major downside of a flexible model is that it can be prone to overfitting data, which mean that the model is capturing noise (i.e, spurious relationships) and may not be a good predictor of future data.
The flip-side of flexibility is interpretability. An interpretable model is one in which the modeled relationships can be easily captured in a functional form, and as a result translated into plain language. The downside of an interpretable model is that it can be prone to oversimplifying data.
I make it a practice of selecting modeling techniques that strike a balance between flexibility and interpretability. Most important, these techniques must be able to be specified in a way as to emphasize one or the other. The two main families of techniques I employ are 1) regression modeling (linear and logistic), and 2) decision tree modeling (CHAID, CRT, etc.). Depending on the type of model needed, I tailor these models to be very flexible (incorporating interaction terms, non-parametric effects, etc.), or very interpretable.
Some providers allow more advanced techniques to be implemented by non-technical users, such as neural nets and random forests. Yet one of the main benefits of these advanced techniques is also their primary weakness: A well-specified neural net/AI model functions like a black-box. You test the model on existing data, and then it will score new data without providing an interpretable justification. This can be a good thing. Consider AI systems developed to play complex games (i.e., Chess, Backgammon, Go): They often find strategies that look like mistakes to experts but prove to be winning approaches. Yet when the context is higher education, and more specifically scoring individual students and making projections about class enrollment, there is a huge premium on interpretability. It is simply unacceptable to use a model that provides a result such as “the yield on the class will be X%” without providing a justification for this projection. Adjustments like these require interpretable models combined with professional expertise, not a black-box solution.
The problem with black-box solutions in this context goes even deeper. Not only do black-box models have the potential of being over-fit and delivering spurious results (especially when implemented by non-statisticians), but they tend to obscure the relationships and values that should be the foundation of institutions of higher education. Developing a meaningful relationship with a student (whether a prospective student, an enrolled student, or an alum) requires knowing key characteristics about that student and how those characteristics inform their experiences and subsequent behavior. Furthermore, models are tools to inform interventions to improve student experience, and the choice of interventions is inherently value-laden. If, for example, different retention models identify different student segments that are at-risk, we would want to know why the models disagree, what the underlying social process is that leads to student attrition, and how best to intervene. Students are better served, and institutions have better outcomes, when these value-laden decisions are made explicit than when they are left to the devices of a (potentially spurious) black-box model. This is not a wholesale denouncement of black-box models in higher education, but rather a call to their cautious and judicious use.
III. Taking Action: What this means for Enrollment Management
Efforts to collect data and improve modeling techniques need to be guided by strategic priorities. Without a clear sense of why data is being collected and analyzed, any results will not be able to inform strategic decisions. Whether the problem is how to assess student academic preparation, evaluate pricing, or predict students’ enrollment, retention, and graduation prospects, the rationale must drive the analysis–never the reverse.
IV. Further Reading
Specific technology providers are not linked here, as most can easily be found through a web search. Instead I am including some resources I highly recommend, whether you are in an analytic position or at a leadership level at an institution and want to understand how to better incorporate new technology.
James, Gareth et al. 2015. An Introduction to Statistical Learning. Springer.
Silver, Nate. 2015. The Signal and the Noise. Penguin Books.