Introduction

Since its inception, the free encyclopedia has had its share of critics, particularly in higher education institutions. My personal experience with Wikipedia has been generally positive. When exploring calculus ideas during my Khan Academy self-study, I found the site to be great for exploring the history behind the development of mathematical fields of study. Wiki articles—which discussed both famous science people and used language (mathematics) that was so foreign to me—were such a strong motivator that I decided to finally enroll in college to pursue a STEM degree to learn how to think like a mathematician.

While my math professors seemed not to have a strong opinion, professors in other disciplines were reluctant to endorse the website as a learning tool. “Wikipedia is not a source” was a common refrain, implying that we should never cite a Wiki page in our academic work. However, some of the educators I encountered held a more nuanced view. “Wikipedia is a tertiary source. It’s fine for a general overview, but you need to follow the sources cited on the page to use in your research projects.” These kinds of differences in opinion have inspired research into identifying the factors that contribute to the adoption of Wikipedia as a teaching aid.

A 2016 paper titled Factors that influence the teaching use of Wikipedia in higher education surveyed faculty members at two universities in Spain to understand how Wikipedia has been adopted in teaching and which criteria make an educator likely to use Wikipedia in their classes. The team shared the dataset with the University of California Irvine’s Machine Learning Repository. That research used a Technology Acceptance Model to identify connections between survey question types. This article seeks to explore a few fundamental machine learning models, such as logistic regression and random forest classifiers, to predict how Wikipedia will be used depending on the professor.

Survey Data

A common technique in data gathering via survey is to frame questions as statements to be self-assessed on a scale from 1 to 5, with the low and high ends corresponding to strong disagreement and strong agreement, respectively. The middle value in the scale is typically interpreted as having a neutral stance on the question at hand. This tool is known as a Likert scale, named after psychologist Rensis Likert.

This type of ordinal data brings about many decisions for data scientists to make when constructing predictive models. It isn’t quantitative, it’s not merely categorical, but the differences between adjacent values are subjective and not necessarily uniform across features. If treated as quantitative, a regression model would be a good candidate to try. If treated as categorical, then ensemble tree methods like random forest might yield a useful model. Which one is better for this data? This article will compare the performance of a logistic regression model to a random forest classifier.

Target: Use Behavior

Five use behaviors may serve as a target for a predictive binary classification model:

  • USE1: I use Wikipedia to develop my teaching materials
  • USE2: I use Wikipedia as a platform to develop educational activities with students
  • USE3: I recommend my students use Wikipedia
  • USE4: I recommend my colleagues use Wikipedia
  • USE5: I agree my students use Wikipedia in my courses

You can try a model for each of these using the notebooks in the repo. This post will focus on USE2.

Features used

The features include demographic data (age, gender, job position, …) in addition to self-assessed survey responses (Likert scale). We’ll use all of the dataset variables not classified as Use Behavior to build a model.

Data processing

Though the dataset is relatively clean and tidy, it contains missing values that have to be handled before the modeling framework can use them. There are many options here, but I opted to fill in the missing data with the most frequent value for that variable since the dataset is relatively small.

The USE2 values are grouped and remapped for the target variable so that 4 and 5 correspond to a ‘Yes’ response while values of 1 and 2 get mapped to a ‘No’ response. Entries where USE2 is 3 are dropped from the training set. The result is a binary outcome variable that allows for simple decision tree models to be constructed that are easier to interpret than a multiclass classification model.

Use behavior 2 distribution
Use Behavior 2 Distribution

Building the Models

Baseline Metrics

A standard metric for evaluating classification models is accuracy, where a baseline model would involve finding the most frequently observed outcome (majority class). With binary outcomes, a ratio of 3:1 or greater is called an imbalanced dataset and it makes it challenging to construct models that perform better than chance.

In such cases, a better metric to use is the ROC-AUC score, where a baseline model yields a score of 0.5 while the models we build range from 0 to 1. The larger the score, the better the model is as a diagnostic tool.

Cross Validation

To avoid the situation where a subset of the training data can bias the training of the model, I performed 5-fold cross validation. Accordingly, the training data was split up into 5 subsets, of nearly equal size, with each subset used once to validate the model’s performance using ROC score while the remaining 4 subsets are used to train the model.

Logistic Regression CV ROC
Logistic Regression Cross-validated ROC Scores


Random Forest Classifier CV ROC
Random Forest Classifier Cross-validated ROC Scores

Feature Importance

I used two method to determine the importance of each variable, feature importance and the second was permutation importance.

Logistic Regression Feature Importance Random Forest Classifier Feature Importance
Logistic Regression Feature Importance using coefficient magnitude Random Forest Classifier Feature Importance

However, feature importance is defined differently for a logistic regression model than it is for a random forest classifier. Permutation importance, which rearranges values of the specified column, is the same for both. So, a direct comparison of variables is possible using permutation importance.

Logistic Regression Permutation Importance Random Forest Classifier Permutation Importance
Logistic Regression Permutation Importance Random Forest Classifier Permutation Importance

Model Performances

After building the models, it is time to test them on the data they haven’t seen yet.

Logistic Regression ROC-AUC: 0.89
Random Forest Classifier ROC-AUC: 0.79
Logistic Regression ROC Random Forest Classifier ROC
Logistic Regression ROC Random Forest Classifier ROC

Confusion Matrices

Logistic Regression Confusion Matrix Random Forest Classifier Confusion Matrix
Logistic Regression Confusion Matrix Random Forest Classifier Confusion Matrix

Conclusion

The logistic regression model performed slightly better than the random forest classifier despite having a lower mean ROC score on the validation data. This experiment is evidence of logistic regression model generalizing better than a random forest classifier when used on the same data set with the same preprocessing steps. No tuning or feature selection was performed on either model, so it is possible that the random forest classifier could generalize just as well as the logistic regression model if restricted to a limited set of variables.

Resources