Weighted Ensemble L2
Train a dozen tabular models with AutoGluon and the row at the top of the leaderboard is almost always something called WeightedEnsemble_L2. It isn't a model. It's a recipe for blending the models that got trained first. The name sounds like it has to do with L2 regularization. It doesn't. There's also no ridge, no sklearn stacker, no learned meta-model. What's inside is a twenty-year-old greedy trick that fits in thirty lines of NumPy.
What the "L2" is (and isn't)
Layer 2. Not L2 regularization. AutoGluon trains its models in stacking layers: L1 is the library of base learners (random forest, extra trees, LightGBM, XGBoost, CatBoost, KNN, a neural net, a few others). L2 sits on top and takes the L1 predictions as its input. If you turn on multi-layer stacking you get WeightedEnsemble_L3 on top of that. No ridge penalty is involved anywhere.
The specific L2 model AutoGluon defaults to is a weighted average of the base predictions — non-negative weights that sum to one. That's the whole output space. No learned transform, no interaction terms. The only thing to figure out is the weight vector w.
Why not fit a meta-model
The textbook answer is stacking: take the out-of-fold predictions as features, fit something (logistic regression, another GBM) on top. sklearn ships StackingClassifier for exactly this. Two things go wrong in practice.
First, the OOF predictions of a well-tuned boosted tree already look a lot like the labels. A logistic regression will gladly dump a huge coefficient on that one column and drive everything else to zero, unless you regularize — and tuning the shrinkage to produce a useful blend is fiddly. Second, the meta-model's coefficients can go negative. A weight of −0.4 says "do the opposite of this model." That helps on whichever validation fold surfaced the trick and blows up on test.
What you actually want is something that (a) stays on the simplex, (b) is coarse enough that it can't fit validation noise, (c) is cheap, (d) cannot produce negative or exploding weights. Caruana et al. described exactly that procedure in 2004 and called it "ensemble selection."
The greedy step
Start with an empty bag. Do this twenty-five times: for each base model, ask what would the running average of the bag look like if I added this model? Score that hypothetical average against the validation labels. Pick the model whose addition lowers the error most. Drop it in the bag. You can pick the same model repeatedly.
Why sampling with replacement
Because it's a coarse, fast approximation of continuous convex weights. If you draw twenty-five slots from four models with replacement, each model can contribute 0, 1, 2, …, 25 picks — a grid of 26 values per model in steps of 0.04. On a real validation set, the difference between 0.36 and 0.40 is rarely distinguishable from noise, so the coarse grid is a feature, not a bug. Greedy selection at this granularity is cheap (O(size × M × N) — a few milliseconds) and beats nonlinear optimizers on Caruana's original benchmarks.
1/25 = 0.04.Where the predictions come from
Not the training set. If you score candidates on rows the base models memorized, the best-fitting base model looks oracle-perfect and vacuums up every weight. The scoring signal has to be out-of-sample for every row it touches.
AutoGluon's bagging mode handles this. Each base model gets k fold-copies (typically eight); for each row, the prediction used is the one from the fold-copy that didn't see that row in training. Concatenate and you have an out-of-fold (OOF) vector as long as the full training set. The ensembler never sees a training-set prediction. No separate holdout needed. With bagging off, AutoGluon falls back to a real validation split and runs the same algorithm on the smaller signal.
Reading the trajectory
Scores change at every step. Usually down, sometimes up — the greedy step always adds the best available model, but best-available can still produce a bag worse than you had two steps ago.
AutoGluon records the entire trajectory, finds its minimum, and truncates the bag to that prefix. (This is the use_best flag, on by default.) Iteration 1 corresponds to picking the single best base model, so in the worst case truncation rolls back to there. On the OOF signal, the ensemble can never lose to the best individual base model — a surprisingly strong guarantee for thirty lines of NumPy.
Not sklearn
Common confusion: sklearn has a StackingClassifier, so surely that's what AutoGluon uses? No. StackingClassifier is the fit-a-meta-model approach Caruana's paper was specifically written against. AutoGluon's implementation is pure NumPy and lives at autogluon.core.models.greedy_ensemble.ensemble_selection. If you grep the repo for sklearn inside that file you'll come up empty. The class is called EnsembleSelection, wrapped by GreedyWeightedEnsembleModel for the rest of the AutoGluon pipeline.
Gotchas
- The metric matters. The greedy search optimizes whatever you passed to
eval_metric. Accuracy, log loss, RMSE, AUC — each gives different weights. Accuracy especially can land on weights that look strange because the objective only cares about which class wins the argmax. - Diversity within a family is weak. Ten near-duplicate XGBoost configs correlate tightly. The greedy step will pile picks onto one of them and waste slots. Diversity across algorithms (trees + KNN + neural net) helps more than diversity across hyperparameters inside one algorithm.
- Truncation guarantees OOF-goodness, not test-goodness. With a small dataset and easy folds, the twenty-five weights can still overfit the OOF signal. The guarantee is only that you won't look worse on the signal you selected on.
- Time-series breaks OOF. Row-wise bagging assumes rows are interchangeable. With temporal structure you need time-aware folds before any of this applies.
- Zero-weight models are dropped. Any base model that never got picked is removed from inference, so the final ensemble is usually smaller than the library — fast at prediction time.
The whole thing in 30 lines of Python
Works on any library of base models that produce probability predictions. The only hard requirement is that P be out-of-fold, not training-set predictions.
import numpy as np from collections import Counter def weighted_ensemble_l2(P, y, metric, size=25): """Caruana-style greedy selection. Returns weights in the simplex. P: (M, N, K) — OOF predictions from M base models, N rows, K classes y: (N,) — true labels metric: callable(y_true, y_pred_proba) -> float, lower is better """ M, N, K = P.shape running = np.zeros((N, K)) picks: list[int] = [] trajectory: list[float] = [] for s in range(size): best_j, best_err = -1, float("inf") for j in range(M): candidate = (s * running + P[j]) / (s + 1) # bag avg if we add j err = metric(y, candidate) # ties broken by preferring models already in the bag if err < best_err or (err == best_err and j in picks): best_j, best_err = j, err picks.append(best_j) running = (s * running + P[best_j]) / (s + 1) trajectory.append(best_err) # use_best: truncate to the prefix with minimum error cutoff = int(np.argmin(trajectory)) + 1 picks = picks[:cutoff] weights = np.zeros(M) for j, c in Counter(picks).items(): weights[j] = c / len(picks) return weights