Weighted Ensemble L2

What AutoGluon stacks on top of everything else

Train a dozen tabular models with AutoGluon and the row at the top of the leaderboard is almost always something called WeightedEnsemble_L2. It isn't a model. It's a recipe for blending the models that got trained first. The name sounds like it has to do with L2 regularization. It doesn't. There's also no ridge, no sklearn stacker, no learned meta-model. What's inside is a twenty-year-old greedy trick that fits in thirty lines of NumPy.

Layer 1 trains a library of base models. Layer 2 is one extra "model" whose only job is to average the layer-1 predictions with a smart choice of weights. The entire question reduces to: which weights?

What the "L2" is (and isn't)

Layer 2. Not L2 regularization. AutoGluon trains its models in stacking layers: L1 is the library of base learners (random forest, extra trees, LightGBM, XGBoost, CatBoost, KNN, a neural net, a few others). L2 sits on top and takes the L1 predictions as its input. If you turn on multi-layer stacking you get WeightedEnsemble_L3 on top of that. No ridge penalty is involved anywhere.

The specific L2 model AutoGluon defaults to is a weighted average of the base predictions — non-negative weights that sum to one. That's the whole output space. No learned transform, no interaction terms. The only thing to figure out is the weight vector w.

Why not fit a meta-model

The textbook answer is stacking: take the out-of-fold predictions as features, fit something (logistic regression, another GBM) on top. sklearn ships StackingClassifier for exactly this. Two things go wrong in practice.

First, the OOF predictions of a well-tuned boosted tree already look a lot like the labels. A logistic regression will gladly dump a huge coefficient on that one column and drive everything else to zero, unless you regularize — and tuning the shrinkage to produce a useful blend is fiddly. Second, the meta-model's coefficients can go negative. A weight of −0.4 says "do the opposite of this model." That helps on whichever validation fold surfaced the trick and blows up on test.

What you actually want is something that (a) stays on the simplex, (b) is coarse enough that it can't fit validation noise, (c) is cheap, (d) cannot produce negative or exploding weights. Caruana et al. described exactly that procedure in 2004 and called it "ensemble selection."

The greedy step

Start with an empty bag. Do this twenty-five times: for each base model, ask what would the running average of the bag look like if I added this model? Score that hypothetical average against the validation labels. Pick the model whose addition lowers the error most. Drop it in the bag. You can pick the same model repeatedly.

One iteration. Current bag has seven picks. For each candidate, compute the bag average including that candidate and score it. Whichever scores best becomes the eighth pick.

Why sampling with replacement

Because it's a coarse, fast approximation of continuous convex weights. If you draw twenty-five slots from four models with replacement, each model can contribute 0, 1, 2, …, 25 picks — a grid of 26 values per model in steps of 0.04. On a real validation set, the difference between 0.36 and 0.40 is rarely distinguishable from noise, so the coarse grid is a feature, not a bug. Greedy selection at this granularity is cheap (O(size × M × N) — a few milliseconds) and beats nonlinear optimizers on Caruana's original benchmarks.

Twenty-five picks, four unique models, counts become weights. Nothing is ever negative. Nothing sums to more than one. The coarsest weight you can assign a model that was picked at all is 1/25 = 0.04.

Where the predictions come from

Not the training set. If you score candidates on rows the base models memorized, the best-fitting base model looks oracle-perfect and vacuums up every weight. The scoring signal has to be out-of-sample for every row it touches.

AutoGluon's bagging mode handles this. Each base model gets k fold-copies (typically eight); for each row, the prediction used is the one from the fold-copy that didn't see that row in training. Concatenate and you have an out-of-fold (OOF) vector as long as the full training set. The ensembler never sees a training-set prediction. No separate holdout needed. With bagging off, AutoGluon falls back to a real validation split and runs the same algorithm on the smaller signal.

Reading the trajectory

Scores change at every step. Usually down, sometimes up — the greedy step always adds the best available model, but best-available can still produce a bag worse than you had two steps ago.

The 25-step trajectory. AutoGluon truncates the bag to the prefix ending at the minimum. Iteration 1 is equivalent to the best single base model, so the final ensemble is guaranteed to be at least as good as that on the OOF signal.

AutoGluon records the entire trajectory, finds its minimum, and truncates the bag to that prefix. (This is the use_best flag, on by default.) Iteration 1 corresponds to picking the single best base model, so in the worst case truncation rolls back to there. On the OOF signal, the ensemble can never lose to the best individual base model — a surprisingly strong guarantee for thirty lines of NumPy.

Not sklearn

Common confusion: sklearn has a StackingClassifier, so surely that's what AutoGluon uses? No. StackingClassifier is the fit-a-meta-model approach Caruana's paper was specifically written against. AutoGluon's implementation is pure NumPy and lives at autogluon.core.models.greedy_ensemble.ensemble_selection. If you grep the repo for sklearn inside that file you'll come up empty. The class is called EnsembleSelection, wrapped by GreedyWeightedEnsembleModel for the rest of the AutoGluon pipeline.

Gotchas

The metric matters. The greedy search optimizes whatever you passed to eval_metric. Accuracy, log loss, RMSE, AUC — each gives different weights. Accuracy especially can land on weights that look strange because the objective only cares about which class wins the argmax.
Diversity within a family is weak. Ten near-duplicate XGBoost configs correlate tightly. The greedy step will pile picks onto one of them and waste slots. Diversity across algorithms (trees + KNN + neural net) helps more than diversity across hyperparameters inside one algorithm.
Truncation guarantees OOF-goodness, not test-goodness. With a small dataset and easy folds, the twenty-five weights can still overfit the OOF signal. The guarantee is only that you won't look worse on the signal you selected on.
Time-series breaks OOF. Row-wise bagging assumes rows are interchangeable. With temporal structure you need time-aware folds before any of this applies.
Zero-weight models are dropped. Any base model that never got picked is removed from inference, so the final ensemble is usually smaller than the library — fast at prediction time.

The whole thing in 30 lines of Python

Works on any library of base models that produce probability predictions. The only hard requirement is that P be out-of-fold, not training-set predictions.

import numpy as np
from collections import Counter




def weighted_ensemble_l2(P, y, metric, size=25):
    """Caruana-style greedy selection. Returns weights in the simplex.


    P:      (M, N, K) — OOF predictions from M base models, N rows, K classes
    y:      (N,)      — true labels
    metric: callable(y_true, y_pred_proba) -> float, lower is better
    """
    M, N, K = P.shape
    running = np.zeros((N, K))
    picks: list[int] = []
    trajectory: list[float] = []


    for s in range(size):
        best_j, best_err = -1, float("inf")
        for j in range(M):
            candidate = (s * running + P[j]) / (s + 1)    # bag avg if we add j
            err = metric(y, candidate)
            # ties broken by preferring models already in the bag
            if err < best_err or (err == best_err and j in picks):
                best_j, best_err = j, err
        picks.append(best_j)
        running = (s * running + P[best_j]) / (s + 1)
        trajectory.append(best_err)


    # use_best: truncate to the prefix with minimum error
    cutoff = int(np.argmin(trajectory)) + 1
    picks = picks[:cutoff]


    weights = np.zeros(M)
    for j, c in Counter(picks).items():
        weights[j] = c / len(picks)
    return weights