explainer

Feature Importance and Partial Dependence Plots

cracking open the black box

Most ML models are black boxes, too tangled to read off what they "learned" by squinting at the weights. These two methods let you ask any trained model two questions: which inputs is it actually using, and how does each one bend the prediction. The model eats rows and returns predictions; we want to strip out one column's contribution and draw a curve for it.

Partial Dependencyper class · sorted by permutation importance

feature `PetalLength`

importance 0.4222 · impact 100%

Above: a model that guesses iris species from four measurements. The bars on the left rank features by how much accuracy collapses when that column is scrambled (permutation importance). The plot on the right shows what the model predicts as one feature sweeps across its range, with the rest of the dataset held natural (partial dependence). Petal length carries the prediction; sepal length is mostly along for the ride. The rest of this page is how you actually compute that.

The model eats rows, returns predictions. The question is: strip out one column's contribution and draw a curve for it.

Why you can't just plot the data

If you wanted to know what petal length does, the lazy answer is: scatter petal length against the model's predictions on the training set and call it a day.

Left: the lazy scatter. Looks decisive, a clean hill, peak around petal length ≈ 4. Right: but in the same data, petal width is glued to petal length. The hill on the left could be petal length, petal width, or both. Just looking can't tell you.

It lies. Petal length and petal width grow together, long petals tend to be wide. So a scatter of "predictions vs petal length" is really "predictions vs petal length and everything correlated with it." You can't separate the two by looking.

To isolate petal length, you have to ask a counterfactual: if I changed only petal length, what would the model do? The trick is constructing rows that don't exist in your data.

Partial dependence: the recipe

Pick a value, say petal length = 1.5 cm. Take every row in your dataset, but overwrite the petal length column with 1.5. The other columns keep their real values. Ask the model to predict on this fake dataset and average the answers. That's one dot.

Slide the value across the range. Each value gives you one dot. Connect them and you have the curve.

v = 1.10 (1/20)

Move the slider, or hit play. Each value of v overwrites the petal length column on every row, the model scores all of them, and the average lands on the curve.

Permutation importance: does this feature matter at all?

Partial dependence shows the shape of one feature's effect. To rank features overall, there's a different test: scramble one column in your validation set so its values are still real numbers but no longer attached to the right rows. Score the model again. If accuracy collapsed, the model was relying on that feature. If it didn't budge, the model wasn't using that column.

Shuffle one column, rescore, measure the drop. The shuffle destroys the feature's signal without changing the model or any other column.

Reading the chart

Once features are sorted by importance and you draw one curve per class for each, the chart per feature reads like this:

Probability of each class as a function of one feature, with the rest of the dataset marginalized out.

A flat line means the model isn't using this feature in that range.
A steep rise or fall is a sensitive zone, small nudges flip predictions.
Lines crossing are decision boundaries, the point where the predicted class changes.

Caveats

Correlated features can produce fake combos. Pinning petal_length = 1.5 on a row with petal_width = 4.0 constructs a flower that doesn't exist. The model still draws a curve; treat it with care. Accumulated Local Effects (ALE) is the principled fix.
Categorical features need category sweeps, not numeric grids. Don't interpolate between "red" and "blue."
Sparse or heavy-tailed features usually need a clipped grid (5th–95th percentile) instead of min–max.
Time-series breaks the shuffle. Permutation importance assumes rows are interchangeable.
These are statements about your model, not about reality. A weak model shows weak importances even on features that genuinely matter.

The whole thing in 25 lines of Python

Once the batching click happens, both algorithms fit on a screen. Works on any sklearn-compatible classifier with predict and predict_proba.

import numpy as np
from sklearn.metrics import accuracy_score




def feature_importance_and_partial_dependency(
    model, X_train, X_val, y_val, features, grid_n=20, seed=0
):
    """Returns (importance, pdp) for any sklearn-style classifier."""
    rng = np.random.default_rng(seed)
    base_acc = accuracy_score(y_val, model.predict(X_val))


    # permutation importance — drop in accuracy when one column is shuffled
    importance = {}
    for j in features:
        shuffled = X_val.copy()
        rng.shuffle(shuffled[:, j])
        importance[j] = max(0.0, base_acc - accuracy_score(y_val, model.predict(shuffled)))


    # partial dependence — average prediction as one column sweeps its range
    pdp = {}
    N = X_train.shape[0]
    for j in features:
        lo, hi = np.quantile(X_train[:, j], [0.05, 0.95])
        grid = np.linspace(lo, hi, grid_n)
        batch = np.tile(X_train, (grid_n, 1))      # (grid_n * N, num_features)
        for g, v in enumerate(grid):
            batch[g * N : (g + 1) * N, j] = v
        probas = model.predict_proba(batch)         # one call, fully vectorized
        pdp[j] = (grid, probas.reshape(grid_n, N, -1).mean(axis=1))


    return importance, pdp