Introduction

Building a machine learning pipeline from scratch can be a daunting task, especially for those new to the field. This project post aims to simplify this process by providing a step-by-step guide to creating a comprehensive machine learning pipeline. We’ll cover everything from data preparation and model building to training, cross-validation, hyperparameter tuning, and visualization of results.

In this post, we’ll walk through the Python code necessary to build and evaluate machine learning models using libraries such as pandas, numpy, scikit-learn, PyTorch, and plotnine. We will start by importing the essential libraries and creating custom classes for modeling. These classes include a featureless baseline model, a neural network implemented in PyTorch, and a custom cross-validation class. We will then move on to data preparation, where we’ll preprocess datasets for training and testing.

Following data preparation, we’ll train our models using various hyperparameters and generate diagnostic plots to visualize the results. Finally, we will apply our models to different datasets, comparing the performance of multiple algorithms to determine the best approach.

By the end of this post, you should have a clear understanding of how to build a robust machine learning pipeline, ready to be adapted and expanded for your specific needs. Let’s dive in!

Imports

The first step in any data science project is to import the necessary libraries. These libraries provide the tools for data manipulation, machine learning, and visualization.

import pandas as pd
import numpy as np
import torch
import plotnine as p9
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

Custom Classes

From the previous posts, we have created the FeaturelessModel class is a simple model that predicts the majority class in the training data.

The TorchModel class defines a neural network using PyTorch. It allows for flexible architecture by specifying the number of hidden layers and units per layer.

class TorchModel(torch.nn.Module):
    def __init__(self, n_hidden_layers, units_in_first_layer, units_per_hidden_layer=100):
        super(TorchModel, self).__init__()
        units_per_layer = [units_in_first_layer]
        for layer_i in range(n_hidden_layers):
            units_per_layer.append(units_per_hidden_layer)
        units_per_layer.append(1)
        seq_args = []
        for layer_i in range(len(units_per_layer)-1):
            units_in = units_per_layer[layer_i]
            units_out = units_per_layer[layer_i+1]
            seq_args.append(torch.nn.Linear(units_in, units_out))
            if layer_i != len(units_per_layer)-2:
                seq_args.append(torch.nn.ReLU())
        self.stack = torch.nn.Sequential(*seq_args)

    def forward(self, feature_mat):
        return self.stack(feature_mat)

The NumpyData class converts numpy arrays into a PyTorch dataset, making it easier to work with DataLoader for batch processing.

class NumpyData(torch.utils.data.Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __getitem__(self, item):
        return self.features[item, :], self.labels[item]

    def __len__(self):
        return len(self.labels)

The MyCV class handles cross-validation, iterating over parameter grids and tracking the best performing parameters.

class MyCV:
    def __init__(self, estimator, param_grid, cv):
        self.estimator = estimator
        self.param_grid = param_grid
        self.cv = cv

    def fit(self, X, y):
        self.train_features = X
        self.train_labels = y
        self.best_params_ = {}
        np.random.seed(1)
        fold_vec = np.random.randint(low=0, high=self.cv, size=self.train_labels.size)
        best_mean_accuracy = 0
        for param_dict in self.param_grid:
            for param_name, [param_value] in param_dict.items():
                setattr(self.estimator, param_name, param_value)
                accuracy_list = []
                loss_df_list = []
                for test_fold in range(self.cv):
                    is_set_dict = {"validation": fold_vec == test_fold, "subtrain": fold_vec != test_fold}
                    set_features = {set_name: self.train_features[is_set, :] for set_name, is_set in is_set_dict.items()}
                    set_labels = {set_name: self.train_labels[is_set] for set_name, is_set in is_set_dict.items()}
                    self.estimator.fit(X=set_features, y=set_labels)
                    predicted_labels = self.estimator.predict(X=set_features["validation"])
                    accuracy = np.mean(predicted_labels == set_labels["validation"])
                    loss_df_list.append(self.estimator.loss_df)
                    accuracy_list.append(accuracy)
                mean_accuracy = np.mean(accuracy_list)
                if mean_accuracy > best_mean_accuracy:
                    best_mean_accuracy = mean_accuracy
                    self.best_params_[param_name] = param_value
                    setattr(self.estimator, param_name, self.best_params_[param_name])

        self.loss_mean_df = pd.concat(loss_df_list)
        print(self.loss_mean_df)

    def predict(self, X):
        return self.estimator.predict(X)

The Regularized Multi-Layer Perceptron(RegularizedMLP) class builds and trains a neural network with regularization. It uses the TorchModel class to define the neural network architecture and the NumpyData class to convert numpy arrays into PyTorch datasets.

class RegularizedMLP:
    def __init__(self, max_epochs, batch_size, step_size, units_per_hidden_layer):
        self.max_epochs = max_epochs
        self.batch_size = batch_size
        self.step_size = step_size
        self.units_per_hidden_layer = units_per_hidden_layer
        self.loss_fun = torch.nn.BCEWithLogitsLoss()

    def fit(self, X, y):
        set_features = X
        set_labels = y
        subtrain_csv = NumpyData(set_features["subtrain"], set_labels["subtrain"])
        subtrain_dl = torch.utils.data.DataLoader(subtrain_csv, batch_size=self.batch_size, shuffle=True)
        loss_df_list = []
        for n_hidden_layers in range(self.hidden_layers):
            model = TorchModel(n_hidden_layers, set_features["subtrain"].shape[1])
            model.train()
            optimizer = torch.optim.SGD(model.parameters(), lr=0.2)
            for epoch in range(self.max_epochs):
                for batch_features, batch_labels in subtrain_dl:
                    pred_tensor = model(batch_features.float()).reshape(len(batch_labels.float()))
                    loss_tensor = self.loss_fun(pred_tensor, batch_labels.float())
                    optimizer.zero_grad()
                    loss_tensor.backward()
                    optimizer.step()
                for set_name in set_features:
                    feature_mat = set_features[set_name]
                    label_vec = set_labels[set_name]
                    feature_mat_tensor = torch.from_numpy(feature_mat.astype(np.float32))
                    label_vec_tensor = torch.from_numpy(label_vec.astype(np.float32))
                    pred_tensor = model(feature_mat_tensor.float()).reshape(len(label_vec_tensor.float()))
                    loss_tensor = self.loss_fun(pred_tensor, label_vec_tensor.float())
                    set_loss = loss_tensor.item()
                    loss_df_list.append(pd.DataFrame({
                        "n_hidden_layers": n_hidden_layers,
                        "set_name": set_name,
                        "loss": set_loss,
                        "epoch": epoch,
                    }, index=[0]))
        self.model = model
        self.loss_df = pd.concat(loss_df_list)

    def decision_function(self, X):
        self.model.eval()
        with torch.no_grad():
            return self.model(torch.Tensor(X)).numpy().ravel()

    def predict(self, X):
        return np.where(self.decision_function(X) > 0, 1, 0)

Data Preparation

Load and preprocess the datasets (spam and zip code data).

spam_df = pd.read_csv("./data/spam.data", header=None, sep=" ")
spam_features = spam_df.iloc[:, :-1].to_numpy()
spam_scaled_features = (spam_features - spam_features.mean(axis=0)) / spam_features.std(axis=0)
spam_labels = spam_df.iloc[:, -1].to_numpy()

zip_df = pd.read_csv("./data/zip.test.gz", header=None, sep=" ")
is01 = zip_df[0].isin([0, 1])
zip01_df = zip_df.loc[is01, :]
zip_features = zip01_df.loc[:, 1:].to_numpy()
zip_labels = zip01_df[0].to_numpy()
zip_scaled_features = (zip_features - zip_features.mean(axis=0)) / zip_features.std(axis=0)

data_dict = {
    "spam": (spam_scaled_features, spam_labels),
    "zip": (zip_features, zip_labels),
}

Hyperparameter Training and Diagnostic Plot

Train the models using cross-validation and generate diagnostic plots.

def hyperparameter_training_and_diagnostic_plot():
    for data_set, (input_mat, output_vec) in data_dict.items():
        param_dicts = [{'hidden_layers': [L]} for L in range(1, 5)]
        rmlp = RegularizedMLP(max_epochs=100, batch_size=200, step_size=0.2, units_per_hidden_layer=100)

        learner_instance = MyCV(estimator=rmlp, param_grid=param_dicts, cv=2)
        learner_instance.fit(input_mat, output_vec)
        print(learner_instance.best_params_)

        loss_df = learner_instance.loss_mean_df
        loss_df.index = range(len(loss_df))

        def get_min(series_obj):
            return series_obj.index[series_obj.argmin()]

        rows_to_plot = loss_df.groupby(["n_hidden_layers", "set_name"])["loss"].apply(get_min)
        layers_plot_data = loss_df.iloc[rows_to_plot, :]

        set_colors = {"subtrain": "red", "validation": "blue"}
        gg = p9.ggplot() +\
            p9.facet_grid(". ~ set_name") +\
            p9.scale_color_manual(values=set_colors) +\
            p9.geom_line(
                p9.aes(
                    x="epoch",
                    y="loss",
                    color="n_hidden_layers"
                ),
                data=loss_df) +\
            p9.geom_point(
                p9.aes(
                    x="epoch",
                    y="loss",
                    color="n_hidden_layers"
                ),
                data=layers_plot_data)
        gg.save(f"./11_regularization/{data_set}_03.png", width=10, height=5)

        set_colors = {1: "red", 2: "green", 3: "blue", 4: "orange"}
        gg = p9.ggplot() +\
            p9.facet_grid(". ~ set_name") +\
            p9.scale_color_manual(values=set_colors) +\
            p9.geom_line(
                p9.aes(
                    x="epoch",
                    y="loss",
                    color="n_hidden_layers"
                ),
                data=loss_df) +\
            p9.geom_point(
                p9.aes(
                    x="epoch",
                    y="loss",
                    color="n_hidden_layers"
                ),
                data=layers_plot_data)
        gg.save(f"./{data_set}_01.png", width=10, height=5)

        validation_df = loss_df.query("set_name=='validation'")
        min_i = validation_df.loss.argmin()
        min_row = pd.DataFrame(dict(validation_df.iloc[min_i, :]), index=[0])
        gg = p9.ggplot() +\
            p9.facet_grid(". ~ n_hidden_layers") +\
            p9.scale_color_manual(values=set_colors) +\
            p9.scale_fill_manual(values=set_colors) +\
            p9.geom_line(
                p9.aes(
                    x="epoch",
                    y="loss",
                    color="set_name"
                ),
                data=loss_df) +\
            p9.geom_point(
                p9.aes(
                    x="epoch",
                    y="loss",
                    fill="set_name"
                ),
                color="black",
                data=min_row)
        gg.save(f"./{data_set}_02.png", width=10, height=5)

Experiments and Application

This function applies the model to different datasets and compares the performance of various algorithms.

def experiments_and_application():
    test_acc_df_list = []
    for data_set, (input_mat, output_vec) in data_dict.items():
        k_fold = KFold(n_splits=3, shuffle=True, random_state=1)
        for fold_id, indices in enumerate(k_fold.split(input_mat)):
            index_dict = dict(zip(["train", "test"], indices))

            set_data_dict = {}
            for set_name, index_vec in index_dict.items():
                set_data_dict[set_name] = {
                    "X": input_mat[index_vec],
                    "y": output_vec[index_vec]
                }

            rmlp = RegularizedMLP(
                max_epochs=100,
                batch_size=200,
                step_size=0.2,
                units_per_hidden_layer=100,
            )
            learner_instance = MyCV(estimator=rmlp, param_grid=[
                                    {'hidden_layers': [L]} for L in range(1, 5)], cv=2)
            learner_instance.fit(**set_data_dict["train"])

            logistic_reg_cv = LogisticRegressionCV(max_iter=1000)
            logistic_reg_cv.fit(**set_data_dict["train"])

            grid_search_cv = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=[
                                          {'n_neighbors': [x]} for x in range(1, 21)], cv=5)
            grid_search_cv.fit(**set_data_dict["train"])

            featureless = Featureless()
            featureless.fit(set_data_dict["train"]['y'])

            test_data_x = set_data_dict["test"]['X']
            test_data_y = set_data_dict["test"]['y']

            pred_dict = {
                "LogisticRegressionCV": logistic_reg_cv.predict(test_data_x),
                "Featureless": featureless.predict(test_data_x),
                "GridSearchCV+KNC": grid_search_cv.predict(test_data_x),
                "MyCV+RegularizedMLP": learner_instance.predict(set_data_dict["test"]['X']),
            }
            for algorithm, pred_vec in pred_dict.items():
                test_acc_dict = {
                    "test_accuracy_percent": (pred_vec == test_data_y).mean()*100,
                    "data_set": data_set,
                    "fold_id": fold_id,
                    "algorithm": algorithm
                }
                test_acc_df_list.append(pd.DataFrame(test_acc_dict, index=[0]))

    test_acc_df = pd.concat(test_acc_df_list)
    print(test_acc_df)

    gg = p9.ggplot() +\
        p9.geom_point(
            p9.aes(
                x="test_accuracy_percent",
                y="algorithm"
            ),
            data=test_acc_df) +\
        p9.facet_wrap("data_set")
    gg.save("./p8_accuracy_facetted.png")

Conclusion

This pipeline integrates various components of machine learning workflow, including data preparation, model building, training, hyperparameter tuning, and result visualization. The use of custom classes and functions allows for flexibility and reusability in different machine learning scenarios. This project post aims to provide a comprehensive overview and a practical implementation guide for building a machine learning pipeline from scratch.