BasePipeline class

class BasePipeline(self, model: sam_ml.models.main_classifier.Classifier | sam_ml.models.main_regressor.Regressor, vectorizer: str | sam_ml.data.preprocessing.embeddings.Embeddings_builder | None, scaler: str | sam_ml.data.preprocessing.scaler.Scaler | None, selector: str | tuple[str, int] | sam_ml.data.preprocessing.feature_selection.Selector | None, sampler: str | sam_ml.data.preprocessing.sampling.Sampler | sam_ml.data.preprocessing.sampling_pipeline.SamplerPipeline | None, model_name: str)

BasePipeline class - parent class Model

Parameters

modelClassifier or Regressor class object: Model used in pipeline (Classifier or Regressor)
vectorizerstr, Embeddings_builder, or None: object or algorithm of Embeddings_builder class which will be used for automatic string column vectorizing (None for no vectorizing)
scalerstr, Scaler, or None: object or algorithm of Scaler class for scaling the data (None for no scaling)
selectorstr, Selector, or None: object, tuple of algorithm and feature number, or algorithm of Selector class for feature selection (None for no selecting)
samplerstr, Sampler, SamplerPipeline, or None: object or algorithm of Sampler / SamplerPipeline class for sampling the train data (None for no sampling)
model_namestr: name of the model

Attributes

cv_scoresdict[str, float]: dictionary with cross validation results
data_classes_trainedbool: If True, the preprocessing step classes are fitted. Important for methods that use warm_start
feature_nameslist[str]: names of all the features that the model saw during training. Is empty if model was not fitted yet.
gridConfigurationSpace: hyperparameter tuning grid of the model
modelmodel object: model with ‘fit’, ‘predict’, ‘set_params’, and ‘get_params’ method (see sklearn API)
model_namestr: name of the model. Used in loading bars and dictionaries as identifier of the model
model_typestr: kind of estimator (e.g. ‘RFC’ for RandomForestClassifier)
rCVsearch_resultspd.DataFrame or None: results from randomCV hyperparameter tuning. Is None if randomCVsearch was not used yet.
stepslist[tuple[str, any]]: list with preprocessing + model pipeline steps as tuples
string_columnslist[str]: list with detected string columns that are used in auto-vectorizing
train_scorefloat: train score value
train_timestr: train time in format: “0:00:00” (hours:minutes:seconds)

Methods

Method	Description
`_auto_vectorizing`	Function to detect string columns and creating a vectorizer for each, and vectorize them
`_changed_parameters`	Function to get parameters that differ from the default ones
`_data_prepare`	Function to run data class objects on data to prepare them for the model
`_get_all_scores`	Function to create multiple scores for given y_true-y_pred pairs
`_get_score`	Calculate a score for given y true and y prediction values
`_inherit_from_model`	Function to inherit methods and attributes from model
`_make_cv_scores`	Function to create from the crossvalidation results a dictionary
`_make_scorer`	Function to create a dictionary with scorer for the crossvalidation
`_print_scores`	Function to print out the values of a dictionary
`_validate_component`	Function to create the data preprocessing steps
`cross_validation`	Random split crossvalidation
`cross_validation_small_data`	One-vs-all cross validation for small datasets
`evaluate`	Function to create multiple scores with predict function of model
`evaluate_score`	Function to create a score with self.__get_score of model
`feature_importance`	Function to generate a matplotlib plot of the top45 feature importance from the model.
`fit`	Function to fit the model
`fit_warm_start`	Function to warm_start fit the model
`get_deepcopy`	Function to create a deepcopy of object
`get_params`	Function to get the parameter from the model object
`get_random_config`	Function to generate one grid configuration
`get_random_configs`	Function to generate grid configurations
`load_model`	Function to load a pickled model class object
`predict`	Function to predict with predict-method from model object
`predict_proba`	Function to predict with predict_proba-method from model object
`randomCVsearch`	Hyperparametertuning with randomCVsearch
`replace_grid`	Function to replace self.grid
`save_model`	Function to pickle and save the class object
`set_params`	Function to set the parameter of the model object
`smac_search`	Hyperparametertuning with SMAC library HyperparameterOptimizationFacade [can only be used in the sam_ml version with swig]
`train`	Function to train the model
`train_warm_start`	Function to warm_start train the model

BasePipeline._auto_vectorizing(X: DataFrame, train_on: bool) → DataFrame

Function to detect string columns and creating a vectorizer for each, and vectorize them

Parameters

Xpd.DataFrame: data to vectorize
train_onbool: if data shall just be transformed (train_on=False) or also the vectorizer be trained before

Returns

X_vectorizedpd.DataFrame: dataframe X with string columns replaced by vectorize columns

BasePipeline._changed_parameters(): Function to get parameters that differ from the default ones

Returns

dictionary of model parameter that are different from default values

BasePipeline._data_prepare(X: DataFrame, y: Series, train_on: bool = True) → tuple[DataFrame, Series]

Function to run data class objects on data to prepare them for the model

Parameters

Xpd.DataFrame: feature data to vectorize
ypd.Series: target column. Only needed if train_on=True and pipeline contains Selector or Sampler. Otherwise, just input None
train_onbool: the data will always be transformed. If train_on=True, the transformers will be fit_transformed

Returns

Xpd.DataFrame: transformed feature data
ypd.Series: transformed target column. Only differes from input if train_on=True and pipeline contains Sampler

abstract BasePipeline._get_all_scores(y_test: Series, pred: list, custom_score: Callable, **kwargs) → dict[str, float]

Function to create multiple scores for given y_true-y_pred pairs

Parameters

y_test, predpd.Series, list

Data to evaluate model

custom_scorecallable or None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

**kwargs:

additional parameters from child-class

Returns

scoresdict: dictionary with score names as keys and score values as values

abstract BasePipeline._get_score(scoring: str, y_test: Series, pred: list, **kwargs) → float

Calculate a score for given y true and y prediction values

Parameters

scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

y_test, predpd.Series, pd.Series

Data to evaluate model

Returns

scorefloat: metrics score value

BasePipeline._inherit_from_model(model: Classifier | Regressor)

Function to inherit methods and attributes from model

Parameters

modelClassifier or Regressor class object: model used in pipeline (Classifier or Regressor)

abstract BasePipeline._make_cv_scores(score: dict, custom_score: Callable | None) → dict[str, float]

Function to create from the crossvalidation results a dictionary

Parameters

scoredict

crossvalidation average column results

custom_scorecallable or None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

Returns

cv_scoresdict: restructured dictionary

abstract BasePipeline._make_scorer(custom_score: Callable | None, **kwargs) → dict[str, Callable]

Function to create a dictionary with scorer for the crossvalidation

Parameters

custom_scorecallable or None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

**kwargs:

additional parameters from child-class

Returns

scorerdict[str, Callable]: dictionary with scorer functions

BasePipeline._print_scores(scores: dict, y_test: Series, pred: list)

Function to print out the values of a dictionary

Parameters

scores: dict: dictionary with score names and values
y_test, predpd.Series, list: Data to evaluate model

Returns

key-value pairs in console, format:

key1: value1

key2: value2

…

static BasePipeline._validate_component(component: str | tuple | Data, component_class: Data, pipeline_class: Data | None = None) → Data

Function to create the data preprocessing steps

Parameters

componentstr, tuple, or Data object: __init__-method input value for data preprocessing component
component_classData object: class of component like Sampler, Selector, …
pipeline_classData object, default=None: currently, only used for SamplerPipeline class input in __init__-method

Returns

Data object

BasePipeline.cross_validation(X: DataFrame, y: Series, cv_num: int, console_out: bool, custom_score: Callable | None, **kwargs) → dict[str, float]

Random split crossvalidation

Parameters

X, ypd.DataFrame, pd.Series

Data to cross validate on

cv_numint

number of different random splits

console_outbool

shall the result dataframe of the different scores for the different runs be printed

custom_scorecallable or None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

**kwargs:

additional parameters from child-class for make_scorer method

Returns

scoresdict: dictionary of the format from the self._make_cv_scores function

The scores are also saved in self.cv_scores.

BasePipeline.cross_validation_small_data(X: DataFrame, y: Series, leave_loadbar: bool, console_out: bool, custom_score: Callable | None, **kwargs) → dict[str, float]

One-vs-all cross validation for small datasets

In the cross_validation_small_data-method, the model will be trained on all datapoints except one and then tested on this last one. This will be repeated for all datapoints so that we have our predictions for all datapoints.

Advantage: optimal use of information for training

Disadvantage: long train time

This concept is very useful for small datasets (recommended: datapoints < 150) because the long train time is still not too long and especially with a small amount of information for the model, it is important to use all the information one has for the training.

Parameters

X, ypd.DataFrame, pd.Series

Data to cross validate on

leave_loadbarbool

shall the loading bar of the training be visible after training (True - load bar will still be visible)

console_outbool

shall the result of the different scores and a classification_report be printed into the console

custom_scorecallable or None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

**kwargs:

additional parameters from child-class for _get_all_scores method

Returns

scoresdict: dictionary of the format from the self._get_all_scores function

The scores are also saved in self.cv_scores.

BasePipeline.evaluate(x_test: DataFrame, y_test: Series, console_out: bool, custom_score: Callable, **kwargs) → dict[str, float]

Function to create multiple scores with predict function of model

Parameters

x_test, y_testpd.DataFrame, pd.Series

Data to evaluate model

console_outbool

shall the result of the different scores and a classification_report be printed into the console

custom_scorecallable or None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

**kwargs:

additional parameters from child-class for _get_all_scores method

Returns

scoresdict: dictionary of the format from the self._get_all_scores function

BasePipeline.evaluate_score(scoring: str | Callable, x_test: DataFrame, y_test: Series, **kwargs) → float

Function to create a score with self.__get_score of model

Parameters

scoringstr or callable (custom score)

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

x_test, y_testpd.DataFrame, pd.Series

Data for evaluating the model

**kwargs:

additional parameters from child-class for _get_score method

Returns

scorefloat: metrics score value

BasePipeline.feature_importance() → show

Function to generate a matplotlib plot of the top45 feature importance from the model. You can only use the method if you trained your model before.

Returns

plt.show object

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>> 
>>> # train and plot features of model
>>> from sam_ml.models.classifier import LR
>>>
>>> model = LR()
>>> model.train(X, y)
>>> model.feature_importance()

BasePipeline.fit(x_train: DataFrame, y_train: Series, **kwargs)

Function to fit the model

Parameters

x_train, y_trainpd.DataFrame, pd.Series: Data to train model
**kwargs:: additional parameters from child-class for fit method

Returns

selfestimator instance: Estimator instance

BasePipeline.fit_warm_start(x_train: DataFrame, y_train: Series, **kwargs)

Function to warm_start fit the model

This function only differs for pipeline objects (with preprocessing) from the train method. For pipeline objects, it only traines the preprocessing steps the first time and then only uses them to preprocess.

Parameters

x_train, y_trainpd.DataFrame, pd.Series: Data to train model
**kwargs:: additional parameters from child-class for fit method

Returns

selfestimator instance: Estimator instance

BasePipeline.get_deepcopy()

Function to create a deepcopy of object

Returns

selfestimator instance: deepcopy of estimator instance

BasePipeline.get_params(deep: bool = True) → dict[str, any]

Function to get the parameter from the model object

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained sub-objects that are estimators

Returns

params: dict: parameter names mapped to their values

BasePipeline.get_random_config() → dict

Function to generate one grid configuration

Returns

configdict: dictionary of random parameter configuration from grid

Examples

>>> from sam_ml.models.classifier import LR
>>> 
>>> model = LR()
>>> model.get_random_config()
{'C': 0.31489116479568624,
'penalty': 'elasticnet',
'solver': 'saga',
'l1_ratio': 0.6026718993550663}

BasePipeline.get_random_configs(n_trails: int) → list[dict]

Function to generate grid configurations

Parameters

n_trailsint: number of grid configurations

Returns

configslist: list with sets of random parameter from grid

Notes

filter out duplicates -> could be less than n_trails

Examples

>>> from sam_ml.models.classifier import LR
>>> 
>>> model = LR()
>>> model.get_random_configs(3)
[Configuration(values={
    'C': 1.0,
    'penalty': 'l2',
    'solver': 'lbfgs',
}),
Configuration(values={
    'C': 2.5378155082656657,
    'penalty': 'l2',
    'solver': 'saga',
}),
Configuration(values={
    'C': 2.801635158716261,
    'penalty': 'l2',
    'solver': 'lbfgs',
})]

static BasePipeline.load_model(path: str)

Function to load a pickled model class object

Parameters

pathstr: path to save the model with suffix ‘.pkl’

Returns

modelestimator instance: estimator instance

BasePipeline.predict(x_test: DataFrame) → list

Function to predict with predict-method from model object

Parameters

x_testpd.DataFrame: Data for prediction

Returns

predictionlist: list with predicted class numbers for data

BasePipeline.predict_proba(x_test: DataFrame) → list

Function to predict with predict_proba-method from model object

Parameters

x_testpd.DataFrame: Data for prediction

Returns

predictionnp.ndarray: np.ndarray with probability for every class per datapoint

BasePipeline.randomCVsearch(x_train: DataFrame, y_train: Series, n_trails: int, cv_num: int, scoring: str | Callable, small_data_eval: bool, leave_loadbar: bool, **kwargs) → tuple[dict, float]

Hyperparametertuning with randomCVsearch

Parameters

x_train, y_trainpd.DataFrame, pd.Series

Data to cross validate on

n_trailsint

max number of parameter sets to test

cv_numint

number of different random splits

scoringstr or callable (custom score)

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

small_data_evalbool

if True: trains model on all datapoints except one and does this for all datapoints (recommended for datasets with less than 150 datapoints)

leave_loadbarbool

shall the loading bar of the different parameter sets be visible after training (True - load bar will still be visible)

**kwargs:

additional parameters from child-class for cross validation methods

Returns

best_hyperparametersdict: best hyperparameter set
best_scorefloat: the score of the best hyperparameter set

Notes

if you interrupt the keyboard during the run of randomCVsearch, the interim result will be returned

BasePipeline.replace_grid(new_grid: ConfigurationSpace)

Function to replace self.grid

See ConfigurationSpace documentation.

Parameters

new_gridConfigurationSpace: new grid to replace the old one with

Returns

changes self.grid variable

Examples

>>> from ConfigSpace import ConfigurationSpace, Categorical, Float
>>> from sam_ml.models.classifier import LDA
>>>
>>> model = LDA()
>>> new_grid = ConfigurationSpace(
...     seed=42,
...     space={
...         "solver": Categorical("solver", ["lsqr", "eigen"]),
...         "shrinkage": Float("shrinkage", (0, 0.5)),
...     })
>>> model.replace_grid(new_grid)

BasePipeline.save_model(path: str, only_estimator: bool = False)

Function to pickle and save the class object

Parameters

pathstr: path to save the model with suffix ‘.pkl’
only_estimatorbool, default=False: If True, only the estimator of the class object will be saved

BasePipeline.set_params(**params)

Function to set the parameter of the model object

Parameters

**paramsdict: Estimator parameters

Returns

selfestimator instance: Estimator instance

BasePipeline.smac_search(x_train: DataFrame, y_train: Series, scoring: str | Callable, n_trails: int, cv_num: int, small_data_eval: bool, walltime_limit: int, log_level: int, **kwargs) → Configuration

Hyperparametertuning with SMAC library HyperparameterOptimizationFacade [can only be used in the sam_ml version with swig]

The smac_search-method will more “intelligent” search your hyperparameter space than the randomCVsearch and returns the best hyperparameter set. Additionally to the n_trails parameter, it also takes a walltime_limit parameter that defines the maximum time in seconds that the search will take.

Parameters

x_train, y_trainpd.DataFrame, pd.Series

Data to cross validate on

scoringstr or callable (custom score)

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

n_trailsint

max number of parameter sets to test

cv_numint

number of different random splits

small_data_evalbool

if True: trains model on all datapoints except one and does this for all datapoints (recommended for datasets with less than 150 datapoints)

walltime_limitint

the maximum time in seconds that SMAC is allowed to run

log_levelint

10 - DEBUG, 20 - INFO, 30 - WARNING, 40 - ERROR, 50 - CRITICAL (SMAC3 library log levels)

**kwargs:

additional parameters from child-class for cross validation methods

Returns

incumbentConfigSpace.Configuration: ConfigSpace.Configuration with best hyperparameters (can be used like dict)

BasePipeline.train(x_train: DataFrame, y_train: Series, console_out: bool = True, **kwargs) → tuple[float, str]

Function to train the model

Parameters

x_train, y_trainpd.DataFrame, pd.Series: Data to train model
console_outbool, default=True: shall the score and time be printed out
**kwargs:: additional parameters from child-class for evaluate_score method

Returns

train_scorefloat: train score value
train_timestr: train time in format: “0:00:00” (hours:minutes:seconds)

BasePipeline.train_warm_start(x_train: DataFrame, y_train: Series, console_out: bool = True, **kwargs) → tuple[float, str]

Function to warm_start train the model

This function only differs for pipeline objects (with preprocessing) from the train method. For pipeline objects, it only traines the preprocessing steps the first time and then only uses them to preprocess.

Parameters

x_train, y_trainpd.DataFrame, pd.Series: Data to train model
console_outbool, default=True: shall the score and time be printed out
**kwargs:: additional parameters from child-class for evaluate_score method

Returns

train_scorefloat: train score value
train_timestr: train time in format: “0:00:00” (hours:minutes:seconds)