RandomForestClassifier (RFC)

class RFC(self, model_name: str = 'RandomForestClassifier', n_jobs: int = -1, random_state: int = 42, **kwargs)

RandomForestClassifier Wrapper class - parent class Classifier

Parameters

n_estimatorint: number of trees
max_depthint,: maximum number of levels in tree
max_featuresfloat, int, or str,: number of features to consider at every split
min_samples_splitfloat or int,: minimum number of samples required to split a node
min_samples_leaffloat or int,: minimum number of samples required at each leaf node
bootstrapbool,: method of selecting samples for training each tree
criterionstr,: function to measure the quality of a split
random_stateint, default=42: random_state for model

Attributes

cv_scoresdict[str, float]: dictionary with cross validation results
feature_nameslist[str]: names of all the features that the model saw during training. Is empty if model was not fitted yet.
gridConfigurationSpace: hyperparameter tuning grid of the model
modelmodel object: model with ‘fit’, ‘predict’, ‘set_params’, and ‘get_params’ method (see sklearn API)
model_namestr: name of the model. Used in loading bars and dictionaries as identifier of the model
model_typestr: kind of estimator (e.g. ‘RFC’ for RandomForestClassifier)
rCVsearch_resultspd.DataFrame or None: results from randomCV hyperparameter tuning. Is None if randomCVsearch was not used yet.
train_scorefloat: train score value
train_timestr: train time in format: “0:00:00” (hours:minutes:seconds)

Note

You can use all parameters of the wrapped model when initialising the wrapper class.

Example

>>> from sam_ml.models.classifier import RFC
>>>
>>> model = RFC()
>>> print(model)
RFC(model_name='RandomForestClassifier')

Methods

Method	Description
`cross_validation`	Random split crossvalidation
`cross_validation_small_data`	One-vs-all cross validation for small datasets
`evaluate`	Function to create multiple scores with predict function of model
`evaluate_proba`	Function to create multiple scores for binary classification with predict_proba function of model
`evaluate_score`	Function to create a score with predict function of model
`evaluate_score_proba`	Function to create a score for binary classification with predict_proba function of model
`feature_importance`	Function to generate a matplotlib plot of the top45 feature importance from the model.
`fit`	Function to fit the model
`fit_warm_start`	Function to warm_start fit the model
`get_deepcopy`	Function to create a deepcopy of object
`get_params`	Function to get the parameter from the model object
`get_random_config`	Function to generate one grid configuration
`get_random_configs`	Function to generate grid configurations
`load_model`	Function to load a pickled model class object
`predict`	Function to predict with predict-method from model object
`predict_proba`	Function to predict with predict_proba-method from model object
`randomCVsearch`	Hyperparametertuning with randomCVsearch
`replace_grid`	Function to replace self.grid
`save_model`	Function to pickle and save the class object
`set_params`	Function to set the parameter of the model object
`smac_search`	Hyperparametertuning with SMAC library HyperparameterOptimizationFacade [can only be used in the sam_ml version with swig]
`train`	Function to train the model
`train_warm_start`	Function to warm_start train the model

Note

A lot of methods use parameters for advanced scoring. For additional information on advanced scoring, see scoring documentation

RFC.cross_validation(X: DataFrame, y: Series, cv_num: int = 10, console_out: bool = True, avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, custom_score: Callable[[list[int], list[int]], float] | None = None) → dict[str, float]

Random split crossvalidation

Parameters

X, ypd.DataFrame, pd.Series

Data to cross validate on

cv_numint, default=10

number of different random splits

console_outbool, default=True

shall the result dataframe of the different scores for the different runs be printed

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

custom_scorecallable or None, default=None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

Returns

scoresdict

dictionary of format:

{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘train_time’: …, ‘train_score’: …,}

or if custom_score != None:

{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘train_time’: …, ‘train_score’: …, ‘custom_score’: …,}

The scores are also saved in self.cv_scores.

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>> 
>>> # cross validate model
>>> from sam_ml.models.classifier import LR
>>>
>>> model = LR()
>>> scores = model.cross_validation(X, y, cv_num=3)

                            0         1         2           average
fit_time                    1.194662  1.295036  1.210156    1.233285
score_time                  0.167266  0.149569  0.173546    0.163460
test_precision (macro)      0.779381  0.809037  0.761263    0.783227
train_precision (macro)     0.951738  0.947397  0.943044    0.947393
test_recall (macro)         0.774488  0.800144  0.761423    0.778685
train_recall (macro)        0.948928  0.943901  0.940066    0.944298
test_accuracy               0.776978  0.803121  0.762305    0.780802
train_accuracy              0.950180  0.945411  0.941212    0.945601
test_s_score                0.923052  0.937806  0.917214    0.926024
train_s_score               0.990794  0.990162  0.989660    0.990206
test_l_score                0.998393  0.998836  0.998575    0.998602
train_l_score               1.000000  1.000000  1.000000    1.000000

RFC.cross_validation_small_data(X: DataFrame, y: Series, leave_loadbar: bool = True, console_out: bool = True, avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, custom_score: Callable[[list[int], list[int]], float] | None = None) → dict[str, float]

One-vs-all cross validation for small datasets

In the cross_validation_small_data-method, the model will be trained on all datapoints except one and then tested on this last one. This will be repeated for all datapoints so that we have our predictions for all datapoints.

Advantage: optimal use of information for training

Disadvantage: long train time

This concept is very useful for small datasets (recommended: datapoints < 150) because the long train time is still not too long and especially with a small amount of information for the model, it is important to use all the information one has for the training.

Parameters

X, ypd.DataFrame, pd.Series

Data to cross validate on

leave_loadbarbool, default=True

shall the loading bar of the training be visible after training (True - load bar will still be visible)

console_outbool, default=True

shall the result of the different scores and a classification_report be printed into the console

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

custom_scorecallable or None, default=None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

Returns

scoresdict

dictionary of format:

{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘train_time’: …, ‘train_score’: …,}

or if custom_score != None:

{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘train_time’: …, ‘train_score’: …, ‘custom_score’: …,}

The scores are also saved in self.cv_scores.

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>> 
>>> # cross validate model
>>> from sam_ml.models.classifier import LR
>>>
>>> model = LR()
>>> scores = model.cross_validation_small_data(X, y)
accuracy: 0.7
precision: 0.7747221430607011
recall: 0.672883787661406
s_score: 0.40853182756324635
l_score: 0.7812935895658734
train_time: 0:00:00
train_score: 0.9946286670687757

classification report:
                precision   recall  f1-score    support

        0       0.65        0.96    0.78        82
        1       0.90        0.38    0.54        68

accuracy                            0.70        150
macro avg       0.77        0.67    0.66        150
weighted avg    0.76        0.70    0.67        150

RFC.evaluate(x_test: DataFrame, y_test: Series, console_out: bool = True, avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, custom_score: Callable[[list[int], list[int]], float] | None = None) → dict[str, float]

Function to create multiple scores with predict function of model

Parameters

x_test, y_testpd.DataFrame, pd.Series

Data to evaluate model

console_outbool, default=True

shall the result of the different scores and a classification_report be printed into the console

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

custom_scorecallable or None, default=None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

Returns

scoresdict

dictionary of format:

{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …}

or if custom_score != None:

{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘custom_score’: …,}

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42)
>>> 
>>> # train and evaluate model
>>> from sam_ml.models.classifier import LR
>>>
>>> model = LR()
>>> model.train(x_train, y_train)
>>> scores = model.evaluate(x_test, y_test)
Train score: 0.9891840171120917 - Train time: 0:00:02
accuracy: 0.802
precision: 0.8030604133545309
recall: 0.7957575757575757
s_score: 0.9395778023942218
l_score: 0.9990945415060262

classification report: 
                precision   recall  f1-score    support

        0       0.81        0.73    0.77        225
        1       0.80        0.86    0.83        275

accuracy                            0.80        500
macro avg       0.80        0.80    0.80        500
weighted avg    0.80        0.80    0.80        500

RFC.evaluate_proba(x_test: DataFrame, y_test: Series, console_out: bool = True, avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, custom_score: Callable[[list[int], list[int]], float] | None = None, probability: float = 0.5) → dict[str, float]

Function to create multiple scores for binary classification with predict_proba function of model

Parameters

x_test, y_testpd.DataFrame, pd.Series

Data to evaluate model

console_outbool, default=True

shall the result of the different scores and a classification_report be printed. Also prints stats for the probabilities

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

custom_scorecallable or None, default=None

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

If None, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.

probability: float (0 to 1), default=0.5

probability for class 1 (with value 0.5 is like evaluate_score method). With increasing the probability parameter, precision will likely increase and recall will decrease (with decreasing the probability parameter, the otherway around).

Returns

scoresdict

dictionary of format:

{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …}

or if custom_score != None:

{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘custom_score’: …,}

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> X, y = make_classification(n_samples=3000, n_features=4, n_classes=2, random_state=42)
>>> X, y = pd.DataFrame(X, columns=["col1", "col2", "col3", "col4"]), pd.Series(y)
>>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42)
>>> 
>>> # train and evaluate model
>>> from sam_ml.models.classifier import LR
>>>
>>> model = LR()
>>> model.train(x_train, y_train)
>>> scores = model.evaluate_proba(x_test, y_test, probability=0.4)
Train score: 0.9775 - Train time: 0:00:00
accuracy: 0.9733333333333334
precision: 0.9728695961572674
recall: 0.9742405994915028
s_score: 0.9930964441542017
l_score: 0.9999999991441061
min proba: 5.126066053780961e-12
max proba: 0.9999731025066587
mean proba: 0.4701783612343521
median proba: 0.11068735707926472
std proba: 0.474678546763958

classification report:
                precision   recall  f1-score   support

        0       0.99        0.96    0.97       318
        1       0.96        0.99    0.97       282

accuracy                            0.97       600
macro avg       0.97        0.97    0.97       600
weighted avg    0.97        0.97    0.97       600

RFC.evaluate_score(x_test: DataFrame, y_test: Series, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3) → float

Function to create a score with predict function of model

Parameters

x_test, y_testpd.DataFrame, pd.Series

Data to evaluate model

scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

Returns

scorefloat: metrics score value

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42)
>>> 
>>> # train and evaluate model
>>> from sam_ml.models.classifier import LR
>>>
>>> model = LR()
>>> model.fit(x_train, y_train)
>>> recall = model.evaluate_score(x_test, y_test, scoring="recall")
>>> print(f"recall: {recall}")
recall: 0.4

RFC.evaluate_score_proba(x_test: DataFrame, y_test: Series, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, probability: float = 0.5) → float

Function to create a score for binary classification with predict_proba function of model

Parameters

x_test, y_testpd.DataFrame, pd.Series

Data to evaluate model

scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

probability: float (0 to 1), default=0.5

probability for class 1 (with value 0.5 is like evaluate_score method). With increasing the probability parameter, precision will likely increase and recall will decrease (with decreasing the probability parameter, the otherway around).

Returns

scorefloat: metrics score value

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> X, y = make_classification(n_samples=3000, n_features=4, n_classes=2, random_state=42)
>>> X, y = pd.DataFrame(X, columns=["col1", "col2", "col3", "col4"]), pd.Series(y)
>>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42)
>>> 
>>> # train and evaluate model
>>> from sam_ml.models.classifier import LR
>>>
>>> model = LR()
>>> model.fit(x_train, y_train)
>>> recall = model.evaluate_score_proba(x_test, y_test, scoring="recall", probability=0.4)
>>> print(f"recall: {recall}")
recall: 0.9742405994915028

RFC.feature_importance() → show

Function to generate a matplotlib plot of the top45 feature importance from the model. You can only use the method if you trained your model before.

Returns

plt.show object

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>> 
>>> # train and plot features of model
>>> from sam_ml.models.classifier import LR
>>>
>>> model = LR()
>>> model.train(X, y)
>>> model.feature_importance()

RFC.fit(x_train: DataFrame, y_train: Series, **kwargs)

Function to fit the model

Parameters

x_train, y_trainpd.DataFrame, pd.Series: Data to train model
**kwargs:: additional parameters from child-class for fit method

Returns

selfestimator instance: Estimator instance

RFC.fit_warm_start(x_train: DataFrame, y_train: Series, **kwargs)

Function to warm_start fit the model

This function only differs for pipeline objects (with preprocessing) from the train method. For pipeline objects, it only traines the preprocessing steps the first time and then only uses them to preprocess.

Parameters

x_train, y_trainpd.DataFrame, pd.Series: Data to train model
**kwargs:: additional parameters from child-class for fit method

Returns

selfestimator instance: Estimator instance

RFC.get_deepcopy()

Function to create a deepcopy of object

Returns

selfestimator instance: deepcopy of estimator instance

RFC.get_params(deep: bool = True) → dict

Function to get the parameter from the model object

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained sub-objects that are estimators

Returns

params: dict: parameter names mapped to their values

RFC.get_random_config() → dict

Function to generate one grid configuration

Returns

configdict: dictionary of random parameter configuration from grid

Examples

>>> from sam_ml.models.classifier import LR
>>> 
>>> model = LR()
>>> model.get_random_config()
{'C': 0.31489116479568624,
'penalty': 'elasticnet',
'solver': 'saga',
'l1_ratio': 0.6026718993550663}

RFC.get_random_configs(n_trails: int) → list[dict]

Function to generate grid configurations

Parameters

n_trailsint: number of grid configurations

Returns

configslist: list with sets of random parameter from grid

Notes

filter out duplicates -> could be less than n_trails

Examples

>>> from sam_ml.models.classifier import LR
>>> 
>>> model = LR()
>>> model.get_random_configs(3)
[Configuration(values={
    'C': 1.0,
    'penalty': 'l2',
    'solver': 'lbfgs',
}),
Configuration(values={
    'C': 2.5378155082656657,
    'penalty': 'l2',
    'solver': 'saga',
}),
Configuration(values={
    'C': 2.801635158716261,
    'penalty': 'l2',
    'solver': 'lbfgs',
})]

static RFC.load_model(path: str)

Function to load a pickled model class object

Parameters

pathstr: path to save the model with suffix ‘.pkl’

Returns

modelestimator instance: estimator instance

RFC.predict(x_test: DataFrame) → list

Function to predict with predict-method from model object

Parameters

x_testpd.DataFrame: Data for prediction

Returns

predictionlist: list with predicted class numbers for data

RFC.predict_proba(x_test: DataFrame) → ndarray

Function to predict with predict_proba-method from model object

Parameters

x_testpd.DataFrame: Data for prediction

Returns

predictionnp.ndarray: np.ndarray with probability for every class per datapoint

RFC.randomCVsearch(x_train: DataFrame, y_train: Series, n_trails: int = 10, cv_num: int = 5, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, small_data_eval: bool = False, leave_loadbar: bool = True) → tuple[dict, float]

Hyperparametertuning with randomCVsearch

Parameters

x_train, y_trainpd.DataFrame, pd.Series

Data to cross validate on

n_trailsint, default=10

max number of parameter sets to test

cv_numint, default=5

number of different random splits

scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

small_data_evalbool, default=False

if True: trains model on all datapoints except one and does this for all datapoints (recommended for datasets with less than 150 datapoints)

leave_loadbarbool, default=True

shall the loading bar of the different parameter sets be visible after training (True - load bar will still be visible)

Returns

best_hyperparametersdict: best hyperparameter set
best_scorefloat: the score of the best hyperparameter set

Notes

if you interrupt the keyboard during the run of randomCVsearch, the interim result will be returned

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>>
>>> # initialise model
>>> from sam_ml.models.classifier import LR
>>> model = LR()
>>>
>>> # use randomCVsearch
>>> best_hyperparam, best_score = model.randomCVsearch(X, y, n_trails=20, cv_num=5, scoring="recall")
>>> print(f"best hyperparameters: {best_hyperparam}, best score: {best_score}")
best hyperparameters: {'C': 8.471801418819979, 'penalty': 'l2', 'solver': 'newton-cg'}, best score: 0.765

RFC.replace_grid(new_grid: ConfigurationSpace)

Function to replace self.grid

See ConfigurationSpace documentation.

Parameters

new_gridConfigurationSpace: new grid to replace the old one with

Returns

changes self.grid variable

Examples

>>> from ConfigSpace import ConfigurationSpace, Categorical, Float
>>> from sam_ml.models.classifier import LDA
>>>
>>> model = LDA()
>>> new_grid = ConfigurationSpace(
...     seed=42,
...     space={
...         "solver": Categorical("solver", ["lsqr", "eigen"]),
...         "shrinkage": Float("shrinkage", (0, 0.5)),
...     })
>>> model.replace_grid(new_grid)

RFC.save_model(path: str, only_estimator: bool = False)

Function to pickle and save the class object

Parameters

pathstr: path to save the model with suffix ‘.pkl’
only_estimatorbool, default=False: If True, only the estimator of the class object will be saved

RFC.set_params(**params)

Function to set the parameter of the model object

Parameters

**paramsdict: Estimator parameters

Returns

selfestimator instance: Estimator instance

RFC.smac_search(x_train: DataFrame, y_train: Series, n_trails: int = 50, cv_num: int = 5, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, small_data_eval: bool = False, walltime_limit: int = 600, log_level: int = 20) → Configuration

Hyperparametertuning with SMAC library HyperparameterOptimizationFacade [can only be used in the sam_ml version with swig]

The smac_search-method will more “intelligent” search your hyperparameter space than the randomCVsearch and returns the best hyperparameter set. Additionally to the n_trails parameter, it also takes a walltime_limit parameter that defines the maximum time in seconds that the search will take.

Parameters

x_train, y_trainpd.DataFrame, pd.Series

Data to cross validate on

n_trailsint, default=50

max number of parameter sets to test

cv_numint, default=5

number of different random splits

scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

small_data_evalbool, default=False

if True: trains model on all datapoints except one and does this for all datapoints (recommended for datasets with less than 150 datapoints)

walltime_limitint, default=600

the maximum time in seconds that SMAC is allowed to run

log_levelint, default=20

10 - DEBUG, 20 - INFO, 30 - WARNING, 40 - ERROR, 50 - CRITICAL (SMAC3 library log levels)

Returns

incumbentConfigSpace.Configuration: ConfigSpace.Configuration with best hyperparameters (can be used like dict)

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>>
>>> # use smac_search
>>> from sam_ml.models.classifier import LR
>>> 
>>> model = LR()
>>> best_hyperparam = model.smac_search(X, y, n_trails=20, cv_num=5, scoring="recall")
>>> print(f"best hyperparameters: {best_hyperparam}")
[INFO][abstract_initial_design.py:82] Using `n_configs` and ignoring `n_configs_per_hyperparameter`.
[INFO][abstract_initial_design.py:147] Using 2 initial design configurations and 0 additional configurations.
[INFO][abstract_initial_design.py:147] Using 3 initial design configurations and 0 additional configurations.
[INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario.
[INFO][abstract_intensifier.py:515] Added config 12be8a as new incumbent because there are no incumbents yet.
[INFO][abstract_intensifier.py:590] Added config ce10f4 and rejected config 12be8a as incumbent because it is not better than the incumbents on 1 instances:
[INFO][abstract_intensifier.py:590] Added config b35335 and rejected config ce10f4 as incumbent because it is not better than the incumbents on 1 instances:
[INFO][smbo.py:327] Configuration budget is exhausted:
[INFO][smbo.py:328] --- Remaining wallclock time: 590.5625982284546
[INFO][smbo.py:329] --- Remaining cpu time: inf
[INFO][smbo.py:330] --- Remaining trials: 0
best hyperparameters: Configuration(values={
'C': 66.7049177605834,
'penalty': 'l2',
'solver': 'lbfgs',
})

RFC.train(x_train: DataFrame, y_train: Series, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, console_out: bool = True) → tuple[float, str]

Function to train the model

Every classifier has a train- and fit-method. They both use the fit-method of the wrapped model, but the train-method returns the train time and the train score of the model.

Parameters

x_train, y_trainpd.DataFrame, pd.Series

Data to train model

scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

console_outbool, default=True

shall the score and time be printed out

Returns

train_scorefloat: train score value
train_timestr: train time in format: “0:00:00” (hours:minutes:seconds)

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>>
>>> # train model
>>> from sam_ml.models.classifier import LR
>>> 
>>> model = LR()
>>> model.train(X, y)
Train score: 0.9891840171120917 - Train time: 0:00:02

RFC.train_warm_start(x_train: DataFrame, y_train: Series, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, console_out: bool = True) → tuple[float, str]

Function to warm_start train the model

This function only differs for pipeline objects (with preprocessing) from the train method. For pipeline objects, it only traines the preprocessing steps the first time and then only uses them to preprocess.

Parameters

x_train, y_trainpd.DataFrame, pd.Series

Data to train model

scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”

metrics to evaluate the models

custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)

avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”

average to use for precision and recall score. If None, the scores for each class are returned.

pos_labelint or str, default=-1

if avg="binary", pos_label says which class to score. pos_label is used by s_score/l_score

secondary_scoring{“precision”, “recall”} or None, default=None

weights the scoring (only for “s_score”/”l_score”)

strengthint, default=3

higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)

console_outbool, default=True

shall the score and time be printed out

Returns

train_scorefloat: train score value
train_timestr: train time in format: “0:00:00” (hours:minutes:seconds)

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> df = load_iris()
>>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target)
>>>
>>> # train model
>>> from sam_ml.models.classifier import LR
>>> 
>>> model = LR()
>>> model.train_warm_start(X, y)
Train score: 0.9891840171120917 - Train time: 0:00:02