RandomForestClassifier (RFC)
- class RFC(self, model_name: str = 'RandomForestClassifier', n_jobs: int = -1, random_state: int = 42, **kwargs)
RandomForestClassifier Wrapper class - parent class Classifier
Parameters |
|
Attributes |
|
Note
You can use all parameters of the wrapped model when initialising the wrapper class.
Example
>>> from sam_ml.models.classifier import RFC
>>>
>>> model = RFC()
>>> print(model)
RFC(model_name='RandomForestClassifier')
Methods
Method |
Description |
---|---|
Random split crossvalidation |
|
One-vs-all cross validation for small datasets |
|
Function to create multiple scores with predict function of model |
|
Function to create multiple scores for binary classification with predict_proba function of model |
|
Function to create a score with predict function of model |
|
Function to create a score for binary classification with predict_proba function of model |
|
Function to generate a matplotlib plot of the top45 feature importance from the model. |
|
Function to fit the model |
|
Function to warm_start fit the model |
|
Function to create a deepcopy of object |
|
Function to get the parameter from the model object |
|
Function to generate one grid configuration |
|
Function to generate grid configurations |
|
Function to load a pickled model class object |
|
Function to predict with predict-method from model object |
|
Function to predict with predict_proba-method from model object |
|
Hyperparametertuning with randomCVsearch |
|
Function to replace self.grid |
|
Function to pickle and save the class object |
|
Function to set the parameter of the model object |
|
Hyperparametertuning with SMAC library HyperparameterOptimizationFacade [can only be used in the sam_ml version with swig] |
|
Function to train the model |
|
Function to warm_start train the model |
Note
A lot of methods use parameters for advanced scoring. For additional information on advanced scoring, see scoring documentation
- RFC.cross_validation(X: DataFrame, y: Series, cv_num: int = 10, console_out: bool = True, avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, custom_score: Callable[[list[int], list[int]], float] | None = None) dict[str, float]
Random split crossvalidation
Parameters
- X, ypd.DataFrame, pd.Series
Data to cross validate on
- cv_numint, default=10
number of different random splits
- console_outbool, default=True
shall the result dataframe of the different scores for the different runs be printed
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- custom_scorecallable or None, default=None
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
If
None
, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.
Returns
- scoresdict
dictionary of format:
{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘train_time’: …, ‘train_score’: …,}
or if
custom_score != None
:{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘train_time’: …, ‘train_score’: …, ‘custom_score’: …,}
The scores are also saved in
self.cv_scores
.Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> >>> # cross validate model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> scores = model.cross_validation(X, y, cv_num=3) 0 1 2 average fit_time 1.194662 1.295036 1.210156 1.233285 score_time 0.167266 0.149569 0.173546 0.163460 test_precision (macro) 0.779381 0.809037 0.761263 0.783227 train_precision (macro) 0.951738 0.947397 0.943044 0.947393 test_recall (macro) 0.774488 0.800144 0.761423 0.778685 train_recall (macro) 0.948928 0.943901 0.940066 0.944298 test_accuracy 0.776978 0.803121 0.762305 0.780802 train_accuracy 0.950180 0.945411 0.941212 0.945601 test_s_score 0.923052 0.937806 0.917214 0.926024 train_s_score 0.990794 0.990162 0.989660 0.990206 test_l_score 0.998393 0.998836 0.998575 0.998602 train_l_score 1.000000 1.000000 1.000000 1.000000
- RFC.cross_validation_small_data(X: DataFrame, y: Series, leave_loadbar: bool = True, console_out: bool = True, avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, custom_score: Callable[[list[int], list[int]], float] | None = None) dict[str, float]
One-vs-all cross validation for small datasets
In the cross_validation_small_data-method, the model will be trained on all datapoints except one and then tested on this last one. This will be repeated for all datapoints so that we have our predictions for all datapoints.
Advantage: optimal use of information for training
Disadvantage: long train time
This concept is very useful for small datasets (recommended: datapoints < 150) because the long train time is still not too long and especially with a small amount of information for the model, it is important to use all the information one has for the training.
Parameters
- X, ypd.DataFrame, pd.Series
Data to cross validate on
- leave_loadbarbool, default=True
shall the loading bar of the training be visible after training (True - load bar will still be visible)
- console_outbool, default=True
shall the result of the different scores and a classification_report be printed into the console
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- custom_scorecallable or None, default=None
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
If
None
, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.
Returns
- scoresdict
dictionary of format:
{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘train_time’: …, ‘train_score’: …,}
or if
custom_score != None
:{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘train_time’: …, ‘train_score’: …, ‘custom_score’: …,}
The scores are also saved in
self.cv_scores
.Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> >>> # cross validate model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> scores = model.cross_validation_small_data(X, y) accuracy: 0.7 precision: 0.7747221430607011 recall: 0.672883787661406 s_score: 0.40853182756324635 l_score: 0.7812935895658734 train_time: 0:00:00 train_score: 0.9946286670687757 classification report: precision recall f1-score support 0 0.65 0.96 0.78 82 1 0.90 0.38 0.54 68 accuracy 0.70 150 macro avg 0.77 0.67 0.66 150 weighted avg 0.76 0.70 0.67 150
- RFC.evaluate(x_test: DataFrame, y_test: Series, console_out: bool = True, avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, custom_score: Callable[[list[int], list[int]], float] | None = None) dict[str, float]
Function to create multiple scores with predict function of model
Parameters
- x_test, y_testpd.DataFrame, pd.Series
Data to evaluate model
- console_outbool, default=True
shall the result of the different scores and a classification_report be printed into the console
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- custom_scorecallable or None, default=None
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
If
None
, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.
Returns
- scoresdict
dictionary of format:
{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …}
or if
custom_score != None
:{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘custom_score’: …,}
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> from sklearn.model_selection import train_test_split >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42) >>> >>> # train and evaluate model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.train(x_train, y_train) >>> scores = model.evaluate(x_test, y_test) Train score: 0.9891840171120917 - Train time: 0:00:02 accuracy: 0.802 precision: 0.8030604133545309 recall: 0.7957575757575757 s_score: 0.9395778023942218 l_score: 0.9990945415060262 classification report: precision recall f1-score support 0 0.81 0.73 0.77 225 1 0.80 0.86 0.83 275 accuracy 0.80 500 macro avg 0.80 0.80 0.80 500 weighted avg 0.80 0.80 0.80 500
- RFC.evaluate_proba(x_test: DataFrame, y_test: Series, console_out: bool = True, avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, custom_score: Callable[[list[int], list[int]], float] | None = None, probability: float = 0.5) dict[str, float]
Function to create multiple scores for binary classification with predict_proba function of model
Parameters
- x_test, y_testpd.DataFrame, pd.Series
Data to evaluate model
- console_outbool, default=True
shall the result of the different scores and a classification_report be printed. Also prints stats for the probabilities
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- custom_scorecallable or None, default=None
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
If
None
, no custom score will be calculated and also the key “custom_score” does not exist in the returned dictionary.- probability: float (0 to 1), default=0.5
probability for class 1 (with value 0.5 is like
evaluate_score
method). With increasing the probability parameter, precision will likely increase and recall will decrease (with decreasing the probability parameter, the otherway around).
Returns
- scoresdict
dictionary of format:
{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …}
or if
custom_score != None
:{‘accuracy’: …, ‘precision’: …, ‘recall’: …, ‘s_score’: …, ‘l_score’: …, ‘custom_score’: …,}
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import make_classification >>> from sklearn.model_selection import train_test_split >>> X, y = make_classification(n_samples=3000, n_features=4, n_classes=2, random_state=42) >>> X, y = pd.DataFrame(X, columns=["col1", "col2", "col3", "col4"]), pd.Series(y) >>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42) >>> >>> # train and evaluate model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.train(x_train, y_train) >>> scores = model.evaluate_proba(x_test, y_test, probability=0.4) Train score: 0.9775 - Train time: 0:00:00 accuracy: 0.9733333333333334 precision: 0.9728695961572674 recall: 0.9742405994915028 s_score: 0.9930964441542017 l_score: 0.9999999991441061 min proba: 5.126066053780961e-12 max proba: 0.9999731025066587 mean proba: 0.4701783612343521 median proba: 0.11068735707926472 std proba: 0.474678546763958 classification report: precision recall f1-score support 0 0.99 0.96 0.97 318 1 0.96 0.99 0.97 282 accuracy 0.97 600 macro avg 0.97 0.97 0.97 600 weighted avg 0.97 0.97 0.97 600
- RFC.evaluate_score(x_test: DataFrame, y_test: Series, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3) float
Function to create a score with predict function of model
Parameters
- x_test, y_testpd.DataFrame, pd.Series
Data to evaluate model
- scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”
metrics to evaluate the models
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
Returns
- scorefloat
metrics score value
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> from sklearn.model_selection import train_test_split >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42) >>> >>> # train and evaluate model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.fit(x_train, y_train) >>> recall = model.evaluate_score(x_test, y_test, scoring="recall") >>> print(f"recall: {recall}") recall: 0.4
- RFC.evaluate_score_proba(x_test: DataFrame, y_test: Series, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, probability: float = 0.5) float
Function to create a score for binary classification with predict_proba function of model
Parameters
- x_test, y_testpd.DataFrame, pd.Series
Data to evaluate model
- scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”
metrics to evaluate the models
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- probability: float (0 to 1), default=0.5
probability for class 1 (with value 0.5 is like
evaluate_score
method). With increasing the probability parameter, precision will likely increase and recall will decrease (with decreasing the probability parameter, the otherway around).
Returns
- scorefloat
metrics score value
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import make_classification >>> from sklearn.model_selection import train_test_split >>> X, y = make_classification(n_samples=3000, n_features=4, n_classes=2, random_state=42) >>> X, y = pd.DataFrame(X, columns=["col1", "col2", "col3", "col4"]), pd.Series(y) >>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42) >>> >>> # train and evaluate model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.fit(x_train, y_train) >>> recall = model.evaluate_score_proba(x_test, y_test, scoring="recall", probability=0.4) >>> print(f"recall: {recall}") recall: 0.9742405994915028
- RFC.feature_importance() show
Function to generate a matplotlib plot of the top45 feature importance from the model. You can only use the method if you trained your model before.
Returns
plt.show object
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> >>> # train and plot features of model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.train(X, y) >>> model.feature_importance()
- RFC.fit(x_train: DataFrame, y_train: Series, **kwargs)
Function to fit the model
Parameters
- x_train, y_trainpd.DataFrame, pd.Series
Data to train model
- **kwargs:
additional parameters from child-class for
fit
method
Returns
- selfestimator instance
Estimator instance
- RFC.fit_warm_start(x_train: DataFrame, y_train: Series, **kwargs)
Function to warm_start fit the model
This function only differs for pipeline objects (with preprocessing) from the train method. For pipeline objects, it only traines the preprocessing steps the first time and then only uses them to preprocess.
Parameters
- x_train, y_trainpd.DataFrame, pd.Series
Data to train model
- **kwargs:
additional parameters from child-class for
fit
method
Returns
- selfestimator instance
Estimator instance
- RFC.get_deepcopy()
Function to create a deepcopy of object
Returns
- selfestimator instance
deepcopy of estimator instance
- RFC.get_params(deep: bool = True) dict
Function to get the parameter from the model object
Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained sub-objects that are estimators
Returns
- params: dict
parameter names mapped to their values
- RFC.get_random_config() dict
Function to generate one grid configuration
Returns
- configdict
dictionary of random parameter configuration from grid
Examples
>>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.get_random_config() {'C': 0.31489116479568624, 'penalty': 'elasticnet', 'solver': 'saga', 'l1_ratio': 0.6026718993550663}
- RFC.get_random_configs(n_trails: int) list[dict]
Function to generate grid configurations
Parameters
- n_trailsint
number of grid configurations
Returns
- configslist
list with sets of random parameter from grid
Notes
filter out duplicates -> could be less than n_trails
Examples
>>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.get_random_configs(3) [Configuration(values={ 'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs', }), Configuration(values={ 'C': 2.5378155082656657, 'penalty': 'l2', 'solver': 'saga', }), Configuration(values={ 'C': 2.801635158716261, 'penalty': 'l2', 'solver': 'lbfgs', })]
- static RFC.load_model(path: str)
Function to load a pickled model class object
Parameters
- pathstr
path to save the model with suffix ‘.pkl’
Returns
- modelestimator instance
estimator instance
- RFC.predict(x_test: DataFrame) list
Function to predict with predict-method from model object
Parameters
- x_testpd.DataFrame
Data for prediction
Returns
- predictionlist
list with predicted class numbers for data
- RFC.predict_proba(x_test: DataFrame) ndarray
Function to predict with predict_proba-method from model object
Parameters
- x_testpd.DataFrame
Data for prediction
Returns
- predictionnp.ndarray
np.ndarray with probability for every class per datapoint
- RFC.randomCVsearch(x_train: DataFrame, y_train: Series, n_trails: int = 10, cv_num: int = 5, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, small_data_eval: bool = False, leave_loadbar: bool = True) tuple[dict, float]
Hyperparametertuning with randomCVsearch
Parameters
- x_train, y_trainpd.DataFrame, pd.Series
Data to cross validate on
- n_trailsint, default=10
max number of parameter sets to test
- cv_numint, default=5
number of different random splits
- scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”
metrics to evaluate the models
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- small_data_evalbool, default=False
if True: trains model on all datapoints except one and does this for all datapoints (recommended for datasets with less than 150 datapoints)
- leave_loadbarbool, default=True
shall the loading bar of the different parameter sets be visible after training (True - load bar will still be visible)
Returns
- best_hyperparametersdict
best hyperparameter set
- best_scorefloat
the score of the best hyperparameter set
Notes
if you interrupt the keyboard during the run of randomCVsearch, the interim result will be returned
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> >>> # initialise model >>> from sam_ml.models.classifier import LR >>> model = LR() >>> >>> # use randomCVsearch >>> best_hyperparam, best_score = model.randomCVsearch(X, y, n_trails=20, cv_num=5, scoring="recall") >>> print(f"best hyperparameters: {best_hyperparam}, best score: {best_score}") best hyperparameters: {'C': 8.471801418819979, 'penalty': 'l2', 'solver': 'newton-cg'}, best score: 0.765
- RFC.replace_grid(new_grid: ConfigurationSpace)
Function to replace self.grid
See ConfigurationSpace documentation.
Parameters
- new_gridConfigurationSpace
new grid to replace the old one with
Returns
changes self.grid variable
Examples
>>> from ConfigSpace import ConfigurationSpace, Categorical, Float >>> from sam_ml.models.classifier import LDA >>> >>> model = LDA() >>> new_grid = ConfigurationSpace( ... seed=42, ... space={ ... "solver": Categorical("solver", ["lsqr", "eigen"]), ... "shrinkage": Float("shrinkage", (0, 0.5)), ... }) >>> model.replace_grid(new_grid)
- RFC.save_model(path: str, only_estimator: bool = False)
Function to pickle and save the class object
Parameters
- pathstr
path to save the model with suffix ‘.pkl’
- only_estimatorbool, default=False
If True, only the estimator of the class object will be saved
- RFC.set_params(**params)
Function to set the parameter of the model object
Parameters
- **paramsdict
Estimator parameters
Returns
- selfestimator instance
Estimator instance
- RFC.smac_search(x_train: DataFrame, y_train: Series, n_trails: int = 50, cv_num: int = 5, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, small_data_eval: bool = False, walltime_limit: int = 600, log_level: int = 20) Configuration
Hyperparametertuning with SMAC library HyperparameterOptimizationFacade [can only be used in the sam_ml version with swig]
The smac_search-method will more “intelligent” search your hyperparameter space than the randomCVsearch and returns the best hyperparameter set. Additionally to the n_trails parameter, it also takes a walltime_limit parameter that defines the maximum time in seconds that the search will take.
Parameters
- x_train, y_trainpd.DataFrame, pd.Series
Data to cross validate on
- n_trailsint, default=50
max number of parameter sets to test
- cv_numint, default=5
number of different random splits
- scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”
metrics to evaluate the models
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- small_data_evalbool, default=False
if True: trains model on all datapoints except one and does this for all datapoints (recommended for datasets with less than 150 datapoints)
- walltime_limitint, default=600
the maximum time in seconds that SMAC is allowed to run
- log_levelint, default=20
10 - DEBUG, 20 - INFO, 30 - WARNING, 40 - ERROR, 50 - CRITICAL (SMAC3 library log levels)
Returns
- incumbentConfigSpace.Configuration
ConfigSpace.Configuration with best hyperparameters (can be used like dict)
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> >>> # use smac_search >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> best_hyperparam = model.smac_search(X, y, n_trails=20, cv_num=5, scoring="recall") >>> print(f"best hyperparameters: {best_hyperparam}") [INFO][abstract_initial_design.py:82] Using `n_configs` and ignoring `n_configs_per_hyperparameter`. [INFO][abstract_initial_design.py:147] Using 2 initial design configurations and 0 additional configurations. [INFO][abstract_initial_design.py:147] Using 3 initial design configurations and 0 additional configurations. [INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario. [INFO][abstract_intensifier.py:515] Added config 12be8a as new incumbent because there are no incumbents yet. [INFO][abstract_intensifier.py:590] Added config ce10f4 and rejected config 12be8a as incumbent because it is not better than the incumbents on 1 instances: [INFO][abstract_intensifier.py:590] Added config b35335 and rejected config ce10f4 as incumbent because it is not better than the incumbents on 1 instances: [INFO][smbo.py:327] Configuration budget is exhausted: [INFO][smbo.py:328] --- Remaining wallclock time: 590.5625982284546 [INFO][smbo.py:329] --- Remaining cpu time: inf [INFO][smbo.py:330] --- Remaining trials: 0 best hyperparameters: Configuration(values={ 'C': 66.7049177605834, 'penalty': 'l2', 'solver': 'lbfgs', })
- RFC.train(x_train: DataFrame, y_train: Series, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, console_out: bool = True) tuple[float, str]
Function to train the model
Every classifier has a train- and fit-method. They both use the fit-method of the wrapped model, but the train-method returns the train time and the train score of the model.
Parameters
- x_train, y_trainpd.DataFrame, pd.Series
Data to train model
- scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”
metrics to evaluate the models
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- console_outbool, default=True
shall the score and time be printed out
Returns
- train_scorefloat
train score value
- train_timestr
train time in format: “0:00:00” (hours:minutes:seconds)
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> >>> # train model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.train(X, y) Train score: 0.9891840171120917 - Train time: 0:00:02
- RFC.train_warm_start(x_train: DataFrame, y_train: Series, scoring: Literal['accuracy', 'precision', 'recall', 's_score', 'l_score'] | Callable[[list[int], list[int]], float] = 'accuracy', avg: str | None = 'macro', pos_label: int | str = -1, secondary_scoring: Literal['precision', 'recall'] | None = None, strength: int = 3, console_out: bool = True) tuple[float, str]
Function to warm_start train the model
This function only differs for pipeline objects (with preprocessing) from the train method. For pipeline objects, it only traines the preprocessing steps the first time and then only uses them to preprocess.
Parameters
- x_train, y_trainpd.DataFrame, pd.Series
Data to train model
- scoring{“accuracy”, “precision”, “recall”, “s_score”, “l_score”} or callable (custom score), default=”accuracy”
metrics to evaluate the models
custom score function (or loss function) with signature score_func(y, y_pred, **kwargs)
- avg{“micro”, “macro”, “binary”, “weighted”} or None, default=”macro”
average to use for precision and recall score. If
None
, the scores for each class are returned.- pos_labelint or str, default=-1
if
avg="binary"
, pos_label says which class to score. pos_label is used by s_score/l_score- secondary_scoring{“precision”, “recall”} or None, default=None
weights the scoring (only for “s_score”/”l_score”)
- strengthint, default=3
higher strength means a higher weight for the preferred secondary_scoring/pos_label (only for “s_score”/”l_score”)
- console_outbool, default=True
shall the score and time be printed out
Returns
- train_scorefloat
train score value
- train_timestr
train time in format: “0:00:00” (hours:minutes:seconds)
Examples
>>> # load data (replace with own data) >>> import pandas as pd >>> from sklearn.datasets import load_iris >>> df = load_iris() >>> X, y = pd.DataFrame(df.data, columns=df.feature_names), pd.Series(df.target) >>> >>> # train model >>> from sam_ml.models.classifier import LR >>> >>> model = LR() >>> model.train_warm_start(X, y) Train score: 0.9891840171120917 - Train time: 0:00:02