Pipeline Factory

The create_pipeline function dynamically creates a machine learning pipeline based on the input model. All functions of the model (also special ones like plot_tree from DTC) can be used with the pipeline. You can use the Pipeline for both Classifier and Regressor.

sam_ml.models.create_pipeline(model: Classifier | Regressor, vectorizer: str | Embeddings_builder | None = None, scaler: str | Scaler | None = None, selector: str | tuple[str, int] | Selector | None = None, sampler: str | Sampler | SamplerPipeline | None = None, model_name: str = 'pipe') BasePipeline

Parameters

modelClassifier or Regressor class object

Model used in pipeline (Classifier or Regressor)

vectorizerstr, Embeddings_builder, or None

object or algorithm of Embeddings_builder class which will be used for automatic string column vectorizing (None for no vectorizing)

scalerstr, Scaler, or None

object or algorithm of Scaler class for scaling the data (None for no scaling)

selectorstr, Selector, or None

object or algorithm of Selector class for feature selection (None for no selecting)

samplerstr, Sampler, SamplerPipeline, or None

object or algorithm of Sampler / SamplerPipeline class for sampling the train data (None for no sampling). For Regressor model, always None (will be implemented in the future).

model_namestr

name of the model

Returns

DynamicPipeline object which inherits from the model parent class and BasePipeline

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> X, y = make_classification(n_samples=3000, n_features=4, n_classes=2, weights=[0.9], random_state=42)
>>> X, y = pd.DataFrame(X, columns=["col1", "col2", "col3", "col4"]), pd.Series(y)
>>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42)
>>> 
>>> # train and evaluate model pipeline
>>> from sam_ml.models import create_pipeline
>>> from sam_ml.models.classifier import LR
>>>
>>> model = create_pipeline(LR(), scaler="standard", sampler="SMOTE_rus_20_50")
>>> model.train(x_train, y_train)
>>> scores = model.evaluate(x_test, y_test)
Train score: 0.9625 - Train time: 0:00:00
accuracy: 0.9583333333333334
precision: 0.8563762626262625
recall: 0.9377241446156828
s_score: 0.9603691957893064
l_score: 0.9989822522866367

classification report: 
                precision   recall  f1-score    support

        0       0.99        0.96    0.98        543
        1       0.72        0.91    0.81        57

accuracy                            0.96        600
macro avg       0.86        0.94    0.89        600
weighted avg    0.97        0.96    0.96        600