SamplerPipeline

class SamplerPipeline(self, algorithm: str | list[sam_ml.data.preprocessing.sampling.Sampler] = 'SMOTE_rus_20_50')

Class uses multplie up- and down-sampling algorithms instead of only one - parent class Data

Parameters

algorithmstr or list[Sampler], default=”SMOTE_rus_20_50”

algorithm format:

  • β€œA1_A2_…_An_x1_x2_…_xn” with A1, A2, … Sampler algorithm and x1, x2, … their sampling_strategy

    first, use Sampler A1 with sampling_strategy x1% on data, then Sampler A2 with sampling_strategy x2% until Sampler An with sampling_strategy xn on data (ONLY works for binary data!!!)

    Note: sampling_strategy is the percentage of class size of minority in relation to the class size of the majority

    Examples (for binary classification):

    1. ros_rus_10_50: RandomOverSampler for minority class to 10% of majority class and then RandomUnderSampler for majority class to 2*minority class

    2. SMOTE_rus_20_50: SMOTE for minority class to 20% of majority class and then RandomUnderSampler for majority class to 2*minority class

  • list[Sampler]: use each Sampler in list one after the other on data

Attributes

algorithmstr

name of the used algorithm

transformertransformer instance

transformer instance (e.g. StandardScaler)

Example

>>> from sam_ml.data.preprocessing import SamplerPipeline
>>>
>>> model = SamplerPipeline()
>>> print(model)
SamplerPipeline(Sampler(algorithm='SMOTE', sampling_strategy=0.2, ), Sampler(algorithm='rus', sampling_strategy=0.5, ))

Methods

Method

Description

get_params

Function to get the parameter from the transformer instance

params

Function to get the recommended parameter values for the class

sample

Function for up- and downsampling

set_params

Function to set the parameter of the transformer instance

SamplerPipeline.get_params(deep: bool = True)

Function to get the parameter from the transformer instance

Parameters

deepbool, default=True

If True, will return the parameters for this estimator and contained sub-objects that are estimators

Returns

params: dict

parameter names mapped to their values

static SamplerPipeline.params() dict

Function to get the recommended parameter values for the class

Returns

paramdict

recommended values for the parameter β€œalgorithm”

Examples

>>> # get possible parameters
>>> from sam_ml.data.preprocessing import SamplerPipeline
>>>
>>> # first way without class object
>>> params1 = SamplerPipeline.params()
>>> print(params1)
{"algorithm": ["SMOTE_rus_20_50", ...]}
>>> # second way with class object
>>> model = SamplerPipeline()
>>> params2 = model.params()
>>> print(params2)
{"algorithm": ["SMOTE_rus_20_50", ...]}
SamplerPipeline.sample(x_train: DataFrame, y_train: Series) tuple[DataFrame, Series]

Function for up- and downsampling

Parameters

x_train, y_trainpd.DataFrame, pd.Series

data to sample

Returns

x_train_sampledpd.DataFrame

sampled x data

y_train_sampledpd.Series

sampled y data

Notes

ONLY sample the train data. NEVER all data because then you will have some samples in train as well as in test data with random splitting

Examples

>>> # load data (replace with own data)
>>> import pandas as pd
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> X, y = make_classification(n_samples=3000, n_features=4, n_classes=2, weights=[0.9], random_state=42)
>>> X, y = pd.DataFrame(X, columns=["col1", "col2", "col3", "col4"]), pd.Series(y)
>>> x_train, x_test, y_train, y_test = train_test_split(X,y, train_size=0.80, random_state=42)
>>> 
>>> # sample data
>>> from sam_ml.data.preprocessing import SamplerPipeline
>>> model = SamplerPipeline()
>>> x_train_sampled, y_train_sampled = model.sample(x_train, y_train)
>>> print("before sampling:")
>>> print(y_train.value_counts())
>>> print()
>>> print("after sampling:")
>>> print(y_train_sampled.value_counts())
before sampling:
0    2140
1     260
Name: count, dtype: int64

after sampling:
0    856
1    428
Name: count, dtype: int64
SamplerPipeline.set_params(**params)

Function to set the parameter of the transformer instance

Parameters

**paramsdict

Estimator parameters

Returns

selfestimator instance

Estimator instance