Embeddings_builder

class Embeddings_builder(self, algorithm: Literal['bert', 'count', 'tfidf'] = 'tfidf', **kwargs)

Vectorizer Wrapper class - parent class Data

Parameters	algorithm{“bert”, “count”, “tfidf”}, default=”tfidf which vectorizing algorithm to use: - ‘count’: CountVectorizer (default) - ‘tfidf’: TfidfVectorizer - ‘bert’: SentenceTransformer(“quora-distilbert-multilingual”) **kwargs: additional parameters for CountVectorizer or TfidfVectorizer
Attributes	algorithmstr name of the used algorithm transformertransformer instance transformer instance (e.g. StandardScaler)

Example

>>> from sam_ml.data.preprocessing import Embeddings_builder
>>>
>>> model = Embeddings_builder()
>>> print(model)
Embeddings_builder()

Methods

Method	Description
`create_parallel_bert_embeddings`	Function to create in parallel embeddings of given strings with bert model
`get_params`	Function to get the parameter from the transformer instance
`params`	Function to get the possible parameter values for the class
`set_params`	Function to set the parameter of the transformer instance
`vectorize`	Function to vectorize text data column

Embeddings_builder.create_parallel_bert_embeddings(content: list[str]) → list

Function to create in parallel embeddings of given strings with bert model

Parameters

contentlist[str]: list of strings that shall be embedded

Returns

content_embeddingslist: list of embedding vectors from content strings

Embeddings_builder.get_params(deep: bool = True) → dict

Function to get the parameter from the transformer instance

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained sub-objects that are estimators

Returns

params: dict: parameter names mapped to their values

static Embeddings_builder.params() → dict

Function to get the possible parameter values for the class

Returns

paramdict: possible values for the parameter “algorithm”

Examples

>>> # get possible parameters
>>> from sam_ml.data.preprocessing import Embeddings_builder
>>>
>>> # first way without class object
>>> params1 = Embeddings_builder.params()
>>> print(params1)
{"algorithm": ["tfidf", ...]}
>>> # second way with class object
>>> model = Embeddings_builder()
>>> params2 = model.params()
>>> print(params2)
{"algorithm": ["tfidf", ...]}

Embeddings_builder.set_params(**params)

Function to set the parameter of the transformer instance

Parameters

**paramsdict: Estimator parameters

Returns

selfestimator instance: Estimator instance

Embeddings_builder.vectorize(data: Series, train_on: bool = True) → DataFrame

Function to vectorize text data column

Parameters

datapd.Series: column with text to vectorize
train_onbool, default=True: If True, the estimator instance will be trained to build embeddings and then vectorize. Otherwise, it uses the trained instance for vectorizing.

Returns

emb_dfpd.DataFrame: pandas Dataframe with vectorized data

Examples

>>> import pandas as pd
>>> x_train = pd.Series(["Hallo world!", "Goodbye Island", "Greetings Berlin"], name="text")
>>> x_test = pd.Series(["Goodbye world!", "Greetings Island"], name="text")
>>> 
>>> # vectorize data
>>> from sam_ml.data.preprocessing import Embeddings_builder
>>> 
>>> model = Embeddings_builder()
>>> x_train = model.vectorize(x_train) # train vectorizer
>>> x_test = model.vectorize(x_test, train_on=False) # vectorize test data
>>> print("x_train:")
>>> print(x_train)
>>> print()
>>> print("x_test:")
>>> print(x_test)
x_train:
    0_text    1_text    2_text    3_text    4_text    5_text
0   0.000000  0.000000  0.000000  0.707107  0.000000  0.707107
1   0.000000  0.707107  0.000000  0.000000  0.707107  0.000000
2   0.707107  0.000000  0.707107  0.000000  0.000000  0.000000

x_test:
    0_text  1_text    2_text    3_text  4_text    5_text
0   0.0     0.707107  0.000000  0.0     0.000000  0.707107
1   0.0     0.000000  0.707107  0.0     0.707107  0.000000