Embeddings_builder

class Embeddings_builder(self, algorithm: Literal['bert', 'count', 'tfidf'] = 'tfidf', **kwargs)

Vectorizer Wrapper class - parent class Data

Parameters

algorithm{“bert”, “count”, “tfidf”}, default=”tfidf

which vectorizing algorithm to use: - ‘count’: CountVectorizer (default) - ‘tfidf’: TfidfVectorizer - ‘bert’: SentenceTransformer(“quora-distilbert-multilingual”)

**kwargs:

additional parameters for CountVectorizer or TfidfVectorizer

Attributes

algorithmstr

name of the used algorithm

transformertransformer instance

transformer instance (e.g. StandardScaler)

Example

>>> from sam_ml.data.preprocessing import Embeddings_builder
>>>
>>> model = Embeddings_builder()
>>> print(model)
Embeddings_builder()

Methods

Method

Description

create_parallel_bert_embeddings

Function to create in parallel embeddings of given strings with bert model

get_params

Function to get the parameter from the transformer instance

params

Function to get the possible parameter values for the class

set_params

Function to set the parameter of the transformer instance

vectorize

Function to vectorize text data column

Embeddings_builder.create_parallel_bert_embeddings(content: list[str]) list

Function to create in parallel embeddings of given strings with bert model

Parameters

contentlist[str]

list of strings that shall be embedded

Returns

content_embeddingslist

list of embedding vectors from content strings

Embeddings_builder.get_params(deep: bool = True) dict

Function to get the parameter from the transformer instance

Parameters

deepbool, default=True

If True, will return the parameters for this estimator and contained sub-objects that are estimators

Returns

params: dict

parameter names mapped to their values

static Embeddings_builder.params() dict

Function to get the possible parameter values for the class

Returns

paramdict

possible values for the parameter “algorithm”

Examples

>>> # get possible parameters
>>> from sam_ml.data.preprocessing import Embeddings_builder
>>>
>>> # first way without class object
>>> params1 = Embeddings_builder.params()
>>> print(params1)
{"algorithm": ["tfidf", ...]}
>>> # second way with class object
>>> model = Embeddings_builder()
>>> params2 = model.params()
>>> print(params2)
{"algorithm": ["tfidf", ...]}
Embeddings_builder.set_params(**params)

Function to set the parameter of the transformer instance

Parameters

**paramsdict

Estimator parameters

Returns

selfestimator instance

Estimator instance

Embeddings_builder.vectorize(data: Series, train_on: bool = True) DataFrame

Function to vectorize text data column

Parameters

datapd.Series

column with text to vectorize

train_onbool, default=True

If True, the estimator instance will be trained to build embeddings and then vectorize. Otherwise, it uses the trained instance for vectorizing.

Returns

emb_dfpd.DataFrame

pandas Dataframe with vectorized data

Examples

>>> import pandas as pd
>>> x_train = pd.Series(["Hallo world!", "Goodbye Island", "Greetings Berlin"], name="text")
>>> x_test = pd.Series(["Goodbye world!", "Greetings Island"], name="text")
>>> 
>>> # vectorize data
>>> from sam_ml.data.preprocessing import Embeddings_builder
>>> 
>>> model = Embeddings_builder()
>>> x_train = model.vectorize(x_train) # train vectorizer
>>> x_test = model.vectorize(x_test, train_on=False) # vectorize test data
>>> print("x_train:")
>>> print(x_train)
>>> print()
>>> print("x_test:")
>>> print(x_test)
x_train:
    0_text    1_text    2_text    3_text    4_text    5_text
0   0.000000  0.000000  0.000000  0.707107  0.000000  0.707107
1   0.000000  0.707107  0.000000  0.000000  0.707107  0.000000
2   0.707107  0.000000  0.707107  0.000000  0.000000  0.000000

x_test:
    0_text  1_text    2_text    3_text  4_text    5_text
0   0.0     0.707107  0.000000  0.0     0.000000  0.707107
1   0.0     0.000000  0.707107  0.0     0.707107  0.000000