dvb.datascience.transform package

Submodules

dvb.datascience.transform.classes module

class dvb.datascience.transform.classes.LabelBinarizerPipe

Bases: dvb.datascience.pipe_base.PipeBase

Split label column in different columns per label value

fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = [('lb', 'pickle', 'pickle')]
input_keys = ('df',)
lb = None
output_keys = ('df', 'df_metadata')
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.core module

class dvb.datascience.transform.core.GetCoreFeatures(model=None, n_features: int = 10, method='RFE')

Bases: dvb.datascience.classification_pipe_base.ClassificationPipeBase

Get the features (maximum n_features) which are key to the label.

fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = [('core_features', None, None)]
get_core_features(X, y) → List[str]
input_keys = ('df', 'df_metadata')
output_keys = ('features',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.features module

class dvb.datascience.transform.features.ComputeFeature(column_name, f: Callable, c: Callable = None)

Bases: dvb.datascience.pipe_base.PipeBase

Add a computed feature to the dataframe

input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.transform.features.DropFeatures(features: List[str] = None, features_function: Callable = None)

Bases: dvb.datascience.transform.features.DropFeaturesMixin, dvb.datascience.transform.features.SpecifyFeaturesBase

class dvb.datascience.transform.features.DropFeaturesMixin

Bases: dvb.datascience.transform.features.FeaturesBase

Mixin for classes which will drop features. Superclasses needs to set self.features, which contains the features which will be dropped.

transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.transform.features.DropHighlyCorrelatedFeatures(threshold: float = 0.9, absolute: bool = True)

Bases: dvb.datascience.transform.features.DropFeaturesMixin, dvb.datascience.transform.features.FeaturesBase

When two columns are highly correlated, one will be removed. From a pair of correlated columns the one that is the latest one in the list of columns, will be removed.

fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

class dvb.datascience.transform.features.DropNonInvertibleFeatures

Bases: dvb.datascience.transform.features.DropFeaturesMixin, dvb.datascience.transform.features.FeaturesBase

Drops features that are not invertible, to prevent singularity.

fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

static is_invertible(a)
class dvb.datascience.transform.features.FeaturesBase

Bases: dvb.datascience.pipe_base.PipeBase, abc.ABC

features = None
fit_attributes = [('features', None, None)]
input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.transform.features.FilterFeatures(features: List[str] = None, features_function: Callable = None)

Bases: dvb.datascience.transform.features.SpecifyFeaturesBase

FilterFeatures returns a dataframe which contains only the specified columns. Note: when a request column does not exists in the input dataframe, this will be silently ignored.

input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.transform.features.FilterTypeFeatures(type_=<class 'numpy.number'>)

Bases: dvb.datascience.pipe_base.PipeBase

Keep only the columns of the given type (np.number is default)

input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.transform.features.SpecifyFeaturesBase(features: List[str] = None, features_function: Callable = None)

Bases: dvb.datascience.transform.features.FeaturesBase

Base class for classes which can be initialised with a list of features or a callable which compute those features. The superclass needs to speficify what will be done with the feautures during transform.

features_function = None
fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.filter module

class dvb.datascience.transform.filter.FilterObservations(filter_: Callable)

Bases: dvb.datascience.pipe_base.PipeBase

Filter observations by row based on a function

input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.impute module

class dvb.datascience.transform.impute.CategoricalImpute(missing_values='NaN', strategy='mode', replacement='')

Bases: dvb.datascience.pipe_base.PipeBase

Impute missing values from a categorical/string np.ndarray or pd.Series with the most frequent value on the training data.

Args:
missing_values : string or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. None and np.nan are treated as being the same, use the string value “NaN” for them.
strategy : string, optional (default = ‘mode’)
If set to ‘mode’, replace all instances of missing_values with the modal value. Otherwise, replace with the value specified via replacement.
replacement : string, optional (default=’?’)
The value that all instances of missing_values are replaced with if strategy is not set to ‘mode’. This is useful if you don’t want to impute with the mode, or if there are multiple modes in your data and you want to choose a particular one. If strategy is set to mode, this parameter is ignored.
fill : str
Most frequent value of the training data.
fit(data: Dict[str, Any], params: Dict[str, Any])

Get the most frequent value.

fit_attributes = [('fill', 'pickle', 'pickle')]
input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Replaces missing values in the input data with the most frequent value of the training data.

class dvb.datascience.transform.impute.ImputeWithDummy(strategy: str = 'median', impValueTrain=None)

Bases: dvb.datascience.pipe_base.PipeBase

Impute missing values with the mean, median, mode or set to a value. Takes as input strategy (str). Possible strategies are “mean”, “median”, “mode” and “value”. If the strategy is “value”, an extra argument impValueTrain can be given, denoting which value should be set.

fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = [('impValueTrain', 'pickle', 'pickle')]
impValueTrain = None
input_keys = ('df',)
output_keys = ('df',)
possible_strategies = ['mean', 'median', 'mode', 'value']
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.metadata module

class dvb.datascience.transform.metadata.MetadataPipeline(file_path: str, remove_vars: List = None)

Bases: dvb.datascience.sub_pipe_base.SubPipelineBase

Read metadata and make some pipes for processing the data

input_keys = ('df',)
output_keys = ('df',)

dvb.datascience.transform.outliers module

class dvb.datascience.transform.outliers.RemoveOutliers(nr_of_std: int = 6, skip_columns: List[str] = None, min_outliers: int = 1)

Bases: dvb.datascience.pipe_base.PipeBase

Remove observations when at least one of the features has an outlier.

Args:
nr_of_std (int): the number of standard deviations the outlier has to be higher/lower than, to be removed (default = 6) skip_columns (List[str]): columns to be skipped min_outliers (int): minimum number of outliers a row must have, to be removed from the dataframe (default = 1)
Returns:
The dataframe, minus any rows with min_outliers of outliers.
fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = [('boundaries', 'pickle', 'pickle')]
input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.transform.outliers.ReplaceOutliersFeature(method: str = 'median', nr_of_std: float = 1.5)

Bases: dvb.datascience.pipe_base.PipeBase

Replace all outliers in features with the median, mean or a clipped value.

Args:
method (str): what method to use when replacing. (default = median)
Options are: - median , replace outliers with median of feature - mean , replace outliers with mean of feature - clip , replace outliers with nr_of_std standard deviations +/- mean

nr_of_std (int): minimum number of outliers a row must have, to be removed from the dataframe (default = 1)

Returns:
The dataframe, with outliers replaced by the method indicated.
fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = [('features_mean', 'pickle', 'pickle'), ('features_median', 'pickle', 'pickle'), ('features_limit', 'pickle', 'pickle')]
input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.pandaswrapper module

class dvb.datascience.transform.pandaswrapper.PandasWrapper(s: Callable[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame])

Bases: dvb.datascience.pipe_base.PipeBase

Generic Wrapper for Pandas operations. The callable will get the DataFrame from the input ‘df’ and the returned DataFrame will be put in the output ‘df’.

Besides the DataFrame, the callable gets the transform_params, so these can be used to change the operation.

input_keys = ('df',)
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.sklearnwrapper module

class dvb.datascience.transform.sklearnwrapper.SKLearnBase

Bases: object

fit(data: Any)
transform(data: Any)
class dvb.datascience.transform.sklearnwrapper.SKLearnWrapper(cls, **kwargs)

Bases: dvb.datascience.pipe_base.PipeBase

Generic SKLearn fit / transform wrapper Geen idee wat dit precies doet…

fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = [('s', 'pickle', 'pickle')]
input_keys = ('df',)
output_keys = ('df',)
s = None
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.smote module

class dvb.datascience.transform.smote.SMOTESampler(**kwargs)

Bases: dvb.datascience.classification_pipe_base.ClassificationPipeBase

Resample the dataset.

Note: the new df will not the indexes, because of extra creates row, the indexes won’t be unique anymore.

fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

input_keys = ('df', 'df_metadata')
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.transform.split module

class dvb.datascience.transform.split.CallableTrainTestSplit(c: Callable[Any, int])

Bases: dvb.datascience.transform.split.TrainTestSplitBase

Return the train, the test set or the complete set, as defined in params[‘split’].

For every row, the callable will be called with the row as single argument and returns - CallableTrainTestSplit.TEST - CallableTrainTestSplit.TRAIN

When the return value is not equal to TEST or TRAIN, the row will be excluded from both sets.

transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.transform.split.RandomTrainTestSplit(random_state: int = 42, test_size: float = 0.25)

Bases: dvb.datascience.transform.split.TrainTestSplitBase

Return the train, the test set or the complete set, as defined in params[‘split’]. The split will be random. A random state is present at default, to make the pipeline reproducable.

transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.transform.split.TrainTestSplit(*args, **kwargs)

Bases: dvb.datascience.transform.split.RandomTrainTestSplit

class dvb.datascience.transform.split.TrainTestSplitBase

Bases: dvb.datascience.pipe_base.PipeBase

ALL = -1
TEST = 1
TRAIN = 0
input_keys = ('df',)
output_keys = ('df',)

dvb.datascience.transform.union module

class dvb.datascience.transform.union.Union(number_of_dfs, join: str = 'outer', axis=1, remove_duplicated_columns: bool = False)

Bases: dvb.datascience.pipe_base.PipeBase

Merge the result of different Pipes. Merge can be done based on columns (default, axis=1) or rows (axis=0). When columns are merged, it’s possible that columns are present in more than one input dataframe. At default, the second occurence of the column will be renamed with an underscore as suffix. Optional, duplicated colums are removed.

The input_keys are generated at initialisation based on the the number of dfs, like:

input_keys = (‘df0’, ‘df1’, ‘df2’, …)

input_keys = ()
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

Module contents