dvb.datascience.transform package¶
Submodules¶
dvb.datascience.transform.classes module¶
-
class
dvb.datascience.transform.classes.
LabelBinarizerPipe
¶ Bases:
dvb.datascience.pipe_base.PipeBase
Split label column in different columns per label value
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
fit_attributes
= [('lb', 'pickle', 'pickle')]¶
-
input_keys
= ('df',)¶
-
lb
= None¶
-
output_keys
= ('df', 'df_metadata')¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.transform.core module¶
-
class
dvb.datascience.transform.core.
GetCoreFeatures
(model=None, n_features: int = 10, method='RFE')¶ Bases:
dvb.datascience.classification_pipe_base.ClassificationPipeBase
Get the features (maximum n_features) which are key to the label.
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
fit_attributes
= [('core_features', None, None)]¶
-
get_core_features
(X, y) → List[str]¶
-
input_keys
= ('df', 'df_metadata')¶
-
output_keys
= ('features',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.transform.features module¶
-
class
dvb.datascience.transform.features.
ComputeFeature
(column_name, f: Callable, c: Callable = None)¶ Bases:
dvb.datascience.pipe_base.PipeBase
Add a computed feature to the dataframe
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
-
class
dvb.datascience.transform.features.
DropFeatures
(features: List[str] = None, features_function: Callable = None)¶ Bases:
dvb.datascience.transform.features.DropFeaturesMixin
,dvb.datascience.transform.features.SpecifyFeaturesBase
-
class
dvb.datascience.transform.features.
DropFeaturesMixin
¶ Bases:
dvb.datascience.transform.features.FeaturesBase
Mixin for classes which will drop features. Superclasses needs to set self.features, which contains the features which will be dropped.
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
Bases:
dvb.datascience.transform.features.DropFeaturesMixin
,dvb.datascience.transform.features.FeaturesBase
When two columns are highly correlated, one will be removed. From a pair of correlated columns the one that is the latest one in the list of columns, will be removed.
Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
class
dvb.datascience.transform.features.
DropNonInvertibleFeatures
¶ Bases:
dvb.datascience.transform.features.DropFeaturesMixin
,dvb.datascience.transform.features.FeaturesBase
Drops features that are not invertible, to prevent singularity.
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
static
is_invertible
(a)¶
-
-
class
dvb.datascience.transform.features.
FeaturesBase
¶ Bases:
dvb.datascience.pipe_base.PipeBase
,abc.ABC
-
features
= None¶
-
fit_attributes
= [('features', None, None)]¶
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
-
class
dvb.datascience.transform.features.
FilterFeatures
(features: List[str] = None, features_function: Callable = None)¶ Bases:
dvb.datascience.transform.features.SpecifyFeaturesBase
FilterFeatures returns a dataframe which contains only the specified columns. Note: when a request column does not exists in the input dataframe, this will be silently ignored.
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
-
class
dvb.datascience.transform.features.
FilterTypeFeatures
(type_=<class 'numpy.number'>)¶ Bases:
dvb.datascience.pipe_base.PipeBase
Keep only the columns of the given type (np.number is default)
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
-
class
dvb.datascience.transform.features.
SpecifyFeaturesBase
(features: List[str] = None, features_function: Callable = None)¶ Bases:
dvb.datascience.transform.features.FeaturesBase
Base class for classes which can be initialised with a list of features or a callable which compute those features. The superclass needs to speficify what will be done with the feautures during transform.
-
features_function
= None¶
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.transform.filter module¶
-
class
dvb.datascience.transform.filter.
FilterObservations
(filter_: Callable)¶ Bases:
dvb.datascience.pipe_base.PipeBase
Filter observations by row based on a function
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.transform.impute module¶
-
class
dvb.datascience.transform.impute.
CategoricalImpute
(missing_values='NaN', strategy='mode', replacement='')¶ Bases:
dvb.datascience.pipe_base.PipeBase
Impute missing values from a categorical/string np.ndarray or pd.Series with the most frequent value on the training data.
- Args:
- missing_values : string or “NaN”, optional (default=”NaN”)
- The placeholder for the missing values. All occurrences of missing_values will be imputed. None and np.nan are treated as being the same, use the string value “NaN” for them.
- strategy : string, optional (default = ‘mode’)
- If set to ‘mode’, replace all instances of missing_values with the modal value. Otherwise, replace with the value specified via replacement.
- replacement : string, optional (default=’?’)
- The value that all instances of missing_values are replaced with if strategy is not set to ‘mode’. This is useful if you don’t want to impute with the mode, or if there are multiple modes in your data and you want to choose a particular one. If strategy is set to mode, this parameter is ignored.
- fill : str
- Most frequent value of the training data.
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Get the most frequent value.
-
fit_attributes
= [('fill', 'pickle', 'pickle')]¶
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Replaces missing values in the input data with the most frequent value of the training data.
-
class
dvb.datascience.transform.impute.
ImputeWithDummy
(strategy: str = 'median', impValueTrain=None)¶ Bases:
dvb.datascience.pipe_base.PipeBase
Impute missing values with the mean, median, mode or set to a value. Takes as input strategy (str). Possible strategies are “mean”, “median”, “mode” and “value”. If the strategy is “value”, an extra argument impValueTrain can be given, denoting which value should be set.
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
fit_attributes
= [('impValueTrain', 'pickle', 'pickle')]¶
-
impValueTrain
= None¶
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
possible_strategies
= ['mean', 'median', 'mode', 'value']¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.transform.metadata module¶
-
class
dvb.datascience.transform.metadata.
MetadataPipeline
(file_path: str, remove_vars: List = None)¶ Bases:
dvb.datascience.sub_pipe_base.SubPipelineBase
Read metadata and make some pipes for processing the data
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
dvb.datascience.transform.outliers module¶
-
class
dvb.datascience.transform.outliers.
RemoveOutliers
(nr_of_std: int = 6, skip_columns: List[str] = None, min_outliers: int = 1)¶ Bases:
dvb.datascience.pipe_base.PipeBase
Remove observations when at least one of the features has an outlier.
- Args:
- nr_of_std (int): the number of standard deviations the outlier has to be higher/lower than, to be removed (default = 6) skip_columns (List[str]): columns to be skipped min_outliers (int): minimum number of outliers a row must have, to be removed from the dataframe (default = 1)
- Returns:
- The dataframe, minus any rows with min_outliers of outliers.
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
fit_attributes
= [('boundaries', 'pickle', 'pickle')]¶
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
class
dvb.datascience.transform.outliers.
ReplaceOutliersFeature
(method: str = 'median', nr_of_std: float = 1.5)¶ Bases:
dvb.datascience.pipe_base.PipeBase
Replace all outliers in features with the median, mean or a clipped value.
- Args:
- method (str): what method to use when replacing. (default = median)
- Options are: - median , replace outliers with median of feature - mean , replace outliers with mean of feature - clip , replace outliers with nr_of_std standard deviations +/- mean
nr_of_std (int): minimum number of outliers a row must have, to be removed from the dataframe (default = 1)
- Returns:
- The dataframe, with outliers replaced by the method indicated.
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
fit_attributes
= [('features_mean', 'pickle', 'pickle'), ('features_median', 'pickle', 'pickle'), ('features_limit', 'pickle', 'pickle')]¶
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
dvb.datascience.transform.pandaswrapper module¶
-
class
dvb.datascience.transform.pandaswrapper.
PandasWrapper
(s: Callable[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame])¶ Bases:
dvb.datascience.pipe_base.PipeBase
Generic Wrapper for Pandas operations. The callable will get the DataFrame from the input ‘df’ and the returned DataFrame will be put in the output ‘df’.
Besides the DataFrame, the callable gets the transform_params, so these can be used to change the operation.
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.transform.sklearnwrapper module¶
-
class
dvb.datascience.transform.sklearnwrapper.
SKLearnBase
¶ Bases:
object
-
fit
(data: Any)¶
-
transform
(data: Any)¶
-
-
class
dvb.datascience.transform.sklearnwrapper.
SKLearnWrapper
(cls, **kwargs)¶ Bases:
dvb.datascience.pipe_base.PipeBase
Generic SKLearn fit / transform wrapper Geen idee wat dit precies doet…
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
fit_attributes
= [('s', 'pickle', 'pickle')]¶
-
input_keys
= ('df',)¶
-
output_keys
= ('df',)¶
-
s
= None¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.transform.smote module¶
-
class
dvb.datascience.transform.smote.
SMOTESampler
(**kwargs)¶ Bases:
dvb.datascience.classification_pipe_base.ClassificationPipeBase
Resample the dataset.
Note: the new df will not the indexes, because of extra creates row, the indexes won’t be unique anymore.
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
input_keys
= ('df', 'df_metadata')¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.transform.split module¶
-
class
dvb.datascience.transform.split.
CallableTrainTestSplit
(c: Callable[Any, int])¶ Bases:
dvb.datascience.transform.split.TrainTestSplitBase
Return the train, the test set or the complete set, as defined in params[‘split’].
For every row, the callable will be called with the row as single argument and returns - CallableTrainTestSplit.TEST - CallableTrainTestSplit.TRAIN
When the return value is not equal to TEST or TRAIN, the row will be excluded from both sets.
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
-
class
dvb.datascience.transform.split.
RandomTrainTestSplit
(random_state: int = 42, test_size: float = 0.25)¶ Bases:
dvb.datascience.transform.split.TrainTestSplitBase
Return the train, the test set or the complete set, as defined in params[‘split’]. The split will be random. A random state is present at default, to make the pipeline reproducable.
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
-
class
dvb.datascience.transform.split.
TrainTestSplit
(*args, **kwargs)¶
dvb.datascience.transform.union module¶
-
class
dvb.datascience.transform.union.
Union
(number_of_dfs, join: str = 'outer', axis=1, remove_duplicated_columns: bool = False)¶ Bases:
dvb.datascience.pipe_base.PipeBase
Merge the result of different Pipes. Merge can be done based on columns (default, axis=1) or rows (axis=0). When columns are merged, it’s possible that columns are present in more than one input dataframe. At default, the second occurence of the column will be renamed with an underscore as suffix. Optional, duplicated colums are removed.
The input_keys are generated at initialisation based on the the number of dfs, like:
input_keys = (‘df0’, ‘df1’, ‘df2’, …)
-
input_keys
= ()¶
-
output_keys
= ('df',)¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-