dvb.datascience package

Submodules

dvb.datascience.classification_pipe_base module

class dvb.datascience.classification_pipe_base.ClassificationPipeBase

Bases: dvb.datascience.pipe_base.PipeBase

Base class for classification pipes, so classification related attributes and methods are reusable for different kind of classification based pipes.

X = None
X_labels = None
classes = None
fit_attributes = [('classes', None, None), ('n_classes', None, None), ('y_true_label', None, None), ('y_pred_label', None, None), ('y_pred_proba_labels', None, None), ('X_labels', None, None)]
n_classes = 0
threshold = None
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

y_pred = None
y_pred_label = ''
y_pred_proba = None
y_pred_proba_labels = None
y_true = None
y_true_label = ''

dvb.datascience.pipe_base module

class dvb.datascience.pipe_base.PipeBase

Bases: object

Common base class for all pipes

figs = None
fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = ()
fit_transform(data: Dict[str, Any], transform_params: Dict[str, Any], fit_params: Dict[str, Any]) → Dict[str, Any]
get_fig(idx: Any)

Set in plt the figure to one to be used.

When idx has already be used, it will set the same Figure so data can be added to that plot. Otherwise a new Figure will be set

get_transform_data_by_key(key: str) → List[Any]

Get all values for a certain key for all transforms

input_keys = ('df',)
load(state: Dict[str, Any])

load all fitted attributes of this Pipe from state.

Note: All PipeBase subclasses can define a fit_attributes attribute which contains a tuple for every attribute which is set during the fit phase. Those are the attributes which needs to be saved in order to be loaded in a new process without having to train (fit) the pipeline. This is useful ie for model inference. The tuple for every attribute consist of (name, serializer, deserializer).

The (de)serializer are needed to convert to/from a JSON serializable format and can be: - None: No conversion needed, ie for str, int, float, list, bool - ‘pickle’: The attribute will be pickled and stored as base64, so it can be part of a json - callable: a function which will get the object to be (de)serialized and need to return the (de)serialized version

name = None
output_keys = ('df',)
save() → Dict[str, Any]

Return all fitted attributes of this Pipe in a Dict which is JSON serializable.

transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.pipeline module

class dvb.datascience.pipeline.Pipeline

Bases: object

A connector specifies which Pipe (identified by its name) and which output from that Pipe (identified by the key of the output) will be input to a Pipe (identified by its name) and which input for that Pipe (identified by its key)

Example

>>> pipeline = Pipeline()
>>> pipeline.addPipe('read', ReadCSV())
>>> pipeline.addPipe('impute', Impute(), [("read", "df", "df")])
>>> pipeline.fit()
>>> pipeline.transform()
addPipe(name: str, pipe: dvb.datascience.pipe_base.PipeBase, inputs: List[Tuple[Union[str, dvb.datascience.pipe_base.PipeBase], str, str]] = None, comment: str = None) → dvb.datascience.pipeline.Pipeline

Add a pipe pipe to the pipeline with the given name. Optionally add the input connectors by adding them to inputs. inputs is a list of the inputs whith for each input a tuple with (output_pipe, output_key, input_key).

current_transform_nr = -1
draw_design()

Returns an image with all pipes and connectors.

end()

When all fit and transforms are finished, end the pipeline, so some clean up can be done. At this moment, that is mainly needed to close plots, so they won’t be shown twice in the notebook

fit_transform(data: Optional[Dict[str, Any]] = None, transform_params: Optional[Dict[str, Any]] = None, fit_params: Optional[Dict[str, Any]] = None, name: str = 'fit', close_plt: bool = False) → None

Train all pipes in the pipeline and run the transform for the first time

fit_transform_try(*args, **kwargs)
static get_params(params: Dict, key: str, metadata: Dict = None) → Dict

Get a dict with the contents of params only relevant for the pipe with the given key as name. Besides that, also the params[‘default’] and metadata will be added.

get_pipe(name) → Optional[dvb.datascience.pipe_base.PipeBase]
get_pipe_input(name) → Optional[Dict]

Get the input for the pipe with name from the transformed outputs. Returns a dict with all data when all data for the pipe are collectable. Returns None when not all data is present yet

get_pipe_output(name: str, transform_nr: int = None) → Dict

Get the output of the pipe with name and the given transform_nr (which default to None which will selects the last one). When no output is present, an empty dict is returned

get_processable_pipes() → List[dvb.datascience.pipe_base.PipeBase]

get the pipes which are processable give the status of the pipeline

input_connectors = None
static is_valid_name(name)
load(file_path: str) → None

Load the fitted parameters from the file in file_path and load them in all Pipes.

output_connectors = None
pipes = None
reset_fit()
save(file_path: str) → None

Save the fitted parameters from alle Pipes to the file in file_path.

transform(data: Optional[Dict[str, Any]] = None, transform_params: Optional[Dict[str, Any]] = None, fit_params: Optional[Dict[str, Any]] = None, fit: bool = False, name: Optional[str] = None, close_plt: bool = False)

When transform_params or fit_params contain a key ‘default’, that params will be given to all pipes, unless it is overridden by a specific value for that pipe in transform_params or fit_params. The default can be useful for params which are needed in a lot of pipes.

transform_outputs = None
transform_status = None
transform_try(*args, **kwargs)
class dvb.datascience.pipeline.Status

Bases: enum.Enum

An enumeration.

FINISHED = 3
NOT_STARTED = 1
PROCESSING = 2

dvb.datascience.score module

class dvb.datascience.score.ClassificationScore(score_methods: List[str] = None)

Bases: dvb.datascience.classification_pipe_base.ClassificationPipeBase

Some scores for classification problems

accuracy() → float
auc() → Optional[float]
classification_report()
confusion_matrix()
fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

input_keys = ('predict', 'predict_metadata')
log_loss()
mcc(threshold: float = None) → float
output_keys = ('scores',)
params = None
plot_auc()
plot_confusion_matrix()
plot_model_performance()
possible_predict_methods = ['plot_model_performance']
possible_score_methods = ['auc', 'plot_auc', 'accuracy', 'mcc', 'confusion_matrix', 'plot_confusion_matrix', 'precision_recall_curve', 'log_loss', 'classification_report', 'plot_model_performance']
precision_recall_curve()
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.sub_pipe_base module

class dvb.datascience.sub_pipe_base.PassData(subpipeline, output_keys)

Bases: dvb.datascience.pipe_base.PipeBase

transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.sub_pipe_base.SubPipelineBase(output_pipe_name: str)

Bases: dvb.datascience.pipe_base.PipeBase

fit_transform(data: Dict[str, Any], transform_params: Dict[str, Any], fit_params: Dict[str, Any]) → Dict[str, Any]
load(state: Dict[str, Any]) → None

load all fitted attributes of this Pipe from state.

save() → Dict[str, Any]

Return all fitted attributes of this Pipe in a Dict which is JSON serializable.

transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

Module contents

dvb.datascience.load_module(name: str, disable_warnings: bool = True, random_seed: Optional[int] = 1122) → Any

Convenience function for running an experiment. This function reloads the experiment when it is already loaded, so any changes in the [.. missing word ..] of that experiment will be used. Usage:

import dvb.datascience as ds p = ds.load_module(‘experiment’).run()

p can be used to access the contents of the pipeline, like:

p.get_pipe_output(‘predict’)

in case you define a ‘run()’ method in ‘experiment.py’ returning the pipeline object

dvb.datascience.run_module(name: str, disable_warnings: bool = True) → Any