dvb.datascience package¶
Subpackages¶
- dvb.datascience.data package
- dvb.datascience.eda package
- Submodules
- dvb.datascience.eda.andrews module
- dvb.datascience.eda.base module
- dvb.datascience.eda.boxplot module
- dvb.datascience.eda.corrmatrix module
- dvb.datascience.eda.describe module
- dvb.datascience.eda.dimension_reduction module
- dvb.datascience.eda.dump module
- dvb.datascience.eda.ecdf module
- dvb.datascience.eda.hist module
- dvb.datascience.eda.logit_summary module
- dvb.datascience.eda.scatter module
- dvb.datascience.eda.swarm module
- Module contents
- dvb.datascience.predictor package
- dvb.datascience.transform package
- Submodules
- dvb.datascience.transform.classes module
- dvb.datascience.transform.core module
- dvb.datascience.transform.features module
- dvb.datascience.transform.filter module
- dvb.datascience.transform.impute module
- dvb.datascience.transform.metadata module
- dvb.datascience.transform.outliers module
- dvb.datascience.transform.pandaswrapper module
- dvb.datascience.transform.sklearnwrapper module
- dvb.datascience.transform.smote module
- dvb.datascience.transform.split module
- dvb.datascience.transform.union module
- Module contents
Submodules¶
dvb.datascience.classification_pipe_base module¶
-
class
dvb.datascience.classification_pipe_base.
ClassificationPipeBase
¶ Bases:
dvb.datascience.pipe_base.PipeBase
Base class for classification pipes, so classification related attributes and methods are reusable for different kind of classification based pipes.
-
X
= None¶
-
X_labels
= None¶
-
classes
= None¶
-
fit_attributes
= [('classes', None, None), ('n_classes', None, None), ('y_true_label', None, None), ('y_pred_label', None, None), ('y_pred_proba_labels', None, None), ('X_labels', None, None)]¶
-
n_classes
= 0¶
-
threshold
= None¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
y_pred
= None¶
-
y_pred_label
= ''¶
-
y_pred_proba
= None¶
-
y_pred_proba_labels
= None¶
-
y_true
= None¶
-
y_true_label
= ''¶
-
dvb.datascience.pipe_base module¶
-
class
dvb.datascience.pipe_base.
PipeBase
¶ Bases:
object
Common base class for all pipes
-
figs
= None¶
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
fit_attributes
= ()¶
-
fit_transform
(data: Dict[str, Any], transform_params: Dict[str, Any], fit_params: Dict[str, Any]) → Dict[str, Any]¶
-
get_fig
(idx: Any)¶ Set in plt the figure to one to be used.
When idx has already be used, it will set the same Figure so data can be added to that plot. Otherwise a new Figure will be set
-
get_transform_data_by_key
(key: str) → List[Any]¶ Get all values for a certain key for all transforms
-
input_keys
= ('df',)¶
-
load
(state: Dict[str, Any])¶ load all fitted attributes of this Pipe from state.
Note: All PipeBase subclasses can define a fit_attributes attribute which contains a tuple for every attribute which is set during the fit phase. Those are the attributes which needs to be saved in order to be loaded in a new process without having to train (fit) the pipeline. This is useful ie for model inference. The tuple for every attribute consist of (name, serializer, deserializer).
The (de)serializer are needed to convert to/from a JSON serializable format and can be: - None: No conversion needed, ie for str, int, float, list, bool - ‘pickle’: The attribute will be pickled and stored as base64, so it can be part of a json - callable: a function which will get the object to be (de)serialized and need to return the (de)serialized version
-
name
= None¶
-
output_keys
= ('df',)¶
-
save
() → Dict[str, Any]¶ Return all fitted attributes of this Pipe in a Dict which is JSON serializable.
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.pipeline module¶
-
class
dvb.datascience.pipeline.
Pipeline
¶ Bases:
object
A connector specifies which Pipe (identified by its name) and which output from that Pipe (identified by the key of the output) will be input to a Pipe (identified by its name) and which input for that Pipe (identified by its key)
Example
>>> pipeline = Pipeline() >>> pipeline.addPipe('read', ReadCSV()) >>> pipeline.addPipe('impute', Impute(), [("read", "df", "df")])
>>> pipeline.fit()
>>> pipeline.transform()
-
addPipe
(name: str, pipe: dvb.datascience.pipe_base.PipeBase, inputs: List[Tuple[Union[str, dvb.datascience.pipe_base.PipeBase], str, str]] = None, comment: str = None) → dvb.datascience.pipeline.Pipeline¶ Add a pipe pipe to the pipeline with the given name. Optionally add the input connectors by adding them to inputs. inputs is a list of the inputs whith for each input a tuple with (output_pipe, output_key, input_key).
-
current_transform_nr
= -1¶
-
draw_design
()¶ Returns an image with all pipes and connectors.
-
end
()¶ When all fit and transforms are finished, end the pipeline, so some clean up can be done. At this moment, that is mainly needed to close plots, so they won’t be shown twice in the notebook
-
fit_transform
(data: Optional[Dict[str, Any]] = None, transform_params: Optional[Dict[str, Any]] = None, fit_params: Optional[Dict[str, Any]] = None, name: str = 'fit', close_plt: bool = False) → None¶ Train all pipes in the pipeline and run the transform for the first time
-
fit_transform_try
(*args, **kwargs)¶
-
static
get_params
(params: Dict, key: str, metadata: Dict = None) → Dict¶ Get a dict with the contents of params only relevant for the pipe with the given key as name. Besides that, also the params[‘default’] and metadata will be added.
-
get_pipe
(name) → Optional[dvb.datascience.pipe_base.PipeBase]¶
-
get_pipe_input
(name) → Optional[Dict]¶ Get the input for the pipe with name from the transformed outputs. Returns a dict with all data when all data for the pipe are collectable. Returns None when not all data is present yet
-
get_pipe_output
(name: str, transform_nr: int = None) → Dict¶ Get the output of the pipe with name and the given transform_nr (which default to None which will selects the last one). When no output is present, an empty dict is returned
-
get_processable_pipes
() → List[dvb.datascience.pipe_base.PipeBase]¶ get the pipes which are processable give the status of the pipeline
-
input_connectors
= None¶
-
static
is_valid_name
(name)¶
-
load
(file_path: str) → None¶ Load the fitted parameters from the file in file_path and load them in all Pipes.
-
output_connectors
= None¶
-
pipes
= None¶
-
reset_fit
()¶
-
save
(file_path: str) → None¶ Save the fitted parameters from alle Pipes to the file in file_path.
-
transform
(data: Optional[Dict[str, Any]] = None, transform_params: Optional[Dict[str, Any]] = None, fit_params: Optional[Dict[str, Any]] = None, fit: bool = False, name: Optional[str] = None, close_plt: bool = False)¶ When transform_params or fit_params contain a key ‘default’, that params will be given to all pipes, unless it is overridden by a specific value for that pipe in transform_params or fit_params. The default can be useful for params which are needed in a lot of pipes.
-
transform_outputs
= None¶
-
transform_status
= None¶
-
transform_try
(*args, **kwargs)¶
-
dvb.datascience.score module¶
-
class
dvb.datascience.score.
ClassificationScore
(score_methods: List[str] = None)¶ Bases:
dvb.datascience.classification_pipe_base.ClassificationPipeBase
Some scores for classification problems
-
accuracy
() → float¶
-
auc
() → Optional[float]¶
-
classification_report
()¶
-
confusion_matrix
()¶
-
fit
(data: Dict[str, Any], params: Dict[str, Any])¶ Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.
-
input_keys
= ('predict', 'predict_metadata')¶
-
log_loss
()¶
-
mcc
(threshold: float = None) → float¶
-
output_keys
= ('scores',)¶
-
params
= None¶
-
plot_auc
()¶
-
plot_confusion_matrix
()¶
-
plot_model_performance
()¶
-
possible_predict_methods
= ['plot_model_performance']¶
-
possible_score_methods
= ['auc', 'plot_auc', 'accuracy', 'mcc', 'confusion_matrix', 'plot_confusion_matrix', 'precision_recall_curve', 'log_loss', 'classification_report', 'plot_model_performance']¶
-
precision_recall_curve
()¶
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
dvb.datascience.sub_pipe_base module¶
-
class
dvb.datascience.sub_pipe_base.
PassData
(subpipeline, output_keys)¶ Bases:
dvb.datascience.pipe_base.PipeBase
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
-
class
dvb.datascience.sub_pipe_base.
SubPipelineBase
(output_pipe_name: str)¶ Bases:
dvb.datascience.pipe_base.PipeBase
-
fit_transform
(data: Dict[str, Any], transform_params: Dict[str, Any], fit_params: Dict[str, Any]) → Dict[str, Any]¶
-
load
(state: Dict[str, Any]) → None¶ load all fitted attributes of this Pipe from state.
-
save
() → Dict[str, Any]¶ Return all fitted attributes of this Pipe in a Dict which is JSON serializable.
-
transform
(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]¶ Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.
-
Module contents¶
-
dvb.datascience.
load_module
(name: str, disable_warnings: bool = True, random_seed: Optional[int] = 1122) → Any¶ Convenience function for running an experiment. This function reloads the experiment when it is already loaded, so any changes in the [.. missing word ..] of that experiment will be used. Usage:
import dvb.datascience as ds p = ds.load_module(‘experiment’).run()p can be used to access the contents of the pipeline, like:
p.get_pipe_output(‘predict’)in case you define a ‘run()’ method in ‘experiment.py’ returning the pipeline object
-
dvb.datascience.
run_module
(name: str, disable_warnings: bool = True) → Any¶