Welcome to dvb.datascience’s documentation!¶

dvb.datascience¶

A python data science pipeline package.

At de Volksbank, our data scientists used to write a lot of overhead code for every experiment from scratch. To help them focus on the more exciting and value added parts of their jobs, we created this package. Using this package you can easily create and reuse your pipeline code (consisting of often used data transformations and modeling steps) in experiments.

This package has (among others) the following features:

Make easy-to-follow model pipelines of fits and transforms (what exactly is a pipeline?)
Make a graph of the pipeline
Output graphics, data, metadata, etc from the pipeline steps
Data preprocessing such as filtering feature and observation outliers
Adding and merging intermediate dataframes
Every pipe stores all intermediate output, so the output can be inspected later on
Transforms can store the outputs of previous runs, so the data from different transforms can be compared into one graph
Data is in Pandas DataFrame format
Parameters for every pipe can be given with the pipeline fit_transform() and transform() methods

Scope¶

This package was developed specifically for fast prototyping with relatively small datasets on a single machine. By allowing the intermediate output of each pipeline step to be stored, this package might underperform for bigger datasets (100,000 rows or more).

Getting Started¶

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. For a more extensive overview of all the features, see the docs directory.

Prerequisites¶

This package requires Python3 and has been tested/developed using python 3.6

Installing¶

The easiest way to install the library (for using it), is using:

pip install dvb.datascience

Development¶

(in the checkout directory): For installing the checkouts repo for developing of dvb.datascience:

pipenv install --dev

For using dvb.datascience in your project:

pipenv install dvb.datascience

Development - Anaconda¶

(in the checkout directory): Create and activate an environment + install the package:

conda create --name dvb.datascience
conda activate dvb.datascience
pip install -e .

or use it via:

pip install dvb.datascience

Jupyter table-of-contents¶

When working with longer pipelines, the output when using a jupyter notebook can become quite long. It is advisable to install the nbextensions for the toc2 extension:

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install

Next, start a jupyter notebook and navigate to edit > nbextensions config and enable the toc2 extension. And optionally set other properties. After that, navigate back to your notebook (refresh) and click the icon in the menu for loading the toc in the side panel.

Examples¶

This example loads the data and makes some plots of the Iris dataset

import dvb.datascience as ds


p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('split', ds.transform.TrainTestSplit(test_size=0.3), [("read", "df", "df")])
p.addPipe('boxplot', ds.eda.BoxPlot(), [("split", "df", "df")])
p.fit_transform(transform_params={'split': {'train': True}})

This example shows a number of features of the package and its usage:

Adding 3 steps to the pipeline using addPipe().
Linking the 3 steps using [("read", "df", "df")]: the 'df' output (2nd parameter) of the "read" method (1st method) to the "df" input (3rd parameter) of the split method.
The usage of 3 subpackages: ds.data, ds.transform and ds.eda. The other 2 packages are: ds.predictor and ds.score.
The last method p.fit_transform() has as a parameter additional input for running the defined pipeline, which can be different for each call to the p.fit_transform() or p.transform() method.

This example applies the KNeighborsClassifier from sklearn to the Iris dataset

import dvb.datascience as ds

from sklearn.neighbors import KNeighborsClassifier
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [("read", "df", "df"), ("read", "df_metadata", "df_metadata")])
p.addPipe('score', ds.score.ClassificationScore(), [("clf", "predict", "predict"), ("clf", "predict_metadata", "predict_metadata")])
p.fit_transform()

This example shows:

The use of the KNeighborsClassifier from sklearn
The usage of coupling of multiple parameters as input: [("read", "df", "df"), ("read", "df_metadata", "df_metadata")]

For a more extensive overview of all the features, see the docs directory.

Unittesting¶

The unittests for the project can be run using pytest:

pytest

Code coverage¶

Pytest will also output the coverage tot the console.

To generate an html report, you can use:

py.test --cov-report html

Code styling¶

Code styling is done using Black

Built With¶

For an extensive list, see setup.py

scipy / numpy / pandas / matplotlib - For calculations and visualizations
sklearn - Machine learning algorithms
statsmodels - Statistics
mlxtend - Feature selection
tabulate - Printing tabular data
imblearn - SMOTE

Contributing¶

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning¶

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors¶

Marc Rijken - Initial work - mrijken
Wouter Poncin - Maintenance - wpbs
Daan Knoope - Contributor - daanknoope
Christopher Huijting - Contributor - chuijting

See also the list of contributors who participated in this project.

License¶

This project is licensed under the MIT License - see the LICENSE file for details

Contact¶

For any questions please don’t hesitate to contact us at tc@devolksbank.nl

Work in progress¶

Adding support for multiclass classification problems
Adding support for regression problems
Adding support for Apache Spark ML