Welcome to dvb.datascience’s documentation!

dvb.datascience

A python data science pipeline package.

Travis

At de Volksbank, our data scientists used to write a lot of overhead code for every experiment from scratch. To help them focus on the more exciting and value added parts of their jobs, we created this package. Using this package you can easily create and reuse your pipeline code (consisting of often used data transformations and modeling steps) in experiments.

Sample Project Gif

This package has (among others) the following features:

  • Make easy-to-follow model pipelines of fits and transforms (what exactly is a pipeline?)
  • Make a graph of the pipeline
  • Output graphics, data, metadata, etc from the pipeline steps
  • Data preprocessing such as filtering feature and observation outliers
  • Adding and merging intermediate dataframes
  • Every pipe stores all intermediate output, so the output can be inspected later on
  • Transforms can store the outputs of previous runs, so the data from different transforms can be compared into one graph
  • Data is in Pandas DataFrame format
  • Parameters for every pipe can be given with the pipeline fit_transform() and transform() methods
logo

Scope

This package was developed specifically for fast prototyping with relatively small datasets on a single machine. By allowing the intermediate output of each pipeline step to be stored, this package might underperform for bigger datasets (100,000 rows or more).

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. For a more extensive overview of all the features, see the docs directory.

Prerequisites

This package requires Python3 and has been tested/developed using python 3.6

Installing

The easiest way to install the library (for using it), is using:

pip install dvb.datascience

Development

(in the checkout directory): For installing the checkouts repo for developing of dvb.datascience:

pipenv install --dev

For using dvb.datascience in your project:

pipenv install dvb.datascience

Development - Anaconda

(in the checkout directory): Create and activate an environment + install the package:

conda create --name dvb.datascience
conda activate dvb.datascience
pip install -e .

or use it via:

pip install dvb.datascience

Jupyter table-of-contents

When working with longer pipelines, the output when using a jupyter notebook can become quite long. It is advisable to install the nbextensions for the toc2 extension:

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install

Next, start a jupyter notebook and navigate to edit > nbextensions config and enable the toc2 extension. And optionally set other properties. After that, navigate back to your notebook (refresh) and click the icon in the menu for loading the toc in the side panel.

Examples

This example loads the data and makes some plots of the Iris dataset

import dvb.datascience as ds


p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('split', ds.transform.TrainTestSplit(test_size=0.3), [("read", "df", "df")])
p.addPipe('boxplot', ds.eda.BoxPlot(), [("split", "df", "df")])
p.fit_transform(transform_params={'split': {'train': True}})

This example shows a number of features of the package and its usage:

  • Adding 3 steps to the pipeline using addPipe().
  • Linking the 3 steps using [("read", "df", "df")]: the 'df' output (2nd parameter) of the "read" method (1st method) to the "df" input (3rd parameter) of the split method.
  • The usage of 3 subpackages: ds.data, ds.transform and ds.eda. The other 2 packages are: ds.predictor and ds.score.
  • The last method p.fit_transform() has as a parameter additional input for running the defined pipeline, which can be different for each call to the p.fit_transform() or p.transform() method.

This example applies the KNeighborsClassifier from sklearn to the Iris dataset

import dvb.datascience as ds

from sklearn.neighbors import KNeighborsClassifier
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [("read", "df", "df"), ("read", "df_metadata", "df_metadata")])
p.addPipe('score', ds.score.ClassificationScore(), [("clf", "predict", "predict"), ("clf", "predict_metadata", "predict_metadata")])
p.fit_transform()

This example shows:

  • The use of the KNeighborsClassifier from sklearn
  • The usage of coupling of multiple parameters as input: [("read", "df", "df"), ("read", "df_metadata", "df_metadata")]

For a more extensive overview of all the features, see the docs directory.

Unittesting

The unittests for the project can be run using pytest:

pytest

Code coverage

Pytest will also output the coverage tot the console.

To generate an html report, you can use:

py.test --cov-report html

Code styling

Code styling is done using Black

Built With

For an extensive list, see setup.py

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

  • Marc Rijken - Initial work - mrijken
  • Wouter Poncin - Maintenance - wpbs
  • Daan Knoope - Contributor - daanknoope
  • Christopher Huijting - Contributor - chuijting

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details

Contact

For any questions please don’t hesitate to contact us at tc@devolksbank.nl

Indices and tables