{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction dvb.datascience" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, you will learn what the basic usage of dvb.datascience is and how you can use in in your datascience activities.\n", "\n", "If you have any suggestions for features or if you encounter any bugs, please let us know at [tc@devolksbank.nl](mailto:tc@devolksbank.nl)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# %matplotlib inline\n", "import dvb.datascience as ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining a pipeline\n", "Defining a pipeline is just adding some Pipes (actions) which will be connected\n", "\n", "Every Pipe can have 0, 1 or more inputs from other pipes.\n", "Every Pipe can have 0, 1 or more outputs to other pipes.\n", "Every Pipe has a name. Every input and output of the pipe has a key by which the input/output is identified. The name of the Pipe and the key of the input/output are used to connect pipes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p = ds.Pipeline()\n", "p.addPipe(\"read\", ds.data.SampleData(dataset_name=\"iris\"))\n", "p.addPipe(\"metadata\", ds.data.DataPipe(\"df_metadata\", {\"y_true_label\": \"label\"}))\n", "p.addPipe(\"write\",ds.data.CSVDataExportPipe(\"dump_input_to_output.csv\", sep=\",\", index_label=\"caseId\"),[(\"read\", \"df\", \"df\")],)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A pipeline has two main methods: `fit_transform()` and `transform().fit_transform()` is training the pipeline. Depending on the Pipe, the training can be computing the mean, making a decision tree, etc. During the transform, those learnings are used to transform() the input to output, for example by replacing outliers by means, predicting with the trained model, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p.fit_transform()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the transform, the output of the transform is available." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p.get_pipe_output('read')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple inputs\n", "Some Pipes have multiple inputs, for example to merge two datasets we can do the following." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p = ds.Pipeline()\n", "p.addPipe('read1', ds.data.CSVDataImportPipe())\n", "p.addPipe('read2', ds.data.CSVDataImportPipe())\n", "p.addPipe('merge', ds.transform.Union(2, axis=0, join='outer'), [(\"read1\", \"df\", \"df0\"), (\"read2\", \"df\", \"df1\")])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p.fit_transform(transform_params={'read1': {'file_path': '../test/data/train.csv'}, 'read2': {'file_path': '../test/data/test.csv'}})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p.get_pipe_output('merge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plots\n", "It's easy to get some plots of the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p = ds.Pipeline()\n", "p.addPipe('read', ds.data.SampleData('iris'))\n", "p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [(\"read\", \"df\", \"df\")])\n", "p.addPipe('boxplot', ds.eda.BoxPlot(), [(\"split\", \"df\", \"df\")])\n", "p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p.get_pipe_output('read')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some plots can combine transforms to one plot\n", "You can add a name to the transform in order to add it to the legend.\n", "By default, the transform won't close the plots. So when you leave out close_plt=True in the call of (fit_)transform, plots of the next transform will be integrated in the plots of the previous transform.\n", "Do not forget to call close_plt=True on the last transform, otherwise all plots will remain open and will be plotted by jupyter again." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p = ds.Pipeline()\n", "p.addPipe('read', ds.data.SampleData('iris'))\n", "p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [(\"read\", \"df\", \"df\")])\n", "p.addPipe('ecdf', ds.eda.ECDFPlots(), [(\"split\", \"df\", \"df\")])\n", "p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})\n", "p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p = ds.Pipeline()\n", "p.addPipe('read', ds.data.SampleData('iris'))\n", "p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [(\"read\", \"df\", \"df\")])\n", "p.addPipe('scatter', ds.eda.ScatterPlots(), [(\"split\", \"df\", \"df\")])\n", "p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})\n", "p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p = ds.Pipeline()\n", "p.addPipe('read', ds.data.SampleData('iris'))\n", "p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [(\"read\", \"df\", \"df\")])\n", "p.addPipe('hist', ds.eda.Hist(), [(\"split\", \"df\", \"df\")])\n", "p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})\n", "p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drawing a pipeline\n", "Once defined, a pipeline can be drawn." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "p = ds.Pipeline()\n", "p.addPipe('read', ds.data.CSVDataImportPipe())\n", "p.addPipe('read2', ds.data.CSVDataImportPipe())\n", "p.addPipe('numeric', ds.transform.FilterTypeFeatures(), [(\"read\", \"df\", \"df\")])\n", "p.addPipe('numeric2', ds.transform.FilterTypeFeatures(), [(\"read2\", \"df\", \"df\")])\n", "p.addPipe('boxplot', ds.eda.BoxPlot(), [(\"numeric\", \"df\", \"df\"), (\"numeric2\", \"df\", \"df\")])\n", "p.draw_design()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "p = ds.Pipeline()\n", "p.addPipe('read', ds.data.SampleData('iris'))\n", "p.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [(\"read\", \"df\", \"df\"), (\"read\", \"df_metadata\", \"df_metadata\")])\n", "p.fit_transform()\n", "p.get_pipe_output('clf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scoring" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "p = ds.Pipeline()\n", "p.addPipe('read', ds.data.SampleData('iris'))\n", "p.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [(\"read\", \"df\", \"df\"), (\"read\", \"df_metadata\", \"df_metadata\")])\n", "p.addPipe('score', ds.score.ClassificationScore(), [(\"clf\", \"predict\", \"predict\"), (\"clf\", \"predict_metadata\", \"predict_metadata\")])\n", "p.fit_transform()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fetching the output\n", "You can fetch the output of a pipe using the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p.get_pipe_output('clf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Confusion matrix\n", "You can print the confusion matrix of a score pipe using the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p.get_pipe('score').plot_confusion_matrix()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Precision Recall Curve\n", "And the same holds for the precision recall curve:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p.get_pipe('score').precision_recall_curve()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# AUC plot\n", "As well as the AUC plot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p.get_pipe('score').plot_auc()\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" }, "toc": { "base_numbering": 1.0, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }