dvb.datascience.data package

Submodules

dvb.datascience.data.arff module

class dvb.datascience.data.arff.ARFFDataExportPipe

Bases: dvb.datascience.pipe_base.PipeBase

Exports ARFF files and writes it to file.

Args:
file_path (str): String with a path to the file to import wekaname (str): The wekaname to be used
Returns:
A file.
file_path = None
fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = [('file_path', None, None), ('wekaname', None, None)]
input_keys = ('df',)
output_keys = ()
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

wekaname = None
class dvb.datascience.data.arff.ARFFDataImportPipe

Bases: dvb.datascience.pipe_base.PipeBase

Imports ARFF files and returns a dataframe.

Args:
file_path (str): String with a path to the file to import
Returns:
A dataframe
file_path = None
fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

fit_attributes = [('file_path', None, None)]
input_keys = ()
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.data.csv module

class dvb.datascience.data.csv.CSVDataExportPipe(file_path: str = None, sep: str = None, **kwargs)

Bases: dvb.datascience.pipe_base.PipeBase

Exports a dataframe to CSV. Takes as input filepath (str), sep (str). Returns a CSV file at the specified location.

input_keys = ('df',)
output_keys = ()
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.data.csv.CSVDataImportPipe(file_path: str = None, content: str = None, sep: bool = None, engine: str = 'python', index_col: str = None)

Bases: dvb.datascience.pipe_base.PipeBase

Imports data from CSV and creates a dataframe using pd.read_csv().

Args:
filepath (str): path to read file content (str): raw data to import sep (bool): separation character to use engine (str): engine to be used, default is “python” index_col (str): column to use as index
Returns:
A dataframe with index_col as index column.
input_keys = ()
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.data.excel module

class dvb.datascience.data.excel.ExcelDataImportPipe(file_path: str = None, sheet_name=0, index_col: str = None)

Bases: dvb.datascience.pipe_base.PipeBase

Imports data from excel and creates a dataframe using pd.read_excel().

Args:
filepath(str): path to read file sheet_name(int): sheet number to be used (default 0) index_col(str): index column to be used
Returns:
A dataframe with index_col as index column.
input_keys = ()
output_keys = ('df',)
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

dvb.datascience.data.teradata module

class dvb.datascience.data.teradata.TeraDataImportPipe

Bases: dvb.datascience.pipe_base.PipeBase

Reads data from Teradata and returns a dataframe.

Args:
file_path(str): path to read file containing SQL query sql(str): raw SQL query to be used
Returns:
A dataframe using pd.read_sql_query(), sorts the index alphabetically.
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.data.teradata.customDataTypeConverter

Bases: teradata.datatypes.DefaultDataTypeConverter

Transforms data types from Teradata to datatypes used by Python. Replaces decimal comma with decimal point. Changes BYTEINT, BIGINT, SMALLINT and INTEGER to the Python type int.

convertValue(dbType, dataType, typeCode, value)

Converts the value returned by the database into the desired python object.

Module contents

class dvb.datascience.data.DataPipe(key: str = 'data', data=None)

Bases: dvb.datascience.pipe_base.PipeBase

Add some data to the pipeline via fit or transform params. The data can be added on three different moments:

>>> pipe = DataPipe(data=[1,2,3])
>>> pipeline.fit_transform(fit_params={"data": [4,5,6]})
>>> pipeline.transform(transform_params={"data": [7,8,9]})

The last data will be used.

fit(data: Dict[str, Any], params: Dict[str, Any])

Train on a dataset df and store the learnings so transform can be called later on to transform based on the learnings.

transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.data.GeneratedSampleClassification(n_classes: int = 10, n_features: int = 20, n_samples: int = 100, random_state: int = None)

Bases: dvb.datascience.pipe_base.PipeBase

input_keys = ()
output_keys = ('df', 'df_metadata')
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.

class dvb.datascience.data.SampleData(dataset_name: str = 'iris')

Bases: dvb.datascience.pipe_base.PipeBase

input_keys = ()
output_keys = ('df', 'df_metadata')
possible_dataset_names = ['iris', 'diabetes', 'wine', 'boston', 'breast_cancer', 'digits', 'linnerud']
transform(data: Dict[str, Any], params: Dict[str, Any]) → Dict[str, Any]

Perform an operations on df using the kwargs and the learnings from training. Transform will return a tuple with the transformed dataset and some output. The transformed dataset will be the input for the next plumber. The output will be collected and shown to the user.