fairxai.data.descriptor package

Submodules

fairxai.data.descriptor.base_descriptor module

class fairxai.data.descriptor.base_descriptor.BaseDatasetDescriptor(data)[source]

Bases: ABC

Represents an abstract base class for dataset descriptors.

This class serves as a blueprint for dataset descriptor implementations, allowing for uniform representation of datasets. It enforces the implementation of a method to describe the dataset in a structured manner.

Initializes an object with the given data.

The constructor method sets up the initial state of the object by assigning the provided data to the instance variable.

Parameters:

data – The data to be associated with this instance.

abstract describe() dict[source]

An abstract representation of a describable resource or entity. Classes inheriting from this should implement the describe method to provide a detailed representation of the resource as a dictionary.

The describe method is intended to serve as a blueprint for outputting structured data about the implementing resource.

- describe

Abstract method that must be implemented by subclasses to produce a description of the resource.

fairxai.data.descriptor.image_descriptor module

class fairxai.data.descriptor.image_descriptor.ImageDatasetDescriptor(data: List[str | ndarray])[source]

Bases: BaseDatasetDescriptor

Descriptor for image datasets.

Analyzes a dataset composed of NumPy arrays or image file paths, providing metadata including number of samples, resolution, number of channels, and optional model input shape via hwc_permutation.

Initialize the descriptor with dataset data.

Parameters:

data (list[Union[str, np.ndarray]]) – List of image file paths or NumPy arrays

describe(hwc_permutation: List[int] | None = None) dict[source]

Analyze and describe the dataset.

Parameters:

hwc_permutation (list[int], optional) – Optional permutation of dimensions expected by the model (e.g., [1,2,0])

Returns:

Dictionary containing dataset description

Return type:

dict

Raises:
  • ValueError – If dataset is empty or permutation is invalid

  • TypeError – If dataset contains unsupported types

fairxai.data.descriptor.tabular_descriptor module

class fairxai.data.descriptor.tabular_descriptor.TabularDatasetDescriptor(data: DataFrame, categorical_columns: list = None, ordinal_columns: list = None)[source]

Bases: BaseDatasetDescriptor

Handles the description of a tabular dataset by categorizing its columns into categorical, ordinal, and numeric types and providing summary statistics.

This class requires explicit declaration of all non-numeric columns through the categorical_columns and ordinal_columns parameters. Columns not listed there and not recognized as numeric (based on their dtype) will raise a ValueError during the description process.

It provides methods to describe the dataset, retrieve specific column types, and export the computed descriptions as a dictionary.

data

The main tabular dataset for analysis.

Type:

DataFrame

categorical_columns

A list of column names which are considered categorical variables.

Type:

list

ordinal_columns

A list of column names which are considered ordinal variables.

Type:

list

describe()[source]

Describes the dataset by categorizing its columns and computing summary statistics for each type.

get_numeric_columns()[source]

Returns the names of numeric columns.

get_categorical_columns()[source]

Returns the list of categorical column names.

get_ordinal_columns()[source]

Retrieves the list of ordinal column names.

Initializes an object with the given data.

The constructor method sets up the initial state of the object by assigning the provided data to the instance variable.

Parameters:

data – The data to be associated with this instance.

describe(target: Series = None, target_name: str = None) dict[source]

Compute column descriptors for numeric, categorical, and ordinal features.

Parameters:
  • target – optional target column (Series). If provided, its summary will be included under ‘target’ in the returned descriptor.

  • target_name – optional target column name.

Returns:

Descriptor dictionary including features and optional target.

Return type:

dict

get_categorical_columns()[source]

Returns the list of categorical column names.

Returns:

A list containing the names of categorical columns.

Return type:

List[str]

get_numeric_columns()[source]

Returns the names of numeric columns.

Returns:

A list containing the names of numeric columns.

Return type:

list

get_ordinal_columns()[source]

Retrieves the list of ordinal column names.

Returns:

A list of column names corresponding to ordinal data.

Return type:

list

fairxai.data.descriptor.text_descriptor module

class fairxai.data.descriptor.text_descriptor.TextDatasetDescriptor(data)[source]

Bases: BaseDatasetDescriptor

Descriptor for text datasets that analyzes and describes textual data.

This class extends BaseDatasetDescriptor to provide specific functionality for text-based datasets, supporting both raw text strings and dictionary formats.

Initializes an object with the given data.

The constructor method sets up the initial state of the object by assigning the provided data to the instance variable.

Parameters:

data – The data to be associated with this instance.

describe() dict[source]

Analyzes the text dataset and returns a dictionary with descriptive information.

Returns:

A dictionary containing:
  • type: Always “text”

  • n_documents: Total number of documents

  • input_format: Either “dict” or “raw_text”

  • Additional format-specific metadata

Return type:

dict

Raises:
  • ValueError – If the dataset is empty

  • TypeError – If the data format is not supported (not string or dict)

fairxai.data.descriptor.timeserie_descriptor module

class fairxai.data.descriptor.timeserie_descriptor.TimeSeriesDatasetDescriptor(data)[source]

Bases: BaseDatasetDescriptor

Descriptor for timeseries datasets.

This class analyzes time series data stored in a pandas DataFrame and provides structured information about the dataset, including the number of series, total rows, and temporal range.

Initializes an object with the given data.

The constructor method sets up the initial state of the object by assigning the provided data to the instance variable.

Parameters:

data – The data to be associated with this instance.

describe() dict[source]

Analyzes the time series dataset and returns a description dictionary.

Returns:

A dictionary containing:
  • type (str): Always “timeseries”

  • n_rows (int): Total number of rows in the dataset

  • n_series (int): Number of unique time series (based on ‘id’ column if present)

  • timestamps_range (tuple): Min and max timestamps (if ‘timestamp’ column exists)

Return type:

dict

Raises:

TypeError – If the data is not a pandas DataFrame

Module contents