fairxai.data.dataset package

Submodules

fairxai.data.dataset.dataset module

class fairxai.data.dataset.dataset.Dataset(data=None, class_name: str = None)[source]

Bases: ABC

Generic abstract class to handle datasets of different modalities (tabular, image, text, timeseries).

data: The raw dataset (DataFrame, list, or array depending on modality)

descriptor: Dictionary describing dataset structure and statistics

class_name: Optional target column name (for tabular datasets)

target: Optional, contains the target column values

set_class_name(class_name: str)[source]: Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict)[source]: Assign a descriptor dictionary to the dataset.

abstract update_descriptor(*args, **kwargs)[source]: Must create and assign the dataset descriptor. Each subclass should call its specific BaseDatasetDescriptor.describe().

fairxai.data.dataset.dataset_factory module

class fairxai.data.dataset.dataset_factory.DatasetFactory[source]

Bases: object

Factory class responsible for creating dataset instances (tabular, image, text, timeseries) using a registry pattern and dataset-specific initialization parameters.

classmethod create(data: Any, dataset_type: str, class_name: str | None = None, categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None)[source]

Create and return a dataset instance based on the specified type.

For tabular datasets, additional arguments such as categorical and ordinal columns can be provided to correctly configure the dataset descriptor.

Parameters:

data (Any) – Input data.
dataset_type (str) – One of [“tabular”, “image”, “text”, “timeseries”].
class_name (str, optional) – Target/label column name (for supervised datasets).
categorical_columns (list[str], optional) – Columns to treat as categorical (tabular only).
ordinal_columns (list[str], optional) – Columns to treat as ordinal (tabular only).

Returns:

An instance of the appropriate dataset subclass.

Return type:

Dataset

Raises:

ValueError – If the dataset_type is unsupported.

classmethod get_class(dataset_type: str)[source]

Return the dataset class corresponding to the dataset_type string.

Parameters:: dataset_type (str) – “tabular”, “image”, “text”, or “timeseries”
Returns:: Dataset subclass (type)
Raises:: ValueError – if dataset_type is unsupported

fairxai.data.dataset.image_dataset module

class fairxai.data.dataset.image_dataset.ImageDataset(data: str | ndarray | List[ndarray], class_name: str | None = None)[source]

Bases: Dataset

Represents an image dataset that can be loaded either from a folder containing image files or directly from in-memory NumPy arrays.

The dataset supports two serialization modes:

Folder-based dataset: Only the folder path is serialized. Images are reloaded at project load time.
Memory-based dataset: Raw NumPy arrays are saved to a compressed .npz file inside the project folder. This allows reconstruction of datasets not tied to an external file system.

Parameters:

data (str | np.ndarray | list[np.ndarray]) – Either a folder path or image arrays.
class_name (str | None, optional) – Optional class label for the dataset.

Raises:

TypeError – If data is not a supported type.
ValueError – If no valid images can be loaded.

classmethod from_dict(meta: Dict, project_path: str) → ImageDataset[source]

Reconstruct an ImageDataset instance from serialized metadata.

Parameters:

meta (dict) – Serialized dataset information.
project_path (str) – Filesystem path to the root of the project.

Return type:

ImageDataset

Raises:

FileNotFoundError – If memory-based dataset arrays are missing.
ValueError – For unknown dataset source types.

get_instance(key: int | str) → ndarray[source]

Retrieve a single image instance either by index or filename.

Parameters:

key (int | str) – Integer index or filename.

Returns:

The requested image.

Return type:

np.ndarray

Raises:

IndexError – If index is out of range.
ValueError – If filename lookup fails or filenames are unavailable.
TypeError – If key is neither int nor str.

save_memory_data(dest_folder: str) → None[source]

Save memory-based image arrays into a compressed .npz file.

Parameters:: dest_folder (str) – Directory where the images.npz will be written.

Notes

If the dataset originates from a folder, this method does nothing.

set_class_name(class_name: str): Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict): Assign a descriptor dictionary to the dataset.

to_dict() → Dict[source]

Serialize dataset metadata into a dictionary.

Notes

Folder datasets store only folder_path.
Memory datasets do not store raw image arrays here; arrays are saved separately via save_memory_data.

Returns:: Metadata describing how to reconstruct the dataset.
Return type:: dict

update_descriptor(hwc_permutation: List[int] | None = None) → Dict[source]

Compute and attach the dataset descriptor.

Parameters:: hwc_permutation (list[int] | None) – Optional permutation of axes (H, W, C).
Returns:: The computed descriptor.
Return type:: dict

fairxai.data.dataset.tabular_dataset module

Bases: Dataset

Tabular dataset container for explainers.

Supports initialization from:

pandas DataFrame
CSV file path
dict or list of dict

The dataset supports two serialization modes:

CSV-based dataset: Only the CSV path is stored. Data is reloaded from the original CSV on load.
Memory-based dataset: Raw DataFrame is saved as CSV inside the project folder for persistence.

Features and target are separated:

self._data: features-only DataFrame
self._target: target Series (if class_name provided)

Initialize TabularDataset.

Parameters:

data (DataFrame | str | dict | list[dict]) – Source data
class_name (str | None) – Target column name, if present
categorical_columns (list[str] | None) – Column names to treat as categorical
ordinal_columns (list[str] | None) – Column names to treat as ordinal
dropna (bool) – If reading CSV, drop rows with missing values

property X: DataFrame

property features: List[str]: Return feature names according to descriptor.

classmethod from_dict(meta: Dict[str, Any], project_path: str | None = None) → TabularDataset[source]

Reconstruct a TabularDataset from metadata.

Parameters:

meta (dict) – Serialized metadata
project_path (str | None) – Project folder path; used to locate memory-based CSV if needed

Return type:

TabularDataset

get_class_values() → List[Any][source]

get_feature_name(index: int) → str[source]

get_feature_names() → List[str][source]

save_memory_data(dest_folder: str) → None[source]

Save memory-based dataset to CSV inside project folder.

Parameters:: dest_folder (str) – Destination folder path

set_class_name(class_name: str) → None[source]: Change target column name, moving column from features to target.

set_descriptor(descriptor: dict): Assign a descriptor dictionary to the dataset.

property target: Series | None

to_dict() → Dict[str, Any][source]

Serialize dataset metadata for project persistence.

Returns:: Metadata describing dataset type, source, and column hints.
Return type:: dict

update_descriptor(categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None) → Dict[str, Any][source]: Compute dataset descriptor based on features-only DataFrame.

property y: Series | None

fairxai.data.dataset.text_dataset module

class fairxai.data.dataset.text_dataset.TextDataset(data, class_name=None)[source]

Bases: Dataset

Represents a dataset containing textual data.

This class is used to handle and manage a text-based dataset. It allows for the updating of a text dataset’s descriptor, which provides metadata or characterization of the dataset. The class can optionally include a name for the dataset’s classification purpose.

data: The raw textual data to be managed by the dataset.

class_name: Optional name or label for categorizing the dataset.

descriptor: Metadata descriptor of the text dataset, populated after invoking the update_descriptor method.

update_descriptor()[source]: Updates or generates the descriptor for the dataset and returns the resulting descriptor value.

Initializes the instance of a class.

Parameters: data (any): The data associated with the instance. class_name (Optional[str]): The name of the class, default is None.

set_class_name(class_name: str): Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict): Assign a descriptor dictionary to the dataset.

update_descriptor()[source]

Updates the descriptor for the text dataset by creating a description using the TextDatasetDescriptor and assigning it to the descriptor attribute.

Returns:: The updated descriptor of the text dataset as created by TextDatasetDescriptor.

fairxai.data.dataset.timeserie_dataset module

class fairxai.data.dataset.timeserie_dataset.TimeSeriesDataset(data, class_name=None)[source]

Bases: Dataset

Represents a dataset specifically designed for time series data.

This class provides an interface to store, process, and update descriptors for time series datasets. It is designed to accommodate time series data and any associated metadata with methods to seamlessly integrate descriptive updates.

data: The time series data stored in the dataset.

class_name: An optional name or identifier for the dataset’s class/category.

descriptor: The descriptor for the dataset, initialized as None.

update_descriptor()[source]: Generates and updates the descriptor for the dataset.

Represents an initializer for an object containing data and an optional class name. This class allows setting up attributes for further handling within the instance.

Attributes: data: Contains the primary data for the instance. Its type depends on its usage. class_name: Optional; represents the name of the class as a string, if applicable. descriptor: Holds additional metadata or information, initialized as None.

Parameters: data: The primary data for the object. class_name: Optional; the name of the class as a string.

set_class_name(class_name: str): Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict): Assign a descriptor dictionary to the dataset.

update_descriptor()[source]

Updates the descriptor for the timeseries dataset.

The method generates a descriptor for the timeseries dataset by using the TimeSeriesDatasetDescriptor class and sets it within the object. The generated descriptor is also returned.

Returns:: The generated descriptor for the timeseries dataset.
Return type:: dict

Module contents

class fairxai.data.dataset.Dataset(data=None, class_name: str = None)[source]

Bases: ABC

Generic abstract class to handle datasets of different modalities (tabular, image, text, timeseries).

data: The raw dataset (DataFrame, list, or array depending on modality)

descriptor: Dictionary describing dataset structure and statistics

class_name: Optional target column name (for tabular datasets)

target: Optional, contains the target column values

set_class_name(class_name: str)[source]: Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict)[source]: Assign a descriptor dictionary to the dataset.

abstract update_descriptor(*args, **kwargs)[source]: Must create and assign the dataset descriptor. Each subclass should call its specific BaseDatasetDescriptor.describe().

Bases: Dataset

Tabular dataset container for explainers.

Supports initialization from:

pandas DataFrame
CSV file path
dict or list of dict

The dataset supports two serialization modes:

CSV-based dataset: Only the CSV path is stored. Data is reloaded from the original CSV on load.
Memory-based dataset: Raw DataFrame is saved as CSV inside the project folder for persistence.

Features and target are separated:

self._data: features-only DataFrame
self._target: target Series (if class_name provided)

Initialize TabularDataset.

Parameters:

data (DataFrame | str | dict | list[dict]) – Source data
class_name (str | None) – Target column name, if present
categorical_columns (list[str] | None) – Column names to treat as categorical
ordinal_columns (list[str] | None) – Column names to treat as ordinal
dropna (bool) – If reading CSV, drop rows with missing values

property X: DataFrame

property features: List[str]: Return feature names according to descriptor.

classmethod from_dict(meta: Dict[str, Any], project_path: str | None = None) → TabularDataset[source]

Reconstruct a TabularDataset from metadata.

Parameters:

meta (dict) – Serialized metadata
project_path (str | None) – Project folder path; used to locate memory-based CSV if needed

Return type:

TabularDataset

get_class_values() → List[Any][source]

get_feature_name(index: int) → str[source]

get_feature_names() → List[str][source]

save_memory_data(dest_folder: str) → None[source]

Save memory-based dataset to CSV inside project folder.

Parameters:: dest_folder (str) – Destination folder path

set_class_name(class_name: str) → None[source]: Change target column name, moving column from features to target.

set_descriptor(descriptor: dict): Assign a descriptor dictionary to the dataset.

property target: Series | None

to_dict() → Dict[str, Any][source]

Serialize dataset metadata for project persistence.

Returns:: Metadata describing dataset type, source, and column hints.
Return type:: dict

update_descriptor(categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None) → Dict[str, Any][source]: Compute dataset descriptor based on features-only DataFrame.

property y: Series | None