fairxai.data.dataset package

Submodules

fairxai.data.dataset.dataset module

class fairxai.data.dataset.dataset.Dataset(data=None, class_name: str = None)[source]

Bases: ABC

Generic abstract class to handle datasets of different modalities (tabular, image, text, timeseries).

data

The raw dataset (DataFrame, list, or array depending on modality)

descriptor

Dictionary describing dataset structure and statistics

class_name

Optional target column name (for tabular datasets)

target

Optional, contains the target column values

set_class_name(class_name: str)[source]

Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict)[source]

Assign a descriptor dictionary to the dataset.

abstract update_descriptor(*args, **kwargs)[source]

Must create and assign the dataset descriptor. Each subclass should call its specific BaseDatasetDescriptor.describe().

fairxai.data.dataset.dataset_factory module

class fairxai.data.dataset.dataset_factory.DatasetFactory[source]

Bases: object

Factory class responsible for creating dataset instances (tabular, image, text, timeseries) using a registry pattern and dataset-specific initialization parameters.

classmethod create(data: Any, dataset_type: str, class_name: str | None = None, categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None)[source]

Create and return a dataset instance based on the specified type.

For tabular datasets, additional arguments such as categorical and ordinal columns can be provided to correctly configure the dataset descriptor.

Parameters:
  • data (Any) – Input data.

  • dataset_type (str) – One of [“tabular”, “image”, “text”, “timeseries”].

  • class_name (str, optional) – Target/label column name (for supervised datasets).

  • categorical_columns (list[str], optional) – Columns to treat as categorical (tabular only).

  • ordinal_columns (list[str], optional) – Columns to treat as ordinal (tabular only).

Returns:

An instance of the appropriate dataset subclass.

Return type:

Dataset

Raises:

ValueError – If the dataset_type is unsupported.

classmethod get_class(dataset_type: str)[source]

Return the dataset class corresponding to the dataset_type string.

Parameters:

dataset_type (str) – “tabular”, “image”, “text”, or “timeseries”

Returns:

Dataset subclass (type)

Raises:

ValueError – if dataset_type is unsupported

fairxai.data.dataset.image_dataset module

class fairxai.data.dataset.image_dataset.ImageDataset(data: str | ndarray | List[ndarray], class_name: str | None = None)[source]

Bases: Dataset

Represents an image dataset that can be loaded either from a folder containing image files or directly from in-memory NumPy arrays.

The dataset supports two serialization modes:

  • Folder-based dataset: Only the folder path is serialized. Images are reloaded at project load time.

  • Memory-based dataset: Raw NumPy arrays are saved to a compressed .npz file inside the project folder. This allows reconstruction of datasets not tied to an external file system.

Parameters:
  • data (str | np.ndarray | list[np.ndarray]) – Either a folder path or image arrays.

  • class_name (str | None, optional) – Optional class label for the dataset.

Raises:
  • TypeError – If data is not a supported type.

  • ValueError – If no valid images can be loaded.

classmethod from_dict(meta: Dict, project_path: str) ImageDataset[source]

Reconstruct an ImageDataset instance from serialized metadata.

Parameters:
  • meta (dict) – Serialized dataset information.

  • project_path (str) – Filesystem path to the root of the project.

Return type:

ImageDataset

Raises:
  • FileNotFoundError – If memory-based dataset arrays are missing.

  • ValueError – For unknown dataset source types.

get_instance(key: int | str) ndarray[source]

Retrieve a single image instance either by index or filename.

Parameters:

key (int | str) – Integer index or filename.

Returns:

The requested image.

Return type:

np.ndarray

Raises:
  • IndexError – If index is out of range.

  • ValueError – If filename lookup fails or filenames are unavailable.

  • TypeError – If key is neither int nor str.

save_memory_data(dest_folder: str) None[source]

Save memory-based image arrays into a compressed .npz file.

Parameters:

dest_folder (str) – Directory where the images.npz will be written.

Notes

If the dataset originates from a folder, this method does nothing.

set_class_name(class_name: str)

Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict)

Assign a descriptor dictionary to the dataset.

to_dict() Dict[source]

Serialize dataset metadata into a dictionary.

Notes

  • Folder datasets store only folder_path.

  • Memory datasets do not store raw image arrays here; arrays are saved separately via save_memory_data.

Returns:

Metadata describing how to reconstruct the dataset.

Return type:

dict

update_descriptor(hwc_permutation: List[int] | None = None) Dict[source]

Compute and attach the dataset descriptor.

Parameters:

hwc_permutation (list[int] | None) – Optional permutation of axes (H, W, C).

Returns:

The computed descriptor.

Return type:

dict

fairxai.data.dataset.tabular_dataset module

class fairxai.data.dataset.tabular_dataset.TabularDataset(data: DataFrame | str | dict | List[dict], class_name: str | None = None, categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None, dropna: bool = False)[source]

Bases: Dataset

Tabular dataset container for explainers.

Supports initialization from:
  • pandas DataFrame

  • CSV file path

  • dict or list of dict

The dataset supports two serialization modes:

  • CSV-based dataset: Only the CSV path is stored. Data is reloaded from the original CSV on load.

  • Memory-based dataset: Raw DataFrame is saved as CSV inside the project folder for persistence.

Features and target are separated:
  • self._data: features-only DataFrame

  • self._target: target Series (if class_name provided)

Initialize TabularDataset.

Parameters:
  • data (DataFrame | str | dict | list[dict]) – Source data

  • class_name (str | None) – Target column name, if present

  • categorical_columns (list[str] | None) – Column names to treat as categorical

  • ordinal_columns (list[str] | None) – Column names to treat as ordinal

  • dropna (bool) – If reading CSV, drop rows with missing values

property X: DataFrame
property features: List[str]

Return feature names according to descriptor.

classmethod from_dict(meta: Dict[str, Any], project_path: str | None = None) TabularDataset[source]

Reconstruct a TabularDataset from metadata.

Parameters:
  • meta (dict) – Serialized metadata

  • project_path (str | None) – Project folder path; used to locate memory-based CSV if needed

Return type:

TabularDataset

get_class_values() List[Any][source]
get_feature_name(index: int) str[source]
get_feature_names() List[str][source]
save_memory_data(dest_folder: str) None[source]

Save memory-based dataset to CSV inside project folder.

Parameters:

dest_folder (str) – Destination folder path

set_class_name(class_name: str) None[source]

Change target column name, moving column from features to target.

set_descriptor(descriptor: dict)

Assign a descriptor dictionary to the dataset.

property target: Series | None
to_dict() Dict[str, Any][source]

Serialize dataset metadata for project persistence.

Returns:

Metadata describing dataset type, source, and column hints.

Return type:

dict

update_descriptor(categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None) Dict[str, Any][source]

Compute dataset descriptor based on features-only DataFrame.

property y: Series | None

fairxai.data.dataset.text_dataset module

class fairxai.data.dataset.text_dataset.TextDataset(data, class_name=None)[source]

Bases: Dataset

Represents a dataset containing textual data.

This class is used to handle and manage a text-based dataset. It allows for the updating of a text dataset’s descriptor, which provides metadata or characterization of the dataset. The class can optionally include a name for the dataset’s classification purpose.

data

The raw textual data to be managed by the dataset.

class_name

Optional name or label for categorizing the dataset.

descriptor

Metadata descriptor of the text dataset, populated after invoking the update_descriptor method.

update_descriptor()[source]

Updates or generates the descriptor for the dataset and returns the resulting descriptor value.

Initializes the instance of a class.

Parameters: data (any): The data associated with the instance. class_name (Optional[str]): The name of the class, default is None.

set_class_name(class_name: str)

Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict)

Assign a descriptor dictionary to the dataset.

update_descriptor()[source]

Updates the descriptor for the text dataset by creating a description using the TextDatasetDescriptor and assigning it to the descriptor attribute.

Returns:

The updated descriptor of the text dataset as created by TextDatasetDescriptor.

fairxai.data.dataset.timeserie_dataset module

class fairxai.data.dataset.timeserie_dataset.TimeSeriesDataset(data, class_name=None)[source]

Bases: Dataset

Represents a dataset specifically designed for time series data.

This class provides an interface to store, process, and update descriptors for time series datasets. It is designed to accommodate time series data and any associated metadata with methods to seamlessly integrate descriptive updates.

data

The time series data stored in the dataset.

class_name

An optional name or identifier for the dataset’s class/category.

descriptor

The descriptor for the dataset, initialized as None.

update_descriptor()[source]

Generates and updates the descriptor for the dataset.

Represents an initializer for an object containing data and an optional class name. This class allows setting up attributes for further handling within the instance.

Attributes: data: Contains the primary data for the instance. Its type depends on its usage. class_name: Optional; represents the name of the class as a string, if applicable. descriptor: Holds additional metadata or information, initialized as None.

Parameters: data: The primary data for the object. class_name: Optional; the name of the class as a string.

set_class_name(class_name: str)

Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict)

Assign a descriptor dictionary to the dataset.

update_descriptor()[source]

Updates the descriptor for the timeseries dataset.

The method generates a descriptor for the timeseries dataset by using the TimeSeriesDatasetDescriptor class and sets it within the object. The generated descriptor is also returned.

Returns:

The generated descriptor for the timeseries dataset.

Return type:

dict

Module contents

class fairxai.data.dataset.Dataset(data=None, class_name: str = None)[source]

Bases: ABC

Generic abstract class to handle datasets of different modalities (tabular, image, text, timeseries).

data

The raw dataset (DataFrame, list, or array depending on modality)

descriptor

Dictionary describing dataset structure and statistics

class_name

Optional target column name (for tabular datasets)

target

Optional, contains the target column values

set_class_name(class_name: str)[source]

Define the target column name. Optionally extracts the target values from the dataset (for tabular).

set_descriptor(descriptor: dict)[source]

Assign a descriptor dictionary to the dataset.

abstract update_descriptor(*args, **kwargs)[source]

Must create and assign the dataset descriptor. Each subclass should call its specific BaseDatasetDescriptor.describe().

class fairxai.data.dataset.TabularDataset(data: DataFrame | str | dict | List[dict], class_name: str | None = None, categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None, dropna: bool = False)[source]

Bases: Dataset

Tabular dataset container for explainers.

Supports initialization from:
  • pandas DataFrame

  • CSV file path

  • dict or list of dict

The dataset supports two serialization modes:

  • CSV-based dataset: Only the CSV path is stored. Data is reloaded from the original CSV on load.

  • Memory-based dataset: Raw DataFrame is saved as CSV inside the project folder for persistence.

Features and target are separated:
  • self._data: features-only DataFrame

  • self._target: target Series (if class_name provided)

Initialize TabularDataset.

Parameters:
  • data (DataFrame | str | dict | list[dict]) – Source data

  • class_name (str | None) – Target column name, if present

  • categorical_columns (list[str] | None) – Column names to treat as categorical

  • ordinal_columns (list[str] | None) – Column names to treat as ordinal

  • dropna (bool) – If reading CSV, drop rows with missing values

property X: DataFrame
property features: List[str]

Return feature names according to descriptor.

classmethod from_dict(meta: Dict[str, Any], project_path: str | None = None) TabularDataset[source]

Reconstruct a TabularDataset from metadata.

Parameters:
  • meta (dict) – Serialized metadata

  • project_path (str | None) – Project folder path; used to locate memory-based CSV if needed

Return type:

TabularDataset

get_class_values() List[Any][source]
get_feature_name(index: int) str[source]
get_feature_names() List[str][source]
save_memory_data(dest_folder: str) None[source]

Save memory-based dataset to CSV inside project folder.

Parameters:

dest_folder (str) – Destination folder path

set_class_name(class_name: str) None[source]

Change target column name, moving column from features to target.

set_descriptor(descriptor: dict)

Assign a descriptor dictionary to the dataset.

property target: Series | None
to_dict() Dict[str, Any][source]

Serialize dataset metadata for project persistence.

Returns:

Metadata describing dataset type, source, and column hints.

Return type:

dict

update_descriptor(categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None) Dict[str, Any][source]

Compute dataset descriptor based on features-only DataFrame.

property y: Series | None