fairxai.data.dataset package
Submodules
fairxai.data.dataset.dataset module
- class fairxai.data.dataset.dataset.Dataset(data=None, class_name: str = None)[source]
Bases:
ABCGeneric abstract class to handle datasets of different modalities (tabular, image, text, timeseries).
- data
The raw dataset (DataFrame, list, or array depending on modality)
- descriptor
Dictionary describing dataset structure and statistics
- class_name
Optional target column name (for tabular datasets)
- target
Optional, contains the target column values
fairxai.data.dataset.dataset_factory module
- class fairxai.data.dataset.dataset_factory.DatasetFactory[source]
Bases:
objectFactory class responsible for creating dataset instances (tabular, image, text, timeseries) using a registry pattern and dataset-specific initialization parameters.
- classmethod create(data: Any, dataset_type: str, class_name: str | None = None, categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None)[source]
Create and return a dataset instance based on the specified type.
For tabular datasets, additional arguments such as categorical and ordinal columns can be provided to correctly configure the dataset descriptor.
- Parameters:
data (Any) – Input data.
dataset_type (str) – One of [“tabular”, “image”, “text”, “timeseries”].
class_name (str, optional) – Target/label column name (for supervised datasets).
categorical_columns (list[str], optional) – Columns to treat as categorical (tabular only).
ordinal_columns (list[str], optional) – Columns to treat as ordinal (tabular only).
- Returns:
An instance of the appropriate dataset subclass.
- Return type:
- Raises:
ValueError – If the dataset_type is unsupported.
fairxai.data.dataset.image_dataset module
- class fairxai.data.dataset.image_dataset.ImageDataset(data: str | ndarray | List[ndarray], class_name: str | None = None)[source]
Bases:
DatasetRepresents an image dataset that can be loaded either from a folder containing image files or directly from in-memory NumPy arrays.
The dataset supports two serialization modes:
Folder-based dataset: Only the folder path is serialized. Images are reloaded at project load time.
Memory-based dataset: Raw NumPy arrays are saved to a compressed
.npzfile inside the project folder. This allows reconstruction of datasets not tied to an external file system.
- Parameters:
data (str | np.ndarray | list[np.ndarray]) – Either a folder path or image arrays.
class_name (str | None, optional) – Optional class label for the dataset.
- Raises:
TypeError – If
datais not a supported type.ValueError – If no valid images can be loaded.
- classmethod from_dict(meta: Dict, project_path: str) ImageDataset[source]
Reconstruct an ImageDataset instance from serialized metadata.
- Parameters:
meta (dict) – Serialized dataset information.
project_path (str) – Filesystem path to the root of the project.
- Return type:
- Raises:
FileNotFoundError – If memory-based dataset arrays are missing.
ValueError – For unknown dataset source types.
- get_instance(key: int | str) ndarray[source]
Retrieve a single image instance either by index or filename.
- Parameters:
key (int | str) – Integer index or filename.
- Returns:
The requested image.
- Return type:
np.ndarray
- Raises:
IndexError – If index is out of range.
ValueError – If filename lookup fails or filenames are unavailable.
TypeError – If key is neither int nor str.
- save_memory_data(dest_folder: str) None[source]
Save memory-based image arrays into a compressed
.npzfile.- Parameters:
dest_folder (str) – Directory where the
images.npzwill be written.
Notes
If the dataset originates from a folder, this method does nothing.
- set_class_name(class_name: str)
Define the target column name. Optionally extracts the target values from the dataset (for tabular).
- set_descriptor(descriptor: dict)
Assign a descriptor dictionary to the dataset.
fairxai.data.dataset.tabular_dataset module
- class fairxai.data.dataset.tabular_dataset.TabularDataset(data: DataFrame | str | dict | List[dict], class_name: str | None = None, categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None, dropna: bool = False)[source]
Bases:
DatasetTabular dataset container for explainers.
- Supports initialization from:
pandas DataFrame
CSV file path
dict or list of dict
The dataset supports two serialization modes:
CSV-based dataset: Only the CSV path is stored. Data is reloaded from the original CSV on load.
Memory-based dataset: Raw DataFrame is saved as CSV inside the project folder for persistence.
- Features and target are separated:
self._data: features-only DataFrame
self._target: target Series (if class_name provided)
Initialize TabularDataset.
- Parameters:
data (DataFrame | str | dict | list[dict]) – Source data
class_name (str | None) – Target column name, if present
categorical_columns (list[str] | None) – Column names to treat as categorical
ordinal_columns (list[str] | None) – Column names to treat as ordinal
dropna (bool) – If reading CSV, drop rows with missing values
- property X: DataFrame
- property features: List[str]
Return feature names according to descriptor.
- classmethod from_dict(meta: Dict[str, Any], project_path: str | None = None) TabularDataset[source]
Reconstruct a TabularDataset from metadata.
- Parameters:
meta (dict) – Serialized metadata
project_path (str | None) – Project folder path; used to locate memory-based CSV if needed
- Return type:
- save_memory_data(dest_folder: str) None[source]
Save memory-based dataset to CSV inside project folder.
- Parameters:
dest_folder (str) – Destination folder path
- set_class_name(class_name: str) None[source]
Change target column name, moving column from features to target.
- set_descriptor(descriptor: dict)
Assign a descriptor dictionary to the dataset.
- property target: Series | None
- to_dict() Dict[str, Any][source]
Serialize dataset metadata for project persistence.
- Returns:
Metadata describing dataset type, source, and column hints.
- Return type:
dict
- update_descriptor(categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None) Dict[str, Any][source]
Compute dataset descriptor based on features-only DataFrame.
- property y: Series | None
fairxai.data.dataset.text_dataset module
- class fairxai.data.dataset.text_dataset.TextDataset(data, class_name=None)[source]
Bases:
DatasetRepresents a dataset containing textual data.
This class is used to handle and manage a text-based dataset. It allows for the updating of a text dataset’s descriptor, which provides metadata or characterization of the dataset. The class can optionally include a name for the dataset’s classification purpose.
- data
The raw textual data to be managed by the dataset.
- class_name
Optional name or label for categorizing the dataset.
- descriptor
Metadata descriptor of the text dataset, populated after invoking the update_descriptor method.
- update_descriptor()[source]
Updates or generates the descriptor for the dataset and returns the resulting descriptor value.
Initializes the instance of a class.
Parameters: data (any): The data associated with the instance. class_name (Optional[str]): The name of the class, default is None.
- set_class_name(class_name: str)
Define the target column name. Optionally extracts the target values from the dataset (for tabular).
- set_descriptor(descriptor: dict)
Assign a descriptor dictionary to the dataset.
fairxai.data.dataset.timeserie_dataset module
- class fairxai.data.dataset.timeserie_dataset.TimeSeriesDataset(data, class_name=None)[source]
Bases:
DatasetRepresents a dataset specifically designed for time series data.
This class provides an interface to store, process, and update descriptors for time series datasets. It is designed to accommodate time series data and any associated metadata with methods to seamlessly integrate descriptive updates.
- data
The time series data stored in the dataset.
- class_name
An optional name or identifier for the dataset’s class/category.
- descriptor
The descriptor for the dataset, initialized as None.
Represents an initializer for an object containing data and an optional class name. This class allows setting up attributes for further handling within the instance.
Attributes: data: Contains the primary data for the instance. Its type depends on its usage. class_name: Optional; represents the name of the class as a string, if applicable. descriptor: Holds additional metadata or information, initialized as None.
Parameters: data: The primary data for the object. class_name: Optional; the name of the class as a string.
- set_class_name(class_name: str)
Define the target column name. Optionally extracts the target values from the dataset (for tabular).
- set_descriptor(descriptor: dict)
Assign a descriptor dictionary to the dataset.
- update_descriptor()[source]
Updates the descriptor for the timeseries dataset.
The method generates a descriptor for the timeseries dataset by using the TimeSeriesDatasetDescriptor class and sets it within the object. The generated descriptor is also returned.
- Returns:
The generated descriptor for the timeseries dataset.
- Return type:
dict
Module contents
- class fairxai.data.dataset.Dataset(data=None, class_name: str = None)[source]
Bases:
ABCGeneric abstract class to handle datasets of different modalities (tabular, image, text, timeseries).
- data
The raw dataset (DataFrame, list, or array depending on modality)
- descriptor
Dictionary describing dataset structure and statistics
- class_name
Optional target column name (for tabular datasets)
- target
Optional, contains the target column values
- class fairxai.data.dataset.TabularDataset(data: DataFrame | str | dict | List[dict], class_name: str | None = None, categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None, dropna: bool = False)[source]
Bases:
DatasetTabular dataset container for explainers.
- Supports initialization from:
pandas DataFrame
CSV file path
dict or list of dict
The dataset supports two serialization modes:
CSV-based dataset: Only the CSV path is stored. Data is reloaded from the original CSV on load.
Memory-based dataset: Raw DataFrame is saved as CSV inside the project folder for persistence.
- Features and target are separated:
self._data: features-only DataFrame
self._target: target Series (if class_name provided)
Initialize TabularDataset.
- Parameters:
data (DataFrame | str | dict | list[dict]) – Source data
class_name (str | None) – Target column name, if present
categorical_columns (list[str] | None) – Column names to treat as categorical
ordinal_columns (list[str] | None) – Column names to treat as ordinal
dropna (bool) – If reading CSV, drop rows with missing values
- property X: DataFrame
- property features: List[str]
Return feature names according to descriptor.
- classmethod from_dict(meta: Dict[str, Any], project_path: str | None = None) TabularDataset[source]
Reconstruct a TabularDataset from metadata.
- Parameters:
meta (dict) – Serialized metadata
project_path (str | None) – Project folder path; used to locate memory-based CSV if needed
- Return type:
- save_memory_data(dest_folder: str) None[source]
Save memory-based dataset to CSV inside project folder.
- Parameters:
dest_folder (str) – Destination folder path
- set_class_name(class_name: str) None[source]
Change target column name, moving column from features to target.
- set_descriptor(descriptor: dict)
Assign a descriptor dictionary to the dataset.
- property target: Series | None
- to_dict() Dict[str, Any][source]
Serialize dataset metadata for project persistence.
- Returns:
Metadata describing dataset type, source, and column hints.
- Return type:
dict
- update_descriptor(categorical_columns: List[str] | None = None, ordinal_columns: List[str] | None = None) Dict[str, Any][source]
Compute dataset descriptor based on features-only DataFrame.
- property y: Series | None