fairxai.data.descriptor package
Submodules
fairxai.data.descriptor.base_descriptor module
- class fairxai.data.descriptor.base_descriptor.BaseDatasetDescriptor(data)[source]
Bases:
ABCRepresents an abstract base class for dataset descriptors.
This class serves as a blueprint for dataset descriptor implementations, allowing for uniform representation of datasets. It enforces the implementation of a method to describe the dataset in a structured manner.
Initializes an object with the given data.
The constructor method sets up the initial state of the object by assigning the provided data to the instance variable.
- Parameters:
data – The data to be associated with this instance.
- abstract describe() dict[source]
An abstract representation of a describable resource or entity. Classes inheriting from this should implement the describe method to provide a detailed representation of the resource as a dictionary.
The describe method is intended to serve as a blueprint for outputting structured data about the implementing resource.
- - describe
Abstract method that must be implemented by subclasses to produce a description of the resource.
fairxai.data.descriptor.image_descriptor module
- class fairxai.data.descriptor.image_descriptor.ImageDatasetDescriptor(data: List[str | ndarray])[source]
Bases:
BaseDatasetDescriptorDescriptor for image datasets.
Analyzes a dataset composed of NumPy arrays or image file paths, providing metadata including number of samples, resolution, number of channels, and optional model input shape via hwc_permutation.
Initialize the descriptor with dataset data.
- Parameters:
data (list[Union[str, np.ndarray]]) – List of image file paths or NumPy arrays
- describe(hwc_permutation: List[int] | None = None) dict[source]
Analyze and describe the dataset.
- Parameters:
hwc_permutation (list[int], optional) – Optional permutation of dimensions expected by the model (e.g., [1,2,0])
- Returns:
Dictionary containing dataset description
- Return type:
dict
- Raises:
ValueError – If dataset is empty or permutation is invalid
TypeError – If dataset contains unsupported types
fairxai.data.descriptor.tabular_descriptor module
- class fairxai.data.descriptor.tabular_descriptor.TabularDatasetDescriptor(data: DataFrame, categorical_columns: list = None, ordinal_columns: list = None)[source]
Bases:
BaseDatasetDescriptorHandles the description of a tabular dataset by categorizing its columns into categorical, ordinal, and numeric types and providing summary statistics.
This class requires explicit declaration of all non-numeric columns through the categorical_columns and ordinal_columns parameters. Columns not listed there and not recognized as numeric (based on their dtype) will raise a ValueError during the description process.
It provides methods to describe the dataset, retrieve specific column types, and export the computed descriptions as a dictionary.
- data
The main tabular dataset for analysis.
- Type:
DataFrame
- categorical_columns
A list of column names which are considered categorical variables.
- Type:
list
- ordinal_columns
A list of column names which are considered ordinal variables.
- Type:
list
- describe()[source]
Describes the dataset by categorizing its columns and computing summary statistics for each type.
Initializes an object with the given data.
The constructor method sets up the initial state of the object by assigning the provided data to the instance variable.
- Parameters:
data – The data to be associated with this instance.
- describe(target: Series = None, target_name: str = None) dict[source]
Compute column descriptors for numeric, categorical, and ordinal features.
- Parameters:
target – optional target column (Series). If provided, its summary will be included under ‘target’ in the returned descriptor.
target_name – optional target column name.
- Returns:
Descriptor dictionary including features and optional target.
- Return type:
dict
- get_categorical_columns()[source]
Returns the list of categorical column names.
- Returns:
A list containing the names of categorical columns.
- Return type:
List[str]
fairxai.data.descriptor.text_descriptor module
- class fairxai.data.descriptor.text_descriptor.TextDatasetDescriptor(data)[source]
Bases:
BaseDatasetDescriptorDescriptor for text datasets that analyzes and describes textual data.
This class extends BaseDatasetDescriptor to provide specific functionality for text-based datasets, supporting both raw text strings and dictionary formats.
Initializes an object with the given data.
The constructor method sets up the initial state of the object by assigning the provided data to the instance variable.
- Parameters:
data – The data to be associated with this instance.
- describe() dict[source]
Analyzes the text dataset and returns a dictionary with descriptive information.
- Returns:
- A dictionary containing:
type: Always “text”
n_documents: Total number of documents
input_format: Either “dict” or “raw_text”
Additional format-specific metadata
- Return type:
dict
- Raises:
ValueError – If the dataset is empty
TypeError – If the data format is not supported (not string or dict)
fairxai.data.descriptor.timeserie_descriptor module
- class fairxai.data.descriptor.timeserie_descriptor.TimeSeriesDatasetDescriptor(data)[source]
Bases:
BaseDatasetDescriptorDescriptor for timeseries datasets.
This class analyzes time series data stored in a pandas DataFrame and provides structured information about the dataset, including the number of series, total rows, and temporal range.
Initializes an object with the given data.
The constructor method sets up the initial state of the object by assigning the provided data to the instance variable.
- Parameters:
data – The data to be associated with this instance.
- describe() dict[source]
Analyzes the time series dataset and returns a description dictionary.
- Returns:
- A dictionary containing:
type (str): Always “timeseries”
n_rows (int): Total number of rows in the dataset
n_series (int): Number of unique time series (based on ‘id’ column if present)
timestamps_range (tuple): Min and max timestamps (if ‘timestamp’ column exists)
- Return type:
dict
- Raises:
TypeError – If the data is not a pandas DataFrame