Learning

DataCleaner

class MEDiml.learning.DataCleaner.DataCleaner(var_type: str = 'continuous', imputation: str = 'mean', missingCutoffpf: float = 0.1, missingCutoffps: float = 0.25, covCutoff: float = 0.1, random_state=None)[source]

Bases: BaseEstimator, TransformerMixin

A scikit-learn compatible transformer that cleans features by removing those with too many missing values or too little variation, removes samples with too many missing features, and imputes missing values.

__init__(var_type: str = 'continuous', imputation: str = 'mean', missingCutoffpf: float = 0.1, missingCutoffps: float = 0.25, covCutoff: float = 0.1, random_state=None)[source]

Initializes the DataCleaner with specified parameters for feature and sample filtering and imputation.

Parameters:

var_type (str) – Type of variable (“continuous”, “hcategorical”, “icategorical”).
imputation_method (str) – Method of imputation (“mean”, “median”, “mode”, “random”).
missing_cutoff_pf (float) – Max % of missing values allowed per feature (column).
missing_cutoff_ps (float) – Max % of missing values allowed per sample (row).
cov_cutoff (float) – Min coefficient of variation allowed per feature.
random_state (int, RandomState instance or None) – Seed for reproducibility.

Returns:

None

_apply_imputation(X)[source]: Helper to apply the imputation.

_fit_imputer(X)[source]: Helper to initialize and fit the correct imputer logic.

_validate_input(X)[source]: Ensures X is a DataFrame and handles infinite values.

fit(X: DataFrame, y: DataFrame | None = None)[source]

Learns which features to keep based on missingness and variation thresholds.

Parameters:

X (pd.DataFrame) – Input feature data.
y (pd.DataFrame, optional) – Ignored, present for API consistency by convention.

Returns:

Returns self.

Return type:

DataCleaner

transform(X: DataFrame)[source]: Applies feature selection, sample filtering, and imputation.

Desgin experiment

class MEDiml.learning.DesignExperiment.DesignExperiment(path_study: Path, path_settings: Path, experiment_label: str)[source]

Bases: object

__init__(path_study: Path, path_settings: Path, experiment_label: str) → None[source]

Constructor of the class DesignExperiment.

Parameters:

path_study (Path) – Path to the main study folder where the outcomes, learning patients and holdout patients dictionaries are found.
path_settings (Path) – Path to the settings file.
experiment_label (str) – String specifying the label to attach to a given learning experiment in “path_experiments”. This label will be attached to the ml__$experiments_label$.json file as well as the learn__$experiment_label$ folder. This label is used to keep track of different experiments with different settings (e.g. radiomics, scans, machine learning algorithms, etc.).

Returns:

None

create_experiment() → Dict[source]

Create the machine learning experiment dictionary, organizes each test/split information in a seperate folder.

Parameters:: ml (dict, optional) – Dictionary containing all the machine learning settings. Defaults to None.
Returns:: Dictionary containing all the organized machine learning settings.
Return type:: Dict

generate_experiment()[source]: Generate the json files containing all the options the experiment. The json files will then be used in machine learning.

Feature set reduction

class MEDiml.learning.FSR.FSR(method: str = 'fda')[source]

Bases: object

__init__(method: str = 'fda') → None[source]

Feature set reduction class constructor.

Parameters:: method (str) – Method of feature set reduction. Can be “FDA”, “LASSO” or “mRMR”.

apply_fda(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, logging: bool = True, path_save_logging: Path | None = None) → List[source]

Applies false discovery avoidance method.

Parameters:

ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
logging (bool, optional) – If True, will save a dict that tracks features selsected for each level. Defaults to True.
path_save_logging (Path, optional) – Path to save the logging dict. Defaults to None.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_fda_balanced(ml: Dict, variable_table: List, outcome_table_binary: DataFrame) → List[source]

Applies false discovery avoidance method but balances the number of features on each level.

Parameters:

ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
logging (bool, optional) – If True, will save a dict that tracks features selsected for each level. Defaults to True.
path_save_logging (Path, optional) – Path to save the logging dict. Defaults to None.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_fda_one_space(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, del_variants: bool = True, logging_dict: Dict | None = None) → List[source]

Applies false discovery avoidance method.

Parameters:

ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
del_variants (bool, optional) – If True, will delete the variants of the same feature. Defaults to True.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_fsr(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, path_save_logging: Path | None = None) → List[source]

Applies feature set reduction method.

Parameters:

ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
path_save_logging (Path, optional) – Path to save logging information. Defaults to None.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_random_fsr(ml: Dict, variable_table: List) → List[source]

Applies random feature set reduction by choosing a random number of features.

Parameters:

ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_random_fsr_one_space(ml: Dict, variable_table: DataFrame) → List[source]

Normalization

class MEDiml.learning.Normalization.CombatNormalization(institution_col: str | None = None, covariates: list | None = None, drop_institution: bool = True)[source]

Bases: BaseEstimator, TransformerMixin

Sklearn-compatible Transformer for ComBat Normalization.

This transformer assumes the input X (DataFrame) contains both the features to be normalized and the column identifying the institution/batch.

__init__(institution_col: str | None = None, covariates: list | None = None, drop_institution: bool = True)[source]

Parameters:

institution_col (str) – Name of the column in X containing the institution/batch IDs. If None, tries to derive from Index using util.
covariates (list) – List of column names in X to treat as covariates (biological retention).
drop_institution (bool) – If True, removes the institution column from output.

_process_institutions(institutions)[source]: Helper to map institution strings to integers.

fit(X, y=None)[source]: ComBat calculates parameters on the current batch data provided in transform. Standard fit does nothing but validate input exists.

transform(X)[source]: Applies ComBat Normalization.

Radiomics Learner

class MEDiml.learning.RadiomicsLearner.RadiomicsLearner(path_study: Path, path_settings: Path, experiment_label: str)[source]

Bases: object

__init__(path_study: Path, path_settings: Path, experiment_label: str) → None[source]

Constructor of the class DesignExperiment.

Parameters:

path_study (Path) – Path to the main study folder where the outcomes, learning patients and holdout patients dictionaries are found.
path_settings (Path) – Path to the settings folder.
experiment_label (str) – String specifying the label to attach to a given learning experiment in “path_experiments”. This label will be attached to the ml__$experiments_label$.json file as well as the learn__$experiment_label$ folder. This label is used to keep track of different experiments with different settings (e.g. radiomics, scans, machine learning algorithms, etc.).

Returns:

None

get_hold_out_set_table(ml: Dict, var_id: str, patients_id: List)[source]

Loads and pre-processes different radiomics tables then combines them to be used for hold-out testing.

Parameters:

ml (Dict) – The machine learning dictionary containing the information of the machine learning test.
var_id (str) – String specifying the ID of the radiomics variable in ml. –> Ex: var1
patients_id (List) – List of patients of the hold-out set.

Returns:

Radiomics table for the hold-out set.

Return type:

pd.DataFrame

ml_run(path_ml: Path, holdout_test: bool = True, method: str = 'auto') → None[source]

This function runs the machine learning test for the ceated experiment.

Parameters:

path_ml (Path) – Path to the main dictionary containing info about the ml current experiment.
holdout_test (bool, optional) – Boolean specifying if the hold-out test should be performed.

Returns:

None.

pre_process_radiomics_table(ml: Dict, var_id: str, outcome_table_binary: DataFrame, patients_train: list) → Tuple[DataFrame, DataFrame][source]

For the given variable, this function loads the corresponding radiomics tables and pre-processes them (cleaning, normalization and feature set reduction).

Note

Only patients of the training/learning set should be found in the given outcome table.

Parameters:

ml (Dict) – The machine learning dictionary containing the information of the machine learning test (parameters, options, etc.).
var_id (str) – String specifying the ID of the radiomics variable in ml. For example: ‘var1’.
outcome_table_binary (pd.DataFrame) – outcome table with binary labels. This table may be used to pre-process some variables with the “FDA” feature set reduction algorithm.
patients_train (list) – List of patients to use for training.

Returns:

Two dataframes of processed radiomics tables, one for training: and one for testing (no feature set reduction).

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

pre_process_variables(ml: Dict, outcome_table_binary: DataFrame) → Tuple[DataFrame, DataFrame][source]

Loads and pre-processes different radiomics tables from different variable types found in the ml dict.

Note

only patients of the training/learning set should be found in this outcome table.

Parameters:

ml (Dict) – The machine learning dictionary containing the information of the machine learning test.
outcome_table_binary (pd.DataFrame) – outcome table with binary labels. This table may be used to pre-process some variables with the “FDA” feature set reduction algorithm.

Returns:

Two dict of processed radiomics tables, one dict for training and one for: testing (no feature set reduction).

Return type:

Tuple

run_experiment(holdout_test: bool = True, method: str = 'pycaret') → None[source]

Run the machine learning experiment for each split/run

Parameters:

holdout_test (bool, optional) – Boolean specifying if the hold-out test should be performed.
method (str, optional) – String specifying the method to use to train the model. - “pycaret”: Use PyCaret to train the model (automatic). - “grid_search”: Grid search with cross-validation to find the best parameters. - “random_search”: Random search with cross-validation to find the best parameters.

Returns:

None

Results

class MEDiml.learning.Results.Results(model_dict: dict | None = None, model_id: str = '')[source]

Bases: object

A class to analyze the results of a given machine learning experiment, including the assessment of the model’s performance,

Parameters:

model_dict (dict, optional) – Dictionary containing the model’s parameters. Defaults to {}.
model_id (str, optional) – ID of the model. Defaults to “”.

model_dict

Dictionary containing the model’s parameters.

Type:: dict

model_id

ID of the model.

Type:: str

results_dict

Dictionary containing the results of the model’s performance.

Type:: dict

__init__(model_dict: dict | None = None, model_id: str = '') → None[source]: Constructor of the class Results

average_results(path_results: Path, save: bool = False) → dict[source]

Averages the results (AUC, BAC, Sensitivity and Specifity) of all the runs of the same experiment, for training, testing and holdout sets.

Parameters:

path_results (Path) – path to the folder containing the results of the experiment.
save (bool, optional) – If True, saves the results in the same folder as the model.

Returns:

Averaged results for each dataset.

Return type:

dict

bootstrap_metrics(response: ndarray, labels: DataFrame, thresh: float, n_bootstraps: int = 100) → dict[source]

Computes 95% Confidence Intervals using bootstrap resampling.

Parameters:

response (np.ndarray) – Array of the probabilities of class “1” for all instances (prediction).
labels (pd.DataFrame) – Column vector specifying the outcome status (1 or 0) for all instances.
thresh (float) – Optimal threshold selected from the ROC curve.
n_bootstraps (int, optional) – Number of bootstrap samples. Defaults to 100.

Returns:

Dictionary containing the 95% confidence intervals for each metric.

Return type:

dict

get_model_performance(response: list, outcome_table: DataFrame) → None[source]

Calculates the performance of the model :param response: List of machine learning model predictions. :type response: list :param outcome_table: Outcome table with binary labels. :type outcome_table: pd.DataFrame

Returns:: Updates the run_results attribute.
Return type:: None

get_optimal_level(path_experiments: Path, experiments_labels: List[str], metric: str = 'AUC_mean', p_value_test: str = 'wilcoxon', aggregate: bool = False) → None[source]

This function plots a heatmap of the metrics values for the performance of the models in the given experiment.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiments_labels (List) – List of experiments labels to use for the plot. including variants is possible. For example: [‘experiment1_morph_CT’, [‘experiment1_intensity5_CT’, ‘experiment1_intensity10_CT’], ‘experiment1_texture_CT’].
metric (str, optional) – Metric to plot. Defaults to ‘AUC_mean’.
p_value_test (str, optional) –
Method to use to calculate the p-value. Defaults to ‘wilcoxon’. Available options:
- ’delong’: Delong test.
- ’ttest’: T-test.
- ’wilcoxon’: Wilcoxon signed rank test.
- ’bengio’: Bengio and Nadeau corrected t-test.
aggregate (bool, optional) – If True, aggregates the results of all the splits and computes one final p-value. Only valid for the Delong test when cross-validation is used. Defaults to False.

Returns:

None.

plot_fda_analysis_heatmap(path_experiments: Path, experiment: str, levels: List, modalities: List, title: str | None = None, save: bool = False) → None[source]

This function plots a heatmap of the percentage of stable features and final features selected by FDA for a given experiment.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
levels (List) – List of radiomics levels to include in plot. For example: [‘morph’, ‘intensity’].
modalities (List) – List of imaging modalities to include in the plot.
title (str, optional) – Title and name used to save the plot. Defaults to None.
save (bool, optional) – Whether to save the plot. Defaults to False.

Returns:

None.

plot_feature_analysis(path_experiments: Path, experiment: str, levels: List, modalities: List = [], title: str | None = None, save: bool = False) → None[source]

This function plots a pie chart of the percentage of the final features used to train the model per radiomics level.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
levels (List) – List of radiomics levels to include in plot. For example: [‘morph’, ‘intensity’].
modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].
title (str, optional) – Title and name used to save the plot. Defaults to None.
save (bool, optional) – Whether to save the plot. Defaults to False.

Returns:

None.

plot_features_importance_histogram(path_experiments: Path, experiment: str, level: str, modalities: List, sort_option: str = 'importance', title: str | None = None, save: bool = True, figsize: tuple = (12, 12)) → None[source]

Plots a histogram of the features importance for the given experiment.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
level (str) – Radiomics level to plot. For example: ‘morph’.
modalities (List) – List of imaging modalities to use for the plot. A plot for each modality.
sort_option (str, optional) – Option used to sort the features. Available options: - ‘importance’: Sorts the features by importance. - ‘times_selected’: Sorts the features by the number of times they were selected across the different splits. - ‘both’: Sorts the features by importance and then by the number of times they were selected.
title (str, optional) – Title of the plot. Defaults to None.
save (bool, optional) – Whether to save the plot. Defaults to True.
figsize (tuple, optional) – Size of the figure. Defaults to (12, 12).

Returns:

None. Plots the figure or saves it.

plot_heatmap(path_experiments: Path, experiments_labels: List[str], metric: str = 'AUC_mean', stat_extra: list = [], plot_p_values: bool = True, p_value_test: str = 'wilcoxon', aggregate: bool = False, title: str | None = None, save: bool = False, figsize: tuple = (8, 8)) → None[source]

This function plots a heatmap of the metrics values for the performance of the models in the given experiment.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiments_labels (List) – List of experiments labels to use for the plot. including variants is possible. For example: [‘experiment1_morph_CT’, [‘experiment1_intensity5_CT’, ‘experiment1_intensity10_CT’], ‘experiment1_texture_CT’].
metric (str, optional) – Metric to plot. Defaults to ‘AUC_mean’.
stat_extra (list, optional) – List of extra statistics to include in the plot. Defaults to [].
plot_p_values (bool, optional) – If True plots the p-value of the choosen test. Defaults to True.
p_value_test (str, optional) –
Method to use to calculate the p-value. Defaults to ‘wilcoxon’. Available options:
- ’delong’: Delong test.
- ’ttest’: T-test.
- ’wilcoxon’: Wilcoxon signed rank test.
- ’bengio’: Bengio and Nadeau corrected t-test.
aggregate (bool, optional) – If True, aggregates the results of all the splits and computes one final p-value. Only valid for the Delong test when cross-validation is used. Defaults to False.
extra_xlabels (List, optional) – List of extra x-axis labels. Defaults to [].
title (str, optional) – Title of the plot. Defaults to None.
save (bool, optional) – Whether to save the plot. Defaults to False.
figsize (tuple, optional) – Size of the figure. Defaults to (8, 8).

Returns:

None.

plot_lf_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) → None[source]

Plots a tree explaining the impact of features in the linear filters radiomics complexity level.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
level (List) – Radiomics complexity level to use for the plot.
modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].
initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.
lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.
title (str, optional) – Title and name used to save the plot. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).

Returns:

None.

plot_original_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) → None[source]

Plots a tree explaining the impact of features in the original radiomics complexity level.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
level (List) – Radiomics complexity level to use for the plot.
modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].
initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.
lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.
title (str, optional) – Title and name used to save the plot. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).

Returns:

None.

plot_radiomics_starting_percentage(path_experiments: Path, experiment: str, levels: List, modalities: List, title: str | None = None, figsize: tuple = (15, 10), save: bool = False) → None[source]

This function plots a pie chart of the percentage of features used in experiment per radiomics level.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
levels (List) – List of radiomics levels to include in the plot.
modalities (List) – List of imaging modalities to include in the plot.
title (str, optional) – Title and name used to save the plot. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (15, 10).
save (bool, optional) – Whether to save the plot. Defaults to False.

Returns:

None.

plot_tf_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) → None[source]

Plots a tree explaining the impact of features in the textural filters radiomics complexity level.

Parameters:

path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
level (List) – Radiomics complexity level to use for the plot.
modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].
initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.
lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.
title (str, optional) – Title and name used to save the plot. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).

Returns:

None.

Creates a dictionary with the results of the model using the class attributes.

Parameters:

response_train (list) – List of machine learning model predictions for the training set.
response_test (list) – List of machine learning model predictions for the test set.
response_holdout (list) – List of machine learning model predictions for the holdout set.
patients_train (list) – List of patients in the training set.
patients_test (list) – List of patients in the test set.
patients_holdout (list) – List of patients in the holdout set.
outcome_table_binary_train (pd.DataFrame) – Binary outcome table for the training set.
outcome_table_binary_test (pd.DataFrame) – Binary outcome table for the test set.
outcome_table_binary_holdout (pd.DataFrame) – Binary outcome table for the holdout set.

Returns:

Dictionary with the the responses of the model and the patients used for training, testing and holdout.

Return type:

Dict