Title: | A Backend for a 'nextflow' Pipeline that Performs Machine-Learning-Based Modeling of Biomedical Data |
---|---|
Description: | Provides functionality to perform machine-learning-based modeling in a computation pipeline. Its functions contain the basic steps of machine-learning-based knowledge discovery workflows, including model training and optimization, model evaluation, and model testing. To perform these tasks, the package builds heavily on existing machine-learning packages, such as 'caret' <https://github.com/topepo/caret/> and associated packages. The package can train multiple models, optimize model hyperparameters by performing a grid search or a random search, and evaluates model performance by different metrics. Models can be validated either on a test data set, or in case of a small sample size by k-fold cross validation or repeated bootstrapping. It also allows for 0-Hypotheses generation by performing permutation experiments. Additionally, it offers methods of model interpretation and item categorization to identify the most informative features from a high dimensional data space. The functions of this package can easily be integrated into computation pipelines (e.g. 'nextflow' <https://www.nextflow.io/>) and hereby improve scalability, standardization, and re-producibility in the context of machine-learning. |
Authors: | Sebastian Malkusch [aut, cre] , Kolja Becker [aut] , Alexander Peltzer [ctb] , Neslihan Kaya [ctb] , Boehringer Ingelheim Ltd. [cph, fnd] |
Maintainer: | Sebastian Malkusch <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.3 |
Built: | 2025-01-23 04:28:36 UTC |
Source: | https://github.com/boehringer-ingelheim/flowml |
Creates an object that defines and handles command line arguments.
create_parser()
create_parser()
A parser that organizes the communication between the user and th function. It also provides a help message.
An instance of type 'optparse::OptionParser'.
Sebastian Malkusch
Creates an object of a resampling experiment.
create_resample_experiment( seed, data_df, parser_inst, model_inst, config_inst, n_features )
create_resample_experiment( seed, data_df, parser_inst, model_inst, config_inst, n_features )
seed |
sets the seed for the random number generator to guarantee reproducibility. (int) |
data_df |
data frame to be learned from. (tibble::tibble) |
parser_inst |
instance of parser object. (optparse::parse_args). |
model_inst |
instance of caret_train object (caret::train). |
config_inst |
list of config options (list). |
n_features |
number of features (int). |
Creates a resampling experiment. It uses user defined parameters to set up the experiment. It creates an instance of the Resampler object and runs the experiment according to the user-defined parameters.
An instance of type 'Resampler'.
Sebastian Malkusch
Pipeline function that sets up and runs a resampling experiment.
fml_bootstrap(parser_inst)
fml_bootstrap(parser_inst)
parser_inst |
Instance of fml_parser class that comprises command line arguments. |
The experiment is run in parallel. All results are written to files.
none
Sebastian Malkusch
## Not run: parser_inst <- flowml::create_parser() parser_inst$pipeline_segment <- "bootstrap" parser_inst$config <- flowml::fml_example(file = "reg_config.json") parser_inst$data <- flowml::fml_example(file = "reg_data.csv") parser_inst$samples_train <- flowml::fml_example(file = "reg_samples_train.txt") parser_inst$samples_test <- flowml::fml_example(file = "reg_samples_test.txt") parser_inst$features <- flowml::fml_example(file = "reg_features.txt") parser_inst$extended_features <- flowml::fml_example(file = "reg_features_extended.txt") parser_inst$trained <- flowml::fml_example(file = "reg_fit.rds") parser_inst$permutation <- "none" parser_inst$result_dir <- tempdir() flowml::fml_bootstrap(parser_inst = parser_inst) ## End(Not run)
## Not run: parser_inst <- flowml::create_parser() parser_inst$pipeline_segment <- "bootstrap" parser_inst$config <- flowml::fml_example(file = "reg_config.json") parser_inst$data <- flowml::fml_example(file = "reg_data.csv") parser_inst$samples_train <- flowml::fml_example(file = "reg_samples_train.txt") parser_inst$samples_test <- flowml::fml_example(file = "reg_samples_test.txt") parser_inst$features <- flowml::fml_example(file = "reg_features.txt") parser_inst$extended_features <- flowml::fml_example(file = "reg_features_extended.txt") parser_inst$trained <- flowml::fml_example(file = "reg_fit.rds") parser_inst$permutation <- "none" parser_inst$result_dir <- tempdir() flowml::fml_bootstrap(parser_inst = parser_inst) ## End(Not run)
path to flowml examples data
fml_example(file = NULL)
fml_example(file = NULL)
file |
Name of file. If 'NULL', the example files will be listed. |
flowml comes bundled with a number of sample files in its 'inst/extdata' directory. This function allows to access them.
The path of to an example file, if file is defined. Else, a list of example files.
Sebastian Malkusch
## Not run: fml_example() fml_example(file = "reg_config.json") ## End(Not run)
## Not run: fml_example() fml_example(file = "reg_config.json") ## End(Not run)
Pipeline function that sets up and runs a post-hoc interpretation of an ml experiment. All results are written to rds files.
fml_interpret(parser_inst)
fml_interpret(parser_inst)
parser_inst |
instance of fml_parser class that comprises command line arguments. |
none
Sebastian Malkusch
## Not run: parser_inst <- flowml::create_parser() parser_inst$pipeline_segment <- "interpret" parser_inst$config <- flowml::fml_example(file = "reg_config.json") parser_inst$data <- flowml::fml_example(file = "reg_data.csv") parser_inst$samples_train <- flowml::fml_example(file = "reg_samples_train.txt") parser_inst$samples_test <- flowml::fml_example(file = "reg_samples_test.txt") parser_inst$features <- flowml::fml_example(file = "reg_features.txt") parser_inst$extended_features <- flowml::fml_example(file = "reg_features_extended.txt") parser_inst$trained <- flowml::fml_example(file = "reg_fit.rds") parser_inst$interpretation <- "shap" parser_inst$result_dir <- tempdir() flowml::fml_interpret(parser_inst = parser_inst) ## End(Not run)
## Not run: parser_inst <- flowml::create_parser() parser_inst$pipeline_segment <- "interpret" parser_inst$config <- flowml::fml_example(file = "reg_config.json") parser_inst$data <- flowml::fml_example(file = "reg_data.csv") parser_inst$samples_train <- flowml::fml_example(file = "reg_samples_train.txt") parser_inst$samples_test <- flowml::fml_example(file = "reg_samples_test.txt") parser_inst$features <- flowml::fml_example(file = "reg_features.txt") parser_inst$extended_features <- flowml::fml_example(file = "reg_features_extended.txt") parser_inst$trained <- flowml::fml_example(file = "reg_fit.rds") parser_inst$interpretation <- "shap" parser_inst$result_dir <- tempdir() flowml::fml_interpret(parser_inst = parser_inst) ## End(Not run)
Pipeline function that performs a hyper-parameter screeing experiment.
fml_train(parser_inst)
fml_train(parser_inst)
parser_inst |
instance of fml_parser class that comprises command line arguments. |
none
Kolja Becker
## Not run: parser_inst <- flowml::create_parser() parser_inst$pipeline_segment <- "train" parser_inst$config <- flowml::fml_example(file = "reg_config.json") parser_inst$data <- flowml::fml_example(file = "reg_data.csv") parser_inst$samples_train <- flowml::fml_example(file = "reg_samples_train.txt") parser_inst$samples_test <- flowml::fml_example(file = "reg_samples_test.txt") parser_inst$features <- flowml::fml_example(file = "reg_features.txt") parser_inst$extended_features <- flowml::fml_example(file = "reg_features_extended.txt") parser_inst$result_dir <- tempdir() flowml::fml_train(parser_inst = parser_inst) ## End(Not run)
## Not run: parser_inst <- flowml::create_parser() parser_inst$pipeline_segment <- "train" parser_inst$config <- flowml::fml_example(file = "reg_config.json") parser_inst$data <- flowml::fml_example(file = "reg_data.csv") parser_inst$samples_train <- flowml::fml_example(file = "reg_samples_train.txt") parser_inst$samples_test <- flowml::fml_example(file = "reg_samples_test.txt") parser_inst$features <- flowml::fml_example(file = "reg_features.txt") parser_inst$extended_features <- flowml::fml_example(file = "reg_features_extended.txt") parser_inst$result_dir <- tempdir() flowml::fml_train(parser_inst = parser_inst) ## End(Not run)
Pipeline function that performs a validation experiment on a a caret train object based on test samples.
fml_validate(parser_inst)
fml_validate(parser_inst)
parser_inst |
instance of fml_parser class that comprises command line arguments. |
none
Kolja Becker
## Not run: parser_inst <- flowml::create_parser() parser_inst$pipeline_segment <- "validate" parser_inst$config <- flowml::fml_example(file = "reg_config.json") parser_inst$data <- flowml::fml_example(file = "reg_data.csv") parser_inst$samples_train <- flowml::fml_example(file = "reg_samples_train.txt") parser_inst$samples_test <- flowml::fml_example(file = "reg_samples_test.txt") parser_inst$features <- flowml::fml_example(file = "reg_features.txt") parser_inst$extended_features <- flowml::fml_example(file = "reg_features_extended.txt") parser_inst$trained <- flowml::fml_example(file = "reg_fit.rds") parser_inst$permutation <- "none" parser_inst$result_dir <- tempdir() flowml::fml_validate(parser_inst = parser_inst) ## End(Not run)
## Not run: parser_inst <- flowml::create_parser() parser_inst$pipeline_segment <- "validate" parser_inst$config <- flowml::fml_example(file = "reg_config.json") parser_inst$data <- flowml::fml_example(file = "reg_data.csv") parser_inst$samples_train <- flowml::fml_example(file = "reg_samples_train.txt") parser_inst$samples_test <- flowml::fml_example(file = "reg_samples_test.txt") parser_inst$features <- flowml::fml_example(file = "reg_features.txt") parser_inst$extended_features <- flowml::fml_example(file = "reg_features_extended.txt") parser_inst$trained <- flowml::fml_example(file = "reg_fit.rds") parser_inst$permutation <- "none" parser_inst$result_dir <- tempdir() flowml::fml_validate(parser_inst = parser_inst) ## End(Not run)
Formats response variable based on the ml-type variable passed by the config file. For regression analyses the response variable will be explicitly transformed to type numeric. For Classification experiments the response variable will be explicitly transformed to a factor. Time-to-event models are to be implemented in the near future.
format_y(y, ml.type)
format_y(y, ml.type)
y |
vector of response varibale. |
ml.type |
type of experiment (chracater). |
a transformed version of the response variable y.
Kolja Becker
Model validation by repeated bootstrapping
[R6::R6Class] object.
Uses repeated bootstrapping to validate models without a test data set. For each experiment multiple metrics are measured. For classification experiments the confusion matrix is calculated additionally. In order to test hypotheses, either features or the response variable can be permuted.
permute
returns the instance variable 'permute'. (character)
permute_alphabet
returns the instance variable 'permute_alphabet'. (character)
n_resample
returns the instance variable 'n_resample'. (integer)
fml_method
returns the instance variable 'fml_method'. (character)
fml_type
returns the instance variable 'fml_type'. (character)
fml_type_alphabet
returns the instance variable 'fml_type_alphabet'. (character)
pre_process_lst
returns the instance variable 'pre_process_lst'. (character)
hyper_parameters
returns the instance variable 'hyper_parameters'. (list)
response_var
returns the instance variable 'response_var'. (character)
n_features
returns the instance variable 'n_features'. (integer)
strata_var
returns the instance variable 'strata_var'. (character)
metrics_df
returns the instance variable 'metrics_df'. (tibble::tibble)
confusion_df
returns the instance variable 'confusion_df'. (tibble::tibble)
new()
checks, if permutation is requested. If true, performs the permutation task.
Checks if ml.type is classification. If true, calculates confusion matrix.
Creates and returns instance of Resampler class.
Resampler$new( n_resample = 500, fml_method = "pcr", fml_type = "classification", hyper_parameters = "list", pre_process_lst = c("center", "scale"), permute = NULL, n_features = 0, response_var = "character", strata_var = NULL )
n_resample
number of bootstrap resamples. The default is 500 (integer)
fml_method
ML model that is being used. The default is 'pcr' (character).
fml_type
ML model type. Needs to be 'classification', 'regression' or 'censored'. Default is 'classification' (character).
hyper_parameters
List of model hyper parameters. (list)
pre_process_lst
Vector of pre-processing steps. Default is 'c("center", "scale")' (character).
permute
Permutation method. Needs to be 'none', 'features' or 'response'. (character)
n_features
Number of features to be chosen in the permutation experiment. Default is 0 (integer).
response_var
Response variable of the model (character).
strata_var
Stratification variable (character).
Resampler
print()
Print instance variables of Resampler class.
Resampler$print()
character
fit()
Runs the bootstrap analysis based on the instance variables chosen under initialize.
Resampler$fit(data_df = "tbl_df")
data_df
data set to be analyzed (tibble::tibble).
None
clone()
The objects of this class are cloneable with this method.
Resampler$clone(deep = FALSE)
deep
Whether to make a deep clone.
Sebastian Malkusch
Performs item categorization on permutation or shap analysis object
run_abc_analysis(data_obj, method)
run_abc_analysis(data_obj, method)
data_obj |
Resuls of model interpretation experiment |
method |
Method used for model interpretation (permutatopn or shap) |
Interpretation results are passed to the function. Based on the type of interpretation experiment the data is transformed into a uniformly structured data frame. Item categorization is performed by computed ABC analysis. The result is returned in form of a tibble.
A tibble with item categories
a tibble
Sebastian Malkusch