Yes
Rust compiler investigation
Do all the datasets have exact same columns?
Approach
They have the same columns. The datasets will be combined, but will have two columns run
and source
, where run
is 1, 2, or 3, and source
is debug
or opt
.
So my approaches will be in two parts – first approach
will be to fit and find variable importance values for all
t_*
and both debug + optsis a meta learner among few algorithms
This approach may help us see if there are any covariates that are important in all t_all__
, t_gen__
, t_opt__
, t_lto__
, and to both debug + opts simultaneously.
Second approach
will be to fit and find variable importance values for only
t_all
and only for debug (unsure if that’s more desirable over opts)is a meta learner among few algos
This will be a bit more surgical and may help Rust devs narrow down a bit more for debug.
First approach
# Recipes
= df |>
recipe_model_all recipe(t_all__ ~ .) |>
step_select(-t_gen__, -t_opt__, -t_lto__) |>
update_role(cgu_name, new_role = 'id') |>
update_role(source, new_role = 'strata') |>
update_role(run, new_role = 'iteration') |>
prep()
= df |>
recipe_model_gen recipe(t_gen__ ~ .) |>
step_select(-t_all__, -t_opt__, -t_lto__) |>
update_role(cgu_name, new_role = 'id') |>
update_role(source, new_role = 'strata') |>
update_role(run, new_role = 'iteration') |>
prep()
= df |>
recipe_model_opt recipe(t_opt__ ~ .) |>
step_select(-t_all__, -t_gen__, -t_lto__) |>
update_role(cgu_name, new_role = 'id') |>
update_role(source, new_role = 'strata') |>
update_role(run, new_role = 'iteration') |>
prep()
= df |>
recipe_model_lto recipe(t_lto__ ~ .) |>
step_select(-t_all__, -t_gen__, -t_opt__) |>
update_role(cgu_name, new_role = 'id') |>
update_role(source, new_role = 'strata') |>
update_role(run, new_role = 'iteration') |>
prep()
# Cross validation
= df |>
resample_cv vfold_cv(times = 5,
repeats = 2,
strata = source)
# Algorithm specifications
= linear_reg(penalty = tune(),
spec_glmnet mixture = tune(),
engine = 'glmnet')
= 30
number_of_variables
= mars(num_terms = !!number_of_variables,
spec_mars prod_degree = 2,
prune_method = 'backward',
mode = 'regression',
engine = 'earth')
= metric_set(rmse, mae, huber_loss, smape)
metrics
= workflow_set(models = list(enet = spec_glmnet,
wfs_meta_learner #spec_xgboost,
mars = spec_mars),
preproc = list(recipe_model_all,
recipe_model_gen,
recipe_model_opt,
recipe_model_lto),cross = T)
# Multi-core
= parallel::makePSOCKcluster(6)
core_cluster ::registerDoParallel(core_cluster)
doParallel
# Tune
= wfs_meta_learner |>
tuned_models workflow_map('tune_grid',
grid = 10, # number of candidate parameter sets
resamples = resample_cv,
metrics = metrics,
control = control_resamples(save_pred = T))
# Best models
= tuned_models |>
best_models rank_results(rank_metric = 'huber_loss',
select_best = T)
cat('Model results overview: \n')
Model results overview:
|>
best_models select(model, rank, .metric, mean, std_err) |>
print.data.frame()
model rank .metric mean std_err
1 mars 1 huber_loss 1.988382 0.02801533
2 mars 1 mae 2.415753 0.02888400
3 mars 1 rmse 4.420764 0.07267093
4 mars 1 smape 32.104160 0.51762551
5 linear_reg 2 huber_loss 2.644729 0.01076610
6 linear_reg 2 mae 3.102345 0.01080196
7 linear_reg 2 rmse 5.638122 0.05003410
8 linear_reg 2 smape 41.108214 0.13934601
9 mars 3 huber_loss 12.396574 0.17642270
10 mars 3 mae 12.867571 0.17895599
11 mars 3 rmse 32.820151 0.84429605
12 mars 3 smape 114.822865 0.35223842
13 linear_reg 4 huber_loss 15.732693 0.12259686
14 linear_reg 4 mae 16.217066 0.12260450
15 linear_reg 4 rmse 38.686650 0.93551912
16 linear_reg 4 smape 120.730060 0.14334815
17 mars 5 huber_loss 30.693366 0.45823084
18 mars 5 mae 31.185624 0.45832850
19 mars 5 rmse 97.908504 3.55766679
20 mars 5 smape 34.150181 0.83265438
21 linear_reg 6 huber_loss 34.691675 0.31188623
22 linear_reg 6 mae 35.182605 0.31184456
23 linear_reg 6 rmse 112.526951 3.79990889
24 linear_reg 6 smape 37.009140 0.16044608
25 mars 7 huber_loss 39.537252 0.34278772
26 mars 7 mae 40.030793 0.34295011
27 mars 7 rmse 121.760634 3.26804067
28 mars 7 smape 30.962118 0.68995907
29 linear_reg 8 huber_loss 44.673637 0.35967970
30 linear_reg 8 mae 45.167056 0.35987220
31 linear_reg 8 rmse 133.559498 3.51880474
32 linear_reg 8 smape 35.113001 0.33728652
= collect_predictions(tuned_models)
best_fits
# Set the KPI
= 'huber_loss'
metric_main
# Get KPI and interpretable metric
= best_models |>
best_model mutate(score = mean + std_err) |>
slice_min(.by = .metric,
order_by = score) |>
filter(.metric == metric_main | .metric == 'smape')
= best_model |>
metric_main_for_print filter(.metric == metric_main) |>
select(score) |>
pull() |>
round(4)
= best_model |>
metric_smape_for_print filter(.metric == 'smape') |>
select(score) |>
pull() |>
round(4)
# Best model
= best_models |>
best_model mutate(score = mean + std_err) |>
slice_min(.by = .metric,
order_by = score) |>
filter(.metric == metric_main)
= best_model |>
best_model_for_print filter(.metric == metric_main) |>
select(model) |>
pull()
# 2nd best model
= best_models |>
best_model_2 mutate(score = mean + std_err) |>
slice_min(.by = .metric,
order_by = score,
n = 2) |>
filter(.metric == metric_main,
== 2)
rank
= best_model_2 |>
best_model_for_print_2 filter(.metric == metric_main) |>
select(model) |>
pull()
cat(paste0('The best performing model is ', best_model_for_print, '\n'))
The best performing model is mars
cat(paste0(metric_main, ' + standard deviation score: ', metric_main_for_print, '\n'))
huber_loss + standard deviation score: 2.0164
cat(paste0('SMAPE : ', metric_smape_for_print, '\n'))
SMAPE : 31.6521
cat(paste0('The second best performing model is ', best_model_for_print_2, '\n'))
The second best performing model is linear_reg
Get variable importance from two best models to be more reliable.
Combine the two models’ ranks of variable importances.
Check the models’ ranking differences of the variables to give us an idea of how reliable the result may be.
# Note: GLMNET gives signs
# Best workflow
= best_model |>
best_workflow filter(.metric == metric_main) |>
select(wflow_id) |>
pull()
= tuned_models |>
df_vi extract_workflow(best_workflow) |>
fit(df) |>
extract_fit_parsnip(best_workflow) |>
::vi(method = "model") |>
vipmutate(rank = rank(-Importance, ties.method = "min"))
# 2nd best workflow
= best_model_2 |>
best_workflow_2 filter(.metric == metric_main) |>
select(wflow_id) |>
pull()
= tuned_models |>
df_vi_2 extract_workflow(best_workflow_2) |>
fit(df) |>
extract_fit_parsnip(best_workflow_2) |>
::vi(method = "model") |>
vipmutate(rank = rank(-Importance, ties.method = "min"))
# Plot
library(ggplot2)
|>
df_vi ggplot() + geom_bar(aes(x = Importance, y = reorder(Variable, Importance)),
stat = "identity") + theme_classic() + ylab("Variable") + labs(title = "Best model variable importance")
|>
df_vi_2 ggplot() + geom_bar(aes(x = Importance, y = reorder(Variable, Importance)),
stat = "identity") + theme_classic() + ylab("Variable") + labs(title = "2nd best model variable importance")
|>
df_vi inner_join(df_vi_2, by = "Variable") |>
select(-starts_with("Importance")) |>
rename(rank_best = rank.x, rank_best_2 = rank.y) |>
mutate(rank_diff = rank_best - rank_best_2, rank_summed = rank_best +
rank_combine = rank(rank_summed, ties.method = "min")) |>
rank_best_2, select(Variable, rank_combine, rank_diff, starts_with("rank"), everything()) |>
arrange(rank_combine)
# A tidytable: 18 × 7
Variable rank_combine rank_diff rank_best rank_best_2 rank_summed Sign
<chr> <int> <int> <int> <int> <int> <chr>
1 rfn__ 1 0 1 1 2 POS
2 icnst 2 0 4 4 8 POS
3 iproj 3 -6 2 8 10 POS
4 ibb__ 4 10 12 2 14 POS
5 rproj 5 3 9 6 15 POS
6 iplac 6 -4 6 10 16 NEG
7 sttic 7 11 14 3 17 POS
8 idecl 8 9 14 5 19 NEG
9 rplac 8 1 10 9 19 POS
10 rdecl 8 -5 7 12 19 NEG
11 rssd_ 8 -11 4 15 19 NEG
12 est__ 12 -16 2 18 20 POS
13 rcnst 13 7 14 7 21 POS
14 issd_ 14 0 11 11 22 NEG
15 rstmt 15 -10 7 17 24 POS
16 rbb__ 16 1 14 13 27 NEG
17 istmt 16 -1 13 14 27 POS
18 ifn__ 18 -2 14 16 30 NEG
Conclusion, first approach
It seems at least somewhat reliable (low rank_diff
) that rfn__
is the most ‘important’ (should not use the word as non-technical English) for everything, followed by icnst
. Signs seem positive i.e. increase in rfn__
sees increases in t_*
. These relationships only come from one model, however, so may not be as reliable.
On the other hand, issd_
, rbb__
, istmt
, ifn__
do not seem to be very ‘important’.
Second approach
# Focus on debug
= df_debug1 |>
df2 rbind(df_debug2) |>
rbind(df_debug3)
# Recipe
= df2 |>
recipe_model_all2 recipe(t_all__ ~ .) |>
step_select(-t_gen__, -t_opt__, -t_lto__) |>
update_role(cgu_name, new_role = 'id') |>
update_role(source, new_role = 'strata') |>
update_role(run, new_role = 'iteration') |>
prep()
# Cross validation
= df2 |>
resample_cv2 vfold_cv(times = 5,
repeats = 2,
strata = source)
= workflow_set(models = list(enet = spec_glmnet,
wfs_meta_learner_debug #spec_xgboost,
mars = spec_mars),
preproc = list(recipe_model_all2),
cross = T)
# Tune
= wfs_meta_learner_debug |>
tuned_models2 workflow_map('tune_grid',
grid = 10, # number of candidate parameter sets
resamples = resample_cv2,
metrics = metrics,
control = control_resamples(save_pred = T))
# Best models
= tuned_models2 |>
best_models2 rank_results(rank_metric = 'huber_loss',
select_best = T)
cat('Model results overview: \n')
Model results overview:
|>
best_models2 select(model, rank, .metric, mean, std_err) |>
print.data.frame()
model rank .metric mean std_err
1 mars 1 huber_loss 9.388043 0.07714824
2 mars 1 mae 9.871018 0.07679225
3 mars 1 rmse 16.646647 0.25992042
4 mars 1 smape 19.512840 0.08416463
5 linear_reg 2 huber_loss 10.755656 0.08593158
6 linear_reg 2 mae 11.245593 0.08586785
7 linear_reg 2 rmse 19.238248 0.33855282
8 linear_reg 2 smape 24.256541 0.15787557
= collect_predictions(tuned_models2)
best_fits2
# Get KPI and interpretable metric
= best_models2 |>
best_model2 mutate(score = mean + std_err) |>
slice_min(.by = .metric,
order_by = score) |>
filter(.metric == metric_main | .metric == 'smape')
= best_model2 |>
metric_main_for_print filter(.metric == metric_main) |>
select(score) |>
pull() |>
round(4)
= best_model2 |>
metric_smape_for_print filter(.metric == 'smape') |>
select(score) |>
pull() |>
round(4)
# Best model
= best_models2 |>
best_model2 mutate(score = mean + std_err) |>
slice_min(.by = .metric,
order_by = score) |>
filter(.metric == metric_main)
= best_model2 |>
best_model_for_print filter(.metric == metric_main) |>
select(model) |>
pull()
# 2nd best model
= best_models2 |>
best_model_2 mutate(score = mean + std_err) |>
slice_min(.by = .metric,
order_by = score,
n = 2) |>
filter(.metric == metric_main,
== 2)
rank
= best_model_2 |>
best_model_for_print_2 filter(.metric == metric_main) |>
select(model) |>
pull()
cat(paste0('The best performing model is ', best_model_for_print, '\n'))
The best performing model is mars
cat(paste0(metric_main, ' + standard deviation score: ', metric_main_for_print, '\n'))
huber_loss + standard deviation score: 9.4652
cat(paste0('SMAPE : ', metric_smape_for_print, '\n'))
SMAPE : 19.597
cat(paste0('The second best performing model is ', best_model_for_print_2, '\n'))
The second best performing model is linear_reg
Get variable importance from two best models to be more reliable.
Combine the two models’ ranks of variable importances.
Check the models’ ranking differences of the variables to give us an idea of how reliable the result may be.
# Note: GLMNET gives signs
# Best workflow
= best_model2 |>
best_workflow2 filter(.metric == metric_main) |>
select(wflow_id) |>
pull()
= tuned_models2 |>
df_vi extract_workflow(best_workflow2) |>
fit(df2) |>
extract_fit_parsnip(best_workflow2) |>
::vi(method = "model") |>
vipmutate(rank = rank(-Importance, ties.method = "min"))
# 2nd best workflow
= best_model_2 |>
best_workflow_2 filter(.metric == metric_main) |>
select(wflow_id) |>
pull()
= tuned_models2 |>
df_vi_2 extract_workflow(best_workflow_2) |>
fit(df2) |>
extract_fit_parsnip(best_workflow_2) |>
::vi(method = "model") |>
vipmutate(rank = rank(-Importance, ties.method = "min"))
# Plot
library(ggplot2)
|>
df_vi ggplot() + geom_bar(aes(x = Importance, y = reorder(Variable, Importance)),
stat = "identity") + theme_classic() + ylab("Variable") + labs(title = "Best model variable importance")
|>
df_vi_2 ggplot() + geom_bar(aes(x = Importance, y = reorder(Variable, Importance)),
stat = "identity") + theme_classic() + ylab("Variable") + labs(title = "2nd best model variable importance")
|>
df_vi inner_join(df_vi_2, by = "Variable") |>
select(-starts_with("Importance")) |>
rename(rank_best = rank.x, rank_best_2 = rank.y) |>
mutate(rank_diff = rank_best - rank_best_2, rank_summed = rank_best +
rank_combine = rank(rank_summed, ties.method = "min")) |>
rank_best_2, select(Variable, rank_combine, rank_diff, starts_with("rank"), everything()) |>
arrange(rank_combine)
# A tidytable: 18 × 7
Variable rank_combine rank_diff rank_best rank_best_2 rank_summed Sign
<chr> <int> <int> <int> <int> <int> <chr>
1 issd_ 1 2 4 2 6 POS
2 rfn__ 1 -2 2 4 6 POS
3 ifn__ 3 9 10 1 11 NEG
4 rstmt 3 -7 2 9 11 POS
5 sttic 5 7 10 3 13 POS
6 est__ 6 -12 1 13 14 POS
7 iproj 7 5 10 5 15 POS
8 icnst 8 4 10 6 16 POS
9 rbb__ 9 3 10 7 17 POS
10 rdecl 9 -5 6 11 17 POS
11 ibb__ 11 2 10 8 18 POS
12 rssd_ 12 0 10 10 20 NEG
13 rcnst 12 -8 6 14 20 POS
14 rproj 12 -10 5 15 20 NEG
15 rplac 15 -2 10 12 22 POS
16 istmt 16 -7 8 15 23 NEG
17 idecl 17 -6 9 15 24 NEG
18 iplac 18 -5 10 15 25 NEG
Conclusion
Investigating only for t_all
, it seems that they’re not as reliable, but issd_
and rfn__
seem to be of highest ‘importance’.