Rust compiler investigation

Published

August 2, 2023

Do all the datasets have exact same columns?

Yes

Approach

They have the same columns. The datasets will be combined, but will have two columns run and source, where run is 1, 2, or 3, and source is debug or opt.

So my approaches will be in two parts – first approach

  • will be to fit and find variable importance values for all t_* and both debug + opts

  • is a meta learner among few algorithms

This approach may help us see if there are any covariates that are important in all t_all__, t_gen__, t_opt__, t_lto__, and to both debug + opts simultaneously.

Second approach

  • will be to fit and find variable importance values for only t_all and only for debug (unsure if that’s more desirable over opts)

  • is a meta learner among few algos

This will be a bit more surgical and may help Rust devs narrow down a bit more for debug.

First approach

# Recipes
recipe_model_all = df |>
  recipe(t_all__ ~ .) |>
  step_select(-t_gen__, -t_opt__, -t_lto__) |> 
  update_role(cgu_name, new_role = 'id') |>
  update_role(source, new_role = 'strata') |>
  update_role(run, new_role = 'iteration') |>
  prep()

recipe_model_gen = df |>
  recipe(t_gen__ ~ .) |>
  step_select(-t_all__, -t_opt__, -t_lto__) |> 
  update_role(cgu_name, new_role = 'id') |>
  update_role(source, new_role = 'strata') |>
  update_role(run, new_role = 'iteration') |>
  prep()

recipe_model_opt = df |>
  recipe(t_opt__ ~ .) |>
  step_select(-t_all__, -t_gen__, -t_lto__) |> 
  update_role(cgu_name, new_role = 'id') |>
  update_role(source, new_role = 'strata') |>
  update_role(run, new_role = 'iteration') |>
  prep()

recipe_model_lto = df |>
  recipe(t_lto__ ~ .) |>
  step_select(-t_all__, -t_gen__, -t_opt__) |> 
  update_role(cgu_name, new_role = 'id') |>
  update_role(source, new_role = 'strata') |>
  update_role(run, new_role = 'iteration') |>
  prep()

# Cross validation
resample_cv = df |> 
  vfold_cv(times = 5,
           repeats = 2,
           strata = source)

# Algorithm specifications
spec_glmnet = linear_reg(penalty = tune(),
                         mixture = tune(),
                         engine = 'glmnet')

number_of_variables = 30

spec_mars = mars(num_terms = !!number_of_variables,
                 prod_degree = 2,
                 prune_method = 'backward',
                 mode = 'regression',
                 engine = 'earth')

metrics = metric_set(rmse, mae, huber_loss, smape)

wfs_meta_learner = workflow_set(models = list(enet = spec_glmnet,
                                              #spec_xgboost,
                                              mars = spec_mars),
                                preproc = list(recipe_model_all,
                                               recipe_model_gen,
                                               recipe_model_opt,
                                               recipe_model_lto),
                                cross = T)

# Multi-core
core_cluster = parallel::makePSOCKcluster(6)
doParallel::registerDoParallel(core_cluster)

# Tune
tuned_models = wfs_meta_learner |> 
  workflow_map('tune_grid',
               grid = 10, # number of candidate parameter sets
               resamples = resample_cv,
               metrics = metrics,
               control = control_resamples(save_pred = T))

# Best models
best_models = tuned_models |> 
  rank_results(rank_metric = 'huber_loss',
               select_best = T)

cat('Model results overview: \n')
Model results overview: 
best_models |> 
  select(model, rank, .metric, mean, std_err) |> 
  print.data.frame()
        model rank    .metric       mean    std_err
1        mars    1 huber_loss   1.988382 0.02801533
2        mars    1        mae   2.415753 0.02888400
3        mars    1       rmse   4.420764 0.07267093
4        mars    1      smape  32.104160 0.51762551
5  linear_reg    2 huber_loss   2.644729 0.01076610
6  linear_reg    2        mae   3.102345 0.01080196
7  linear_reg    2       rmse   5.638122 0.05003410
8  linear_reg    2      smape  41.108214 0.13934601
9        mars    3 huber_loss  12.396574 0.17642270
10       mars    3        mae  12.867571 0.17895599
11       mars    3       rmse  32.820151 0.84429605
12       mars    3      smape 114.822865 0.35223842
13 linear_reg    4 huber_loss  15.732693 0.12259686
14 linear_reg    4        mae  16.217066 0.12260450
15 linear_reg    4       rmse  38.686650 0.93551912
16 linear_reg    4      smape 120.730060 0.14334815
17       mars    5 huber_loss  30.693366 0.45823084
18       mars    5        mae  31.185624 0.45832850
19       mars    5       rmse  97.908504 3.55766679
20       mars    5      smape  34.150181 0.83265438
21 linear_reg    6 huber_loss  34.691675 0.31188623
22 linear_reg    6        mae  35.182605 0.31184456
23 linear_reg    6       rmse 112.526951 3.79990889
24 linear_reg    6      smape  37.009140 0.16044608
25       mars    7 huber_loss  39.537252 0.34278772
26       mars    7        mae  40.030793 0.34295011
27       mars    7       rmse 121.760634 3.26804067
28       mars    7      smape  30.962118 0.68995907
29 linear_reg    8 huber_loss  44.673637 0.35967970
30 linear_reg    8        mae  45.167056 0.35987220
31 linear_reg    8       rmse 133.559498 3.51880474
32 linear_reg    8      smape  35.113001 0.33728652
best_fits = collect_predictions(tuned_models)

# Set the KPI
metric_main = 'huber_loss'

# Get KPI and interpretable metric
best_model = best_models |>
  mutate(score = mean + std_err) |> 
  slice_min(.by = .metric,
            order_by = score) |> 
  filter(.metric == metric_main | .metric == 'smape')

metric_main_for_print = best_model |>
  filter(.metric == metric_main) |> 
  select(score) |>
  pull() |>
  round(4)

metric_smape_for_print = best_model |>
  filter(.metric == 'smape') |> 
  select(score) |>
  pull() |>
  round(4)

# Best model
best_model = best_models |>
  mutate(score = mean + std_err) |> 
  slice_min(.by = .metric,
            order_by = score) |> 
  filter(.metric == metric_main)

best_model_for_print = best_model |>
  filter(.metric == metric_main) |>
  select(model) |>
  pull()

# 2nd best model
best_model_2 = best_models |>
  mutate(score = mean + std_err) |> 
  slice_min(.by = .metric,
            order_by = score,
            n = 2) |> 
  filter(.metric == metric_main,
         rank == 2)

best_model_for_print_2 = best_model_2 |>
  filter(.metric == metric_main) |> 
  select(model) |>
  pull()

cat(paste0('The best performing model is ', best_model_for_print, '\n'))
The best performing model is mars
cat(paste0(metric_main, ' + standard deviation score: ', metric_main_for_print, '\n'))
huber_loss + standard deviation score: 2.0164
cat(paste0('SMAPE : ', metric_smape_for_print, '\n'))
SMAPE : 31.6521
cat(paste0('The second best performing model is ', best_model_for_print_2, '\n'))
The second best performing model is linear_reg

Get variable importance from two best models to be more reliable.

Combine the two models’ ranks of variable importances.

Check the models’ ranking differences of the variables to give us an idea of how reliable the result may be.

# Note: GLMNET gives signs

# Best workflow
best_workflow = best_model |>
    filter(.metric == metric_main) |>
    select(wflow_id) |>
    pull()

df_vi = tuned_models |>
    extract_workflow(best_workflow) |>
    fit(df) |>
    extract_fit_parsnip(best_workflow) |>
    vip::vi(method = "model") |>
    mutate(rank = rank(-Importance, ties.method = "min"))

# 2nd best workflow
best_workflow_2 = best_model_2 |>
    filter(.metric == metric_main) |>
    select(wflow_id) |>
    pull()

df_vi_2 = tuned_models |>
    extract_workflow(best_workflow_2) |>
    fit(df) |>
    extract_fit_parsnip(best_workflow_2) |>
    vip::vi(method = "model") |>
    mutate(rank = rank(-Importance, ties.method = "min"))

# Plot
library(ggplot2)

df_vi |>
    ggplot() + geom_bar(aes(x = Importance, y = reorder(Variable, Importance)),
    stat = "identity") + theme_classic() + ylab("Variable") + labs(title = "Best model variable importance")

df_vi_2 |>
    ggplot() + geom_bar(aes(x = Importance, y = reorder(Variable, Importance)),
    stat = "identity") + theme_classic() + ylab("Variable") + labs(title = "2nd best model variable importance")

df_vi |>
    inner_join(df_vi_2, by = "Variable") |>
    select(-starts_with("Importance")) |>
    rename(rank_best = rank.x, rank_best_2 = rank.y) |>
    mutate(rank_diff = rank_best - rank_best_2, rank_summed = rank_best +
        rank_best_2, rank_combine = rank(rank_summed, ties.method = "min")) |>
    select(Variable, rank_combine, rank_diff, starts_with("rank"), everything()) |>
    arrange(rank_combine)
# A tidytable: 18 × 7
   Variable rank_combine rank_diff rank_best rank_best_2 rank_summed Sign 
   <chr>           <int>     <int>     <int>       <int>       <int> <chr>
 1 rfn__               1         0         1           1           2 POS  
 2 icnst               2         0         4           4           8 POS  
 3 iproj               3        -6         2           8          10 POS  
 4 ibb__               4        10        12           2          14 POS  
 5 rproj               5         3         9           6          15 POS  
 6 iplac               6        -4         6          10          16 NEG  
 7 sttic               7        11        14           3          17 POS  
 8 idecl               8         9        14           5          19 NEG  
 9 rplac               8         1        10           9          19 POS  
10 rdecl               8        -5         7          12          19 NEG  
11 rssd_               8       -11         4          15          19 NEG  
12 est__              12       -16         2          18          20 POS  
13 rcnst              13         7        14           7          21 POS  
14 issd_              14         0        11          11          22 NEG  
15 rstmt              15       -10         7          17          24 POS  
16 rbb__              16         1        14          13          27 NEG  
17 istmt              16        -1        13          14          27 POS  
18 ifn__              18        -2        14          16          30 NEG  

Conclusion, first approach

It seems at least somewhat reliable (low rank_diff) that rfn__ is the most ‘important’ (should not use the word as non-technical English) for everything, followed by icnst. Signs seem positive i.e. increase in rfn__ sees increases in t_*. These relationships only come from one model, however, so may not be as reliable.

On the other hand, issd_, rbb__, istmt, ifn__ do not seem to be very ‘important’.

Second approach

# Focus on debug
df2 = df_debug1 |>
  rbind(df_debug2) |> 
  rbind(df_debug3)

# Recipe
recipe_model_all2 = df2 |>
  recipe(t_all__ ~ .) |>
  step_select(-t_gen__, -t_opt__, -t_lto__) |> 
  update_role(cgu_name, new_role = 'id') |>
  update_role(source, new_role = 'strata') |>
  update_role(run, new_role = 'iteration') |>
  prep()

# Cross validation
resample_cv2 = df2 |> 
  vfold_cv(times = 5,
           repeats = 2,
           strata = source)

wfs_meta_learner_debug = workflow_set(models = list(enet = spec_glmnet,
                                                       #spec_xgboost,
                                                       mars = spec_mars),
                                         preproc = list(recipe_model_all2),
                                         cross = T)

# Tune
tuned_models2 = wfs_meta_learner_debug |> 
  workflow_map('tune_grid',
               grid = 10, # number of candidate parameter sets
               resamples = resample_cv2,
               metrics = metrics,
               control = control_resamples(save_pred = T))

# Best models
best_models2 = tuned_models2 |> 
  rank_results(rank_metric = 'huber_loss',
               select_best = T)

cat('Model results overview: \n')
Model results overview: 
best_models2 |> 
  select(model, rank, .metric, mean, std_err) |> 
  print.data.frame()
       model rank    .metric      mean    std_err
1       mars    1 huber_loss  9.388043 0.07714824
2       mars    1        mae  9.871018 0.07679225
3       mars    1       rmse 16.646647 0.25992042
4       mars    1      smape 19.512840 0.08416463
5 linear_reg    2 huber_loss 10.755656 0.08593158
6 linear_reg    2        mae 11.245593 0.08586785
7 linear_reg    2       rmse 19.238248 0.33855282
8 linear_reg    2      smape 24.256541 0.15787557
best_fits2 = collect_predictions(tuned_models2)

# Get KPI and interpretable metric
best_model2 = best_models2 |>
  mutate(score = mean + std_err) |> 
  slice_min(.by = .metric,
            order_by = score) |> 
  filter(.metric == metric_main | .metric == 'smape')

metric_main_for_print = best_model2 |>
  filter(.metric == metric_main) |> 
  select(score) |>
  pull() |>
  round(4)

metric_smape_for_print = best_model2 |>
  filter(.metric == 'smape') |> 
  select(score) |>
  pull() |>
  round(4)

# Best model
best_model2 = best_models2 |>
  mutate(score = mean + std_err) |> 
  slice_min(.by = .metric,
            order_by = score) |> 
  filter(.metric == metric_main)

best_model_for_print = best_model2 |>
  filter(.metric == metric_main) |>
  select(model) |>
  pull()

# 2nd best model
best_model_2 = best_models2 |>
  mutate(score = mean + std_err) |> 
  slice_min(.by = .metric,
            order_by = score,
            n = 2) |> 
  filter(.metric == metric_main,
         rank == 2)

best_model_for_print_2 = best_model_2 |>
  filter(.metric == metric_main) |> 
  select(model) |>
  pull()

cat(paste0('The best performing model is ', best_model_for_print, '\n'))
The best performing model is mars
cat(paste0(metric_main, ' + standard deviation score: ', metric_main_for_print, '\n'))
huber_loss + standard deviation score: 9.4652
cat(paste0('SMAPE : ', metric_smape_for_print, '\n'))
SMAPE : 19.597
cat(paste0('The second best performing model is ', best_model_for_print_2, '\n'))
The second best performing model is linear_reg

Get variable importance from two best models to be more reliable.

Combine the two models’ ranks of variable importances.

Check the models’ ranking differences of the variables to give us an idea of how reliable the result may be.

# Note: GLMNET gives signs

# Best workflow
best_workflow2 = best_model2 |>
    filter(.metric == metric_main) |>
    select(wflow_id) |>
    pull()

df_vi = tuned_models2 |>
    extract_workflow(best_workflow2) |>
    fit(df2) |>
    extract_fit_parsnip(best_workflow2) |>
    vip::vi(method = "model") |>
    mutate(rank = rank(-Importance, ties.method = "min"))

# 2nd best workflow
best_workflow_2 = best_model_2 |>
    filter(.metric == metric_main) |>
    select(wflow_id) |>
    pull()

df_vi_2 = tuned_models2 |>
    extract_workflow(best_workflow_2) |>
    fit(df2) |>
    extract_fit_parsnip(best_workflow_2) |>
    vip::vi(method = "model") |>
    mutate(rank = rank(-Importance, ties.method = "min"))

# Plot
library(ggplot2)

df_vi |>
    ggplot() + geom_bar(aes(x = Importance, y = reorder(Variable, Importance)),
    stat = "identity") + theme_classic() + ylab("Variable") + labs(title = "Best model variable importance")

df_vi_2 |>
    ggplot() + geom_bar(aes(x = Importance, y = reorder(Variable, Importance)),
    stat = "identity") + theme_classic() + ylab("Variable") + labs(title = "2nd best model variable importance")

df_vi |>
    inner_join(df_vi_2, by = "Variable") |>
    select(-starts_with("Importance")) |>
    rename(rank_best = rank.x, rank_best_2 = rank.y) |>
    mutate(rank_diff = rank_best - rank_best_2, rank_summed = rank_best +
        rank_best_2, rank_combine = rank(rank_summed, ties.method = "min")) |>
    select(Variable, rank_combine, rank_diff, starts_with("rank"), everything()) |>
    arrange(rank_combine)
# A tidytable: 18 × 7
   Variable rank_combine rank_diff rank_best rank_best_2 rank_summed Sign 
   <chr>           <int>     <int>     <int>       <int>       <int> <chr>
 1 issd_               1         2         4           2           6 POS  
 2 rfn__               1        -2         2           4           6 POS  
 3 ifn__               3         9        10           1          11 NEG  
 4 rstmt               3        -7         2           9          11 POS  
 5 sttic               5         7        10           3          13 POS  
 6 est__               6       -12         1          13          14 POS  
 7 iproj               7         5        10           5          15 POS  
 8 icnst               8         4        10           6          16 POS  
 9 rbb__               9         3        10           7          17 POS  
10 rdecl               9        -5         6          11          17 POS  
11 ibb__              11         2        10           8          18 POS  
12 rssd_              12         0        10          10          20 NEG  
13 rcnst              12        -8         6          14          20 POS  
14 rproj              12       -10         5          15          20 NEG  
15 rplac              15        -2        10          12          22 POS  
16 istmt              16        -7         8          15          23 NEG  
17 idecl              17        -6         9          15          24 NEG  
18 iplac              18        -5        10          15          25 NEG  
Conclusion

Investigating only for t_all, it seems that they’re not as reliable, but issd_ and rfn__ seem to be of highest ‘importance’.