Welcome, friend :)

This is another tutorial about spark using the sparklyr package. In this way, I am going to present you how tuning your model parameters. It’s not so difficult but there is some details that I have to tell you. If you are not confident about trainning your models in spark yet, check my previous post and come back here later :)

Let’s get to action…

Pipeline

First of all, you need to create a pipeline. A pipeline is just a sequence of steps that you are going to execute in your data. In sparklyr context, you don’t point a pipeline directly to your data, you have to point to connection where the data is connected. Let’s see.

library(sparklyr)

sc <- spark_connect(master = "local")

pipeline <- ml_pipeline(sc)

Now that you have created a pipeline object, you can add the steps that you like to use. To add these steps you can use the %>% operator. I’ll perform an example using random forest in our well-known iris dataset.

iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

pipeline <- ml_pipeline(sc) %>%
  ft_r_formula(Species ~ .) %>%
  ml_random_forest_classifier()

The ft_r_formula() function creates two columns named features and label. You don’t need to worry, it’s just a way to “tell” to spark what the predictors are (features) and the response (label) using the R syntax (Species ~ .). If you want to check, run it into your console ft_r_formula(iris_tbl, Species ~ .).

The ml_random_forest(), used in the previous post, is a wrapper and inside it there are two other functions: ml_random_forest_classifier() (for classification) and ml_random_forest_regressor() (for regression). I’d rather be very explicit concerning the steps inside the pipeline, it’s a good practice to avoid a headache when you have to read a very long pipeline with a lot of steps :(
Let’s see how the object pipeline looks like!

pipeline
## Pipeline (Estimator) with 2 stages
## <pipeline_192c555d699a> 
##   Stages 
##   |--1 RFormula (Estimator)
##   |    <r_formula_192c5cf9e6a> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |     (Parameters)
##   |      force_index_label: FALSE
##   |      formula: Species ~ .
##   |      handle_invalid: error
##   |      stringIndexerOrderType: frequencyDesc
##   |--2 RandomForestClassifier (Estimator)
##   |    <random_forest_classifier_192c5efe102c> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |      prediction_col: prediction
##   |      probability_col: probability
##   |      raw_prediction_col: rawPrediction
##   |     (Parameters)
##   |      cache_node_ids: FALSE
##   |      checkpoint_interval: 10
##   |      feature_subset_strategy: auto
##   |      impurity: gini
##   |      max_bins: 32
##   |      max_depth: 5
##   |      max_memory_in_mb: 256
##   |      min_info_gain: 0
##   |      min_instances_per_node: 1
##   |      num_trees: 20
##   |      seed: 207336481
##   |      subsampling_rate: 1

As you can see, our pipeline has two stages: RFormula and RandomForestClassifier. In each one the parameters that may be used are specified

The grid parameters

To control the parameters that you want to check, you need to create a grid object. An example is shown below.

grid <- list(
  random_forest = list(
    num_trees = c(5,10),
    impurity = c("entropy", "gini")
  )
)

In this grid that I created, I will test what the best combination of parameters is, varying the number of trees, and the impurity. You can change and use any combination that you want to check. There is one thing that you have to keep in mind, the name of the list inside the grid has to be the same name used by the classifier. Let’s look our pipeline again!

pipeline
## Pipeline (Estimator) with 2 stages
## <pipeline_192c555d699a> 
##   Stages 
##   |--1 RFormula (Estimator)
##   |    <r_formula_192c5cf9e6a> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |     (Parameters)
##   |      force_index_label: FALSE
##   |      formula: Species ~ .
##   |      handle_invalid: error
##   |      stringIndexerOrderType: frequencyDesc
##   |--2 RandomForestClassifier (Estimator)
##   |    <random_forest_classifier_192c5efe102c> 
##   |     (Parameters -- Column Names)
##   |      features_col: features
##   |      label_col: label
##   |      prediction_col: prediction
##   |      probability_col: probability
##   |      raw_prediction_col: rawPrediction
##   |     (Parameters)
##   |      cache_node_ids: FALSE
##   |      checkpoint_interval: 10
##   |      feature_subset_strategy: auto
##   |      impurity: gini
##   |      max_bins: 32
##   |      max_depth: 5
##   |      max_memory_in_mb: 256
##   |      min_info_gain: 0
##   |      min_instances_per_node: 1
##   |      num_trees: 20
##   |      seed: 207336481
##   |      subsampling_rate: 1

In our example, the random forest classifier is on the second stage, take a look on the second line, did you see <random_forest_classifier_###########>? That is the name you have to use, ok? You don’t need to use the entire name, note that I just use the beginning, for short.

Cross validator object

Our final step is to create a cross validator object. Let’s check it out.

cv <- ml_cross_validator(
  sc, estimator = pipeline, estimator_param_maps = grid,
  evaluator = ml_multiclass_classification_evaluator(sc),
  num_folds = 5,
  parallelism = 4
)

As you can see, you have to pass your connection, pipeline and the grid parameters that you want to test. Make sure you are using the correct evaluator, in our case, it is multiclass. The num_folds is how many folders you want to use in your train-validation split and parallelism is the number of threads to use when running parallel algorithms.

Train and check the metrics

Now, we arrived to the funniest part, let’s train and check the metrics out!

# Train the models
cv_model <- ml_fit(cv, iris_tbl)

# Print the metrics
ml_validation_metrics(cv_model)
##          f1 impurity_1 num_trees_1
## 1 0.9563436    entropy           5
## 2 0.9622804    entropy          10
## 3 0.9630614       gini           5
## 4 0.9461037       gini          10

In our toy example, the best parameters were impurity Gini with 5 number of trees, because f1-score was the greater (0.963). Now it is your turn, try different parameters and have some fun :)

That’s all folks

Liked it? You can share this tutorial using the buttons below. If you want to contribute with my website you can fork me on github. If you still have any doubts feel free to contact at samuelmacedo@recife.ifpe.edu.br.

See ya!