---
title: "Exploring Random Forests with ggRandomForests"
author: "John Ehrlinger"
date: today
format: 
  html:
    toc: true
    html-math-method: mathjax
editor: 
  markdown: 
    wrap: 80
vignette: >
  %\VignetteIndexEntry{Vignette's Title}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

A fitted random forest carries a lot of information, but getting at it
usually means digging through list structures that were never meant to be
plotted directly. **ggRandomForests** does that digging for you: it pulls
tidy data objects out of a `randomForestSRC` or `randomForest` fit, and
those objects drop straight into the `ggplot2` workflows you already know.
This vignette walks through the three objects you will reach for most often
(`gg_error`, `gg_variable`, and `gg_vimp`), plus a small helper for
cutting a predictor into evenly populated groups.

```{r pkg-setup, include=FALSE}
if (requireNamespace("ggRandomForests", quietly = TRUE)) {
  library(ggRandomForests)
} else if (requireNamespace("pkgload", quietly = TRUE)) {
  pkgload::load_all(export_all = FALSE, helpers = FALSE, attach_testthat = FALSE)
} else {
  stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.")
}
```

## Error trajectories with `gg_error()`

```{r error-demo}
library(randomForest)
set.seed(42)
rf_iris <- randomForest(Species ~ ., data = iris, ntree = 200, keep.forest = TRUE)
err_df <- ggRandomForests::gg_error(rf_iris, training = TRUE)
head(err_df)
```

A forest's error rate settles down as trees are added, and the `gg_error()`
object lets you watch that happen. It holds the cumulative out-of-bag (OOB)
error rate for each outcome column, indexed by the `ntree` counter. Ask for
`training = TRUE` and the function reconstructs the original model frame and
adds the in-bag error trajectory (`train`) as well, so you can see both
curves at once:

```{r error-plot, fig.height=4}
plot(err_df)
```

## Marginal dependence via `gg_variable()`

```{r variable-demo}
set.seed(99)
boston <- MASS::Boston
rf_boston <- randomForest(medv ~ ., data = boston, ntree = 150)
var_df <- ggRandomForests::gg_variable(rf_boston)
str(var_df[, c("lstat", "yhat")])
```

`gg_variable()` recovers the training data straight from the model call,
so it still works when the forest was fit inside a helper function or
against a `subset()` expression, cases where the data is not sitting in
the global environment. The object you get back keeps the raw predictors
alongside the prediction: a single `yhat` column for regression, or one
`yhat.<class>` column per class for classification. To plot one predictor,
name it with `xvar`:

```{r variable-plot, fig.height=4}
plot(var_df, xvar = "lstat")
```

Survival forests can request multiple horizons using the `time` argument;
non-OOB predictions are available by setting `oob = FALSE`.

## Variable importance with `gg_vimp()`

```{r vimp-demo}
vimp_df <- ggRandomForests::gg_vimp(rf_boston)
head(vimp_df)
plot(vimp_df)
```

Variable importance is not always stored on the fitted object. If a
`randomForest` fit is missing its importance scores, `gg_vimp()` will try to
compute them for you. When even that is not possible (the forest was grown
with `importance = FALSE` and the predictors are no longer reachable),
the function warns and returns `NA` in place of the scores, so a plot still
draws rather than failing outright.

## Balanced conditioning cuts with `quantile_pts()`

```{r quantile-demo}
rm_breaks <- ggRandomForests::quantile_pts(boston$rm, groups = 6, intervals = TRUE)
rm_groups <- cut(boston$rm, breaks = rm_breaks)
table(rm_groups)
```

When you build a coplot, you want each conditioning group to hold a roughly
equal share of the data --- equal-width bins leave the sparse tails nearly
empty. `quantile_pts()` wraps `stats::quantile()` to give you break points
that do exactly that, and they pass straight to `cut()` for the grouping or
facet labels.

## Next steps

* The full API reference lives at <https://ehrlinger.github.io/ggRandomForests/>.
* `?gg_error`, `?gg_variable`, `?gg_vimp`, and `?quantile_pts` cover the
  remaining arguments and have their own examples.
* The `gg_error`, `gg_variable`, and `gg_vimp` objects shown here are tidy
  data frames underneath, so you can skip the `plot()` methods entirely and
  build the figure yourself with `ggplot2`.
* For the full varPro toolkit (release-rule importance, lasso-refined
  importance, per-observation local importance, anomaly scores, and the
  dependency graph) walked across regression, classification, and
  survival examples, see `vignette("varpro", package = "ggRandomForests")`.
