RandomForest

Overview

The RandomForest class provides an implementation of the Random Forest algorithm integrated with the DataFrame library. Random Forest is an ensemble learning method for classification and regression tasks that operates by constructing multiple decision trees during training and outputting the class (classification) or mean prediction (regression) of the individual trees.

This implementation supports both regression and classification tasks, allowing customization of various hyperparameters such as the number of trees, maximum depth, and minimum samples required for splitting.

Applications in geosciences

The random forest algorithm is well-suited for many geological and geophysical applications where there's a need to model complex relationships in data with many variables. Here are some interesting examples in these domains:

Mineral Exploration and Resource Estimation:
- Predicting mineral deposit locations using geochemical, geophysical, and geological features
- Estimating ore grade or resource quantity based on drilling samples and geophysical measurements
- Classifying rock types from geophysical logs and core sample data
Seismic Event Classification:
- Distinguishing between different types of seismic events (earthquakes, explosions, mining blasts)
- Predicting earthquake magnitudes from early-wave characteristics
- Identifying volcanic eruption precursors from seismic signal patterns
Reservoir Characterization:
- Predicting porosity, permeability, and fluid content in oil and gas reservoirs
- Classifying lithology from well log data
- Estimating reservoir connectivity based on pressure and production data
Landslide Susceptibility Mapping:
- Predicting landslide-prone areas using topography, geology, precipitation, and land cover data
- Estimating landslide volume or runout distance based on terrain characteristics
Groundwater Modeling:
- Predicting groundwater quality parameters based on hydrogeological conditions
- Estimating aquifer transmissivity and storage coefficients
- Identifying zones of potential groundwater contamination
Geohazard Assessment:
- Assessing risk levels for various geohazards (sinkholes, soil liquefaction, etc.)
- Predicting coastal erosion rates based on geological and oceanographic factors
- Mapping subsidence risk from mining or fluid extraction
Weather and Climate Applications in Geoscience:
- Downscaling climate model outputs for local geological applications
- Predicting extreme precipitation events and their geomorphic impacts
- Assessing climate change impacts on geological processes
Remote Sensing and Geological Mapping:
- Automated geological feature extraction from satellite or drone imagery
- Mapping hydrothermal alteration zones from multispectral satellite data
- Classifying land cover/land use with geological significance

These applications typically involve multivariate datasets with complex, non-linear relationships - exactly the type of problems where random forests excel due to their ability to handle high-dimensional data, capture non-linear relationships, and provide feature importance rankings.

Class Definition

RandomForest Class Definition


namespace ml {

enum class TaskType { REGRESSION, CLASSIFICATION };

class RandomForest {
  public:
    // Constructor
    RandomForest(size_t num_trees = 100,
                 TaskType task_type = TaskType::REGRESSION,
                 size_t max_features = 0,
                 size_t max_depth = std::numeric_limits<size_t>::max(),
                 size_t min_samples_split = 2, size_t n_classes = 0);

    void fit(const df::Dataframe &data, const std::string &target_column);

    df::Serie<double> predict(const df::Dataframe &data);

    df::Serie<double> feature_importance(const df::Dataframe &data,
                                         const std::string &target_column);

    double oob_error(const df::Dataframe &data,
                     const std::string &target_column);

    df::Serie<double> permutation_importance(const df::Dataframe &data,
                                             const std::string &target_column,
                                             size_t n_repeats = 10);

    std::vector<std::string>
    get_feature_names(const df::Dataframe &data,
                      const std::string &target_column) const;

    df::Dataframe feature_importance_df(const df::Dataframe &data,
                                        const std::string &target_column);

    size_t get_num_trees() const { return num_trees; }

    TaskType get_task_type() const { return task_type; }

    std::map<std::string, double> evaluate(const df::Dataframe &data,
                                           const std::string &target_column);
};

// Utility function to create a Random Forest Regressor
RandomForest create_random_forest_regressor(
    size_t num_trees = 100, size_t max_features = 0,
    size_t max_depth = std::numeric_limits<size_t>::max(),
    size_t min_samples_split = 2);

// Utility function to create a Random Forest Classifier
RandomForest create_random_forest_classifier(
    size_t num_trees = 100, size_t n_classes = 0, size_t max_features = 0,
    size_t max_depth = std::numeric_limits<size_t>::max(),
    size_t min_samples_split = 2);

} // namespace ml

Constructor

RandomForest Constructor


RandomForest(size_t num_trees = 100,
             TaskType task_type = TaskType::REGRESSION,
             size_t max_features = 0,
             size_t max_depth = std::numeric_limits<size_t>::max(),
             size_t min_samples_split = 2, 
             size_t n_classes = 0);

Parameters

Parameter	Type	Description
num_trees	size_t	The number of trees in the forest. Increasing this generally improves performance at the cost of more training time. Default is 100.
task_type	TaskType	Type of task to perform, either `REGRESSION` or `CLASSIFICATION`. Default is `REGRESSION`.
max_features	size_t	The number of features to consider when looking for the best split. If 0, all features are considered. Default is 0.
max_depth	size_t	The maximum depth of the tree. Default is unlimited (std::numeric_limits<size_t>::max()).
min_samples_split	size_t	The minimum number of samples required to split an internal node. Default is 2.
n_classes	size_t	Number of classes for classification tasks. Only used when task_type is CLASSIFICATION. Default is 0 (auto-detect).

Utility Factory Functions

The library provides two convenience functions for creating RandomForest instances configured for either regression or classification:

RandomForest Factory Functions


// Create a Random Forest Regressor
RandomForest create_random_forest_regressor(
    size_t num_trees = 100, 
    size_t max_features = 0,
    size_t max_depth = std::numeric_limits<size_t>::max(),
    size_t min_samples_split = 2);

// Create a Random Forest Classifier
RandomForest create_random_forest_classifier(
    size_t num_trees = 100, 
    size_t n_classes = 0, 
    size_t max_features = 0,
    size_t max_depth = std::numeric_limits<size_t>::max(),
    size_t min_samples_split = 2);

These functions simplify the creation of RandomForest instances with appropriate defaults for either regression or classification tasks.

Core Methods

fit

fit Method

void fit(const df::Dataframe &data, const std::string &target_column);

Trains the random forest model using the provided data. The target_column specifies which column in the DataFrame contains the target values to predict.

predict

predict Method

df::Serie<double> predict(const df::Dataframe &data);

Makes predictions on the provided data using the trained random forest model. Returns a Serie of predicted values.

evaluate

evaluate Method

std::map<std::string, double> evaluate(const df::Dataframe &data, const std::string &target_column);

Evaluates the model's performance on the provided data. Returns a map of evaluation metrics. For regression tasks, this typically includes metrics like MSE, RMSE, and R². For classification tasks, it includes metrics like accuracy, precision, recall, and F1-score.

Feature Importance Methods

feature_importance

feature_importance Method

df::Serie<double> feature_importance(const df::Dataframe &data, const std::string &target_column);

Calculates and returns the importance of each feature in the model. This is based on how much each feature contributes to the decrease in impurity or error across all trees in the forest.

permutation_importance

permutation_importance Method

df::Serie<double> permutation_importance(
    const df::Dataframe &data,
    const std::string &target_column,
    size_t n_repeats = 10);

Calculates feature importance via the permutation importance method. This method measures the decrease in model accuracy when a single feature's values are randomly shuffled, indicating how much the model depends on that feature.

feature_importance_df

feature_importance_df Method

df::Dataframe feature_importance_df(const df::Dataframe &data, const std::string &target_column);

Returns a DataFrame containing feature names and their importance scores, providing a more structured output for analyzing feature importance.

get_feature_names

get_feature_names Method

std::vector<std::string> get_feature_names(
    const df::Dataframe &data,
    const std::string &target_column) const;

Returns a vector of feature names from the DataFrame, excluding the target column. This is useful for mapping feature indices to their original names in the dataset.

Model Assessment Methods

oob_error

oob_error Method

double oob_error(const df::Dataframe &data, const std::string &target_column);

Calculates the out-of-bag (OOB) error estimate for the random forest model. This provides an unbiased estimate of the generalization error without requiring a separate validation set.

get_num_trees

get_num_trees Method

size_t get_num_trees() const;

Returns the number of trees in the random forest.

get_task_type

get_task_type Method

TaskType get_task_type() const;

Returns the task type (REGRESSION or CLASSIFICATION) that the random forest is configured for.

Usage Example

Complete Usage Example


#include <dataframe/Serie.h>
#include <dataframe/Dataframe.h>
#include <dataframe/io/csv.h>
#include <dataframe/core/split.h>
#include <dataframe/ml/random_forest.h>
#include <iostream>

int main() {
    // Load data from CSV
    df::Dataframe data = df::io::read_csv("iris.csv");
    
    // Print dataset info
    std::cout << "Dataset columns: ";
    for (const auto& name : data.names()) {
        std::cout << name << " ";
    }
    std::cout << "\nTotal samples: " << data.get<double>("sepal_length").size() << std::endl;
    
    // Split data into training and testing sets (80/20)
    auto splits = df::split(5, data);
    
    // Combine first 4 parts for training (80%)
    df::Dataframe train_data;
    for (size_t i = 0; i < 4; ++i) {
        // Merge split data into train_data
        // Implementation depends on specific merge function
    }
    
    // Use the 5th part for testing (20%)
    df::Dataframe test_data = splits[4];
    
    // Create and train a random forest classifier
    ml::RandomForest rf = ml::create_random_forest_classifier(
        100,    // num_trees
        3,      // n_classes
        0,      // max_features (auto)
        10,     // max_depth
        2       // min_samples_split
    );
    
    // Train the model
    rf.fit(train_data, "species");
    std::cout << "Model trained successfully with " << rf.get_num_trees() << " trees." << std::endl;
    
    // Make predictions
    df::Serie<double> predictions = rf.predict(test_data);
    
    // Calculate feature importance
    df::Dataframe importance = rf.feature_importance_df(train_data, "species");
    
    // Print feature importance
    std::cout << "\nFeature Importance:" << std::endl;
    for (const auto& name : importance.names()) {
        std::cout << name << ": " << importance.get<double>(name)[0] << std::endl;
    }
    
    // Evaluate the model
    auto metrics = rf.evaluate(test_data, "species");
    std::cout << "\nModel Evaluation:" << std::endl;
    for (const auto& [metric, value] : metrics) {
        std::cout << metric << ": " << value << std::endl;
    }
    
    // Calculate OOB error
    double oob = rf.oob_error(train_data, "species");
    std::cout << "Out-of-bag error: " << oob << std::endl;
    
    return 0;
}

Implementation Notes

The implementation uses bootstrap sampling to create diverse training sets for each decision tree in the forest.
Out-of-bag (OOB) samples are used to provide an unbiased estimate of the generalization error.
For classification tasks, the model outputs class labels based on majority voting among the trees.
For regression tasks, the model outputs the average prediction of all trees.
The implementation supports feature importance calculation, which helps identify the most influential features in the model.
Permutation importance provides an alternative method to assess feature importance by measuring the impact of shuffling each feature's values.
When max_features is set to 0, all features are considered for each split, which can be computationally intensive for datasets with many features.