RandomForest
Overview
The RandomForest class provides an implementation of the Random Forest algorithm
integrated with the DataFrame library. Random Forest is an ensemble learning method for
classification and regression tasks that operates by constructing multiple decision trees during
training and outputting the class (classification) or mean prediction (regression) of the individual
trees.
This implementation supports both regression and classification tasks, allowing customization of various hyperparameters such as the number of trees, maximum depth, and minimum samples required for splitting.
Applications in geosciences
The random forest algorithm is well-suited for many geological and geophysical applications where there's a need to model complex relationships in data with many variables. Here are some interesting examples in these domains:
- Mineral Exploration and Resource Estimation:
- Predicting mineral deposit locations using geochemical, geophysical, and geological features
- Estimating ore grade or resource quantity based on drilling samples and geophysical measurements
- Classifying rock types from geophysical logs and core sample data
- Seismic Event Classification:
- Distinguishing between different types of seismic events (earthquakes, explosions, mining blasts)
- Predicting earthquake magnitudes from early-wave characteristics
- Identifying volcanic eruption precursors from seismic signal patterns
- Reservoir Characterization:
- Predicting porosity, permeability, and fluid content in oil and gas reservoirs
- Classifying lithology from well log data
- Estimating reservoir connectivity based on pressure and production data
- Landslide Susceptibility Mapping:
- Predicting landslide-prone areas using topography, geology, precipitation, and land cover data
- Estimating landslide volume or runout distance based on terrain characteristics
- Groundwater Modeling:
- Predicting groundwater quality parameters based on hydrogeological conditions
- Estimating aquifer transmissivity and storage coefficients
- Identifying zones of potential groundwater contamination
- Geohazard Assessment:
- Assessing risk levels for various geohazards (sinkholes, soil liquefaction, etc.)
- Predicting coastal erosion rates based on geological and oceanographic factors
- Mapping subsidence risk from mining or fluid extraction
- Weather and Climate Applications in Geoscience:
- Downscaling climate model outputs for local geological applications
- Predicting extreme precipitation events and their geomorphic impacts
- Assessing climate change impacts on geological processes
- Remote Sensing and Geological Mapping:
- Automated geological feature extraction from satellite or drone imagery
- Mapping hydrothermal alteration zones from multispectral satellite data
- Classifying land cover/land use with geological significance
These applications typically involve multivariate datasets with complex, non-linear relationships - exactly the type of problems where random forests excel due to their ability to handle high-dimensional data, capture non-linear relationships, and provide feature importance rankings.
Class Definition
namespace ml {
enum class TaskType { REGRESSION, CLASSIFICATION };
class RandomForest {
public:
// Constructor
RandomForest(size_t num_trees = 100,
TaskType task_type = TaskType::REGRESSION,
size_t max_features = 0,
size_t max_depth = std::numeric_limits<size_t>::max(),
size_t min_samples_split = 2, size_t n_classes = 0);
void fit(const df::Dataframe &data, const std::string &target_column);
df::Serie<double> predict(const df::Dataframe &data);
df::Serie<double> feature_importance(const df::Dataframe &data,
const std::string &target_column);
double oob_error(const df::Dataframe &data,
const std::string &target_column);
df::Serie<double> permutation_importance(const df::Dataframe &data,
const std::string &target_column,
size_t n_repeats = 10);
std::vector<std::string>
get_feature_names(const df::Dataframe &data,
const std::string &target_column) const;
df::Dataframe feature_importance_df(const df::Dataframe &data,
const std::string &target_column);
size_t get_num_trees() const { return num_trees; }
TaskType get_task_type() const { return task_type; }
std::map<std::string, double> evaluate(const df::Dataframe &data,
const std::string &target_column);
};
// Utility function to create a Random Forest Regressor
RandomForest create_random_forest_regressor(
size_t num_trees = 100, size_t max_features = 0,
size_t max_depth = std::numeric_limits<size_t>::max(),
size_t min_samples_split = 2);
// Utility function to create a Random Forest Classifier
RandomForest create_random_forest_classifier(
size_t num_trees = 100, size_t n_classes = 0, size_t max_features = 0,
size_t max_depth = std::numeric_limits<size_t>::max(),
size_t min_samples_split = 2);
} // namespace ml
Constructor
RandomForest(size_t num_trees = 100,
TaskType task_type = TaskType::REGRESSION,
size_t max_features = 0,
size_t max_depth = std::numeric_limits<size_t>::max(),
size_t min_samples_split = 2,
size_t n_classes = 0);
Parameters
| Parameter | Type | Description |
|---|---|---|
| num_trees | size_t | The number of trees in the forest. Increasing this generally improves performance at the cost of more training time. Default is 100. |
| task_type | TaskType | Type of task to perform, either REGRESSION or CLASSIFICATION.
Default is REGRESSION. |
| max_features | size_t | The number of features to consider when looking for the best split. If 0, all features are considered. Default is 0. |
| max_depth | size_t | The maximum depth of the tree. Default is unlimited (std::numeric_limits<size_t>::max()). |
| min_samples_split | size_t | The minimum number of samples required to split an internal node. Default is 2. |
| n_classes | size_t | Number of classes for classification tasks. Only used when task_type is CLASSIFICATION. Default is 0 (auto-detect). |
Utility Factory Functions
The library provides two convenience functions for creating RandomForest instances configured for either regression or classification:
// Create a Random Forest Regressor
RandomForest create_random_forest_regressor(
size_t num_trees = 100,
size_t max_features = 0,
size_t max_depth = std::numeric_limits<size_t>::max(),
size_t min_samples_split = 2);
// Create a Random Forest Classifier
RandomForest create_random_forest_classifier(
size_t num_trees = 100,
size_t n_classes = 0,
size_t max_features = 0,
size_t max_depth = std::numeric_limits<size_t>::max(),
size_t min_samples_split = 2);
These functions simplify the creation of RandomForest instances with appropriate defaults for either regression or classification tasks.
Core Methods
fit
void fit(const df::Dataframe &data, const std::string &target_column);
Trains the random forest model using the provided data. The target_column specifies
which column in the DataFrame contains the target values to predict.
predict
df::Serie<double> predict(const df::Dataframe &data);
Makes predictions on the provided data using the trained random forest model. Returns a Serie of predicted values.
evaluate
std::map<std::string, double> evaluate(const df::Dataframe &data, const std::string &target_column);
Evaluates the model's performance on the provided data. Returns a map of evaluation metrics. For regression tasks, this typically includes metrics like MSE, RMSE, and R². For classification tasks, it includes metrics like accuracy, precision, recall, and F1-score.
Feature Importance Methods
feature_importance
df::Serie<double> feature_importance(const df::Dataframe &data, const std::string &target_column);
Calculates and returns the importance of each feature in the model. This is based on how much each feature contributes to the decrease in impurity or error across all trees in the forest.
permutation_importance
df::Serie<double> permutation_importance(
const df::Dataframe &data,
const std::string &target_column,
size_t n_repeats = 10);
Calculates feature importance via the permutation importance method. This method measures the decrease in model accuracy when a single feature's values are randomly shuffled, indicating how much the model depends on that feature.
feature_importance_df
df::Dataframe feature_importance_df(const df::Dataframe &data, const std::string &target_column);
Returns a DataFrame containing feature names and their importance scores, providing a more structured output for analyzing feature importance.
get_feature_names
std::vector<std::string> get_feature_names(
const df::Dataframe &data,
const std::string &target_column) const;
Returns a vector of feature names from the DataFrame, excluding the target column. This is useful for mapping feature indices to their original names in the dataset.
Model Assessment Methods
oob_error
double oob_error(const df::Dataframe &data, const std::string &target_column);
Calculates the out-of-bag (OOB) error estimate for the random forest model. This provides an unbiased estimate of the generalization error without requiring a separate validation set.
get_num_trees
size_t get_num_trees() const;
Returns the number of trees in the random forest.
get_task_type
TaskType get_task_type() const;
Returns the task type (REGRESSION or CLASSIFICATION) that the random forest is configured for.
Usage Example
#include <dataframe/Serie.h>
#include <dataframe/Dataframe.h>
#include <dataframe/io/csv.h>
#include <dataframe/core/split.h>
#include <dataframe/ml/random_forest.h>
#include <iostream>
int main() {
// Load data from CSV
df::Dataframe data = df::io::read_csv("iris.csv");
// Print dataset info
std::cout << "Dataset columns: ";
for (const auto& name : data.names()) {
std::cout << name << " ";
}
std::cout << "\nTotal samples: " << data.get<double>("sepal_length").size() << std::endl;
// Split data into training and testing sets (80/20)
auto splits = df::split(5, data);
// Combine first 4 parts for training (80%)
df::Dataframe train_data;
for (size_t i = 0; i < 4; ++i) {
// Merge split data into train_data
// Implementation depends on specific merge function
}
// Use the 5th part for testing (20%)
df::Dataframe test_data = splits[4];
// Create and train a random forest classifier
ml::RandomForest rf = ml::create_random_forest_classifier(
100, // num_trees
3, // n_classes
0, // max_features (auto)
10, // max_depth
2 // min_samples_split
);
// Train the model
rf.fit(train_data, "species");
std::cout << "Model trained successfully with " << rf.get_num_trees() << " trees." << std::endl;
// Make predictions
df::Serie<double> predictions = rf.predict(test_data);
// Calculate feature importance
df::Dataframe importance = rf.feature_importance_df(train_data, "species");
// Print feature importance
std::cout << "\nFeature Importance:" << std::endl;
for (const auto& name : importance.names()) {
std::cout << name << ": " << importance.get<double>(name)[0] << std::endl;
}
// Evaluate the model
auto metrics = rf.evaluate(test_data, "species");
std::cout << "\nModel Evaluation:" << std::endl;
for (const auto& [metric, value] : metrics) {
std::cout << metric << ": " << value << std::endl;
}
// Calculate OOB error
double oob = rf.oob_error(train_data, "species");
std::cout << "Out-of-bag error: " << oob << std::endl;
return 0;
}
Implementation Notes
- The implementation uses bootstrap sampling to create diverse training sets for each decision tree in the forest.
- Out-of-bag (OOB) samples are used to provide an unbiased estimate of the generalization error.
- For classification tasks, the model outputs class labels based on majority voting among the trees.
- For regression tasks, the model outputs the average prediction of all trees.
- The implementation supports feature importance calculation, which helps identify the most influential features in the model.
- Permutation importance provides an alternative method to assess feature importance by measuring the impact of shuffling each feature's values.
- When
max_featuresis set to 0, all features are considered for each split, which can be computationally intensive for datasets with many features.