split Function - DataFrame Library Documentation

Overview

The split function divides a Serie (or multiple Series) into a specified number of approximately equal-sized parts. This is particularly useful for partitioning data for parallel processing, creating training/test sets, or performing k-fold cross-validation.

Function Signatures


// Single Serie version
template <typename T>
std::vector<Serie<T>> split(size_t n, const Serie<T>& serie);

// Multiple Series version
template <typename T, typename... Ts>
auto split(size_t n, const Serie<T>& first, const Serie<Ts>&... rest);

// Pipeline operation (bound version)
template <typename T>
auto bind_split(size_t n);

Parameters

Parameter	Type	Description
n	size_t	Number of parts to split the Serie(s) into. If n is larger than the Serie size, it will be adjusted to the Serie size.
serie	const Serie<T>&	The Serie to split (single Serie version).
first, rest...	const Serie<T>&, const Serie<Ts>&...	Multiple Series to split in parallel (multiple Series version). All Series must have the same size.

Return Value

Single Serie version: Returns a vector of Serie<T> objects, each containing a portion of the original Serie.
Multiple Series version: Returns a vector of tuples, where each tuple contains corresponding portions from each input Serie.

Each returned portion will have approximately the same size. If the Serie size is not evenly divisible by n, the first remainder parts will each have one additional element.

Example Usage

Single Serie Example


// Create a Serie with 10 elements
df::Serie<int> numbers{1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

// Split into 3 parts
auto parts = df::split(3, numbers);

// Output the parts
for (size_t i = 0; i < parts.size(); ++i) {
    std::cout << "Part " << i + 1 << ": ";
    parts[i].forEach([](int val, size_t) {
        std::cout << val << " ";
    });
    std::cout << std::endl;
}

// Output:
// Part 1: 1 2 3 4 
// Part 2: 5 6 7 
// Part 3: 8 9 10

Multiple Series Example


// Create multiple Series with the same length
df::Serie<std::string> names{"Alice", "Bob", "Charlie", "Diana", "Eve"};
df::Serie<int> ages{25, 30, 35, 28, 42};
df::Serie<double> scores{95.5, 88.0, 76.5, 91.0, 85.5};

// Split into 2 parts (e.g., for training/test split)
auto splits = df::split(2, names, ages, scores);

// Process each split
for (size_t i = 0; i < splits.size(); ++i) {
    // Destructure the tuple for each split
    auto& [names_part, ages_part, scores_part] = splits[i];
    
    std::cout << "Split " << i + 1 << ":" << std::endl;
    for (size_t j = 0; j < names_part.size(); ++j) {
        std::cout << "  " << names_part[j] << ", " 
                  << ages_part[j] << " years, "
                  << scores_part[j] << " points" << std::endl;
    }
}

// Output:
// Split 1:
//   Alice, 25 years, 95.5 points
//   Bob, 30 years, 88 points
//   Charlie, 35 years, 76.5 points
// Split 2:
//   Diana, 28 years, 91 points
//   Eve, 42 years, 85.5 points

Pipeline Example


// Create a Serie
df::Serie<double> values{1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8};

// Use bind_split in a pipeline
auto parts = values | df::bind_split<double>(4);

// Process each part independently
std::vector<double> part_sums;
for (const auto& part : parts) {
    double sum = part.reduce([](double acc, double val, size_t) {
        return acc + val;
    }, 0.0);
    part_sums.push_back(sum);
}

// Output part sums
for (size_t i = 0; i < part_sums.size(); ++i) {
    std::cout << "Sum of part " << i + 1 << ": " << part_sums[i] << std::endl;
}

// Output:
// Sum of part 1: 3.3
// Sum of part 2: 7.7
// Sum of part 3: 12.1
// Sum of part 4: 16.5

K-Fold Cross-Validation Example


// Create data for a machine learning task
df::Serie<Vector2> features{{0.1, 0.2}, {0.3, 0.4}, {0.5, 0.6}, 
                            {0.7, 0.8}, {0.9, 1.0}, {1.1, 1.2}, 
                            {1.3, 1.4}, {1.5, 1.6}, {1.7, 1.8}, {1.9, 2.0}};
df::Serie<int> labels{0, 0, 0, 0, 0, 1, 1, 1, 1, 1};

// Perform 5-fold cross-validation
auto folds = df::split(5, features, labels);

for (size_t i = 0; i < folds.size(); ++i) {
    // Use fold i as test set, all others as training set
    auto& [test_features, test_labels] = folds[i];
    
    df::Serie<Vector2> train_features;
    df::Serie<int> train_labels;
    
    for (size_t j = 0; j < folds.size(); ++j) {
        if (j != i) {
            // Add fold j to training set
            auto& [fold_features, fold_labels] = folds[j];
            // Here you'd append fold_features to train_features
            // and fold_labels to train_labels
            // Implementation depends on how you'd merge Series
        }
    }
    
    // Now train a model on train_features, train_labels
    // and evaluate it on test_features, test_labels
    std::cout << "Fold " << i + 1 << ": "
              << "Training on " << train_features.size() << " samples, "
              << "Testing on " << test_features.size() << " samples" << std::endl;
}

Implementation Notes

The algorithm ensures that all parts have approximately equal sizes. If the Serie size is not evenly divisible by n, extra elements are distributed among the first parts.
When splitting multiple Series, the function ensures that corresponding elements from each Serie are kept together in the same split.
If n is greater than the Serie size, it will be adjusted to the Serie size (resulting in one element per part).
For multiple Series, all input Series must have the same size, or an exception will be thrown.
The split function creates copies of the data, not views. This means modifications to the split parts do not affect the original Serie(s).

Common Use Cases

Parallel Processing: Split a large dataset into chunks for parallel processing or multi-threading.
Data Partitioning: Create training/validation/test splits for machine learning models.
Cross-Validation: Implement k-fold cross-validation by splitting data into k parts.
Batch Processing: Process large datasets in manageable chunks to reduce memory usage.
Data Sampling: Split data to create representative samples for analysis.

split