Back

Statistical Functions

Overview

The df::stats namespace provides a comprehensive set of statistical functions for analyzing and summarizing data in Series. These functions support both scalar types (like int, double) and array types (like Vector2, Vector3), making them versatile for various data analysis tasks.

Basic Statistical Functions

Mean and Average

Mean and Average Functions

// Calculate mean of a Serie
template <typename T> T mean(const Serie<T> &serie);

// Legacy function for mean - same as mean()
template <typename T> T avg(const Serie<T> &serie);

// Return a Serie containing the average value (useful for pipeline operations)
template <typename T> Serie<T> avg_serie(const Serie<T> &serie);

The mean and avg functions calculate the arithmetic average of all elements in a Serie. For vector types, they calculate the average component-wise.

Mean/Average Example

// Calculate mean of numeric values
df::Serie<double> values{1.5, 2.5, 3.5, 4.5, 5.5};
double mean_val = df::stats::mean(values);  // 3.5

// For vector Serie, mean is calculated component-wise
df::Serie<Vector2> points{{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}};
Vector2 mean_point = df::stats::mean(points);  // {3.0, 4.0}

// Using in a pipeline
auto result = values | df::stats::bind_mean<double>();

Variance and Standard Deviation

Variance and Standard Deviation Functions

// Calculate variance
template <typename T>
auto variance(const Serie<T> &serie, bool population = false);

// Calculate standard deviation
template <typename T>
auto std_dev(const Serie<T> &serie, bool population = false);

These functions calculate the variance and standard deviation of a Serie. The population parameter controls whether to calculate the population statistics (dividing by n) or sample statistics (dividing by n-1).

Variance and Standard Deviation Example

// Calculate variance and standard deviation
df::Serie<double> values{2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0};

// Population statistics (dividing by n)
double var_pop = df::stats::variance(values, true); 
double std_pop = df::stats::std_dev(values, true);

// Sample statistics (dividing by n-1, default)
double var_sample = df::stats::variance(values); 
double std_sample = df::stats::std_dev(values);

// Using in a pipeline
auto std_values = values | df::stats::bind_std_dev<double>();

Order Statistics

Median and Quantiles

Median and Quantile Functions

// Calculate median (middle value) of a Serie
template <typename T> auto median(Serie<T> serie);

// Calculate any quantile (value at a specific percentile)
template <typename T> auto quantile(Serie<T> serie, double q);

// Calculate Interquartile Range (IQR = Q3 - Q1)
template <typename T> T iqr(const Serie<T> &serie);

These functions calculate order statistics, which are based on the sorted values of a Serie:

  • median - The middle value (50th percentile) of the sorted Serie. For even-sized Series, it's the average of the two middle values.
  • quantile - Any percentile from 0 to 1. For example, 0.25 gives the first quartile (Q1), 0.5 gives the median, and 0.75 gives the third quartile (Q3).
  • iqr - The Interquartile Range, which is Q3 - Q1, a measure of statistical dispersion.
Median and Quantile Example

// Calculate median
df::Serie<int> values{5, 1, 8, 4, 3};
int median_val = df::stats::median(values);  // 4

// For even-sized Series, median is average of two middle values
df::Serie<double> even_values{2.5, 1.0, 8.5, 3.5, 4.5, 6.0};
double even_median = df::stats::median(even_values);  // 4.0 (avg of 3.5 and 4.5)

// Calculate quartiles
df::Serie<double> sorted_values{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
double q1 = df::stats::quantile(sorted_values, 0.25);  // First quartile
double q2 = df::stats::quantile(sorted_values, 0.5);   // Second quartile (median)
double q3 = df::stats::quantile(sorted_values, 0.75);  // Third quartile

// Calculate Interquartile Range
double iqr_val = df::stats::iqr(sorted_values);        // Q3 - Q1

Outlier Detection

Outlier Detection Functions

// Identify outliers using the 1.5 * IQR criterion
template <typename T> Serie<bool> isOutlier(const Serie<T> &serie);

// Identify non-outliers
template <typename T> Serie<bool> notOutlier(const Serie<T> &serie);

These functions detect outliers using the common 1.5 × IQR criterion:

  • A value is considered an outlier if it is below Q1 - 1.5 × IQR or above Q3 + 1.5 × IQR.
  • isOutlier returns a Serie of boolean values indicating which elements are outliers.
  • notOutlier returns the complement, indicating which elements are not outliers.
Outlier Detection Example

// Create a Serie with outliers
df::Serie<double> data{1.0, 2.0, 2.5, 3.0, 3.5, 4.0, 10.0};

// Detect outliers
df::Serie<bool> outliers = df::stats::isOutlier(data);
// outliers might be {false, false, false, false, false, false, true}

// Filter the data to keep only non-outliers
df::Serie<bool> regular_values = df::stats::notOutlier(data);

// Use with filter to remove outliers
auto cleaned_data = data | df::filter([&](double, size_t idx) {
    return !outliers[idx];
});

Advanced Statistics

Mode

Mode Function

// Calculate the mode (most frequent value)
template <typename T> T mode(const Serie<T> &serie);

The mode function finds the most frequently occurring value in a Serie. If multiple values have the same highest frequency, it returns the first (lowest) such value.

Mode Example

// Calculate mode
df::Serie<int> values{1, 2, 2, 3, 3, 3, 4, 4, 5};
int mode_val = df::stats::mode(values);  // 3 (appears three times)

Z-Scores

Z-Score Function

// Calculate z-scores (standardized values)
template <typename T>
Serie<double> z_score(const Serie<T> &serie, bool population = false);

The z_score function standardizes a Serie by subtracting the mean and dividing by the standard deviation. The result shows how many standard deviations each value is from the mean.

Z-Score Example

// Calculate z-scores
df::Serie<double> values{2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0};
df::Serie<double> z_scores = df::stats::z_score(values);

// Values near zero are close to the mean
// Values > 2 or < -2 are usually considered outliers

// Using in a pipeline
auto standardized = values | df::stats::bind_z_score<double>();

Correlation and Covariance

Correlation and Covariance Functions

// Calculate covariance between two Series
template <typename T, typename U>
double covariance(const Serie<T> &serie1, const Serie<U> &serie2,
                  bool population = false);

// Calculate Pearson correlation coefficient
template <typename T, typename U>
double correlation(const Serie<T> &serie1, const Serie<U> &serie2);

These functions measure relationships between two Serie objects:

  • covariance - Measures how two variables change together. Positive values indicate that they tend to increase together, negative values indicate that one tends to decrease as the other increases.
  • correlation - The Pearson correlation coefficient, which is a normalized measure of covariance that always falls between -1 and 1. A value of 1 indicates perfect positive correlation, -1 indicates perfect negative correlation, and 0 indicates no linear correlation.
Correlation and Covariance Example

// Calculate covariance and correlation
df::Serie<double> x{1.0, 2.0, 3.0, 4.0, 5.0};
df::Serie<double> y{5.0, 4.0, 3.0, 2.0, 1.0};

// Calculate covariance
double cov = df::stats::covariance(x, y);  // -2.5 (negative covariance)

// Calculate correlation
double corr = df::stats::correlation(x, y);  // -1.0 (perfect negative correlation)

// Positive correlation example
df::Serie<double> z{1.0, 2.0, 3.0, 4.0, 5.0};
double pos_corr = df::stats::correlation(x, z);  // 1.0 (perfect positive correlation)

Summary Statistics

Summary Function

// Calculate a vector of summary statistics
template <typename T> auto summary(const Serie<T> &serie);

The summary function calculates multiple statistics at once and returns them in a map, similar to the summary function in R or Python's pandas.

Summary Example

// Get summary statistics
df::Serie<double> values{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
auto stats = df::stats::summary(values);

// The summary map contains:
// stats["count"] - Number of elements
// stats["min"] - Minimum value
// stats["q1"] - First quartile (25th percentile)
// stats["median"] - Median (50th percentile)
// stats["q3"] - Third quartile (75th percentile)
// stats["max"] - Maximum value
// stats["mean"] - Arithmetic mean
// stats["std_dev"] - Standard deviation

// Print the summary
for (const auto& [key, value] : stats) {
    std::cout << key << ": " << value << std::endl;
}

Pipeline Operations

All statistical functions have corresponding bind_* versions for use in pipelines. These allow you to chain operations together with the pipe operator |.

Pipeline Binding Functions

// Pipeline binding functions
template <typename T> auto bind_avg();
template <typename T> auto bind_mean();
template <typename T> auto bind_variance(bool population = false);
template <typename T> auto bind_std_dev(bool population = false);
template <typename T> auto bind_median();
template <typename T> auto bind_quantile(double q);
template <typename T> auto bind_iqr();
template <typename T> auto bind_isOutlier();
template <typename T> auto bind_notOutlier();
template <typename T> auto bind_mode();
template <typename T> auto bind_z_score(bool population = false);
Pipeline Examples

// Create a Serie
df::Serie<double> values{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 100.0};

// Complex pipeline with multiple operations
auto result = values
    | df::stats::bind_isOutlier<double>()            // Detect outliers
    | df::bind_map([&values](bool is_outlier, size_t i) {
        return is_outlier ? NAN : values[i];        // Replace outliers with NaN
      })
    | df::filter([](double v, size_t) {
        return !std::isnan(v);                      // Remove NaN values
      })
    | df::stats::bind_z_score<double>();            // Standardize remaining values

// Calculate statistics through a pipeline
double mean_val = values | df::stats::bind_mean<double>();
double std_val = values | df::stats::bind_std_dev<double>();
double median_val = values | df::stats::bind_median<double>();

Implementation Notes

  • Most functions throw std::runtime_error if the input Serie is empty.
  • Functions that operate on multiple Series (like correlation) require both Series to have the same length.
  • For array types (vectors, matrices), functions typically operate component-wise.
  • The population parameter in variance and standard deviation controls whether to use n (population) or n-1 (sample) in the denominator.
  • When calculating quantiles, linear interpolation is used for quantiles that fall between data points.
  • For outlier detection, the standard 1.5 × IQR criterion is used.

Complex Example

Complete Analysis Example

#include <dataframe/Serie.h>
#include <dataframe/stats.h>
#include <dataframe/map.h>
#include <dataframe/filter.h>
#include <iostream>
#include <iomanip>

int main() {
    // Create a Serie with some outliers
    df::Serie<double> temperatures{
        22.5, 23.1, 21.8, 22.7, 23.5, 22.9, 23.2, 35.6, 22.0, 23.4, 22.8, 10.2
    };
    
    std::cout << "Temperature data analysis:\n";
    
    // Get summary statistics
    auto stats = df::stats::summary(temperatures);
    
    // Print summary
    std::cout << std::fixed << std::setprecision(2);
    std::cout << "Number of measurements: " << stats["count"] << std::endl;
    std::cout << "Min temperature: " << stats["min"] << "°C" << std::endl;
    std::cout << "Max temperature: " << stats["max"] << "°C" << std::endl;
    std::cout << "Mean temperature: " << stats["mean"] << "°C" << std::endl;
    std::cout << "Median temperature: " << stats["median"] << "°C" << std::endl;
    std::cout << "Standard deviation: " << stats["std_dev"] << "°C" << std::endl;
    
    // Detect outliers
    auto outliers = df::stats::isOutlier(temperatures);
    
    // Count outliers
    int outlier_count = outliers.reduce([](int acc, bool is_outlier, size_t) {
        return acc + (is_outlier ? 1 : 0);
    }, 0);
    
    std::cout << "Number of outliers detected: " << outlier_count << std::endl;
    
    // Show outlier values
    std::cout << "Outlier values: ";
    for (size_t i = 0; i < temperatures.size(); ++i) {
        if (outliers[i]) {
            std::cout << temperatures[i] << "°C ";
        }
    }
    std::cout << std::endl;
    
    // Remove outliers and recalculate statistics
    auto cleaned_data = temperatures | df::filter([&outliers](double, size_t idx) {
        return !outliers[idx];
    });
    
    std::cout << "\nAfter removing outliers:\n";
    std::cout << "Mean temperature: " << df::stats::mean(cleaned_data) << "°C" << std::endl;
    std::cout << "Median temperature: " << df::stats::median(cleaned_data) << "°C" << std::endl;
    std::cout << "Standard deviation: " << df::stats::std_dev(cleaned_data) << "°C" << std::endl;
    
    // Calculate z-scores for the cleaned data
    auto z_scores = df::stats::z_score(cleaned_data);
    
    std::cout << "\nZ-scores for non-outlier temperatures:\n";
    for (size_t i = 0; i < cleaned_data.size(); ++i) {
        std::cout << "Temperature " << cleaned_data[i] << "°C: z-score = " 
                  << z_scores[i] << std::endl;
    }
    
    return 0;
}