sort Function - DataFrame Library Documentation

Overview

The sort function creates an ordered copy of a Serie based on various sorting criteria. It supports both ascending and descending order, custom comparator functions, sorting by a key function, and special handling for NaN values. The function is designed with immutability in mind, returning a new Serie instead of modifying the original.

Function Signatures

// Basic sort with order control
template <typename T>
Serie<T> sort(const Serie<T>& serie, 
                SortOrder order = SortOrder::ASCENDING,
                ExecutionPolicy exec = ExecutionPolicy::SEQ);

// Sort with custom comparator
template <typename T, typename Compare>
Serie<T> sort(const Serie<T>& serie, 
                Compare comp,
                ExecutionPolicy exec = ExecutionPolicy::SEQ);

// Sort by key function
template <typename T, typename KeyFunc>
Serie<T> sort_by(const Serie<T>& serie, 
                   KeyFunc key_func,
                   SortOrder order = SortOrder::ASCENDING,
                   ExecutionPolicy exec = ExecutionPolicy::SEQ);

// Sort with NaN handling
template <typename T>
Serie<T> sort_nan(const Serie<T>& serie, 
                    SortOrder order = SortOrder::ASCENDING,
                    bool nan_first = false,
                    ExecutionPolicy exec = ExecutionPolicy::SEQ);

// Pipeline binding functions
template <typename T>
auto bind_sort(SortOrder order = SortOrder::ASCENDING,
               ExecutionPolicy exec = ExecutionPolicy::SEQ);

template <typename Compare>
auto bind_sort_with(Compare comp, 
                    ExecutionPolicy exec = ExecutionPolicy::SEQ);

template <typename KeyFunc>
auto bind_sort_by(KeyFunc key_func, 
                  SortOrder order = SortOrder::ASCENDING,
                  ExecutionPolicy exec = ExecutionPolicy::SEQ);

Parameters

Parameter	Type	Description
serie	const Serie<T>&	The input Serie to sort.
order	SortOrder	The sort direction: ASCENDING or DESCENDING. Default is ASCENDING.
comp	Compare	Custom comparison function that should return true if the first element is less than the second.
key_func	KeyFunc	Function that extracts a key for comparison from each element.
nan_first	bool	If true, NaN values will be placed at the beginning; if false, at the end. Default is false.
exec	ExecutionPolicy	Execution policy for parallel algorithms (SEQ, PAR). Default is SEQ (sequential).

Return Value

All sort functions return a new Serie containing the same elements as the input Serie, but in sorted order. The returned Serie has the same type as the input Serie.

Example Usage

Basic Sorting

// Create a Serie with unsorted values
df::Serie<double> values{5.2, 1.7, 3.9, 2.1, 4.4};

// Sort in ascending order (default)
auto asc_values = df::sort(values);
// asc_values: {1.7, 2.1, 3.9, 4.4, 5.2}

// Sort in descending order
auto desc_values = df::sort(values, df::SortOrder::DESCENDING);
// desc_values: {5.2, 4.4, 3.9, 2.1, 1.7}

// Sort strings (lexicographically)
df::Serie<std::string> names{"Charlie", "Alice", "Bob", "Dave"};
auto sorted_names = df::sort(names);
// sorted_names: {"Alice", "Bob", "Charlie", "Dave"}

Custom Comparators

// Sort by absolute value
df::Serie<double> numbers{-5.0, 3.0, -1.0, 4.0, -2.0};
auto abs_sorted = df::sort(numbers, [](double a, double b) {
    return std::abs(a) < std::abs(b);
});
// abs_sorted: {-1.0, -2.0, 3.0, 4.0, -5.0}

// Case-insensitive string sorting
df::Serie<std::string> words{"Apple", "banana", "Cat", "dog"};
auto case_insensitive = df::sort(words, [](const std::string& a, const std::string& b) {
    std::string a_lower = a, b_lower = b;
    std::transform(a.begin(), a.end(), a_lower.begin(), ::tolower);
    std::transform(b.begin(), b.end(), b_lower.begin(), ::tolower);
    return a_lower < b_lower;
});
// case_insensitive: {"Apple", "banana", "Cat", "dog"}

Sorting Complex Types with Key Functions

// Define a Person struct
struct Person {
    std::string name;
    int age;
    double salary;
};

// Create a Serie of Person objects
df::Serie<Person> people{
    {"Alice", 32, 75000.0},
    {"Bob", 25, 65000.0},
    {"Charlie", 45, 85000.0},
    {"Dave", 32, 72000.0}
};

// Sort by age (ascending)
auto by_age = df::sort_by(people, [](const Person& p) { return p.age; });
// Result: Bob (25), Alice/Dave (32), Charlie (45)

// Sort by name (descending)
auto by_name = df::sort_by(people, 
                          [](const Person& p) { return p.name; },
                          df::SortOrder::DESCENDING);
// Result: Dave, Charlie, Bob, Alice

// Sort by salary (ascending)
auto by_salary = df::sort_by(people, [](const Person& p) { return p.salary; });
// Result: Bob, Dave, Alice, Charlie

Handling NaN Values

// Create a Serie with NaN values
df::Serie<double> data{5.0, std::nan(""), 3.0, std::nan(""), 1.0, 4.0};

// Sort with NaN values at the end (default)
auto sorted1 = df::sort_nan(data);
// sorted1: {1.0, 3.0, 4.0, 5.0, NaN, NaN}

// Sort with NaN values at the beginning
auto sorted2 = df::sort_nan(data, df::SortOrder::ASCENDING, true);
// sorted2: {NaN, NaN, 1.0, 3.0, 4.0, 5.0}

// Sort descending with NaN values at the end
auto sorted3 = df::sort_nan(data, df::SortOrder::DESCENDING);
// sorted3: {5.0, 4.0, 3.0, 1.0, NaN, NaN}

Pipeline Usage

// Create a Serie
df::Serie<double> values{5.2, 1.7, 3.9, 2.1, 4.4};

// Use in pipeline with bind_sort
auto sorted_squared = values
    | df::bind_map([](double x, size_t) { return x * x; })  // Square values
    | df::bind_sort<double>(df::SortOrder::ASCENDING);     // Sort squared values
// sorted_squared: {2.89, 4.41, 15.21, 19.36, 27.04}

// Use with custom comparator in pipeline
auto abs_sorted = df::Serie<double>{-5.0, 3.0, -1.0, 4.0, -2.0}
    | df::bind_sort_with([](double a, double b) {
        return std::abs(a) < std::abs(b);
      });
// abs_sorted: {-1.0, -2.0, 3.0, 4.0, -5.0}

// Complex pipeline with key function
auto result = people
    | df::bind_sort_by([](const Person& p) { return p.age; })  // Sort by age
    | df::bind_map([](const Person& p, size_t) {               // Extract names
        return p.name;
      });
// result: {"Bob", "Alice", "Dave", "Charlie"}

Parallel Sorting

// Create a large Serie
df::Serie<int> large_serie = df::random_uniform<int>(100000, 0, 1000000);

// Sort using parallel execution policy
auto sorted_parallel = df::sort(large_serie, df::SortOrder::ASCENDING, df::ExecutionPolicy::PAR);

// Parallel sort with custom comparator
auto parallel_custom = df::sort(large_serie, 
                               [](int a, int b) { return a % 10 < b % 10; },  // Sort by last digit
                               df::ExecutionPolicy::PAR);

// Parallel sort by key in pipeline
auto result = large_serie
    | df::bind_sort_by([](int x) { return x % 100; },  // Sort by last two digits
                      df::SortOrder::ASCENDING,
                      df::ExecutionPolicy::PAR);

Implementation Notes

All sort operations create a new Serie rather than modifying the input Serie in place.
The sort algorithm has an average case complexity of O(n log n), where n is the size of the Serie.
For small Series, parallel sorting may actually be slower due to overhead. Consider using sequential execution for small datasets.
NaN handling is particularly useful for floating-point data with missing values.
The sort implementation relies on std::sort internally and respects the C++ standard library guarantees.
The parallel execution policy is only effective when the library is compiled with parallel algorithms support.

sort