Monday, September 8, 2014

Iterators part one: Lazy shuffling with the Knuth Shuffle

[This is part one in a series about iterators.]

tl;dr: Generate a random permutation of $1,2 \dots, n$ lazily. Skip to the section titled Lazy Shuffle if you already know how the Knuth Shuffle works.

Introduction

Suppose you are working on something where you needed to create a random permutation of $1,2,\dots, n$ and process the random permutation one element at a time, potentially stopping the processing long before you run out of the n elements. How many elements you need to process is dependent on the elements you see (i.e. you don't know it before hand).

For example, n could be a billion, and you process the first 1000 elements in some cases, 100 in some etc.

A well known way to create a random permutation/shuffle is the Knuth Shuffle. In the Knuth shuffle, you are given an array, and the algorithm creates a random, in-place permutation of the array, in $O(n)$ time.

The idea of the Knuth shuffle is actually quite simple.

An explanation of Knuth Shuffle


The array consists of two portions, the shuffled portion and the yet to be shuffled portion. Initially the whole array is yet to be shuffled.

The shuffled portion will grow from the right to the left (i.e. initially empty, then $(A[n-1])$, then $(A[n-1], A[n-2])$ etc). You now pick a random object from the yet to be shuffled portion and move it to the location which will grow the shuffled portion by one. This will result in the displacement of a yet to be shuffled object. That object you move to the empty space created by the newly selected object that was just moved to the shuffled portion.


Lazy Shuffle


We could apply the Knuth shuffle to our problem, by creating an array $A$ of size n, such that $A[i] = i$, performing the Knuth shuffle on it, and process the elements in the order $A[0], A[1], \dots$.

Since n could be a billion, the time and space costs of this could be wasteful.

A practical approach would be to repeatedly generate a random number from $1$ to $n$ and check if you have generated that number before. If you have seen it before, you regenerate. This is a big improvement, but still wasteful, with the potential to run for a long time.

We can actually make the Knuth Shuffle work for us in a lazy fashion, without having to pay the $O(n)$ space or time costs.

Assuming we need to process only $m$ elements ($m$ unknown to us), there is a way to make only $m$ random number calls, and use $O(m)$ space, and run in $O(m)$ expected time.

We can achieve this by simulating the array in the Knuth shuffle by a hashtable.

[We can also use a sparse array implementation of the array, which will use $\Theta(n)$ space, but the running time will still be $O(m)$]

If we are simulating the array $A$, then the key will be $i$, and the value will be $A[i]$. The hashtable will only contain entries for elements that have been touched (read or written to). The first time we try to read $A[i]$, the hashtable will create an entry for it, setting the value to be $i$.

This allows us to provide the hasNext, next style iterator for the shuffled elements.

Here is what the code might look like (C++).

#include <map>  // For unordered_map
#include <random>  // Assume exists with some API.

// Sample usage:
// LazyShuffle ls(1000000);
// while (ls.has_next()) {
//     int v = ls.next();
//     Status s = process(v);
//     if (s.Done()) break;
// }
class LazyShuffle {
public:
    // generates a random permutation of 0 to n-1.
    LazyShuffle(int n)
        : _to_shuffle(n) {
    };

    bool has_next() {
        return _to_shuffle > 0;
    }

    int next() {
        // Do the Knuth shuffle here.
        // [yet to be shuffled] [shuffled]

        // Generate a random number from
        // 0 to _to_shuffle-1;
        _to_shuffle--;
        int selected = _random.generate(0, _to_shuffle);

        // Now swap this with element which 
        // grows the shuffled part
        swap(selected, _to_shuffle);

        return get(_to_shuffle);
    }

private:
    void set(int idx, int value) {
        _array[idx] = value;
    }

    int get(int idx) {
        if (_array.find(idx) == _array.end()) {
            _array[idx] = idx;
        }
        return _array[idx];
    } 

    void swap(int idx1, int idx2) {
        int val1 = get(idx1);
        int val2 = get(idx2);
        set(idx1, val2);
        set(idx2, val1);
    }

    // We use unordered_map, but we can use a hashtable
    // with expected O(1) time operations.
    std::unordered_map<int, int> _array; 

    // The number of elements not shuffled.
    int _to_shuffle;

    // Assume exists
    random::Generator _random;
};


No comments:

Post a Comment