incest porn

Sandbox Web


Changes Index Search

Webs Book Compare GPCE06 Gmt Gpce Gpce04 Gpce05 IFIPWG211 IPA06 Main Octave PEPM07 PEPM08 PHP Sandbox Sdf Stratego Sts TWiki Tiger Tools Transform Variability default porn free porn

Andrin Von Rechenberg Sandbox

Sandbox

Super Scalar Sample Sort

This is a paper I wrote for ETH, I thought I might as well publish it here

Introduction

Abstract

When to use

How to read this paper

The Sample Sort Algorithm

How it works

The input is an array of integers of size n. We take the first k numbers in the array and use them as splitters. Splitters are elements of the input that divide the input into smaller pieces. It is fairly simple: You take out the splitters and sort them (using quicksort or inline algorithm). Between the splitters you put buckets. If our input are the numbers 1-100 and our splitters are 32,57 and 73 it will look like this:
1          32          57          73          100
 \ Bucket /  \ Bucket /  \ Bucket /  \ Bucket /
  """"""""    """"""""    """"""""    """"""""
Once we have the splitters and the buckets, we run through the input sequentially and throw eveQry element in to the right bucket. Do the whole thing recursivly for every bucket by splitting the bucket into smaller buckets (with the elements in the bucket as the input) until there is at most one element in every bucket. On the way, store the buckets and the splitters in sequential order.

In the end we will have many small buckets in a sorted order that only contain at most one element. If we now run over all buckets and splitters sequentially we have the sorted input.
[Image]

The Speed-Ups (Super Scalar Sample Sort)

Let us look a little more into processor architecture. In the following paragraphs the word processor means an Intel Pentium 4 (later 90nm generation) processor.

An instruction is not executed by the processor in one step. In fact an instruction in a processor runs through 31 stages (like reading the instruction, getting all the data from memory that is needed to execute this instruction, performing the instruction, writing the result of the instruction, etc) called the execution pipeline. It would not be very wise to wait until every instruction ran through all the execution stages before processing the next one. That's why processors use instruction parallelism, which means that they already start the next instruction if possible, before the previous instruction has completly finished. This is not always possible:

Conditional Branches

When the processor comes to a Conditional Branch it tries to predict its direction in order to keep the execution pipeline full. The prediction rule is simple: If a processor reaches a conditional branch, it always evaluates the condition as true and hence executes the fist branch. When the processor finally evaluates the expression as false and hence mispredicted the branch it needs to flush the execution pipeline (which costs 31 cycles)!
In many cases a programmer can predict the control flow of an application and can optimize the conditional branches, so that at execution time, the processor doesn't mispredict branches a lot.
With sorting though the problem of predicting a branch becomes impossible. If we have a conditional branch and we have to predict whether an element will be smaller or greater than another element, we have a 50% chance of beeing correct. This means that every second time we compare two elements we loose 31 processor cycles.
element = randomNumber();
splitter = randomNumber();
if ( element < splitter ) {
   // The processor will predict this branch. There is a 50% chance of beeing correct
   LeftBucket.add(element)
} else {
   // If this is the correct branch, we lose 31 processor cycles.
   RightBucket.add(element)
}

The k-Splitters and the buckets

In order to get rid of conditional branches, we do a neat trick to place the elements in the correct bucket, without using conditional branches.

k = 1

If we have only one splitter (_k_=1) it is fairly easy. We put the two buckets in an array and we are using a boolean expression to find the correct bucket number in the array (assuming that true==1 and false==0):
element = randomNumber();
splitter = randomNumber();

BucketArray[0]=LeftBucket;
BucketArray[1]=RightBucket;

BucketArray[element>splitter].add(element);

any k

If we have any k (preferably a power of 2 -1) we build a binary search tree out of the k splitters in order to find the correct bucket.[Image] If we decide k to be small enough to fit the k splitters into the processors registers we do so, but usually we should choose k=255 to allow byte sized comparisons. We keep the 255 splitters in a buffered array.
This is the example with k=3, which results in 5 buckets:
element = randomNumber();
s1,s2,s3 = get3Splitters();
BucketArray[] = get5Buckets();

firstBranch= element > s2;
BucketArray[ firstBranch*2 
             + (!fistBranch & element > s1)*1 
             + (fistBranch & element > s3)*1 
           ].add(element)

Pre-Allocating Memory

We can get an additional speed-up, by first allocating buckets with the correct bucket sizes in memory. In order to that we need to know the bucket sizes in advance. This is how we do it: Instead of throwing the elements directly into the buckets, we first run over all elements and tell them in which bucket they will be thrown and at the same time keep track of the number of elements that will be thrown in each bucket. Then we allocate the buckets with the correct size that we know now and ask every elemnt in which bucket it will be thrown in and then throw it there.

For all that we use an auxilary array with the same size as the original input. Than we swap the two array pointers (the auxilary becomes the input array and vice-versa) and we do the next step.

Oversampling k

To have more equaly sized buckets we should oversample k by a factor a (look at a*k) possible splitters and take the most equally distributed k ones.

Data Dependencies

What it is

There is instruction A that comes before an instruction B. Data Dendency occurs if an B gets stalled because it waits for the result of A, but A is has not yet finished:
c = a + b;   // Instruction A
d = c + 1;   // Instruction B

Why we don't have it

As soon as we have the _k_-splitters fixed, all elements can be placed in the correct bucket independently.

The Speed-Up

Since we don't have any data dependency we can interleave the corresponding instructions for several elements. The compiler will automaticly be able to do loop-unrollment and software-pipelining.

Memory Hierarchies / Cache-Oblivious

n = Number of input elements
k = Number of splitters
M = Number of elements that fit into primary memory
B = Number of elements that can be transfered from secondary to primary memory in one step.

Since we can read through all the elements sequentially we have perfect spatial locality. Quicksort needs to write the entire input and output log(n/m) times before a subproblem fits into M. But since Super Scalar Sample Sort is like a divide-and-conquer algorithm and splits the problem into subproblems of about size n/k we only need to read and write the input log.k.(n/m) times. In the I/O Model with an omniscent cache of size M and transfers of size B, we have k=O(M/B).

By appropratly choosing the size of k, this algorithm becomes cache oblivious, under the assumption that all k-splitters fit into memory and they are optimally distributed.