# Parallel Programming/Lesson 4

- Fundamental GPU algorithms

- Measure complexity

- Step complexity ("stages")

- Work complexity ("actual nodes in the tree")

- Efficient parallel algorithm can help reduce step complexity

- Work Efficient: if work complexity of the parallel algorithm is asymptotically the same as the serial algorithm

- Reduce

- Parallel implementation can have a smaller number of steps than a serial implementation

- Operator characteristics

- Serial: loop reuses previous iteration

- Parallel: break into multiple trees – regroup to allow multiple operations to run in parallel at the same time

- because it's associative

- Linear work, O(n)

- But O(Logn) steps

- Brent's Theorem

- How much time will it take if we only have p processors

- If max required parallel processors is P; P < p -> nothing to do

- but if P > p, then (P/p) increase in step complexity wherever we're stuck

- Brent's theorem: t + (m-t) / p time to run something that has t steps, m work items, and p processors

- (x + x / 2 + x / 2 + x / 4 + … + 2 + 1) / (x + 1) = 1 + 2(x / 2 + x / 4 + … + 1) / (x + 1)

- Potential improvements for reduction

- Multiple items per thread reduction

- First step of reduction while reading -> writing shared

- Warps are synchronous

- Scan

- Compaction, etc.

- Very useful applications in parallel

- Input array, binary associative operator, identity element

- [a
_{0} a_{1} a_{2} … a_{(n-1)}] -> [I a_{0} a_{0} + a_{1} a_{0} + a_{1} + a_{2} … ]

- current formulation ignores last element (n -> n scan)

- Inclusive scan: f(i) = up till, and also including x
_{i}

- Exclusive scan: f(i) = up till, but not including x
_{i}

- Parallelization

- The pattern in scan is very common – being able to parallelize it pays off significantly

- serial: n operations, n steps

- one approach: reduce every single output (n
^{2} work, logn steps)

- hillist & Guy Steele

- step efficient

- add neighbors 1 to the left / copy over remaining

- copy over 2 to the left / copy over remaining

- repeat

- log n steps, nlogn work

- blelloch – would be good to implement this

- reduce & downsweep

- downsweep starts with an identity operator

- add neighbors 1, 2, 4, etc. hops away

- work efficient

- 2 log n steps, 2 O(n) work – same efficiency as serial implementation

- Choosing between the 2

- More work than processors: should have something that's more work efficient

- more processors than work: should have something that step efficient

- Histogram

- exclusive sum scan

- parallelization requires atomics / synchronization

- reduce the price by making local histograms, and then reducing them in parallel

- Alternatively, Sort & Reduce by Key

- Additional references: