# Parallel Programming/Lesson 4

• Fundamental GPU algorithms
• Reduce
• Scan
• Histogram
• Measure complexity
• Step complexity ("stages")
• Work complexity ("actual nodes in the tree")
• Efficient parallel algorithm can help reduce step complexity
• Work Efficient: if work complexity of the parallel algorithm is asymptotically the same as the serial algorithm
• Reduce
• Parallel implementation can have a smaller number of steps than a serial implementation
• Operator characteristics
• Binary
• Associative
• Serial: loop reuses previous iteration
• Parallel: break into multiple trees – regroup to allow multiple operations to run in parallel at the same time
• because it's associative
• Linear work, O(n)
• But O(Logn) steps
• Brent's Theorem
• How much time will it take if we only have p processors
• If max required parallel processors is P; P < p -> nothing to do
• but if P > p, then (P/p) increase in step complexity wherever we're stuck
• Brent's theorem: t + (m-t) / p time to run something that has t steps, m work items, and p processors
• (x + x / 2 + x / 2 + x / 4 + … + 2 + 1) / (x + 1) = 1 + 2(x / 2 + x / 4 + … + 1) / (x + 1)
• = (approx) 3
• Potential improvements for reduction
• Multiple items per thread reduction
• First step of reduction while reading -> writing shared
• Warps are synchronous
• Scan
• Compaction, etc.
• Very useful applications in parallel
• Input array, binary associative operator, identity element
• [a0 a1 a2 … a(n-1)] -> [I a0 a0 + a1 a0 + a1 + a2 … ]
• current formulation ignores last element (n -> n scan)
• Inclusive scan: f(i) = up till, and also including xi
• Exclusive scan: f(i) = up till, but not including xi
• Parallelization
• The pattern in scan is very common – being able to parallelize it pays off significantly
• serial: n operations, n steps
• one approach: reduce every single output (n2 work, logn steps)
• hillist & Guy Steele
• step efficient
• add neighbors 1 to the left / copy over remaining
• copy over 2 to the left / copy over remaining
• repeat
• log n steps, nlogn work
• blelloch – would be good to implement this
• reduce & downsweep
• downsweep starts with an identity operator
• add neighbors 1, 2, 4, etc. hops away
• work efficient
• 2 log n steps, 2 O(n) work – same efficiency as serial implementation
• Choosing between the 2
• More work than processors: should have something that's more work efficient
• more processors than work: should have something that step efficient
• Histogram
• exclusive sum scan
• parallelization requires atomics / synchronization
• reduce the price by making local histograms, and then reducing them in parallel
• Alternatively, Sort & Reduce by Key
• map/reduce?