Data access techniques to optimize code in modern processor:

  • Loop interchange: Do stride-1 access rather than stride- to leverage spatial locality
  • Loop fusion: Merge adjacent loop with same iteration space, a.k.a. loop jamming. Increases instruction-level parallelism and reduces loop overhead
  • Loop blocking: Transform a -level nested loop into -level nested loop to improve data locality. An example is blocked algorithm for matrix multiply. Also known as loop tiling.
  • Prefetching: Some processor provide instructions to prefetch data into cache (without passing the data to CPU)

Data layout techniques to optimize code:

  • Array padding: To prevent conflict misses, pad data (e.g. one cache line) between two consecutively declared, power-of-2 arrays so that elements A[i] and B[i] uses different cache lines. Inter-array padding is between different arrays and intra-array padding is to reduce self-interferences.
  • Array merging: Also known as group and transpose. Instead of arrays A and B, create array AB which each element is a duo (e.g. struct in C)
  • Array transpose: To change column-access to row-access in row-major order arrays
  • Data copying: Temporary transpose array before intensive operations

