

#### Innocenzo Mungiello Tutor: Alessandro Cilardo XXXI Cycle - III year presentation

#### Improving Multibank Memory Access Parallelism on SIMT Architectures





# Background

- Master of Science:
  - Cum laude in "Ingegneria Informatica" at University of Naples "Federico II"
- DIETI Group:
  - Seclab
- Type of Fellowship:
  - No Grant
- Collaboration:
  - With CeRICT in the context of the european project
    MANGO





| Scuola Politecnica e                          | Ser |
|-----------------------------------------------|-----|
| Scuola Politecnica e<br>delle Scienze di Base | MOR |
|                                               |     |



Università degli Studi di Napoli Federico II



#### Training Activities

Student: Innocenzo Mungiello

Tutor: Alessandro Cilardo

Cycle XXXI

innocenzo.mungiello@unina.it

acilardo@unina.it

|          | Credits year 1 |         |         |         |         | Credits year 2 |         |         |           |         |         | Credits year 3 |         |         |         |         |           |         |         |         |         |         |         |         |       |        |
|----------|----------------|---------|---------|---------|---------|----------------|---------|---------|-----------|---------|---------|----------------|---------|---------|---------|---------|-----------|---------|---------|---------|---------|---------|---------|---------|-------|--------|
|          |                | ~       | 2       | с       | 4       | 5              | 9       |         |           | ~       | 2       | с              | 4       | 5       | 9       |         |           | ~       | 2       | с       | 4       | 5       | 9       |         |       |        |
|          | Estimated      | bimonth | bimonth | bimonth | bimonth | bimonth        | bimonth | Summary | Estimated | bimonth | bimonth | bimonth        | bimonth | bimonth | bimonth | Summary | Estimated | bimonth | bimonth | bimonth | bimonth | bimonth | bimonth | Summary | Total | Check  |
| Modules  | 20             | 0       | 6       | 3       | 5       | 0              | 9       | 23      | 10        | 9       | 3       | 0              | 0       | 0       | 6       | 18      |           | 0       | 2       | 0       | 0       | 0       | 0       | 2       | 43    | 30-70  |
| Seminars | 5              | 1,8     | 0,4     | 0,7     | 1       | 0              | 1,3     | 5,2     | 5         | 2,9     | 1,3     | 1,4            | 0       | 0       | 0       | 5,6     |           | 0       | 0       | 0       | 0       | 0       | 0       | 0       | 11    | 10-30  |
| Research | 35             | 7       | 4       | 6       | 4       | 7              | 4       | 32      | 45        | 1       | 5       | 7              | 10      | 10      | 3       | 36      | 60        | 10      | 8       | 10      | 10      | 10      | 10      | 58      | 126   | 80-140 |
|          | 60             | 8,8     | 10,4    | 9,7     | 10,0    | 7,0            | 14,3    | 60,2    | 60        | 13      | 9,3     | 8,4            | 10      | 10      | 9       | 60      | 60        | 10      | 10      | 10      | 10      | 10      | 10      | 60      | 180   | 180    |



# The End of Dennard Scaling

- Processors aren't getting faster, just wider
  - Future gains in performance are from parallelism
- Process matters less
  - One generation is 1.2x, not 2.8x
  - Dark silicon problem
- Future systems are energy limited
  - Efficiency *IS* Performance





## Heterogeneous Computing

- Today in a system we have:
  - manycore CPU
  - GPU
  - MIC
  - FPGA

- The goal of the parallel revolution:
  - programs for parallel easy as programs for sequential





#### Problem: The Unbroken Memory Wall

- Memory wall remains a fundamental limit to system performance
- Also in terms of performance per watt
  - Only the 15% of energy consumption is used for useful computation



#### All memory levels must be Optimized!



# A sub-problem: Bank Conflicts

- A bank conflict occurs when:
  - two or more threads *in a warp* access different words in the same bank
    - Think: two or more threads access different "rows" in the same bank
  - N-way bank conflict: N threads in a warp conflict
  - Instruction gets issued N times: increases latency
- There is no bank conflict if:
  - Several threads access the same word
  - Several threads access different bytes of the same word





## A sub-problem: Conflicts

- Shared memory performance may suffer bank conflicts
- Waste of memory ++ conflict resolution Trade off
- Shared memory can become a limiting factor



# Related Works (1/2)

- Several techniques have been presented to solve bank conflicts and reduce memory access latency:
  - NVIDIA Padding: wastes shared memory to solve conflicts
  - Mrugesh Gajjar and Ismayil Guracar<sup>1</sup> do not modify the data layout of input and output arrays but they exploit the computational structure of the convolution filtering operation to modify the algorithm and avoid bank conflicts.
  - Akihiko Kasagi, Koji Nakano, and Yasuaki Ito<sup>2</sup> used a Graph-Coloring technique in order to implement a conflict-free permutation algorithm on these models. They must use extra data structures in the kernel code, in order to implement the technique.
- 1. Mrugesh Gajjar and Ismayil Guracar, Efficient rate conversion filtering on GPUs with shared memory access pattern scrambling, Signal Processing Systems (SiPS), 2016 IEEE International Workshop on, IEEE, 2016.
- 2. Akihiko Kasagi, Koji Nakano, and Yasuaki Ito, An implementation of conflict-free offline permutation on the GPU, Networking and Computing (ICNC), 2012 Third International Conference on, IEEE, 2012.



# Related Works (2/2)

- Several techniques have been presented to solve bank conflicts and reduce memory access latency:
  - Dymaxion<sup>3</sup> is an automatic tool that analyses GPU kernels code and provides a data layout transformation in order to improve memory coalescing accesses.
  - Random Address Shift<sup>4</sup> uses a vector of random numbers in order to perform a shift of the elements in shared memory and avoid conflicts. The technique must use extra scratch-pad memory in order to store data structures
- All these techniques:
  - Waste shared memory;
  - Are not focused on finding a relationship between the memory access pattern, the shared memory bank conflict and the power consumption.
- 3. Shuai Che, Jeremy W Sheaer, and Kevin Skadron, Dymaxion: Optimizing memory access patterns for heterogeneous systems, Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, ACM, 2011.
- 4. Koji Nakano, Susumu Matsumae, and Yasuaki Ito, The random address shift to reduce the memory access congestion on the discrete memory machine, Computing and Networking (CANDAR), 2013 First International Symposium on, IEEE, 2013.



## Idea: Use Matematical Models to solve Bank Conflicts

- Polyhedral Model Transformation
  - Useful in some conditions
  - can achieve a 30% increase in performance per watt<sup>5</sup>
  - Find the right *T* matrix is an hardest work
  - Wastes shared memory
- Linear Programming Model
  - Useful in any condition
  - Does not waste shared memory
  - Too many solutions
  - 5. A. Cilardo, I. Mungiello, "Experimental evaluation of memory optimizations on an embedded GPU platform", 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2015



$$Banks(x, y) = \begin{bmatrix} 1 & N \end{bmatrix} \cdot \begin{bmatrix} a & b \\ c & d \end{bmatrix} \cdot \begin{bmatrix} x \\ y \end{bmatrix} mod \ banks$$

 $T = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$ 





## Polyhedral Model Transformation



6. Darte, Alain, Michele Dion, and Yves Robert. "A characterization of one-to-one modular mappings." *Parallel Processing Letters 6.01* (1996): 145-157.



## Polyhedral Model Transformation: Implementation

- Using NVIDIA Jetson TK1 for
  - Running two Kernels:
    - One Non-Optimized
    - One Optimized
- With the Digilent Analog Discovery,
  - measure the power consumption through Resistor *R5C11*





#### Polyhedral Model Transformation: Results

- Matrix Multiplication, Non-Optimized
  - Duration: 6,5 s
  - Energy consumption: 5,2 J
  - 10,816 MFLOPS/watt
- Matrix Multiplication, Optimized
  - Duration: 5 s
  - Energy consumption: 4 J
  - 14,0608 MFLOPS/watt, 30% more
  - Use 3,73% more memory for the same kernel
  - Perform 3,84% more instructions for the same kernel







#### Polyhedral Model Transformation: Trade off

- Shared memory performance may suffer bank conflicts
- Waste of memory / conflict resolution Trade off
- Shared memory can become a limiting factor





## Idea: Use Matematical Models to solve Bank Conflicts

- Polyhedral Model Transformation
  - Useful in some conditions
  - can achieve a 30% increase in performance per watt<sup>5</sup>
  - Find the right *T* matrix is an hardest work
  - Wastes shared memory

#### Linear Programming Model<sup>7</sup>

- Useful in any condition
- Does not waste shared memory
- Too many solutions

 $\begin{aligned} &\mininimize\left(\sum_{t=1}^{N_{TH}}\sum_{b=1}^{N_{BK}}\sum_{i=1}^{N_{IT}}x_{t,b,i}\right)\\ &\text{subject to :}\\ &\sum_{i=1}^{N_{IT}}x_{t,b,i}=1 \quad \forall \ t\in \ [1,N_{TH}] \quad , \quad \forall \ b\in \ [1,N_{BK}]\\ &\sum_{b=1}^{N_{BK}}x_{t,b,i}=1 \quad \forall \ t\in \ [1,N_{TH}] \quad , \quad \forall \ i\in \ [1,N_{IT}]\\ &\sum_{t=1}^{N_{TH}}x_{t,b,i}=1 \quad \forall \ b\in \ [1,N_{BK}] \quad , \quad \forall \ i\in \ [1,N_{IT}]\\ &\sum_{t=1}^{N_{TH}}x_{t,b,i}=1 \quad \forall \ t\in \ [1,N_{TH}] \quad , \quad \forall \ i\in \ [1,N_{IT}]\\ &x_{t,1,i}=1 \quad \forall \ t\in \ [1,N_{TH}] \quad \forall \ i\in \ [1,N_{IT}] \land \ t=i\\ &x_{t,b,i} \quad \text{is binary} \end{aligned}$ 

7. A. Cilardo and I. Mungiello, "Zero-conflict Memory Mapping for Transpose-like Kernels in SIMT Architectures", *Journal of Parallel and Distributed Computing*, 2018, submitted.



#### Linear Programming Model: Implementation

 $\begin{aligned} &\mininimize\left(\sum_{t=1}^{N_{TH}}\sum_{b=1}^{N_{BK}}\sum_{i=1}^{N_{IT}}x_{t,b,i}\right) \\ &\text{subject to :} \\ &\sum_{i=1}^{N_{IT}}x_{t,b,i}=1 \quad \forall \ t \in \ [1,N_{TH}] \quad , \quad \forall \ b \in \ [1,N_{BK}] \\ &\sum_{b=1}^{N_{BK}}x_{t,b,i}=1 \quad \forall \ t \in \ [1,N_{TH}] \quad , \quad \forall \ i \in \ [1,N_{IT}] \\ &\sum_{t=1}^{N_{TH}}x_{t,b,i}=1 \quad \forall \ b \in \ [1,N_{BK}] \quad , \quad \forall \ i \in \ [1,N_{IT}] \\ &x_{t,1,i}=1 \quad \forall \ t \in \ [1,N_{TH}] \quad \forall \ i \in \ [1,N_{IT}] \land \ t = i \\ &x_{t,b,i} \quad \text{is binary} \end{aligned}$ 

#### • ILP Model

- Generalizes on existing conflictavoiding techniques
- Does not waste shared memory
- supporting a systematic exploration of feasible mapping schemes
- Not all solutions are useful for a SIMT architecture

#### • Filtering Tool-chain

 check on the solutions generated by the ILP model, filtering SIMT feasible solutions





# Linear Programming Model: Results

- Discovered some useful mapping schemes:
  - Adaptive Modular Mapping<sup>8</sup>
  - Triangular Based Mapping
  - Inverse Adaptive Modular
    Mapping
  - Inverse Triangular Based
    Mapping
  - And more...



8. A. Cilardo, I. Mungiello and F. De Rosa, "Adaptive Modular Mapping to Reduce Shared Memory Bank Conflicts on GPUs", 11th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2016



#### Linear Programming Model: Adaptive Modular Mapping







#### Linear Programming Model: Triangular Based Mapping





#### Linear Programming Model: Experimental Results (TK1)











#### Innocenzo Mungiello

## Linear Programming Model: Experimental Results (TK1)

| Parameters               | Mapping | DCT    | Transpose | Convolution<br>Col | n Convolution<br>Row | ı Lud<br>Perime-<br>ter | Lud<br>Diago-<br>nal |
|--------------------------|---------|--------|-----------|--------------------|----------------------|-------------------------|----------------------|
| Dataset Size (MB)        | All     | 664.06 | 256       | 256                | 256                  | 1024                    | 1024                 |
| CUDA Block Size          | All     | 128    | 256       | 128                | 64                   | 64                      | 64                   |
|                          | Native  | 12     | 3         | 6                  | 6                    | 4                       | 3                    |
| Resident CUDA Blocks     | Padding | 11     | 2         | 5                  | 5                    | 3                       | 2                    |
| Resident CODA Blocks     | AMM     | 12     | 3         | 6                  | 6                    | 4                       | 3                    |
|                          | TBM     | 12     | 3         | 6                  | 6                    | 4                       | 3                    |
|                          | Native  | 19     | 23        | 33                 | 32                   | 36                      | 34                   |
| Registers                | Padding | 19     | 23        | 33                 | 32                   | 36                      | 34                   |
| Registers                | AMM     | 32     | 30        | 33                 | 41                   | 36                      | 35                   |
|                          | TBM     | 32     | 28        | 41                 | 49                   | 36                      | 35                   |
|                          | Native  | 4      | 16        | 8                  | 8                    | 12                      | 16                   |
| Shared Memory (KB)       | Padding | 4.125  | 16.25     | 8.125              | 8.125                | 12.125                  | 16.25                |
| Shared Memory (RD)       | AMM     | 4      | 16        | 8                  | 8                    | 12                      | 16                   |
|                          | TBM     | 4      | 16        | 8                  | 8                    | 12                      | 16                   |
|                          | Native  | 25%    | 3%        | 1.56%              | 3.12%                | 8.54%                   | 4.52%                |
| Shared Memory Efficiency | Padding | 50%    | 50%       | 50%                | 50%                  | 50%                     | 50%                  |
| Shared Memory Eniciency  | AMM     | 50%    | 50%       | 50%                | 50%                  | 50%                     | 50%                  |
|                          | TBM     | 50%    | 50%       | 50%                | 50%                  | 50%                     | 50%                  |
|                          | Native  | 4-way  | 32-way    | 32-way             | 16-way               | 16-way                  | <i>20</i> -way       |
| N-way Conflicts          | Padding | None   | None      | None               | None                 | None                    | None                 |
| 11-way Connicts          | AMM     | None   | None      | None               | None                 | None                    | None                 |
|                          | TBM     | None   | None      | None               | None                 | None                    | None                 |
|                          | Native  | 73.04% | 36%       | 36%                | 18%                  | 12%                     | 9%                   |
| GPU Occupancy            | Padding | 67.03% | 24%       | 24%                | 14%                  | 9%                      | 6%                   |
| or o occupancy           | AMM     | 73.04% | 36%       | 36%                | 18%                  | 12%                     | 9%                   |
|                          | TBM     | 73.04% | 36%       | 36%                | 18%                  | 12%                     | 9%                   |









| Parameters               | Mapping | DCT    | Transpose      | Convolution<br>Col | Convolution<br>Row | Lud<br>Perime-<br>ter | Lud<br>Diago-<br>nal |
|--------------------------|---------|--------|----------------|--------------------|--------------------|-----------------------|----------------------|
| Dataset Size (MB)        | All     | 1638   | 1024           | 1024               | 1024               | 1024                  | 1024                 |
| CUDA Block Size          | All     | 128    | 256            | 512                | 256                | 64                    | 64                   |
|                          | Native  | 16     | 4              | 2                  | 2                  | 5                     | 4                    |
| Resident CUDA Blocks     | Padding | 15     | 3              | 1                  | 1                  | 5                     | 3                    |
| Resident CODA DIOCKS     | AMM     | 12     | 4              | 2                  | 2                  | 5                     | 4                    |
|                          | TBM     | 12     | 4              | 2                  | 2                  | 5                     | 4                    |
|                          | Native  | 25     | 32             | 32                 | 32                 | 32                    | 32                   |
| Registers                | Padding | 24     | 32             | 32                 | 32                 | 32                    | 32                   |
| negisters                | AMM     | 37     | 32             | 32                 | 40                 | 32                    | 32                   |
|                          | TBM     | 37     | 32             | 48                 | 48                 | 32                    | 32                   |
|                          | Native  | 4      | 16             | 32                 | 32                 | 12                    | 16                   |
| Shared Memory (KB)       | Padding | 4.125  | 16.25          | 32.5               | 33                 | 12.125                | 16.25                |
| Shared Memory (RD)       | AMM     | 4      | 16             | 32                 | 32                 | 12                    | 16                   |
|                          | TBM     | 4      | 16             | 32                 | 32                 | 12                    | 16                   |
|                          | Native  | 30%    | 6.1%           | 3.1%               | 3.12%              | 6.2%                  | 6%                   |
| Shared Memory Efficiency | Padding | 100%   | 100%           | 100%               | 100%               | 100%                  | 100%                 |
| Shared Memory Eniciency  | AMM     | 100%   | 100%           | 100%               | 100%               | 100%                  | 100%                 |
|                          | TBM     | 100%   | 100%           | 100%               | 100%               | 100%                  | 100%                 |
|                          | Native  | 4-way  | <i>32</i> -way | <i>32</i> -way     | 16-way             | 16-way                | <i>20</i> -way       |
| N-way Conflicts          | Padding | None   | None           | None               | None               | None                  | None                 |
| N-way Conflicts          | AMM     | None   | None           | None               | None               | None                  | None                 |
|                          | TBM     | None   | None           | None               | None               | None                  | None                 |
|                          | Native  | 96.83% | 47%            | 49.7%              | 23.2%              | 15.5%                 | 12.4%                |
| GPU Occupancy            | Padding | 90.50% | 35.5%          | 41.7%              | 21.4%              | 15.4%                 | 9.1%                 |
| Gr U Occupancy           | AMM     | 71.90% | 46.6%          | 48%                | 24.3%              | 15.4%                 | 12.5%                |
|                          | TBM     | 72.3%  | 46.3%          | 48%                | 23.7%              | 15.4%                 | 12.5%                |



#### Conclusions

- Polyhedral Model Transformation
  - A methodology based on Z-modules, Left Hermite Normal Form and Smith Normal Form has been proved to be effective to provide improvements on SIMT architecture performance also in terms of power consumption.
  - prove that there exists a strong relationship between the access pattern to the memory, the shared memory bank conflicts and the power consumption

#### Linear Programming Model

- Approach that explores the solution space in order to find memory mapping schemes that avoid bank conflicts and memory waste.
- The results pointed out a significant impact of the specific mapping choice adopted as a result of this analysis.



#### Future Works

- Polyhedral Model Transformation
  - Automate the processing of the transformation

#### • Linear Programming Model

- Automate the entire process, from the discovery of the mapping scheme to the source code transformation
  - E.g. implementing a parser which applies the scheme chosen by the user before the compilation process
  - insert an ad-hoc hardware component that routes the memory requests to the corresponding banks by computing the mapping dynamically
- Explore the other factors that affect the performance and the power consumption
  - E.g Half-Precision Floating Point Arithmetic in AI problems



#### Publications

- A. Cilardo, I. Mungiello, "Experimental evaluation of memory optimizations on an embedded GPU platform", 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2015
- A. Cilardo, I. Mungiello and F. De Rosa, "Adaptive Modular Mapping to Reduce Shared Memory Bank Conflicts on GPUs", 11th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2016
- A. Cilardo and I. Mungiello, "Zero-conflict Memory Mapping for Transpose-like Kernels in SIMT Architectures", Journal of Parallel and Distributed Computing, 2018, submitted.





#### Thanks!

innocenzo.mungiello@unina.it wpage.unina.it/innocenzo.mungiello









Innocenzo Mungiello

Lud Perimeter

Lud Diagonal

Convolution Row























