## CS 636: Memory Consistency Models

#### Swarnendu Biswas

Department of Computer Science and Engineering, Indian Institute of Technology Kanpur

Sem 2024-25-II



"To write correct and efficient shared memory programs, programmers need a precise notion of how memory behaves with respect to read and write operations from multiple processors"

S. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorials. Journal of Computer, vol. 29, no. 12, pp. 66–76, Dec. 1996.

#### **Busy-Wait Paradigm**

```
1 Object X = null;
2 boolean done = false;
```

| Thread 1                                   | Thread 2                                                |
|--------------------------------------------|---------------------------------------------------------|
| <pre>X = new Object(); done = true; </pre> | <pre>while (!done) {} if (X != null) X.compute();</pre> |

#### What Value Can a Read Return?

1 
$$X = 0;$$
  
2 done = 0;

| Core 1           | Core 2                   |
|------------------|--------------------------|
| 1 S1: ST X, 10   | L1: LD r1, done          |
| 2 S2: ST done, 1 | B1: if (r1 != 1) goto L1 |
| 3                | L2: LD r2, X             |

## Reordering of Accesses by Hardware

Accesses are to different addresses

- Store-store reordering
  - Non-FIFO write buffer (first store misses in the cache while the second hits or the second store can coalesce with an earlier sore)
- Load-load reordering
  - ► Cache hits, dynamic scheduling, execute out of order
- Load-store reordering
  - ► Cache hits, out-of-order core
- Store-load reordering
  - ► FIFO write buffer with bypassing, out-of-order core

## Reordering of Accesses by Hardware

Accesses are to **different** addresses

- Store-store reordering
  - Non-FIFO write buffer (first store misses in the cache while the second hits or the second store and st

## Correct in a single-threaded context

Load-load reor

Cache hits

- Non-trivial in a multithreaded context
- Load-store reordering
  - ► Cache hits, out-of-order core
- Store-load reordering
  - ► FIFO write buffer with bypassing, out-of-order core

#### What values can a load return?

#### Return the "last" write

- Uniprocessor: program order defines the "last" write
- Multiprocessor: ?

## Memory Consistency Model

Set of rules that govern how systems process memory operation requests from multiple processors

Determines the order in which memory operations appear to execute

Specifies the allowed behaviors of multithreaded programs executing with shared memory

- Both at the hardware-level and at the programming-language-level
- There can be multiple correct behaviors

#### Importance of Memory Consistency Models

- + Determines what optimizations are correct
- + Contract between the programmer and the hardware
- + Influences ease of programming and program performance
- + Impacts program portability

#### Dekker's Algorithm

| Core 1             | Core 2             |
|--------------------|--------------------|
| 1 S1: ST flag1, 1  | 1 S2: ST flag2, 1  |
| 2 L1: LD r1, flag2 | 2 L2: LD r2, flag1 |

## Can both r1 and r2 be set to zero?

#### Issues with Memory Consistency

#### Visibility

When are the effects of one thread (e.g., updating a memory location) visible to another?

#### Ordering

When can operations of any given thread appear out of order to another thread?

A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program.

- Uniprocessor Operations execute in the order specified by the program
- All operations execute in order, and the operations of each individual core appear in program order

#### **Uniprocessor Memory Model**

- Memory operations occur in program order, and maintain data and control dependences
- Read from memory returns the value from the last write in program order
- Compiler optimizations preserve these semantics

## Interleavings with SC

```
1 data = null;
2 flag = false;
```

| Core 1                     | Core 2                                    |
|----------------------------|-------------------------------------------|
| 1 S1: data = new Object(); | 1                                         |
| 2 S2: flag = true;         | <pre>2 L1: r1 = flag;</pre>               |
| 3                          | <pre>3 B1: if (r1 != true) goto L1;</pre> |
| 4                          | 4 L2: r2 = data;                          |

Should r2 always be set to the new Object() stored?

## Interleavings with SC



Swarnendu Biswas (IIT Kanpur)

#### SC Formalism

Every load gets its value from the last store before it (in global memory order) to the same address

Suppose we have two addresses a and b (a == b or a != b). L(a) is a load from a and S(a) is a store to a.

#### Challenges in Implementing SC

#### Is preserving program order on a per-location basis sufficient?

- Hardware implementations of SC need to satisfy the following requirements
  - Program orderPrevious memory operation completes before proceeding with the next memory operation in program order
  - Write atomicity 

     Writes to the same location should be serialized, i.e., writes to the same location should be visible in the same order to all processors

## Need for Write Atomicity

| Core 1 | Core 2               | Core 3                                   |
|--------|----------------------|------------------------------------------|
| A = 1  | if (A == 1)<br>B = 1 | if (B == 1)<br>tmp = A<br>(What should A |

## Need for Write Atomicity

| Core 1                                                           | Core 2                      | Core 3      |
|------------------------------------------------------------------|-----------------------------|-------------|
| A = 1                                                            | if (A == 1)<br>B = 1        | if (B == 1) |
| from all processors <ul> <li>The effect of a write op</li> </ul> | a single sequential order a | o all the   |

processors at the same time (i.e., instantaneous)

time

#### Importance of Maintaining Write-Read Order

- Assume a bus-based system with no caches
- Includes a write buffer with bypassing capabilities

| Core 1             | Core 2             |
|--------------------|--------------------|
| 1 S1: ST flag1, 1  | 1 S2: ST flag2, 1  |
| 2 L1: LD r1, flag2 | 2 L2: LD r2, flag1 |

### Importance of Maintaining Write-Read Order

- Assume a bus-based system with no caches
- Includes a write buffer with bypassing capabilities



#### SC in Architecture with Caches

- Replication of data requires a cache coherence protocol
  - Several definitions of cache coherence protocols exist
- Propagating new values to multiple other caches is non-atomic

## Providing Write Atomicity with Caches

- Consider a system with caches, and assume that all variables are cached by all the cores
- SC can be violated with a network with no ordering guarantees

| Core 1 | Core 2               | Core 3                            |
|--------|----------------------|-----------------------------------|
| A = 1  | if (A == 1)<br>B = 1 | <pre>if (B == 1)    tmp = A</pre> |

#### Providing Write Atomicity with Caches

- Consider a system with caches, and assume that all variables are cached by all the cores
- SC can be violated with a network with no ordering guarantees



## Serialization of Writes

| Core 1                                         | Core 2  | Core 3                                                       | Core 4 |
|------------------------------------------------|---------|--------------------------------------------------------------|--------|
| 1 <b>A</b> = <b>1</b><br>2 <b>B</b> = <b>1</b> | 2 C = 1 | <pre>while (B != 1) {} 1 while (C != 1) {} 2 tmp1 = A </pre> |        |

Writes to A in Cores 1 and 2 should not reach Cores 3 and 4 out of order even if the network is out of order or does not provide guarantees—it would violate SC

## Serialization of Writes

| Core 1 | Core 2  | Core 3                                                                                           | Core 4 |
|--------|---------|--------------------------------------------------------------------------------------------------|--------|
|        | 2 C = 1 | 1       while (B != 1) {} 1         2       while (C != 1) {} 2         3       tmp1 = A       3 |        |

- Cache coherence must serialize writes to the same memory location
- Writes to the same memory location must be seen in the same order by all

even if ate SC

Writes

the ne

# Cache Coherence

## Block Diagram of a SMP



#### Data Coherence

#### Private caches create data coherence problem

- Copies of a variable can be present in multiple caches
- Private copies of shared data must be **coherent**, i.e., all copies must have the same value (okay if the requirement holds eventually)

#### Consider the following sequence of operations on a single core system

Final value of x will be 30



#### Coherence Challenge with Multicores



## Coherence Challenge with Multicores



(iii) Each core updates x in its private cache

(iv) Cores write back x, a store is lost depending on the order of write backs

C0

C1

x = x + 5

x = x + 15

x = 25

Main

Memory

0

2

x = 15

x = 25

## Can Write-through Caches Avoid the Coherence Problem?

#### Assume 3 cores with write-through caches

- (i) Core C0 reads  ${\rm x}$  from memory, caches it, and gets the value 10
- (ii) Core C1 reads  ${\rm x}$  from memory, caches it, and gets the value 10
- (iii) C1 writes x=20, and updates its cached and memory values
- (iv) C0 reads x from its cache and gets the value 10
- (v) C2 reads  ${\rm x}$  from memory, caches it, and gets the value 20
- (vi) C2 writes x=30, and updates its cached and memory value

## Sources of Errors in the Previous Examples

#### Write-back cache

- Stores are not visible to memory immediately
- Order of write backs are important
- Lesson learned: do not allow more than one copy of a cache line in dirty state

#### Write-through cache

- The value in memory may be correct if the writes are correctly ordered
- Our example system allowed a store to proceed when there is already a cached copy
- Lesson learned: must invalidate all cached copies before allowing a store to proceed

## Understanding Coherence

#### A memory system is coherent if the following hold:

- (i) A read from a location X by a core C that follows a write by C to X always returns the value written by C provided there are no writes of X by another processor between the two accesses by C.
- (ii) A read from a location X by a core C that follows a write to X by another core returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses.
- (iii) Writes to the same location are serialized. That is, two writes to the same location by any two cores are seen in the same order by all processors.

#### **Correctness Requirement**

#### For sequential programs, there is only one correct output

A read from a memory location must return the "latest" value written to it

#### For parallel programs, there can be multiple correct outputs

- Defining "latest" precisely is crucial
- Assume that the latest value of a location is the latest value "committed" by any thread/process

## Cache Coherence Protocol

Multicore processors implement a cache coherence protocol to keep private caches in sync

# A "cache coherence protocol" is a set of actions that ensure that a load to address A returns the "last committed" value to A

- Essentially, makes one core's write visible to other cores by propagating the write to other caches
- Aims to make the presence of private caches functionally invisible
- Coherence protocols can operate on granularities from 1–64 bytes, usually operate on whole cache blocks (e.g., 64 bytes)

## Cache Coherence Protocol Invariants

#### **1.** Enforces the Single-Writer-Multiple-Reader (SWMR) invariant

For any given memory location, at any given moment in time, there is either a single core that may write it (including read) or some number of cores that may read it

#### **2.** Data values must be propagated correctly (data invariant)

The value of a memory location at the start of a read-only time period is the same as the value of the location at the end of its last read-write time period



#### **Definition 2**

A coherent system must appear to execute all threads' loads and stores to a **single** memory location in a total order that respects the program order of each thread

#### **Definition 3**

A coherent system satisfies two invariants:

write propagation every store is eventually made visible to all cores, and

write serialization writes to the same memory location are serialized (i.e., observed in the same order by all cores)

## Memory Consistency vs Cache Coherence

Memory Consistency

- Defines shared memory behavior
- Related to all shared-memory locations
- Policy on when new value is propagated to other cores
- Memory consistency implementations can use cache coherence as a "black box"

#### Cache Coherence

- Does not define shared memory behavior
- Specific to a single shared-memory location
- Propagates a new value to other cached copies
- Invalidation-based or update-based

# Sequential Consistency

### End-to-end SC

- Simple memory model that can be implemented both in hardware and in languages
- Performance can take a hit
  - Naïve hardware
  - Maintain program order—expensive for a write

D. Marino et al. A Case for an SC-Preserving Compiler. PLDI'11.

# SC-Preserving Optimizations

| Redundant load | Original<br>t = X; u = X; | $\Rightarrow$ | Optimized<br>t = X; u = t; |
|----------------|---------------------------|---------------|----------------------------|
| Forwarded load | Original<br>X = t; u = X; | $\Rightarrow$ | Optimized<br>X = t; u = t; |
|                |                           |               |                            |
| Dead store     | Original<br>X = t; X = u; | $\Rightarrow$ | Optimized<br>X = u;        |

### **Optimizations Forbidden in SC**

Loop invariant code motion, common sub-expression elimination, ...

| Original                                   | Optimized                                  |  |
|--------------------------------------------|--------------------------------------------|--|
| L1: t = X*2;<br>L2: u = Y;<br>L3: v = X*2; | ⇒ L1: t = X*2;<br>L2: u = Y;<br>L3: v = t; |  |

CSE reorders the memory accesses to Y and the second read from X (relaxes  $L \rightarrow L$  constraint, performs an eager load)

#### **Optimizations Forbidden in SC**



## Problematic Optimizations with SC



Eager load optimizations involve  $S \rightarrow L$  and  $L \rightarrow L$  reordering. These optimizations perform a load earlier than would have been performed without the optimizations.

## Problematic Optimizations with SC

| Dead store      | Original                               |               | Optimized                         |
|-----------------|----------------------------------------|---------------|-----------------------------------|
|                 | L1: X = 1;<br>L2: P = Q;<br>L3: X = 2; | $\Rightarrow$ | L1: ;<br>L2: P = Q;<br>L3: X = 2; |
| Redundant store | Original                               | I             | Optimized                         |
|                 | L1: t = X;<br>L2: P = Q;<br>L3: X = t; | $\Rightarrow$ | L1: t = X;<br>L2: P = Q;<br>L3: ; |

### Implementing SC with Compiler Support

Implement a compiler pass (e.g., in LLVM) to deal with non-SC preserving optimizations

| L1: | t | = | X*2; |
|-----|---|---|------|
| L2: | u | = | Υ;   |
| L3: | v | = | X*2; |

| L1: | t = X * 2                           |
|-----|-------------------------------------|
| L2: | u = Y                               |
| L3: | v = t                               |
| C3: | <pre>if (X modified since L1)</pre> |
| L3: | v = X * 2                           |

D. Marino et al. A Case for an SC-Preserving Compiler. PLDI'11.

#### SC Semantics

- SC is not a strong memory model
  - ► Does not guarantee data race freedom

| Thread 1                    | Thread 2                    |
|-----------------------------|-----------------------------|
| a++;                        | a++;                        |
|                             |                             |
| Thread 3                    | Thread 4                    |
| <pre>buffer[index++];</pre> | <pre>buffer[index++];</pre> |

#### Questions

- How would you implement an RMW instruction with SC?
- Are memory models only relevant in systems with support for caches?
- Is memory consistency not needed in presence of cache coherence?
- Do memory models only impact hardware design?

# Hardware Memory Models

### Characterizing Hardware Memory Models

#### Relax program order

- For example, Store  $\rightarrow$  Load and Store  $\rightarrow$  Store
- Applicable to pairs of operations with different addresses

#### Relax write atomicity

- Read other core's write early
- Applicable to only cache-based systems

#### Relax both program order and write atomicity

Read own write early

#### Possible Interleavings Under SC and TSO

1 
$$x = 0;$$
  
2  $y = 0;$ 

| Core 1                  | Core 2                  |
|-------------------------|-------------------------|
| 1 S1: x = new Object(); | 1 S2: y = new Object(); |
| 2 L1: r1 = y;           | 2 L2: r2 = x;           |

# Can both r1 and r2 be set to zero?

### **Total Store Order**

- Allows reordering stores to loads
  - A read is not allowed to return the value of another processor's write until it is made visible to all other processors (as in SC)
- Requires write atomicity, can read own write early, not other's writes
- Conjecture: widely-used x86 memory model is equivalent to TSO

#### **TSO** Formalism

Suppose we have two addresses a and b (a == b or a != b)

Constraints

1. If L(a) 
$$<_p L(b) \Rightarrow L(a) <_m L(b)$$
  
2. If L(a)  $<_p S(b) \Rightarrow L(a) <_m S(b)$   
3. If S(a)  $<_p S(b) \Rightarrow S(a) <_m S(b)$   
4. If S(a)  $<_p L(b) \Rightarrow S(a) <_m L(b)$  /\* Enables FIFO write buffer \*/

. . .

Every load gets its value from the last store before it to the same address

# Support for FENCE Operations in TSO

If L(a)  $<_p$  FENCE  $\Rightarrow$  L(a)  $<_m$  FENCE

If S(a)  $<_p$  FENCE  $\Rightarrow$  S(a)  $<_m$  FENCE

If FENCE  $<_p$  FENCE  $\Rightarrow$  FENCE  $<_m$  FENCE

If FENCE  $<_p L(a) \Rightarrow$  FENCE  $<_m L(a)$ 

If FENCE  $<_p S(a) \Rightarrow$  FENCE  $<_m S(a)$ 

If S(a)  $<_p$  FENCE  $\Rightarrow$  S(a)  $<_m$  FENCE

If FENCE  $<_p L(a) \Rightarrow$  FENCE  $<_m L(a)$ 

### Possible Outcomes with TSO

1 
$$x = 0;$$
  
2  $y = 0;$ 

| Core 1         | Core 2         |  |
|----------------|----------------|--|
| 1 S1: x = NEW; | 1 S2: y = NEW; |  |
| 2 L1: r1 = x;  | 2 L3: r3 = y;  |  |
| 3 L2: r2 = y;  | 3 L4: r4 = x;  |  |

Assume r2 and r4 are zero. Can r1 or r3 be set to zero?

#### Possible Outcomes with TSO



#### **Outcome**: r2 ==0, r4 == 0, r1 == NEW, and r3 == NEW

Swarnendu Biswas (IIT Kanpur)

### RMW in TSO

- Load of a RMW cannot be performed until earlier stores are performed (i.e., exited the write buffer). Why?
  - ► Effectively drains the write buffer
- Load requires read–write coherence permissions, not just read permissions
- To guarantee atomicity, the cache controller may not relinquish coherence permission to the block between the load and the store

### Relationship among Memory Models

- A memory model Y is strictly more relaxed (weaker) than a memory model X if all X executions are also Y executions, but not vice versa
- If Y is more relaxed than X, then all X implementations are also Y implementations
- Two memory models may be incomparable if both allow executions precluded by the other



# Which is correct?

### Processor Consistency (PC) and Partial Store Order (PSO)

#### PC is similar to TSO, but does not guarantee write atomicity

Writes may become visible to different processors in different order

#### PSO allows reordering of store to loads and stores to stores

- Writes to different locations from the same processor can be pipelined or overlapped and are allowed to reach memory or other cached copies out of program order
- Can read own write early, not other's writes
- Write-write reordering is present in many architectures, including Alpha, IA64, and POWER

#### **Opportunities to Reorder Memory Operations**

```
1 data1 = null;
2 data2 = null;
3 flag = false;
```

#### Core 1

```
Core 2
```

| 1 | S1: | data1 = <mark>new</mark> Object(); |
|---|-----|------------------------------------|
| 2 | S2: | <pre>data2 = new Object();</pre>   |
| 3 | S3: | <pre>flag = true;</pre>            |
| 4 |     |                                    |



What order ensures r2 and r3 always see initialized objects?

### Reorder Operations Within a Synchronization Block



Swarnendu Biswas (IIT Kanpur)

CS 636: Memory Consistency Models

## **Optimization Opportunities**

- (i) Non-FIFO coalescing write buffer
- (ii) Support non-blocking reads
  - ► Hide latency of reads
  - Use lockup-free caches and speculative execution
- (iii) Simpler support for speculation
  - Need not compare addresses of loads to coherence requests
  - ▶ For SC, need support to check whether the speculation is correct

Maintain TSO rules for ordering two accesses to the same address only

• Every load gets its value from the last store before it to the same address

Constraints

1. If 
$$L(a) <_p L'(a) \Rightarrow L(a) <_m L'(a)$$
  
2. If  $L(a) <_p S(b) \Rightarrow L(a) <_m S(b)$   
3. If  $S(a) <_n S(b) \Rightarrow S(a) <_m S(b)$ 

Maintain TSO rules for ordering two accesses to the same address only

• Every load gets its value from the last store before it to the same address

| Constraints 1. If $L(a) <_p L'(a) \Rightarrow L(a) <_m L'(a)$<br>2. If $L(a) <_p S(b) \Rightarrow L(a) <_m S(b)$<br>3. If $S(a) <_p S(b) \Rightarrow S(a) <_m S(b)$ |                                   |                                                    |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------|----------------------------------------------------|
| If L(a) $<_p$ FE                                                                                                                                                    | $NCE \Rightarrow L(a) <_m FENCE$  | If FENCE $<_p L(a) \Rightarrow$ FENCE $<_m L(a)$   |
| If S(a) $<_p$ FE                                                                                                                                                    | $ENCE \Rightarrow S(a) <_m FENCE$ | If FENCE $<_p$ S(a) $\Rightarrow$ FENCE $<_m$ S(a) |

#### If FENCE $<_p$ FENCE $\Rightarrow$ FENCE $<_m$ FENCE

#### Correct Implementation under Relaxed Consistency

```
1 data1 = null;
2 data2 = null;
3 flag = false;
```

#### Core 1



Core 2

#### Correct Implementation Under Relaxed Consistency

| Core 1                                                                                                                                                                                                                   | Core 2                                                                                                                                                                                                                        |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>1 F11: FENCE<br/>2 A11: acquire(lock);<br/>3 F12: FENCE<br/>4 // Loads L<sub>1i</sub> arbitrarily<br/>5 // interleaved with stores S<sub>1j</sub><br/>6 F13: FENCE<br/>7 R12: release(lock);<br/>8 F14: FENCE</pre> | 1<br>2<br>3<br>4<br>5<br>6<br>7 F21: FENCE<br>8 A21: acquire(lock);<br>9 F22: FENCE<br>10 // Loads $L_{2i}$ arbitrarily<br>11 // interleaved with stores $S_{2j}$<br>12 F23: FENCE<br>13 R22: release(lock);<br>14 F24: FENCE |

## Examples of Relaxed Consistency Memory Models

#### Weak ordering

- Distinguishes between data and synchronization operations
- A synchronization operation is not issued until all previous operations are complete
- No operations are issued until the previous synchronization operation completes

#### Release consistency

- Relaxes WO further, distinguishes between acquire and release synchronization operations
- All previous acquire operations must be performed before an ordinary load or store access is allowed to perform
- Previous accesses have to complete before a release is performed
- RC<sub>sc</sub> maintains SC between synchronization operations
- Acquire  $\rightarrow$  all, all  $\rightarrow$  release, and sync  $\rightarrow$  sync

#### Correct Implementation Under Relaxed Consistency

| Core 1                                                   | Core 2                                       |
|----------------------------------------------------------|----------------------------------------------|
| 1 F11: FENCE                                             | 1                                            |
| <pre>2 A11: acquire(lock);</pre>                         | 2                                            |
| 3 F12: FENCE                                             | 3                                            |
| 4 // Loads L <sub>1i</sub> arbitrarily                   | 4                                            |
| 5 // interleaved with stores $S_{1j}$                    | 5                                            |
| 6 F13: FENCE                                             | 6                                            |
| <pre>7 R12: release(lock);</pre>                         | 7 F21: FENCE                                 |
| 8 F14: FENCE                                             | <pre>8 A21: acquire(lock);</pre>             |
| 9                                                        | 9 F22: FENCE                                 |
| 10                                                       | 10 // Loads $L_{2i}$ arbitrarily             |
| 11 Which fences are needed to ensure correct             | 11 // interleaved with stores ${\it S}_{2j}$ |
| <sup>12</sup> ordering and visibility between C1 and C2? | 12 F23: FENCE                                |
| 13                                                       | <pre>13 R22: release(lock);</pre>            |
| 14                                                       | 14 F24: FENCE                                |

### **Relaxed Consistency Memory Models**

#### Why should we use them?

Performance

#### Why should we not use them?

Complexity

### Hardware Memory Models: One Slide Summary

| Model            | $\mathbf{W}  ightarrow \mathbf{R}$ | $\mathbf{W}  ightarrow \mathbf{W}$ | $\mathbf{R}  ightarrow \mathbf{RW}$ | Read Own<br>Write Early | Read Other's<br>Write Early |
|------------------|------------------------------------|------------------------------------|-------------------------------------|-------------------------|-----------------------------|
| SC               |                                    |                                    |                                     | $\checkmark$            |                             |
| TSO              | $\checkmark$                       |                                    |                                     | $\checkmark$            |                             |
| PC               | $\checkmark$                       |                                    |                                     | $\checkmark$            | $\checkmark$                |
| PSO              | $\checkmark$                       | $\checkmark$                       |                                     | $\checkmark$            |                             |
| WO               | $\checkmark$                       | $\checkmark$                       | $\checkmark$                        | $\checkmark$            |                             |
| $RC_{SC}$        | $\checkmark$                       | $\checkmark$                       | $\checkmark$                        | $\checkmark$            |                             |
| RC <sub>PC</sub> | $\checkmark$                       | $\checkmark$                       | $\checkmark$                        | $\checkmark$            | $\checkmark$                |

### Desirable Properties of a Memory Model

#### Desirable properties: Programmability, Performance, and Portability

Hard to satisfy all three properties

#### **Evaluating SC**

- + Intuitive when we think of uniprocessor executions
- + Serializability of instructions
- No atomicity of regions
- Inhibits many compiler transformations
- Almost all recent architectures violate SC

# Programming Language Memory Models

## Language Memory Models

- Data-Race-Free-0 (DRF0) model is conceptually similar to Weak Ordering (WO)
- Assumes no data races
  - ► DRF0 ensures SC for data-race-free programs
  - ► No guarantees for racy programs
- Allows many optimizations in the compiler and hardware
- Language memory models were developed much later than hardware models
  - ► Recent standardizations are largely driven by languages
- Most language models are based on DRF0

Why do we need one? Is the hardware memory model not enough?

## C++ Memory Model and Catch-Fire Semantics

- Adaptation of the DRF0 memory model
  - Provides SC for data-race-free programs
  - C/C++ simply ignores data races
- No safety guarantees in the language

```
1 X* x = null;
2 bool done = false;
```

| Thread 1                         | Thread 2                              |  |
|----------------------------------|---------------------------------------|--|
| 1 X = new X();<br>2 done = true; | <pre>if (done)    X-&gt;func();</pre> |  |

## C++ Memory Model and Catch-Fire Semantics

- Adaptation of the DRF0 memory model
  - Provides SC for data-race-free programs
  - C/C++ simply ignores data races
- No safety guarantees in the language



## Memory Operations in C++

### Synchronization Lock, unlock, atomic load, atomic store, atomic RMW Data Load, store

Compiler reordering is allowed for memory operations M1 and M2 if

- M1 is a data operation and M2 is a read synchronization operation
- M1 is write synchronization and M2 is data
- M1 and M2 are both data with no synchronization between them
- M1 is data and M2 is the write of a lock operation
- M1 is unlock and M2 is either a read or write of a lock

## Writing Correct Concurrent C++ Code Using Locks

```
std::mutex mtx;
bool ready = false;
```

| Thread 1                                                          | Thread 2                                                                           |  |
|-------------------------------------------------------------------|------------------------------------------------------------------------------------|--|
| <pre>mtx.lock(); prepareData(); ready = true; mtx.unlock();</pre> | <pre>1 mtx.lock();<br/>2 if (ready)<br/>3 consumeData();<br/>4 mtx.unlock();</pre> |  |

## Using Atomics Available from C++11

- "Data race free" by definition (e.g., std::atomic<int>)
  - A store synchronizes with operations that load the stored value Similar to volatile in Java
- C++ volatile is different!
  - Does not establish inter-thread synchronization
  - ► Can be part of a data race

```
std::atomic<bool> ready(false);
Thread 1
prepareData();
ready.store(true);
```

## **Ensuring Visibility**

- Writer thread releases a lock
  - ▶ Flushes all writes from the thread's working memory
- Reader thread acquires a lock
  - ► Forces a (re)load of the values of the affected variables
- std::atomic in C++ and volatile in Java
  - ► Values written are made visible immediately before any further memory operations
  - Readers reload the value upon each access
- Thread join
  - ▶ Parent thread is guaranteed to see the effects made by the child thread

## Memory Order of Atomics

- Specifies how regular, non-atomic memory accesses are to be ordered around an atomic operation
- Default is sequential consistency

### atomic.h

| 1 | <pre>enum memory_order {</pre>    |
|---|-----------------------------------|
| 2 | <pre>memory_order_relaxed ,</pre> |
| 3 | <pre>memory_order_consume ,</pre> |
| 4 | <pre>memory_order_acquire ,</pre> |
| 5 | <pre>memory_order_release ,</pre> |
| 6 | <pre>memory_order_acq_rel ,</pre> |
| 7 | memory_order_seq_cst              |
| 8 | };                                |

## Memory Model Synchronization Modes

### Producer

- Producer thread creates data
- Producer thread stores to an atomic

#### Consumer

- Consumer threads read from the atomic
- When the expected value is seen, data from the producer thread is complete and visible to the consumer thread

The different memory model modes indicate how strong this data-sharing bond is between threads

Memory model synchronization modes

memory\_order\_seq\_cst

$$x = 0;$$
  
y = 0;

Thread 1

Thread 2

y = 1; x.store(2); if (x.load() == 2)
 assert(y == 1);



```
memory_order_seq_cst
```

$$x = 0;$$
  
y = 0;



memory\_order\_relaxed: no happens-before edge

#### Thread 1

```
y.store(20, memory_order_relaxed);
x.store(10, memory_order_relaxed);
```

#### Thread 2

```
if (x.load(memory_order_relaxed) == 10)
  assert(y.load(memory_order_relaxed) == 20);
  y.store(30, memory_order_relaxed);
```

#### Thread 3

```
Can these asserts fail?
```

```
if (y.load(memory_order_relaxed) == 30)
  assert(x.load(memory_order_relaxed) == 10);
```

memory\_order\_relaxed: no happens-before edge

#### Thread 1

```
x.store(10, memory_order_relaxed);
x.store(20, memory_order_relaxed);
```

#### Thread 2

```
y = x.load(memory_order_relaxed);
z = x.load(memory_order_relaxed);
assert(y <= z);
Can this assert
fail?
```

- In the absence of HB edges, a thread should not rely on the exact ordering of instructions in another thread
- Once a value of a variable from Thread 1 is observed in Thread 2, Thread 2 cannot see an earlier value

memory\_order\_acquire and memory\_order\_release: introduces HB edges only
between dependent variables



### Thread 1 y.store(20, memory\_order\_release); Thread 2 x.store(10, memory\_order\_release); Thread 3 assert(y.load(memory\_order\_acquire) == 20 && x.load(memory\_order\_acquire) == 0); What will happen with the asserts? Thread 4 assert(y.load(memory\_order\_acquire) == 0 && x.load(memory\_order\_acquire) == 10); Swarnendu Biswas (IIT Kanpur) CS 636: Memory Consistency Models Sem 2024-25-II

memory\_order\_consume: removes HB ordering on non-dependent variables

Thread 1

n = 1; m = 1; p.store(&n, memory\_order\_release);

#### Thread 2

```
t = p.load(memory_order_acquire);
assert(*t == 1 && m == 1);
```

Thread 3 t = p.load(memory\_order\_consume); assert(\*t == 1 && m == 1); Swarnendu Biswas (IIT Kanpur) CS 636: Memory Consistency Models Sem 2024-25-II

80/95

## Happens-Before Memory Model (HBMM)

Read operation a = rd(t, x, v) may return the value written by any write operation b = wr(t, x, v) provided

- (i) b does not happen after a, i.e.,  $b \prec_{HB} a$  or b||a
- (ii) There is no intervening write c to x where b  $\prec_{HB}$  c  $\prec_{HB}$  a

1 x = 0:  $_{2}$  y = 0; Thread 1 Thread 2 1 x = 1;y = 1; $_{2}$  r1 = x;  $_{2}$  r2 = y; assert(r1!=0); assert(r2!=0);Can these asserts fail?

## Happens-Before Memory Model (HBMM)

Read operation a = rd(t, x, v) may return the value written by any write operation b = wr(t, x, v) provided

- (i) b does not happen after a, i.e.,  $b \prec_{HB} a$  or b||a
- (ii) There is no intervening write c to x where b  $\prec_{HB}$  c  $\prec_{HB}$  a

1 x = 0:  $_{2}$  y = 0; Thread 1 Thread 2 1 r2 = y;r1 = x; $_{2}$  y = 1; 2 x = 1;assert(r2 == 0); Can these asserts assert(r1 == 0):fail?

### HBMM

$$\begin{array}{rcl}
 x &= 0; \\
 y &= 0;
\end{array}$$

### Thread 1



### Thread 2

| 1 | while (y | <i>r</i> == | 0) | {} |  |  |
|---|----------|-------------|----|----|--|--|
| 2 | x = 1;   |             |    |    |  |  |
| 3 |          |             |    |    |  |  |

### HBMM



### HBMM

HBMM has the potential to generate out-of-thin-air values

$$x = 0;$$

$$y = 0;$$
Thread 1
Thread 2
$$y = x;$$

- Problematic for garbage-collected languages since the "out-of-thin-air" value could be an invalid pointer
- Introduces potential security loopholes

### DRF0 vs HBMM

1 x = 0;2 y = 0;

### Thread 1

Thread 2

 1
 r1 = x;
 1
 r2 = y;

 2
 if (r1 == 1) {
 2
 if (r1 == 1) {

 3
 y = 1;
 3
 x = 1;

 4
 }
 4
 }

 5
 assert (r1 == 0);
 5
 assert (r2 == 0);

Is there a data race on x and y?
Remember that DRF0 provides SC only for data-race-free programs

CS 636: Memory Consistency Models

## DRF0 vs HBMM

### DRF0 allows arbitrary behavior for racy executions

DRF0 is not strictly stronger than HBMM

### HBMM does not guarantee SC for DRF programs

HBMM is not strictly stronger than DRF0

## Java Memory Model (JMM)

- First high-level language to incorporate a relaxed memory model
- JMM provides SC for data-race-free executions (like DRF0)
- Provides memory- and type-safety, so has to define some semantics for programs with data races
  - ► JMM prohibits out-of-thin-air values

## Outcomes Possible with JMM



## Outcomes Possible with JMM

1 x = 0;2 y = 0;

### Thread 1

Thread 2

| 1 | y = 1;            |
|---|-------------------|
| 2 | r1 = x;           |
| 3 | assert (r1 != 0); |

1 x = 1; 2 r2 = y; 3 assert (r2 != 0);

## Outcomes Possible with JMM

1 x = 0;2 y = 0;

### Thread 1

| 1 | r1 = x;    |    |     |
|---|------------|----|-----|
| 2 | y = 1;     |    |     |
| 3 | assert (r1 | == | 0); |

1 r2 = y; 2 x = 1; 3 assert (r2 == 0);



## Outcomes Not Possible with JMM

| x = 0;     y = 0;                        |                                                            |
|------------------------------------------|------------------------------------------------------------|
| Thread 1                                 | Thread 2                                                   |
| r1 = x;<br>y = r1;<br>assert (r1 != 42); | <pre>1 r2 = y;<br/>2 x = r2;<br/>3 assert (r2 == 0);</pre> |

- HBMM permits an execution in which each load reads say 42
- DRF0 allows any arbitrary behavior
- JMM is strictly stronger than DRF0 and HBMM

Swarnendu Biswas (IIT Kanpur)

CS 636: Memory Consistency Models

## JVMs do not comply with the JMM!



Thread 2

### Thread 1

1 r1 = x;2 y = r1;3 4 5 6 7 Can this assert fail under HBMM and JMM?

## JVMs do not comply with the JMM!



### Lessons Learned

### Specifying semantics for racy programs is hard

Simple optimizations may introduce unintended consequences

### SC for DRF is now the preferred baseline

- Make sure your program is free of data races
- System guarantees SC execution

### References

- V. Nagarajan et al. A Primer on Memory Consistency and Cache Coherence. Chapters 1,2,6–8, 2<sup>nd</sup> edition, Morgan and Claypool.
- S. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. Journal of Computer, vol. 29, no. 12, pp. 66–76, Dec. 1996.
- D. Marino et al. A Case for an SC-Preserving Compiler. PLDI 2011.
- C. Flanagan and S. Freund. Adversarial Memory for Detecting Destructive Races. PLDI 2010.
- M. Cao et al. Prescient Memory: Exposing Weak Memory Model Behavior by Looking into the Future. ISMM 2016.