# CS 636: Transactional Memory

#### Swarnendu Biswas

Department of Computer Science and Engineering, Indian Institute of Technology Kanpur

Sem 2024-25-II



## Challenges with Concurrent Programming



# Challenges with Concurrent Programming



## Task Parallelism

- Different tasks run on the same data
  - Threads execute computation concurrently (e.g., pipelines)
- Explicit synchronization is used to coordinate threads



## HashMap in Java

```
public Object get(Object key) {
    int idx = hash(key); // Compute hash to find bucket
2
    HashEntry e = buckets[idx];
3
    while (e != null) { // Find element in bucket
4
      if (equals(key, e.key))
5
         return e.value;
6
      e = e.next;
7
    ş
8
    return null:
9
  Z
                        No lock overhead but
10
                        thread-unsafe
```

## Synchronized HashMap in Java



Thread-safe but uses explicit coarse-grained locking

## Coarse-Grained and Fine-Grained Locking

**Coarse-Grained Locking** 

- + Easy to implement correctly
- Limits concurrency, poor scalability

**Fine-Grained Locking** 

- + More concurrency, better performance
- Difficult to get correct, more error-prone

## Data Parallelism

Same task is applied on many data items in parallel

- E.g., processing pixels in an image
- Useful for numeric computations
- Not a universal programming model



## Task vs Data Parallelism

### Task Parallelism

- Different operations on same or different data
- Parallelization depends on task decomposition
- Speedup is usually less because of limited opportunities and synchronization

### Data Parallelism

- Same operation on different data
- Parallelization proportional to the input data size
- Speedup is usually more

## Abstraction and Composability

#### Programming models provide abstraction and composition

- For example, procedures, ADTs, and libraries
- Abstraction is a simplified view of an entity or a problem (e.g., procedures and ADT)
- Composability joins smaller units to form larger, more complex units (e.g., library methods)

## Abstraction and Composability

#### Programming models provide abstraction and composition

- For example, procedures, ADTs, and libraries
- Abstraction is a simplified view of an entity or a problem (e.g., procedures and ADT)
- Composability joins smaller units to form larger, more complex units (e.g., library methods)

#### Parallel programming lacks abstraction mechanisms

- Low-level parallel programming models, such as threads and explicit synchronization, are unsuitable for constructing abstractions
- Explicit synchronization is not composable

## Locks are difficult to program

- If a thread holding a lock is delayed, other contending threads cannot make progress
  - ► All contending threads will possibly wake up, but only one can make progress
- Lost wakeup missed notify for condition variable
- Deadlock
- Priority inversion
- Lock convoying
- Locking relies on programmer conventions

## Locks are difficult to program

- If a thread holding a lock is delayed, other contending threads cannot make progress
  - ► All contending threads will possibly wake up, but only one can make progress
- Lost wakeup missed notify for condition variable
- Deadlock
- Priority inversion
- Lock convoving

Loc /\* \* When a locked buffer is visible to the I/O layer \* BH\_Launder is set. This means before unlocking \* we must clear BH\_Launder,mb() on alpha and then \* clear BH\_Lock, so no reader can see BH\_Launder set \* on an unlocked buffer and then risk to deadlock. \*/

Bradley Kuszmaul

Swarnendu Biswas (IIT Kanpur)

## Lock-based Synchronization is not Composable



#### You may now want to add a new method move

| 1 | boolean | move(HashTable         | tab1, | HashTable | tab2, | Т | elem) |
|---|---------|------------------------|-------|-----------|-------|---|-------|
| 2 |         | $\Rightarrow$ remove() |       |           |       |   |       |
| 3 |         | $\Rightarrow$ insert() |       |           |       |   |       |

## Lock-based Synchronization is not Composable



Option: Add new methods such as LockHashTable() and You may UnlockHashTable()

- Breaks the abstraction by exposing an implementation detail
- Lock methods are error prone
  - A client that locks more than one table must be careful to lock them in a globally consistent order to prevent deadlock

## Choosing the right locks!

- Locking schemes for 4 threads may not be the most efficient at 64 threads
- Need to profile the amount of contention

## What about hardware atomic primitives?

# **Transactional Memory**

## Transactional Memory

Transaction A computation sequence that executes as if without external interference

- Computation sequence appears indivisible and instantaneous
- Proposed by Lomet ['77] and Herlihy and Moss ['93]

## Advantages of Transactional Memory (TM)

+ Provides reasonable tradeoff between abstraction and performance + No need for explicit locking

Avoids lock-related issues like lock convoying, priority inversion, and deadlock

```
boolean move(HashTable tab1, HashTable tab2, T elem) {
1
    atomic {
2
    boolean res = tab1.remove(elem):
    if (res)
      tab2.insert(\emph{}lem);
    ş
    return res;
7
8 }
```

5

6

## Advantages of TM

#### Programmer says what needs to be atomic

TM system/runtime implements synchronization

#### Declarative abstraction

- Programmer says what work should be done
- Programmer has to say how work should be done with imperative abstraction

#### Easy programmability (like coarse-grained locks)

Performance goal is like fine-grained locks

## **Basic TM Design**

- Transactions are executed speculatively
- If the transaction execution completes without a conflict, then the transaction commits
  - ► The updates are made permanent
- If the transaction experiences a conflict, then it aborts

## Database Systems as a Motivation

- Database systems have successfully exploited parallel hardware for decades
- Achieve good performance by executing many queries simultaneously and by running queries on multiple processors when possible

### ACID properties

- Atomicity
- Consistency
- Isolation
- Durability

## TM vs Database Transactions

ΤМ

- Supported by language runtime or hardware
- Not durable
- Operations are from main memory, performance is critical

#### Databases

- Application level concept
- Durable
- Operations involve mostly disk accesses

## Properties of TM execution

Atomic Appears to happen instantaneously Commit Appears atomic Abort Has no side effects Serializable Appears to happen serially in order Isolation Other transactions cannot observe writes before commit

## **TM Execution Semantics**

Thread 1

| 1 | atomic { |     |   |   |   |     |  |  |  |  |  |
|---|----------|-----|---|---|---|-----|--|--|--|--|--|
| 2 | a        | L = | = | а | - | 20; |  |  |  |  |  |
| 3 | b        | ) = | = | b | + | 20; |  |  |  |  |  |
| 4 | C        | : = | = | а | + | b;  |  |  |  |  |  |
| 5 | a        | L = | = | а | - | b;  |  |  |  |  |  |
| 6 | }        |     |   |   |   |     |  |  |  |  |  |

#### Thread 2



## **TM Execution Semantics**

Thread 1



No data race due to TM semantics

Thread 2

## Linked-List-based Double Ended Queue



## Linked-List-based Double Ended Queue



## **Atomicity Violations**

Thread 1

```
1 ...
2 if (thd->proc_info) {
3
4
5
6 puts(thd->proc_info, ...)
7 }
8 ...
```

## Thread 2



MySQL:ha\_innodb.cc

## Fixing Atomicity Violations with TM



# Fixing Atomicity Violations with TM

Thread 1



## Thread 2

## TM vs synchronized in Java

ТΜ

- A transaction is atomic w.r.t. all other transactions in the system
- Nested transactions never deadlock

#### synchronized

- Provides mutual exclusion compared to other blocks on the same lock
- Nested blocks can deadlock if locks are acquired in wrong order

## TM Interface

```
void startTx();
```

```
2 bool commitTx();
```

```
3 void abortTx();
```

```
4 T readTx(T *addr);
```

```
5 void writeTx(T *addr, T val);
```

Read set Set of variables read by the Tx Write set Set of variables written by the Tx

Functions can be overloaded by types or we can use generics

## Linked-List-based Double Ended Queue



## Transactions cannot replace all uses of locks



## Concurrency in TM



Thread 2



# Concurrency in TM



Two levels

- (i) Among Txs from concurrent thread
- (ii) Among individual Tx operations

# **Design Choices**

Concurrency Control, Version Management, and Conflict Detection

# TM Terminology

- A conflict occurs when two transactions perform conflicting operations on the same memory location
  - Let  $R_i$  and  $W_j$  be the read and write sets of Txs *i* and *j*. Then, a conflict occurs iff
    - $R_i \cap W_j \neq 0, \text{ or }$
    - $R_j \cap W_i \neq 0, \text{ or }$
    - $\blacktriangleright \quad W_i \cap W_j \neq 0,$
- The conflict is detected when the underlying TM system determines that the conflict has occurred
- The conflict is resolved when the underlying TM system takes some action to avoid the conflict
  - ► For example, delay or abort one of the conflicting transactions
- A conflict, its detection, and its resolution can occur at different times

bal = 1000;

| Thread | 1 |
|--------|---|
|--------|---|

| 1 | atomic {         |
|---|------------------|
| 2 | tmp = bal;       |
| 3 | bal = tmp + 100; |
| 4 | }                |

#### Thread 2

| 1 | atomic {              |
|---|-----------------------|
| 2 | <pre>tmp = bal;</pre> |
| 3 | bal = tmp - 100;      |
| 4 | }                     |

Location Value read Value written

Location Value read Value written

bal = 1000;

|   | Thread 1         |
|---|------------------|
| 1 | atomic {         |
| 2 | tmp = bal;       |
| 3 | bal = tmp + 100; |
|   | 7                |

4 }

#### Thread 2

| 1 | atomic {         |  |
|---|------------------|--|
| 2 | tmp = bal;       |  |
| 3 | bal = tmp - 100; |  |
| 4 | 7                |  |

| Location Value read Value written | Location | Value read | Value written |
|-----------------------------------|----------|------------|---------------|
|                                   | bal      | 1000       |               |

bal = 1000;

| Thread | 1 |
|--------|---|
|--------|---|

2 3

4

| atomic {         |
|------------------|
| tmp = bal;       |
| bal = tmp + 100; |
| }                |

#### Thread 2

| 1 | atomic {         |
|---|------------------|
| 2 | tmp = bal;       |
| 3 | bal = tmp - 100; |
| 4 | 3                |

| Location | Value read | Value written | Location | Value read | Value written |
|----------|------------|---------------|----------|------------|---------------|
| bal      | 1000       |               | bal      | 1000       |               |

bal = 1000;

| Thread | 1 |
|--------|---|
|--------|---|

| 1 | atomic {              |
|---|-----------------------|
| 2 | <pre>tmp = bal;</pre> |
| 3 | bal = tmp + 100;      |
| 4 | }                     |

| 1 | atomic {         |
|---|------------------|
| 2 | tmp = bal;       |
| 3 | bal = tmp - 100; |
| 4 | 3                |

| Location | Value read | Value written | Location | Value read | Value written |
|----------|------------|---------------|----------|------------|---------------|
| bal      | 1000       | 1100          | bal      | 1000       |               |

bal = 1100;

Thread 1

| 1 | atomic {         |
|---|------------------|
| 2 | tmp = bal;       |
| 3 | bal = tmp + 100; |
| 4 | }                |

Thread 2

| L | atomic {         |  |
|---|------------------|--|
| 2 | tmp = bal;       |  |
| 3 | bal = tmp - 100; |  |
| 1 | 7                |  |

| Location | Value read | Value written | L |
|----------|------------|---------------|---|
| bal      | 1000 🗸     | 1100          |   |

| Location | Value read | Value written |
|----------|------------|---------------|
| bal      | 1000       |               |

Thread 1's Tx ends, updates are committed, value of bal is written to memory, Tx log is discarded

3

bal = 1100;

#### Thread 1

| 1 | atomic {         |
|---|------------------|
| 2 | tmp = bal;       |
| 3 | bal = tmp + 100; |
| 4 | }                |

#### Thread 2

| 1 | atomic {         |
|---|------------------|
| 2 | tmp = bal;       |
| 3 | bal = tmp - 100; |
| 4 | }                |

| Location | Value read | Value written |
|----------|------------|---------------|
| bal      | 1000       | 900           |

bal = 1100;

#### Thread 1

| 1 | atomic {         |  |
|---|------------------|--|
| 2 | tmp = bal;       |  |
| 3 | bal = tmp + 100; |  |
| 4 | }                |  |

#### Thread 2

| 1 | atomic {         |
|---|------------------|
| 2 | tmp = bal;       |
| 3 | bal = tmp - 100; |
| 4 | 3                |

| Location | Value read | Value written |
|----------|------------|---------------|
| bal      | 1000 🗡     | 900           |

Thread 2's Tx ends, but commit fails, because value of bal in memory does not match the read log, Tx needs to rerun

# **Concurrency Control**

#### Pessimistic

- Occurrence, detection, and resolution happen at the same time during execution
- Claims ownership of data before modifications

#### Optimistic

- Conflict detection and resolution can happen after the conflict occurs
- Multiple conflicting transactions can continue to keep running, as long as the conflicts are detected and resolved before the Txs commit

#### Pessimistic Concurrency Control





# Time of Locking

- When the Tx first accesses a location
- When the Tx is about to commit

### **Optimistic Concurrency Control**





### Pessimistic vs Optimistic Concurrency Control

#### Pessimistic

- Usually claims exclusive ownership of data before accessing
- Needs to avoid or detect and recover from deadlock situations
- + Effective in high contention cases

#### Optimistic

- Avoids claiming exclusive ownership of data, provides more conflict resolution choices
- Needs to avoid livelock situations through contention management schemes
- + Effective in low contention cases

### Hybrid Concurrency Control

• Use pessimistic control for writes and optimistic control for reads

- Use optimistic control TM with pessimistic control of irrevocable Txs
  - ► Irrevocable Tx means that the changes cannot be rolled back
  - ► A Tx that has performed I/O or a Tx that has experienced frequent conflicts in the past

#### Version Management

TMs need to track updates for conflict resolution

- Eager
  - Tx directly updates data in memory (direct update)
  - Maintains an undo log with overwritten values
  - Values in the undo log are used to revert updates on an abort



#### Version Management

TMs need to track updates for conflict resolution

- Lazy
  - ► Tx updates data in a private redo log
  - Updates are made visible at commit (deferred update)
  - ► Tx reads must look up redo logs
  - Discard redo log on an abort



• Pessimistic concurrency control is straightforward

• Pessimistic concurrency control is straightforward

- Pessimistic concurrency control is straightforward
- Which concurrency control type should we use with eager version management, pessimistic or optimistic?

- Pessimistic concurrency control is straightforward
- Which concurrency control type should we use with eager version management, pessimistic or optimistic?

- Pessimistic concurrency control is straightforward
- Which concurrency control type should we use with eager version management, pessimistic or optimistic?
- How do you check for conflicts in optimistic concurrency control?

- Pessimistic concurrency control is straightforward
- Which concurrency control type should we use with eager version management, pessimistic or optimistic?
- How do you check for conflicts in optimistic concurrency control?
  - ► Validation operation Successful validation means Tx had no conflicts

## Conflict Detection in Optimistic Concurrency Control

#### Conflict granularity

- Object or field in software TM, line offset or whole cache line in hardware TM
- What are the tradeoffs?

#### Time of conflict detection

- Just before access (eager), during validation, during final validation before commit (lazy)
- Validation can occur at any time, and can occur multiple times

#### Conflicting access types

Among concurrent ongoing Txs, or between active and committed Txs

## **Object Layout**



#### Object Model in Jikes RVM

ObjectModel (Jikes RVM API)

Swarnendu Biswas (IIT Kanpur)

### Issues with Conflict Granularity



- Detect conflicts at the granularity of objects or fields
- A hardware technique can detect conflicts at the line/block level or at the level of individual byte offsets
- What are the tradeoffs?

# **Transaction Semantics**

# Concurrency in TM

#### Two levels

- (i) Among Txs from concurrent thread
- (ii) Among individual Tx operations



# Serializability

The result of executing concurrent transactions must be identical to a result in which these transactions executed serially

- Widely-used correctness condition in databases
- The TM system can reorder transactions
- Serializability requires the Txs appear to run in serial order
  - Does not require that the order has to be real-time



# Strict Serializability

In strict serializability, if transaction TA completes before transaction TB starts, then TA must occur before TB in the equivalent serial execution



#### Limitations of Strict Serializability



# Linearizability

- A method call is the interval that starts with an invocation event and ends with a response event
- A method call is pending if the response event has not yet occurred
- Linearizability of an operation (e.g., method call): each operation appears to execute atomically at some point between its invocation and its completion

Thread 1



# Linearizability

Linearizability of a transaction: a transaction is a single operation extending from the beginning of startTx() until the completion of its final commitTx()



# Linearizability

Linearizability of a transaction: a transaction is a single operation extending from the beginning of startTx() until the completion of its final commitTx()



# Snapshot Isolation (SI)

- Weaker isolation requirement than serializability
  - ► Can potentially allow greater concurrency between Txs
  - Many database implementations actually provide SI
- SI allows a Tx's reads to be serialized before the Tx's writes
- All reads must see a valid snapshot of memory
- Updates must not conflict

#### Example of SI



What are possible values of x and y after execution?

- With serializability
- With SI

#### Understanding SI

| 1 $X = 0;$<br>2 $y = 0;$                 |                                          |
|------------------------------------------|------------------------------------------|
| Thread 1                                 | Thread 2                                 |
| 1 <b>int</b> t = x + 1; // 1<br>2 x = t; | 1 <b>int</b> t = x + 1; // 1<br>2 x = t; |

#### Sequentially consistent but no SI

CS 636: Transactional Memory

M. Zhang et al. Avoiding Consistency Exceptions Under Strong Memory Models. ISMM 2017.

#### **Understanding SI**

| 1 $X = 0;$<br>2 $y = 0;$             |                               |
|--------------------------------------|-------------------------------|
| Thread 1                             | Thread 2                      |
| 1 x = 1;<br>2 <b>int</b> t = y; // 0 | 1 y = 1;<br>2 int t = x; // 0 |

SI but not sequentially consistent and not serializable

M. Zhang et al. Avoiding Consistency Exceptions Under Strong Memory Models. ISMM 2017.

#### Understanding SI

- Semantics of SI may seem unexpected when compared with simpler models based on serial ordering of complete transactions
- Potential increased concurrency often does not manifest as a performance advantage when compared with models such as strict serializability

# Other TM Considerations

#### **Consistency During Transactions**

- Semantics such as serializability characterize the behavior of committed Txs
- What about the Txs which fail to commit?
  - ► Tx may abort or may be slow to reach commitTx()

#### Inconsistent Reads and Zombie Txs

| 1 $\mathbf{x} = 0;$<br>2 $\mathbf{y} = 0;$                                                                                                    | Assume eager version management<br>and lazy conflict detection                                                      |
|-----------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| Thread 1                                                                                                                                      | Thread 2                                                                                                            |
| <pre>do {    startTx();    int tmp1 = readTx(&amp;x);    int tmp2 = readTx(&amp;y);    while (tmp1 != tmp2) {}    while (!commitTx()); </pre> | <pre>1 2 3 4 do { 5 startTx(); 6 writeTx(&amp;x, 10); 7 writeTx(&amp;y, 10); 8 } while (!commitTx()); 9 10 11</pre> |

1

#### Inconsistent Reads and Zombie Txs



#### Considerations with Zombie Txs

- A Tx that is inconsistent but is not yet detected is called a zombie Tx
- Careful handling of zombie Txs are required, especially for unsafe languages like C/C++
  - Inconsistent values can potentially be used in pointer arithmetic to access unwanted memory locations
- Possible workarounds: perform periodic validations
  - ▶ Increases run-time overhead, validating *n* locations once requires *n* memory accesses
  - Couples the program to the TM system
    - A TM using eager updates allows a zombie transaction's effects to become visible to other transactions
    - A TM using lazy updates only allows the effects of committed transactions to become visible

#### Challenges with Mixed-Mode Accesses

- TM semantics must consider the interaction between transactional and non-transactional memory accesses
- Many TMs do not detect conflicts between transactional and non-transactional accesses
  - ► Can lead to unexpected behavior with zombie Txs
- Requires the non-Tx thread to participate in conflict detection

#### Weak Atomicity

- Provides Tx semantics only among Txs
- Checks for conflicts only among Txs

#### **Strong Atomicity**

Guarantees Tx semantics among Txs and non-Txs

Swarnendu Biswas (IIT Kanpur)

## Think of Challenges with Weak Atomicity

- (i) Data races between Tx and non-Tx code
- (ii) Mismatched conflict detection granularity
  - ► Tx detects conflicts at a coarser granularity
- (iii) Complicated sharing idioms
  - ▶ Use a Tx to initialize shared data, expect other threads to read the data transactionally

#### Lock-Based Synchronization

- java.util.LinkedList list is shared
- Initially list == [Itemval1==0, val2==0]

#### Thread 1

Thread 2

```
1 Item item;
2 synchronized(list) {
3 item = list.removeFirst();
4 }
5 int r1 = item.val1;
6 int r2 = item.val2;
7
```

# synchronized(list) { if (!list.isEmpty()) { Item item = list.getFirst(); item.val1++; item.val2++; }

T. Shpeisman et al. Enforcing Isolation and Ordering in STM, PLDI 2007.

#### Can we safely replace synchronize with atomic?

Consider a TM design with eager versioning and lazy conflict detection

- java.util.LinkedList list is shared
- Initially list == [Item{val1==0,val2==0}]

| 1 | Item item;                            |  |  |
|---|---------------------------------------|--|--|
| 2 | <pre>weakly_atomic(list) {</pre>      |  |  |
| 3 | <pre>item = list.removeFirst();</pre> |  |  |
| 4 | }                                     |  |  |
| 5 | <pre>int r1 = item.val1;</pre>        |  |  |
| 6 | <pre>int r2 = item.val2;</pre>        |  |  |
| 7 |                                       |  |  |
|   |                                       |  |  |

#### Thread 2

```
1 weakly_atomic(list) {
2     if (!list.isEmpty()) {
3         Item item = list.getFirst();
4         item.val1++;
5         item.val2++;
6     }
7 }
```

T. Shpeisman et al. Enforcing Isolation and Ordering in STM, PLDI 2007.

Thread 1

#### Few Issues to Consider with Weak Isolation

- Non-repeatable reads
- Intermediate lost updates
- Intermediate dirty reads
- Granular lost updates
- . . .

#### Non-repeatable Reads

A non-repeatable read can occur if a Tx reads the same variable multiple times, and a non-Tx write is made to it in between.



Unless the TM buffers the value seen by the first read, the transaction will see the update.

#### Intermediate Lost Update

An intermediate lost update can occur if a non-Tx write interposes in a transactional read-modify-write sequence. The non-Tx write can be lost, without being seen by the Tx read.

$$x = 0;$$
fread 1
Thread 2
$$tomic \begin{cases} \\ r = x; \\ x = r + 1; \end{cases}$$

$$x = 10;$$

$$x = 10;$$

Th

3

#### Intermediate Dirty Read

An intermediate dirty read can occur with a TM using eager version management in which a non-Tx read sees an intermediate value written by a transaction, rather than the final, committed value.



# Single-Lock Atomicity for Transactions

- How do we provide semantics for mixed-mode accesses?
- A program executes as if all transactions acquire a single, program-wide mutual exclusion lock

| Thread 1                                          | Thread 2                                                    |
|---------------------------------------------------|-------------------------------------------------------------|
| <pre>startTx(); while (true) {} commitTx();</pre> | <pre>startTx(); int tmp = readTx(&amp;x); commitTx();</pre> |
| What will happen with SLA?                        |                                                             |

There are many other proposed models like DLA and TSC

## **Nested Transactions**

- Nested parallelism is important
  - ▶ Utilizes increasing number of cores
  - Integrates with programming models like OpenMP
- Execution of a nested Tx is wholely contained in the dynamic extent of another Tx
- Many choices on how nested Txs interact
  - Flattened
    - Aborting the inner Tx causes the outer Tx to abort
    - Committing the inner Tx has no effect until the outer Tx commits
  - Closed
    - Inner Tx can abort without terminating its parent Tx

```
1 // Parallelize loops
2 FOR I := ...
3 FOR J := ...
4 FOR K := ...
```

```
int x = 1:
  do f
2
    StartTx();
3
    WriteTx(&x, 2):
Δ
    do f
5
    StartTx();
6
    WriteTx(&x, 3);
7
    AbortTx():
8
9
     . . .
```

# Providing Txs: TM Implementations

Software Transactional Memory (STM)

Hardware Transactional Memory (HTM)

#### STM

- + Supports flexible techniques in TM design
- + Easy to integrate STMs with PL runtimes
- + Easier to support unbounded Txs with dynamically-sized logs
- More expensive than HTMs

#### HTM

- Restricted variety of implementations
- Need to adapt existing runtimes to make use of HTM
- Limited by bounded-sized structures like caches
- + Better performance than STMs

# Software Transactional Memory

#### Software Transactional Memory (STM)

#### Data structures

- Need to maintain per-thread Tx state
- Maintain either redo log or undo log
- Maintain per-Tx read/write sets

- McRT-STM, PPoPP'06
- Bartok-STM, PLDI'06
- JudoSTM, PACT'07
- RingSTM, SPAA'08

- NoRec STM, PPoPP'10
- DeuceSTM, HiPEAC'10
- LarkTM, PPoPP'15
- ...

C. Cascaval et al. Software Transactional Memory: Why Is It Only a Research Toy? ACM Queue, vol. 6, no. 5, Sep. 2008.

#### We love questions?

Remember well-designed applications should have low conflict rates

• Is the design of undo log important in a TM with eager version management?

#### We love questions?

Remember well-designed applications should have low conflict rates

- Is the design of undo log important in a TM with eager version management?
- Is the design of redo log important in a TM with lazy version management?

# Implementing STM

• Use compilation passes to instrument the program

startTx() Tx entry point (prologue) commitTx() Tx exit point (epilogue) readTx() Tx read access writeTx() Tx write access

• TM runtime tracks memory accesses, detects conflicts, and commits/aborts Txs

```
1 atomic {
2   tmp = x;
3   y = tmp + 1;
4 }
```

6 commitTx(td);

## Object Metadata and Word Metadata







#### Pros and Cons of Metadata in Object Header







- + May lie on the same cache line
- + Single update for accesses to all fields

#### Con

- Potential for false conflicts
- Increases coupling

   (e.g., complicates GC)

#### Variants of Word-based Metadata







Use hash functions to map addresses to a fixed-size metadata space Process-wide metadata space

#### Which granularity to use?

Potential impact due to false conflicts

Impact on memory usage

Impact on performance, i.e., speed of mapping location to metadata

#### Major STM Designs

Per-object versioned locks (McRT-STM, Bartok-STM)

• Use locks for protecting updates, and use versions to detect conflicts involving reads

Global clock with per-object metadata (TL2)

Fixed global metadata (JudoSTM, RingSTM, NOrec STM)

Nonblocking STMs (DSTM)

• Does not use locks

#### Lock-Based STM with Versioned Reads

High-level design

- Pessimistic concurrency control for writes
  - ► Locks are acquired dynamically
- Optimistic concurrency control for reads
  - ► Validation using per-object version numbers

#### Header Word Optimizations in Bartok STM



#### **Other Design Choices**

- Eager vs lazy version management
- Access-time locking or commit-time locking
- Access-time locking
  - ► Can support both eager or lazy version management
  - Detects conflicts between active transactions, irrespective of whether they ultimately commit
- Commit-time locking
  - ► Can support only lazy version management

#### STM Metadata

• Versioned locks

Lock mutual exclusion of writes Version number detect conflicts involving reads

- Lock is available no pending writes, holds the current version of the object
- Lock is taken refers to the owner Tx
- Invisible reads presence of a reading Tx is not visible to concurrent Txs which might try to commit updates to the objects being read

#### Read and Write Operations

```
readTx(tx, obj, off) {
                                             writeTx(tx, obj, off, newVal) {
    tx.readSet.obi = obi;
                                                acquire(obi):
2
                                           2
    tx.readSet.ver = getVerFromMd(obj);
                                           3
3
    tx.readSet++:
                                                tx.undoLog.obj = obj;
                                           Δ
                                                tx.undoLog.offset = off,
                                           5
5
    return read(obj, off);
                                                tx.undoLog.val = read(obj, off);
6
                                           6
  3
                                                tx.undoLog++:
7
                                  eager version
                                                tx.writeSet.obj = obj;
                                  management
                                                tx.writeSet.off = off:
                                                tx.writeSet.ver = ver;
                                          11
                                               tx.writeSet++:
                                          12
                                          13
                                                write(obj, off, newVal);
                                          14
                                                release(obj);
                                          15
                                          16 }
```

#### Read and Write Operations

```
readTx(tx, obj, off) {
                                             writeTx(tx, obj, off, newVal) {
    tx.readSet.obi = obi;
                                               acquire(obi):
2
                                           2
    tx.readSet.ver = getVerFromMd(obj);
                                               undoLogInt(tx, obj, off);
                                           3
3
                                               tx.writeSet.obj = obj;
    tx.readSet++:
                                           Δ
                                               tx.writeSet.off = off:
                                           5
5
    return read(obj, off);
                                               tx.writeSet.ver = ver;
6
                                           6
  3
                                              tx.writeSet++:
7
                                           7
                                               write(obj, off, newVal);
                                           8
                                               release(obj);
                                           0
                                type specialization
                                          undoLogInt(tx, obj, off) {
                                               tx.undoLog.obj = obj;
                                          13
                                               tx.undoLog.offset = off,
                                          14
                                               tx.undoLog.val = read(obj, off);
                                          15
                                               tx.undoLog++;
                                          16
                                          17
                                             ş
```

## **Conflict Detection on Writes**

Writes

Reads

How do you detect conflicts on writes?

## **Conflict Detection on Reads**

Writes

#### Reads

| 1                     | <pre>bool commitTx(tx) {</pre>  |
|-----------------------|---------------------------------|
| 2                     | foreach (entry e in tx.readSet) |
| 3                     | if (!validateTx(e.obj, e.ver))  |
| 4                     | abortTx(tx);                    |
| 5                     | return false;                   |
| 6                     | foreach (entr e in tx.writeSet) |
| 2                     | unlock(e.obj, e.ver);           |
| Unlock increments the | return true;                    |
|                       | }                               |
| version number        | -                               |

## No Conflict on Read from Addr=200



Transaction read from the object, and its version number is unchanged at commit time

## No Conflict on Read from and Write to Addr=200



## No Conflict on Write to and Read from Addr=200



Swarnendu Biswas (IIT Kanpur)

CS 636: Transactional Memory

# Conflict on Read from Addr=200, Concurrent Tx Updates and Commits





Transaction read from the object, and there is a version mismatch during commitTx()

## Conflict on Read from Addr=200, Concurrent Write





Transaction read from the object when it was owned by some other  $\mathsf{T} \mathsf{x}$ 

## Conflict on Read from Addr=200 during Commit







Transaction is owned by some other Tx when the current reader Tx tries to commit

## Conflict Between Read and Write from Addr=200



tween

## **Practical Issues**

#### Version overflow

- Theoretical concern, is a practical concern if the metadata is "packed"
- Globally renumber objects if overflow is rare
- Distinguish between an "old" and a wrapped-around "new" version
  - Ensure that each thread validates its current Tx at least once within *n* version increments

Do these techniques (McRT, Bartok) allow zombie txs?

## Semantics of McRT and Bartok

Read set may not remain consistent during txs

Does not detect conflicts between txs and non-txs

## Hardware Transactional Memory

## Hardware Transactional Memory (HTM)

- Can provide strong isolation without modifications to non-Tx accesses
- Easy to extend to unmanaged languages

- TCC, ISCA'04
- LogTM, HPCA'06
- Rock HTM, ASPLOS'09
- FlexTM, ICS'09
- Azul HTM
- Intel TSX
- IBM Blue Gene/Q

## Possible ISA Extensions

Similar to STMs, HTMs need to demarcate Tx boundaries and transactional memory accesses

#### Explicit

- begin\_transaction
- end\_transaction
- load\_transactional
- store\_transactional

Memory accessed within a Tx through ordinary memory instructions do not participate in any transactional memory protocol

#### Implicit

- begin\_transaction
- end\_transaction

All memory accesses are transactional

## Possible ISA Extensions

Similar to STMs, HTMs need to demarcate Tx boundaries and transactional memory accesses

#### Explicit

- begin\_transaction
- end\_transaction
- load\_transactional
- store\_transactional

Memory accessed within a Tx through ordinary memory instructions do not participate in any transactional memory protocol

#### Implicit

- begin\_transaction
- end\_transaction

All memory accesses are transactional



## Explicitly vs Implicitly Transactional HTMs

#### **Explicitly Transactional**

- Provides flexibility to choose desired memory locations
  - Reduced read and write set size
- May require multiple library versions
  - ► Limits reuse of legacy libraries in HTMs

### Implicitly Transactional

- Larger read and write sets
- Easy to reuse software libraries

## Design Issues in HTMs

#### Tracking read and write sets

- Introducing additional structures like transactional cache complicates the data path
- Recent ideas extend existing data caches to track accesses
  - Granularity matters (one read bit for a cache line)
- Need to be careful with writes

#### **Conflict detection**

- Natural to piggyback on cache coherence protocols to detect conflicts
- Most HTMs detect conflicts eagerly, and transfer control to a software handler

## Intel Transactional Synchronization Extensions (TSX)

- TSX supported by Intel in selected series based on Haswell microarchitecture
- TSX hardware can dynamically determine whether threads need to serialize lock-protected critical sections

Coarse-grained Locks and Transactional Synchronization Explained

Swarnendu Biswas (IIT Kanpur)

Transactional Synchronization with Intel® Core™ 4th Generation Processor

## High-Level Goal with Transactions

- Hardware dynamically determines whether threads need to serialize
  - ► For example, with lock-protected critical sections
- Hardware serializes only when required
- Thus, processor exposes and exploits concurrency that is hidden due to unnecessary synchronization
- Lock elision idea introduced by Ravi Rajwar and James R. Goodman in 2001
  - ► Remove locks, run code as a transaction
  - ▶ If there are conflicts, abort and rerun code with locks intact
  - On success, commit the transaction's writes to memory

R. Rajwar and J. Goodman. Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, MICRO 2001.

## Intel Transactional Synchronization Extensions

#### **TSX** operation

- Optimistically executes critical sections eliding lock operations
- Commit if the Tx executes successfully
- Otherwise abort discard all updates, restore architectural state, and resume execution
- Resumed execution may fall back to locking

## **TSX** Interface

#### Hardware Lock Elision (HLE)

- xacquire
- xrelease

#### Extends HTM support to legacy hardware

Restricted Transactional Memory (RTM)

- xbegin
- xend
- xabort

New ISA extensions

## Hardware Lock Elision (HLE)

- Application uses legacy-compatible prefix hints to identify critical sections
  - ► Hints ignored on hardware without TSX
- HLE provides support to execute critical section transactionally without acquiring locks
- Abort causes a re-execution without lock elision
- Hardware manages all state

## Goal with Intel TSX



Coarse-grained Locks and Transactional Synchronization Explained

Swarnendu Biswas (IIT Kanpur)

#### CS 636: Transactional Memory

## Lock Acquire Code



## **HLE Interface**



## Restricted Transactional Memory (RTM)

- Software uses new instructions to identify critical sections
  - ▶ Similar to HLE, but more flexible interface for software
  - ► Requires programmers to provide an alternate fallback path
- Processor may abort RTM transactional execution for several reasons
- Abort transfers control to target specified by XBEGIN operand
  - ► Abort information encoded in the EAX GPR

## RTM Interface

acquire(mutex)





## **XTEST** Instruction

Queries whether the logical processor is transactionally executing in a transactional region identified by either HLE or RTM

## Aborts in TSX

- Conflicting accesses from different cores (data, locks, false sharing)
  - ▶ TSX maintains read/write sets at the granularity of cache lines
- Capacity misses
- Some instructions always cause aborts (system calls, I/O)
- Eviction of a transactionally-written cache line
- Eviction of transactionally-read cache lines do not cause immediate aborts
  - ► Backed up in a secondary structure which might overflow

Section 12.2.4 in Intel 64 and IA-32 Architectures Optimization Reference Manual

## Finding Reasons for Aborts can be Hard

| EAX register<br>bit position | Meaning                                                                                                          |
|------------------------------|------------------------------------------------------------------------------------------------------------------|
| 0                            | Set if abort caused by XABORT instruction                                                                        |
| 1                            | If set, the transaction may succeed on a retry. This bit is always clear if bit 0 is set.                        |
| 2                            | Set if another logical processor conflicted with a memory address that was part of the transaction that aborted. |
| 3                            | Set if an internal buffer overflowed                                                                             |
| 4                            | Set if debug breakpoint was hit                                                                                  |
| 5                            | Set if an abort occurred during execution of a nested transaction                                                |
| 23:6                         | Reserved                                                                                                         |
| 31:24                        | XABORT argument (only valid if bit 0 set, otherwise reserved)                                                    |

## **TSX Implementation Details**

- Every detail is not known
  - ▶ Read and write sets are at cache line granularity
  - Uses L1 data cache as the storage
- Conflict detection is through cache coherence protocol

#### TSX caveats

- No guarantees that Txs will commit
- There should be a software fallback independent of TSX to guarantee forward progress

## Applying Intel TSX



R. Rajwar. Going Under the Hood with Intel's Next Generation Microarchitecture Codename Haswell. QCon 2012.

CS 636: Transactional Memory

## Relevance of TSX-like Concepts

- GNU glibc 2.18 added support for lock elision of pthread mutexes of type PTHREAD\_MUTEX\_DEFAULT
- Glibc 2.19 added support for elision of read/write mutexes
  - Depends whether the -enable-lock-elision=yes parameter was set at compilation time of the library
- Java JDK 8u20 onward support adaptive elision for synchronized sections when the -XX:+UseRTMLocking option is enabled
- Intel Thread Building Blocks (TBB) 4.2 supports elision with the speculative\_spin\_rw\_mutex

## References

- T. Harris et al. Transactional Memory. 2<sup>nd</sup> edition, Morgan Kaufmann.
- Web Resources About Intel® Transactional Synchronization Extensions
- R. Rajwar and M. Dixon. Intel®Transactional Synchronization Extensions. Intel Developer Forum, 2012.
- A. Kleen. Adding lock elision to Linux. Linux Plumbers Conference, 2012.
- Dr. Roman Dementiev. Making the Most of Intel®Transactional Synchronisation Extension.
- R. Rajwar. Going Under the Hood with Intel's Next Generation Microarchitecture Codename Haswell. QCon 2012.