## Optimizing Spin Locks for Multicore Processors

中正大學 資訊工程研究所

指導教授: 羅習五 副教授

學生:林靖紳

## Outline

- Introduction
- Related Work
- Implementation
- Evaluation
- Conclusion



# Introduction

#### Introduction - Spinlocks Overview in Multi-Core CPUs

- Rise of Multi-core Processors
  - The advent of multi-core processors has revolutionized the computing field, offering

unprecedented performance capabilities.

- This architecture allows parallel processing but also introduces the challenge of ensuring synchronization mechanisms for concurrent access to shared resources.
- Why focus on spinlocks:
  - Spinlocks can be used independently, crucial for high-performance applications.
  - Other locks like mutexes and semaphores are implemented using spinlocks; for example, a mutex is a spinlock combined with a context switch.

#### Introduction - System Architecture Impact

- NUMA and ccNUMA Systems
  - Memory access times vary based on processor proximity.
  - Maintaining data coherence (cache coherence protocols) introduces overhead.
  - Spinlocks can increase interconnect traffic, reducing system performance.
- AMD Threadripper PRO 3995WX Case Study
  - Under the Multi-Chip Module (MCM) architecture, the system includes 4 NUMA nodes and multiple cores.
  - Uses snoop-based and directory-based cache coherence.
  - Serialization costs: Contention cost (next task determination) and Handoff cost (data passing). 5

#### Introduction - System Architecture Impact

#### • Core-to-Core Latency

• Significant latency variations in ccNUMA environments impact performance.



Operating System Lab, National Chung-Cheng University

### Introduction - Optimizing Spinlock Performance

- NUMA-Aware Spinlocks and RON
  - Traditional grouping spinlocks often assume uniform latencies within a numa node.
  - RON reduces handoff costs using a routing table, optimizing data exchange paths.
- Potential Issues with RON:
  - **Memory Overhead:** Data structure increases linearly with the number of cores, leading to higher memory usage.
  - **Contention Cost:** RON handoff locks with the time comlexity of O(N).

## Introduction - nxtRON ans shRON

- New Algorithms: next-RON (nxtRON) and shared-RON (shRON)
  - Developed to address limitations in RON.
  - Aim to enhance performance and scalability in multi-core systems.
- nxtRON:
  - Focuses on **optimizing contention costs**.
  - Aims to reduce the **time complexity** of RON
- shRON:
  - Targets **reducing memory overhead** and **cache contention** compared to traditional spinlock approaches.
  - Implements innovative data sharing and synchronization techniques tailored for modern multi-core CPUs.

# Related Work

#### **RON: One-Way Circular Shortest Routing**

- Problem Description
  - The number of cores on a single CPU is increasing
  - Accessing shared data involves transferring data between different CPU cores
  - Target: Minimize the data transfer between cores
- Problem Solution
  - "How to allow threads on different cores to access shared data" is more like a path planning problem
  - Pre-establish routing table using "Approximate Shortest Circular Path" to determine the order in which each core enters the critical section.

#### RON

#### Algorithm 1 The RON Algorithm

```
int TSP_ID_ARRAY[NUM_core]; /*per-process*/
   thread_local TSP_ID; /*thread-local-storage*/
   atomic_bool InUse=false; /*per-lock*/
   atomic int WaitArray[NUM core]; /*per-lock*/
4
    TSP_{ID} = TSP_{ID} ARRAY[qetcpu()]
   void spin_init()
6
        for (each element in WaitArray)
8
            element = 0:
   void spin_lock()
9
        WaitArray[TSP_ID]=1;
10
11
        while(1)
            if (WaitArray[TSP_ID]==0)
12
13
                return:
14
            if (cmp_xchq(&InUse, false, true)):
                WaitArray[TSP ID] = 0
15
16
                return:
17
   void spin_unlock()
        for (int i=1; i<NUM core; i++)</pre>
18
            if (WaitArray[(i+TSP_ID)%NUM_core]==1)
19
                WaitArray[(i+TSP_ID)%NUM_core]=0;
20
21
                return;
22
        InUse=false:
```

- int TSP\_ID\_ARRAY[]
  - Pre-establish routing table
- thread\_local TSP\_ID
  - The core's order in the routing table
  - Each thread has its own TSP\_ID
- atomic\_bool InUse
  - Someone gets the lock or not
  - If *InUse = false*: no thread in CS
- atomic\_int WaitArray[]
  - $\circ$   $\;$  Which core wants the lock (waiting to
    - enter CS)

#### RON

#### Algorithm 1 The RON Algorithm

```
int TSP ID ARRAY[NUM core]; /*per-process*/
   thread_local TSP_ID; /*thread-local-storage*/
   atomic_bool InUse=false; /*per-lock*/
3
   atomic_int WaitArray[NUM_core]; /*per-lock*/
4
   TSP_{ID} = TSP_{ID}_{ARRAY}[getcpu()]
 5
   void spin_init()
        for (each element in WaitArray)
            element = 0;
   void spin lock()
9
        WaitArray[TSP ID]=1;
10
        while (1)
            if (WaitArray[TSP_ID]==0)
                return;
            if (cmp_xchq(&InUse, false, true)):
14
                WaitArray[TSP ID] = 0
15
                return;
16
   void spin_unlock()
17
18
        for (int i=1; i<NUM_core; i++)</pre>
            if (WaitArray[(i+TSP_ID)%NUM_core]==1)
19
                WaitArray[(i+TSP ID)%NUM core]=0;
20
                return:
21
22
        InUse=false:
```

- Lock procedure
  - Goal: get the lock
  - Register the TSP\_ID wants the lock
  - Wait until the *previous TSP\_ID* hand the

lock to TSP\_ID

- Or wait until there is no thread in CS
  - InUse = false

Operating System Lab, National Chung-Cheng University

#### RON

#### Algorithm 1 The RON Algorithm

```
int TSP_ID_ARRAY[NUM_core]; /*per-process*/
   thread_local TSP_ID; /*thread-local-storage*/
   atomic_bool InUse=false; /*per-lock*/
3
   atomic_int WaitArray[NUM_core]; /*per-lock*/
4
   TSP_{ID} = TSP_{ID}_{ARRAY}[qetcpu()]
5
   void spin_init()
       for (each element in WaitArray)
            element = 0;
8
   void spin_lock()
9
       WaitArray[TSP ID]=1;
10
       while (1)
11
12
            if (WaitArray[TSP_ID]==0)
                return;
13
            if (cmp_xchq(&InUse, false, true)):
14
                WaitArray[TSP ID] = 0
15
                return:
16
   void spin_unlock()
        for (int i=1; i<NUM_core; i++)</pre>
18
            if (WaitArray[(i+TSP_ID)%NUM_core]==1)
                WaitArray[(i+TSP ID)%NUM core]=0;
                return:
        InUse=false:
```

• Unlock procedure

- Goal: find the next and release the lock
- Find the next core wants to get the lock from *TSP\_ID+1*
- $\circ$   $\;$  If find someone want the lock :
  - Hand the lock
  - Set WaitArray[TSP\_ID] to 1
- else if cannot find someone want the lock
  - set *InUse* into false

13

Operating System Lab, National Chung-Cheng University

## Related Work - Plock (GNU Pthread's Spinlock)

- Description:
  - Implements test-and-test-and-set (TTAS) mechanism for lock acquisition.
  - Simple and commonly used in POSIX environments like GNU.
  - Prone to fairness issues and scalability limitations under high contention.
- Implementation:
  - Threads repeatedly check and set the lock status using TTAS.
  - Can lead to increased contention and potential thread starvation.

## Related Work - Ticket Spinlock

- Description:
  - Uses a ticket-based mechanism where threads wait for their ticket number to match the service

#### number.

- Ensures fairness by strictly adhering to the order of ticket issuance.
- Implementation:
  - Threads increment their ticket number when waiting to enter the critical section.
  - Waits until their ticket number matches the current service number for entry.

#### **Related Work - RONPlock**

- Description:
  - Integrates RON algorithm with pthreads spinlock to tackle oversusbscription.
  - Uses optimized locking-unlocking order and routing table to reduce communication costs.
  - Manages thread contention and entry into critical sections efficiently.
- Implementation:
  - Utilizes *WaitArray* to track waiting threads per core.
  - Employs *atomic\_fetch\_add* for managing nWait variables locally.

#### **Related Work - RONPlock**

- Implementation:
  - Utilizes *WaitArray* to track waiting

threads per core.

• Employs *atomic\_fetch\_add* for

managing numWait variables locally.

#### Algorithm 3 The RON-plock Algorithm

```
struct PLock {numWait=0, lock=MUST WAIT;}
   atomic_bool InUse=false;//per-lock variable
   PLock WaitArray[NUM_core]; //per-lock variable
   void lock()
    atomic_inc(&WaitArray[TSP_ID].numWait);
5
    while(1)
6
     if (cmpxchg(&WaitArray[TSP_ID].lock,HAS_LOCK,
          MUST WAIT))
      return:
8
     if (cmpxchq(&InUse, false, true))
9
10
      return:
   void unlock()
12
    atomic_dec(&WaitArray[TSP_ID].numWait);
13
     for(int i = 1;i < NUM core+1;i++)</pre>
      if(WaitArray[(TSP_ID+i)%NUM_core].numWait>0)
14
        WaitArray[(TSP ID+i)%NUM core]=HAS LOCK;
15
16
       return:
    InUse=false:
17
```

#### **Related Work - RONTick**

- Description:
  - Oversubscribed version of RON algorithm with ticket spinlock.
  - Enhances scalability under low contention by using ticket numbers for queuing.
  - Ensures fairness in critical section access across cores.
- Implementation:
  - Each core manages *grant* and *ticket* variables to control thread entry order.
  - Threads spin on a loop until their ticket matches the *grant* value for their core.

## Related Work - MCS Spinlock

- Description:
  - Queue-based spinlock algorithm.
  - Each thread spins on a local flag and waits its turn in a queue structure.
- Implementation:
  - Known for its scalability and efficiency in NUMA architectures.
  - Supports multiple threads waiting for the lock in a FIFO order.

#### **Related Work - C-BO-MCS Spinlock**

- Description:
  - Grouping based Spinlock
  - Combines MCS locks for NUMA nodes with back-off locking strategy.
  - Prioritizes neighboring cores to reduce handover costs.
  - Uses test-and-test-and-set with back-off for inter-node competition.
- Implementation:
  - Manages two types of locks: local MCS and back-off for cross-node competition.
  - Enhances performance by minimizing cross-socket communication.

## Related Work - Shuffle Lock (ShflLock)

- Description:
  - Dynamically reorders the queue of waiting threads based on a predefined policy.
  - Improves fairness and reduces contention, though may introduce short delays during reordering.
- Implementation:
  - Ensures efficient thread scheduling without additional data structures on the critical path.
  - Optimizes performance by managing thread order dynamically.

## Implementation



## Implementation - nxtRON: Design Overview (1)

- RON-Plock Algorithm:
  - lock: Uses an integer array(WaitArray) to spin
  - unlock: Uses the same array to select the next thread by for loop traversing.
- nxtRON Introduction:
  - **Structure Separation**: nxtRON separates the "auxiliary variable" from the "spinning variable", keeping data consistent.
  - **Bitmap Usage**: Uses a bitmap as the auxiliary variable to track threads waiting for the lock.
- Bitmap Mechanism Advantages:
  - Reduces overhead during thread selection.
  - Reduces time complexity from O(#NUM\_CORE) to O(1).

### Implementation - nxtRON: Design Overview (2)



off the lock using the bitmap

Operating System Lab, National Chung-Cheng University

25

Thread C

Thread A

## Implementation - nxtRON: Design Overview (3)

- Observation:
  - Separation of auxiliary variable and spinning variable reduces the cost of *cmpxchg*.
    - Spinning variable is only changed during wait and wakeup.
- Key Advantages of nxtRON
  - **Reduced Time Complexity:** 
    - Reduces time complexity of selecting the next thread from O(N) to O(1).
  - Cache Coherence:
    - By ensuring that the data within the auxiliary and spinning variable is consistent, the nxtRON

algorithm benefits from improved cache coherence, further enhancing its performance.

#### Implementation - nxtRON: Implementation Details (1)

```
Data Structuros
1 /*per-process*/
   int TSP ID ARRAY[NUM CORE];
3 /*thread-local-storage*/
   thread local TSP ID;
4
   TSP ID = TSP ID ARRAY[getcpu()];
 5
6
   struct WAIT t:
 7
8
        /*Number of waiting threads*/
       int nWait=0;
9
       /*Release(REL) & Acquire(ACQ)*/
10
        int Lock=REL;
11
12
13
   /*per-lock structure*/
   struct nxtRONPlock t:
14
15
       WAIT t WaitAry[NUM CORE];
16
        atomic bool InUse=false;
17
        /*#bits=NUM CORE*/
18
        long nxt ary;
        int record next[NUM CORE];
19
```

- *WaitArray[]*: (spinning variable)
  - Array indicating the wait status of each core.
- atomic\_bool InUse:
  - if any thread is currently in the critical section.
  - InUse = false: no thread in CS.
- long nxt\_ary: (auxiliary variable)
  - Bitmap where each bit corresponds to a core's wait status.
- record\_next[]:
  - $\circ$   $\,$   $\,$  Recording the next core scheduled to enter CS  $\,$

#### Implementation - nxtRON: Implementation Details (2)

```
26
    void spin lock():
                                                       spin lock():
27
         atomic inc(&WaitAry[TSP ID].nWait,1);
28
         set bit(TSP ID, nxt ary);
                                                            Thread Request
29
         while(1):
             while(WaitAry[TSP ID].Lock!=0 &&
30
                                                                 Sets the corresbonding bit in nxt ary
                                                              0
             InUse==True):
31
32
                  CPU PAUSE();
                                                            Busy-Wait Loop
33
             if(cmp xchg(&WaitAry[TSP ID].Lock,
34
             ACQ, REL):
                                                                 Busy Waiting until its turn to acquire
                                                              0
35
                  return;
                                                                 the lock
36
             if(cmp xchg(&InUse, false, true)):
37
                  return:
                                                                 Or there is no thread in CS
                                                              \bigcirc
```

#### **Implementation - nxtRON: Implementation Details (3)**

```
/*Find the first 1 bit set (FFS) after ID*/
35
   int find next(int ID):
36
        long tmp = rotate right(nxt ary, TID)
37
38
        return (ffs(tmp)+ID)%NUM CORE;
39
40
   void spin unlock():
        int next = record next[TSP ID];
41
        if(next>-1 && WaitAry[next].nWait>0):
42
43
             WaitAry[next].Lock = 0;
             atomic dec(&WaitAry[TSP ID].nWait,1);
44
             clr bit(TSP ID, nxt ary);
45
             record next[next] = find next(next+1);
46
47
             return;
48
        InUse=false;
49
        atomic dec(&WaitAry[TSP ID].nWait,1);
50
        clr bit(TSP ID, nxt ary);
```

spin\_unlock():

- Handoff the lock
  - We use an int record\_next array to record the TSP\_ID's next

```
• If next > -1:
```

We had found the value in the last roud

- TSP\_ID hand the lock to next (line 43)
- TSP\_ID put down the hand (line 45)
- TSP\_ID find the next's next (line 46)
- else: set *InUse* into false (line 48)

#### Implementation - nxtRON: Implementation Details (4)



#### **Implementation - nxtRON: Conclusion**

- Efficiency and Scalability:
  - Reduces thread selection overhead.
  - Improves system performance and scalability in multi-core environments.
- Limitations
  - Does not provide optimizations for space complexity.
  - Focuses on improving time complexity and reducing overhead without optimizing memory footprint.



## Implementation - shRON: Design Overview (1)

- RON-Plock Algorithm:
  - Uses a **per-lock** *WaitArray[*], which size increases linearly with the number of cores.
- ShRON introduction:
  - **Per-Process Shared WaitArray[]:** Reduces memory contention and overhead.
  - WaitArray[] Bitmap: Uses *atomic\_long* to track cores waiting for the lock.
  - Lock Representation: Two bits represent a lock in WaitArray.
  - Capacity: Each atomic\_long can track 32 locks in our system.

### Implementation - shRON: Design Overview (2)

- Observation
  - Processor-Native Types:
    - Use types like *atomic\_long* for hardware support.
  - Performance Impact:
    - Multiple cores accessing the same *atomic\_long* leads to lower performance.
  - Bitmap Approach:
    - shRON uses a bitmap so that a core waits on multiple locks, minimizing performance issues.

#### Implementation - shRON: Design Overview (3)

- Key Advantages of shRON
  - Improved Cache Utilization:
    - Optimized data access patterns enhance cache memory usage.
  - Scalability:
    - Dynamic bitmap resizing ensures efficient resource use regardless of active locks.
  - Space Complexity Reduction:
    - Reduces space complexity by 2/(sizeof(atomic\_long)) times compared to RON-Plock.

## Implementation - shRON: Implementation Details (1)

- Data Structure:
  - 1 /\*per-process\*/
  - 2 int TSP\_ID\_ARRAY[NUM\_CORE];
  - 3 /\*PEND\_BIT and LOCK\_BIT denote one lock,
  - 4 each row in WaitAry represents 32 locks\*/
  - 5 atomic\_long WaitAry[NUM\_LOCK/32][NUM\_CORE];

- *WaitArray[]:* Array of *atomic\_long* structures indicating each core's wait status.
  - LOCK\_BIT: Shows lock status (ACQUIRE (ACQ) or RELEASE (REL)).
  - PEND\_BIT: Shows waiting status (PENDING (PEN) or IDLING (IDL)).

#### Implementation - shRON: Implementation Details (2)

- Assume there are 32 locks to create and the #NUM\_CORE is "n"
  - We will first malloc the WaitAry size as 1\*n

| WaitAry[0] | core 0 | core 1 |  | core n-1 |
|------------|--------|--------|--|----------|
|------------|--------|--------|--|----------|

• We zoom into the first core: WaitAry[0][0], which is a atomic\_long variable with 64 bits.

| #bits   | 63       | 62       | <br>3      | 2        | 1        | 0        |
|---------|----------|----------|------------|----------|----------|----------|
| Meaning | PEND_BIT | LOCK_BIT | PEND_BIT   | LOCK_BIT | PEND_BIT | LOCK_BIT |
| Lock    | Lock 31  |          | <br>Lock 1 |          | Lock 0   |          |

#### **Implementation - shRON: Implementation Details (3)**

- Data Structure:
- 7 /\*thread-local-storage\*/
- 8 thread\_local TSP\_ID;
- 9 TSP\_ID = TSP\_ID\_ARRAY[getcpu()];

```
10
```

14

11 /\*per-lock structure \*/

```
12 struct shRON_t:
```

```
13 /*The lock is acquired or not*/
```

```
int InUse:1;
```

```
15 /*The index bit in long WaitAry(0~63)*/
```

```
16 int WaitID:6;
```

- 17 /\*The column index of WaitAry\*/
- 18 int ColID:25;

- int InUse:1
  - if any thread is currently in the CS
  - InUse = 0: no thread in CS
- int WaitID:6
  - Six-bit integer representing the

LOCK\_BIT index. (0, 2, 4, 6,.., 62)

- The third lock will get the WaitID as 4
- int ColID: 25
  - 25-bit integer denoting the column index in *WaitAry[ColID][#CORE]*.
  - $\circ$  The 34th lock will get the ColID as 1.

#### Implementation - shRON: Implementation Details (4)

```
/*Denotes "WaitAry[ColID][TSP_ID]" as WaitPos*/
20
    void spin lock():
21
      while(1):
22
        set_bit(PEND_BIT, WaitPos);
23
        /*bitvl() means the bit value*/
24
        while(bitvl(LOCK_BIT, WaitPos)==REL
25
        && InUse==1)
26
27
            CPU PAUSE();
        if(bitvl(LOCK_BIT, WaitPos)==ACQ):
28
          /* cmp xchg set PEND BIT as IDL,
29
30
          LOCK BIT as REL*/
            if(putDownLock(LOCK BIT,WaitPos)):
31
                return;
32
33
        if (cmp_xchg(&InUse, false, true)):
            clr bit(PEND BIT, WaitPos);
34
35
            return:
```

- spin\_lock()
  - Thread Request
    - Set the PEND\_BIT as PEND
  - Busy-Wait Loop
    - Busy Waiting until its turn to acquire the lock
    - Or there is no thread in CS
  - $\circ$  Get the lock
    - Set PEND\_BIT as IDL
    - Set the LOCK\_BIT as REL (line 31)
       39

#### Implementation - shRON: Implementation Details (5)

```
/*Denotes "WaitAry[ColID][nxt]" as NxtWaitPos*/
37
    void spin_unlock():
38
        for(int i=1; i<NUM_CORE; i++):</pre>
39
            int nxt = (TSP ID+i)%NUM CORE;
40
            if(bitvl(PEND BIT, NxtWaitPos)):
41
42
                set bit(LOCK BIT, NxtWaitPos);
43
                return:
        InUse=0:
44
        clr_bit(PEND_BIT, WaitPos);
45
```

- spin\_unlock()
  - Like RON-Plock
    - Using a for-loop traverse
    - Visit the bit value of WaitAry
  - Check the PEND\_BIT as the thread want

the lock or not. (line 42)

• Release the lock by set the next's

```
LOCK_BIT as ACQ.
```

#### **Implementation - shRON: Conclusion**

- Efficiency and Scalability:
  - Reduces the space complexity of RON-Plock and nxtRON.
  - Improves system performance and scalability in multi-lock environments.
- Limitations
  - Despite these advantages, shRON does incur increased computational overhead due to

the additional bit operations involved in its design.

## Evaluation

## **Evaluation - Testing Environment**

- Model name: AMD Ryzen Threadripper PRO 3995WX
- Number of Cores: 64-Core Processor
- Virtual core: 128 (virtual core per core is 2)
- Architecture: x86\_64
- Compile Environment: gcc version 10.5.0
- Target: AMD Zen+ and x86 64-linux-gnu
- Thread model: POSIX
- Linux Version: 5.4.0-177-generic

#### **Evaluation - Testing Program - Userspace**

#### • Userspace:

• We analyze each lock method in a quantitative manner through a controlled

#### microbenchmark

- We construct our locking algorithms under a public library: LiTL
  - Library for Transparent Lock interposition
  - LiTL is a library that allows executing a program based on Pthread mutex locks with another locking algorithm

44

Author : Hugo Guiroux < hugo.guiroux at gmail dot com>

Related Publication: Multicore Locks: the Case is Not Closed Yet, Hugo Guiroux, Renaud Lachaize, Vivien Quéma, USENIX ATC'16. Operating System Lab, National Chung-Cheng University

#### **Evaluation - Testing Program - Microbenchmarks**

spinunlock()

while(1) {

//non-critical section
for\_loop\_sleep(nCS ± 15%);

• nCS:

Non-critical section / Remainder section

• It reflects the contention of the process

• The random variable is utilized to

simulate varying program loads

## **Evaluation - Throughput in High Contention**



Operating System Lab, National Chung-Cheng University

#### **Evaluation - Normalized Throughput in Low Contention**



Operating System Lab, National Chung-Cheng University

#### Evaluation - Scalability with RS=10000 ns



Operating System Lab, National Chung-Cheng University

#### **Evaluation - Long-Term Fairness**

Long-term Fairness (CV, 10 secs)



Operating System Lab, National Chung-Cheng University

#### Evaluation - Handover Time per CS (ns)



Operating System Lab, National Chung-Cheng University

#### **Evaluation - Contention Cost**

#### ---- RON\_Plock hRON - MCS 320K ---- RON\_Ticket ShflLock - Plock - Ticket × nxtRON 280K s 240K 200K 160K 120K 120K 80K 73k 40K 52K 43K 20 120 40 60 80 100 Average Number of Threads in LS

Lock-Unlock Time (Contention Overhead)

Operating System Lab, National Chung-Cheng University

#### Evaluation - Lock Release Time (ns)



Operating System Lab, National Chung-Cheng University

#### **Evaluation - Oversubscribed**



Operating System Lab, National Chung-Cheng University

#### Evaluation - Google - leveldb (ms)



## **Evaluation - Testing Program - Linux Kernel**

- Linux kernel locking algorithm
  - **qspinlock**

111

112

113

115

}

- Fast path:
  - TTAS In low contention scenarios (few competitors), the lock is directly acquired through TTAS, like plock
- Slow path:
  - MCS In high contention scenarios (many competitors), the slow path is entered

```
107 static __always_inline void queued_spin_lock(struct qspinlock *lock)
```

```
108 {
109 int val = 0;
110
```

```
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
```

return;

114 queued\_spin\_lock\_slowpath(lock, val);

## **Evaluation - Testing Program - Linux Kernel**

- Linux kernel locking algorithm
  - By rewriting these two functions, we implement nxtRON in the Linux kernel.

```
80 extern void spin unlock(struct gspinlock *spin);
    static always inline void queued spin lock(struct qspinlock *lock)
 81
 82
    /*
 83
 84
        int val = 0:
 85
        if (likely(atomic try cmpxchg acquire(&lock->val, &val, Q LOCKED VAL)))
 86
 87
             return:
 88
 89
        queued spin lock slowpath(lock, val);
 90
    */
 91
        spin lock(lock);
 92
 93
    #endif
 94
 95
    #ifndef queued spin unlock
    /**
 96
     * queued spin unlock - release a queued spinlock
 97
     * @lock : Pointer to gueued spinlock structure
 98
99
     */
    static always inline void queued spin unlock(struct qspinlock *lock)
100
101
102
        //smp store release(&lock->locked, 0);
        spin unlock(lock);
103
104
105 #endif
```

#### **Testing Program - mmap**

#### Linux Kernel - mmap (ms)

|           | mmap<br>(ms) | clone<br>(ms) | mprotect<br>(ms) | munmap<br>(ms) | geomean |
|-----------|--------------|---------------|------------------|----------------|---------|
| qspinlock | 1883.0       | 276.6         | 911.6            | 87.4           | 451.4   |
| RON       | 2050.9       | 259.3         | 637.3            | 37.9           | 336.6   |
| nxtRON    | 1961.8       | 280.7         | 609.0            | 37.9           | 335.7   |

#### Linux Kernel - leveldb (MB/sec)



# Conclusion

## Conclusion

- nxtRON (next-RON):
  - Uses bitmap to track waiting threads, reducing time complexity of RON.
  - Cuts lock acquisition time by up to 20%, and outperforms in Linux Kernel
- shRON (shared-RON):
  - Implements shared data structures, reducing memory and cache contention.
  - Enhance throughput by about 25% in high contention.
- Performance Evaluations:
  - Outperform existing solutions (RON-Plock, RON-ticket) by 10% to 15%.
  - Enhance system throughput by up to 20% across various operational modes.

#### **Conclusion - Future Work**

- Combine the two algorithms
- Integrate the combined algorithm into Linux kernel for broader optimization.
- Explore impact on energy efficiency and thermal management.