# Work-Stealing DeQue **Repository Path**: shi-feng-logic/work-stealing-de-que ## Basic Information - **Project Name**: Work-Stealing DeQue - **Description**: Work Stealing Queue, Job System, Thread Pool, Programming Parallel / Concurrent Applications - **Primary Language**: C++ - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-06 - **Last Updated**: 2025-08-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Work Stealing Queue This is an implementation of a Work Stealing Queue described in a series of [blog articles](https://blog.molecular-matters.com) by [Stefan Reinalter](https://twitter.com/molecularmusing) at Molecular Matters, beginning with [Job System 2.0](https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/). ## Origin In the blog article, Stefan uses a lock-free structure with memory barriers and compare-exchange primitives. This algorithm also uses these ideas, but does so in a slightly different way then described by Stefan. It dispenses with the memory barriers and uses compare-exchange with exchange primitives to schedule jobs using a [Work Stealing Queue](https://stackoverflow.com/questions/27830691/work-stealing-and-deques) which is probably taught in many CS classes like [CSE P 506 - Concurrency](https://courses.cs.washington.edu/courses/csep506/11sp/Home.html). The version described by Stefan is a variation of a [Chase Lev dequeue](https://www.dre.vanderbilt.edu/~schmidt/PDF/work-stealing-dequeue.pdf). There is an implementation of that work queue in [Aamanieu](https://github.com/Amanieu)'s [asyncplusplus](https://github.com/Amanieu/asyncplusplus) in the default task scheduler. This is the header file for it: [work_steal_queue.h](https://github.com/Amanieu/asyncplusplus/blob/master/src/work_steal_queue.h). ## Description of Algorithm A Work Stealing Queue is a double ended queue that each thread/core maintains. The thread which owns the queue puts jobs at the bottom end of the queue and other threads to steal jobs from the top end, when they have nothing to do in their own queues. ```console +--------+ <- entries[ 0 ] | top | <- stealers consume here: job = entries[ top++ ] | | | || | | | | vv | | bottom | <- owner pushes here: entries[ bottom++ ] = job | | owner consumes here: job = entries[ --bottom ] | | +--------+ <- entries[ MASK_JOBS ] ``` The parallelism obtained by this lock-free structure is fine grained. The test included is able to schedule synthetic jobs with approximately 100 ns (per thread) of overhead on my i9-7960X cpu. I surmise that most of the this is due to the cache coherency overhead of the x86 cmpxchg and xchg instructions along with push/pop mechanics of the queue. I measure this 100 ns overhead by putting jobs into the queue with only one worker thread. ```console $ git clone https://injinj.github.com/WSQ $ cd WSQ $ g++ -Wall -Wextra -std=c++11 -O3 test_job.cpp -pthread $ a.out -c 1 Sizeof Job Sys Ctx: 528 Sizeof Job Thread: 524408 Sizeof Job: 48 Sizeof Job Alloc: 65480 Number of threads: 1 Serial workload: 1000 iterations Parallel workload: 10000 jobs ...................................................................... Workload Serial Elapsed Parallel Elapsed Speedup -------- -------------- ---------------- ------- 100 142 ns 192 ns 0.74 (- 50 / thr: 50) 200 287 ns 335 ns 0.86 (- 48 / thr: 48) 300 432 ns 474 ns 0.91 (- 42 / thr: 42) 400 579 ns 605 ns 0.96 (- 26 / thr: 26) ^C $ a.out -c 2 ... Workload Serial Elapsed Parallel Elapsed Speedup -------- -------------- ---------------- ------- 100 142 ns 106 ns 1.34 200 289 ns 169 ns 1.71 300 432 ns 237 ns 1.82 400 579 ns 308 ns 1.88 ^C $ a.out -c 3 ... Workload Serial Elapsed Parallel Elapsed Speedup -------- -------------- ---------------- ------- 100 142 ns 87 ns 1.63 200 287 ns 128 ns 2.24 300 435 ns 174 ns 2.50 400 577 ns 218 ns 2.65 ^C $ a.out -c 4 ... Workload Serial Elapsed Parallel Elapsed Speedup -------- -------------- ---------------- ------- 100 142 ns 70 ns 2.03 200 289 ns 99 ns 2.92 300 432 ns 133 ns 3.25 400 580 ns 174 ns 3.33 ^C ``` I also created a gnuplot script to graph the speedup of this synthetic workload. This graph is the result of running that. ```console $ gnuplot Terminal type set to 'qt' gnuplot> load "plot.gnuplot" 1-core 2-core 3-core 4-core 5-core 6-core 7-core 8-core 9-core 10-core 11-core 12-core 13-core 14-core 15-core 16-core ``` ![Job Stealing Queue](jsq.svg)