By now, unless you have been living under a rock, you have probably heard about Golang It’s claim to fame lies in its ability to handle concurrency through goroutines, which act as lightweight threads. These goroutines are highly efficient, a lot cheaper than threads and can be managed in millions, thanks to Go’s unique design and scheduler.
But how does Go manage this impressive feat? How does it enable such efficient concurrency?
Let’s explore the inner workings of the Go Scheduler to uncover the magic behind its capabilities.
sa
We are going to start with some crucial element which we should understand their existence and role.
System calls act as the vital communication channel between user-level programs and the heart of the operating system, the Kernel. Think of them as the carefully crafted interfaces that allow applications to request essential services and resources from the underlying operating system.
It’s similar to how web services present a user-friendly REST API, enabling seamless interaction between users and the service.
Threads, a fundamental concept in computer science, serve as the backbone of multitasking and parallel execution within modern computing systems.
Imagine your computer’s processor as a powerhouse with multiple cores, each functioning as an independent processing unit capable of executing tasks autonomously.
Threads, in essence, provide an abstraction layer above these cores. Through this abstraction, we can create/delete threads within our program, enabling us to execute specific code on each thread. These threads, once created, are then managed and scheduled by the operating system’s Kernel to run on individual cores.
Threads are essential in multitasking because they prevent chaos. In a computer, many programs run at once, all wanting to work on different tasks. If programs directly used cores, they could monopolize resources (or system deadlock), causing problems. They’re created using system calls, allowing programs to work together without causing system issues.
A goroutine in Go is a lightweight, concurrent unit of execution.
It’s similar to a thread in concept but more efficient.
Goroutines allow programs to perform tasks concurrently, meaning they can run independently alongside other goroutines, enabling efficient use of resources.
Goroutines are not directly mapped to operating system threads, the Go runtime manages a pool of operating system threads to execute goroutines. These threads are responsible for running the goroutines, providing the necessary parallelism and concurrency for Go programs.
How the whole thing works as flow
In Go’s M:N scheduler, M goroutines are efficiently mapped to N kernel threads.
So, users can create however many goroutines they want, but we have control over a finite set of threads, and we schedule their goroutines on a thread for a certain period of time.
The M:N Go Scheduler provides flexibility as millions of goroutines can be intelligently managed and run on a smaller set of kernel threads, optimizing system resources and allowing for high levels of concurrency. This flexibility and efficient utilization of threads are key features that contribute to Go’s ability to handle massive numbers of concurrent tasks efficiently.
Concurrency is when two or more tasks can start, run, and complete in overlapping time periods. It doesn’t necessarily mean they’ll ever both be running at the same instant. Check the picture below.
Example:
One Core System
Parallelism is when tasks literally run at the same time, e.g., on a multicore processor.
Example:
Two Core System
A goroutine can be in one of three states: Waiting, Runnable or Executing.
Running: This means the Goroutine has been placed on a Thread and is executing its instructions.
Runnable: A goroutine is runnable when it is ready to be executed and waiting for the kernel thread to execute it.
Waiting: This means the Goroutine has stopped and waits for event to happen – reasons:
Traditional operating systems use preemptive schedulers, meaning the system can interrupt tasks and switch to another task without the program’s explicit consent. The kernel is making decisions – you can’t predict what the scheduler is going to do at any given time.
Cooperative Preemption
Go has used cooperative preemption with safe-points only at function calls. This means that Go can only switch between concurrently-executing goroutines at specific points. (As switching G1 off and G2 on – diagram above)
From execution point of view you can give goroutine processor time (execution time) only on specific events (safe-points) which are function calls.
Real Life Example:
Imagine you have a list of tasks to do, like cooking, cleaning, and shopping. In the older versions of Go, these tasks (goroutines) could only switch when the person doing them took a break.
In our analogy, these breaks are like “break times” during work. When you’re cooking, you might take a break after chopping vegetables.
Similarly, in Go, tasks (goroutines) could only switch when they were between tasks, like when one goroutine finished a task/function (like chopping vegetables) and was about to start another.
There were many problems with such approach as evolving of the Go Runtime introduced situation where “break” could not occur. What happens if you can not take a break from cleaning but the dish is on fire?
Example: If you run any tight loops that are not making function calls, you will cause latencies within the Go Scheduler and garbage collection.
You can read more about original Go paper here or here
Non-Cooperative Preemption
Go introduced non-cooperative preemption because of the problems mentioned above.
In non-cooperative preemption, the Go runtime can forcibly pause a running goroutine even if it doesn’t explicitly yield control. This preemptive behavior ensures that no single goroutine can monopolize the CPU for an extended period.
Benefits of Non-Cooperative Preemption:
There are two different run queues in the Go Scheduler:
When a goroutine begins, it joins a common waiting area called the Global Run Queue.
Later on, a system component (analysed below – processors) assigns it to a specific, smaller waiting area called the Local Run Queue.
Each processing thread primarily deals with its own local queue, choosing a goroutine from there to execute. This approach efficiently distributes tasks among threads, ensuring that each thread manages its own queue independently.
Because each local queue is handled by just one thread, there’s no need for complex locking mechanisms like mutexes, simplifying the process and improving performance.
Both queues follows the FIFO (First in First Out) Principle.
It’s all good for now, but what will happen if Go makes a system call that might take a long time (like reading from a file)?
During this time, the thread handling that goroutine has no new tasks and is essentially inactive because it waits (blocked).
This can lead to a problem:
To solve this issue, we need a way for threads to continue working on other tasks while waiting for system calls to complete.
To solve this problem, we add another layer of abstraction, called the “processor”.
“The Processor” entity is in charge of managing the interaction between the local run queues, the global run queue, and the kernel threads.
How it looks the Go Scheduler with processor
Processor States:
The number of processors is the maximum number of GOMAXPROCS
We have talked about the higher level concepts but how our application development affects the Go Scheduler and when it has to do a decision.
There are a couple of “events” that occur in your Go programs that allow the Go Scheduler to make scheduling decisions (put a goroutine in queue, give goroutine processor time, take processor time from goroutine, attach/detach thread, etc.)
go
This does not mean it will always happen on one of these events. It means the Go Scheduler gets the opportunity.
We have seen how goroutines end up running on the kernel threads, but what happens when a thread blocks itself?
For instance, if a goroutine does a system call, it will block the kernel thread causing the other goroutines in the local run queue to be starved.
When this happens, the Go Runtime will detach/release the thread from “the processor”, and it will be assigned to an existing idle thread or will create new thread.
Go Runtime keeps already created threads without work idle as it is most cost-efficient that creating/deleting such. The number of the threads depends on the load, goroutines/cores, so it’s impossible to say how much it would keep as this is very generic question answer.
Switching tasks (or handoffs) can be costly, especially when creating new threads. For system calls, some are quick, and threads don’t stay blocked for long. To optimize, Go’s runtime handles this intelligently.
All I/O must be done through syscalls.
The syscalls in Go are always called through code that is controlled by the runtime.
This means that when you call a syscall, instead of just calling it directly (thus giving up control of the thread to the kernel), the runtime is notified of the syscall you want to make, and it does it on the goroutine’s behalf.
This allows it to let the goroutine know later once the result is ready.
In other words Go Scheduler
Operating systems often use signals to indicate when system calls are completed. When the operating system signals that a system call is done, the Go runtime can respond accordingly.
When you return from handed-off syscall Go Scheduler will check:
Network poller is component like a traffic director, especially helpful when dealing with tasks like asynchronous system calls such as network I/O.
How it works:
Imagine you have a busy street (your computer) with various vehicles (goroutines) moving around. Now, sometimes, a vehicle (goroutine) needs to wait for something to happen, like a green signal at a junction (network request completion). In regular scenarios, this waiting could create a jam (blocking the system).
Here’s where the network poller steps in:
If there is a runnable goroutine in the network poller, which finished it network I/O waiting the processor will add it to its local run queue.
The work stealing is a technique that is used to balance the load between the processors and respectively between kernel threads.
The order of work is as follows:
If the global run queue, local run queue, and network poller are empty
The Processor 2 is idle. It will check LRQ, GRQ and NQ which are empty
Then it will then have to randomly pick a processor to steal the half of work from.
It took half of the work from processor 1
Reference to the order of work stealing execution – runtime/proc.go.
To prove that it is as describe above, check the code snippet taken from Go Scheduler below.
// Check the global runnable queue once in a while to ensure fairness.
// Otherwise two goroutines can completely occupy the local runqueue
// by constantly respawning each other.
if pp.schedtick%61 == 0 && sched.runqsize > 0 {
lock(&sched.lock)
gp := globrunqget(pp, 1)
unlock(&sched.lock)
if gp != nil {
return gp, false, false
}
}
// local runq
if gp, inheritTime := runqget(pp); gp != nil {
return gp, inheritTime, false
}
// global runq
if sched.runqsize != 0 {
lock(&sched.lock)
gp := globrunqget(pp, 0)
unlock(&sched.lock)
if gp != nil {
return gp, false, false
}
}
// Poll network.
// This netpoll is only an optimization before we resort to stealing.
// We can safely skip it if there are no waiters or a thread is blocked
// in netpoll already. If there is any kind of logical race with that
// blocked thread (e.g. it has already returned from netpoll, but does
// not set lastpoll yet), this thread will do blocking netpoll below
// anyway.
if netpollinited() && netpollWaiters.Load() > 0 && sched.lastpoll.Load() != 0 {
if list := netpoll(0); !list.empty() { // non-blocking
gp := list.pop()
injectglist(&list)
casgstatus(gp, _Gwaiting, _Grunnable)
if traceEnabled() {
traceGoUnpark(gp, 0)
}
return gp, false, false
}
}
// Spinning Ms: steal work from other Ps.
//
// Limit the number of spinning Ms to half the number of busy Ps.
// This is necessary to prevent excessive CPU consumption when
// GOMAXPROCS>>1 but the program parallelism is low.
if mp.spinning || 2*sched.nmspinning.Load() < gomaxprocs-sched.npidle.Load() {
if !mp.spinning {
mp.becomeSpinning()
}
gp, inheritTime, tnow, w, newWork := stealWork(now)
if gp != nil {
// Successfully stole.
return gp, inheritTime, false
}
if newWork {
// There may be new timer or GC work; restart to
// discover.
goto top
}
now = tnow
if w != 0 && (pollUntil == 0 || w < pollUntil) {
// Earlier timer to wait for.
pollUntil = w
}
}
What happens when there is long-running goroutine?
This could potentially make other goroutines to starve in the Local Run Queue.
Take for example cashiers (threads):
Both cashier has very loaded first clients. This will slow down the whole system (two pipelines, two threads)
That’s why we should do something, we should put aside “long-running clients” so we can optimize the pipelines.
With that optimization we will process quickly the other ones which should not starve. Of course there are scenarios where the second client could be also “long-running” which should have the same fate or not.
The Go standard library dedicates a thread to watch after your application and help with bottlenecks your program could face called sysmon
Stands for system monitor and it is not linked to any Processor, Thread, Goroutine in our model.
It is not taken into account in the Go Scheduler too. It is always running but is smart enough to not consume resources when there is nothing to do. Its cycle time is dynamic and depends on the current activity of the running program.
Any goroutine running for more than 10ms is marked as preemptible (soft limit). Therefore, the Go Scheduler can decide if it should be return to the LRQ or continue running on the thread. If we are in situation as the example above both, first clients (goroutines) will be put in the local queue as there are other less time-consuming clients.
That’s how the Go Scheduler tackles long running goroutines
Certainly, the Go scheduler is a critical component of Go’s concurrency model. It efficiently manages goroutines, ensuring fair execution and optimal resource usage. By utilizing non-cooperative preemption, local and global, network run queues, work-stealing, the scheduler enables millions of concurrent tasks to run efficiently, making Go a powerhouse for concurrent programming.