ThreadPool 에 대해서

빨리 Task 에 대해서 정리하고 싶지만, 그 근간인 ThreadPool 을 먼저 정리하는 것이 순서이기에 ThreadPool 에 대해서 한번 더 다뤄보려고 한다. 이번엔 이 전 포스트보다 좀 더 디테일에 집중한다.

ThreadPool 은 두 종류의 쓰레드를 지닌다.

ThreadPool API 중 ThreadPool.GetMinThreads(out int workerThreads, out int completionPortThreads); 가 있는데, ThreadPool 의 최소 Thread 유지 개수를 반환하는데, 종류가 두 개이다. 이것의 정체가 무엇일까?

WorkerThread
Main Thread 를 제외한 모든 Thread 를 의미한다. 이 Worker (Thread) 에 할당되는 Work 는 Thread 에 할당되는 delegate 를 의미하며, 일련의 작업 모두를 지칭한다.
CompletionPortThread
I/O Thread 로 불리며 .NET 에서 ThreadPool 이 NativeOverlapped Callback 을 보내기 위해 보관하는 Thread 를 의미한다.

일단 이 두 Thread 는 일반적인 Thread 이다. 하지만 ThreadPool 에서 이 둘을 논리적으로 분리하여 정의짓고 따로 보관하며 관리하는 이유는 데드락을 방지하기 위한 것이다. ThreadPool 에서 너무나 많은 작업이 Queue 에 적재되고, 이를 빠르게 소비하기 위해서 WorkerThread 가 Native I/O Callback 을 발생시켜야 하는 Thread 에 까지 작업을 할당해버렸다고 가정해보자. I/O 완료 Callback 을 전달한 Thread 가 없으니, I/O 요청이 완료되지 않아 해당 자원을 선점한 상태로 남는 Thread 가 금새 생길 것이고, 결국 데드락이 발생하게 될 것이다.

NativeOverlapped Callback
일종의 Struct 이며, overlapped 된 win32 호출을 의미한다. win32 호출이란 user32.dll 이나 kernel32.dll 같이 PInvoke 를 통해서 호출할 수 있는 Windows API 를 의미한다.

아래 그림을 보면 CompletionPortThread 가 왜 필요한지 어렴풋이 짐작이 가능해질 것이다. Stephen Cleary 의 There is no thread 에서 코드에서 작성한 Thread 가 어떻게 디바이스 레벨까지 내려가는지 자세히 설명하고 있으니 참고하면 좋겠다. 이 그림에서 의미하는 Overlapped 가 바로 NativeOverlapped Callback 전달 시점이다.

ThreadPool 의 알고리즘

ThreadPool 은 자체적으로 Thread 를 관리하고, On demand 형식으로 상황에 따라 Thread 수를 늘리거나 줄여서 Queue 에 적재된 작업 항목을 빠르게 처리하여 TPS 를 높이기 위해 노력한다. 분명히 위대한 개발자가 다양한 상황을 고려하여 최적의 성능을 발휘할 수 있도록 고심해서 만들었을텐데, 왜 예전 포스트에 다룬 ThreadPool 과 관련된 성능 문제가 발생했을까? 해당 포스트에서 원인 분석을 하긴 했지만, 동작 방식 자체가 궁금하여 정리해보았다.

Starvation
컴퓨터 사이언스에서 Starvation 은 동시 처리에서 필요한 작업 요청이 지속적으로 거절당하여 처리되어야 하는 작업을 수행할 수 없게되는 경우를 일컫는다. 의도적인 DoS 공격으로 인해 Starvation 이 발생할 수 있는데, 그것이 좋은 예인 것 같다. WorkerThread 에 작업이 모두 할당되어 있고, Queue 에 지나치게 많은 작업이 쌓여나가고 있는 상황이되면, 배고픈 상태 (Starved) 로 Queue 에서 대기 중인 작업을 처리하기 위해 새로운 Thread 를 추가하는 메커니즘이 ThreadPool 에 구현되어 있다.
Hill climbing heuristic
Heuristic 이란 find / discover 를 의미하며 스스로 찾는 방식을 의미한다. Hill Climbing 은 알고리즘 중에 하나인데, 문제를 해결하기 위해 다양한 방식을 고안해내고 점진적으로 보다 나은 해결책을 찾아나가는 방식의 알고리즘이다. 이 알고리즘이 ThreadPool 에 적용되었다.

Thread 가 I/O 에 의해 블럭되거나 프로세스를 점유하는 대기 상태에 있는 경우 CPU 코어를 효율적으로 활용하기 위해 동작한다. 가장 특징적으로 ThreadPool 에서 작업이 끝났을 경우, 그리고 500ms 의 간격마다 새로운 Thread 를 생성할 기회를 갖게된다. 이 때 최근 발생한 Thread 개수의 변화를 참고하게 된다. ThreadPool 이 Thread 를 늘리는 것이 옳다고 판단하면, 새로운 Thread 가 추가 될 것이고, 그렇지 않으면 Thread 를 삭제해서 ThreadPool.GetMinThreads 에서 얻을 수 있는 최소 유지 Thread 값으로 돌아가려고 할 것이다. 이렇게 상황에 따라 능동적으로 대응하는 알고리즘을 Hill climbing heuristic 이라고 부른다.

알고리즘은 집어 치우고, ThreadPool 에서 항상 많은 Thread 를 보유하면 되지 않을까?

ThreadPool WorkQueue 에 쌓인 작업을 빠르게 소비시키기 위해 Thread 를 생성한다고 했는데, 애초에 많은 수의 Thread 를 보유하고 있으면 굳이 Hill climbing heuristic 이라는 귀찮은 내부 메커니즘을 따를 필요없지 않을까? 정답은 No 이다.

더 많은 Thread 는 더 많은 Context-switching 을 의미한다. 그리고 이 Context-switching 은 CPU 오버헤드를 유발한다. 많은 수의 Thread 가 존재할 때, 이러한 오버헤드는 매우 큰 영향을 미치게 된다.
더 많은 Thread 는 더 많은 메모리 스택을 보유하게 됨을 의미하며 이는 데이터의 로컬리티를 저해한다. 마치 저글링을 하듯 CPU 에서 각 Thread 마다 메모리 스택을 보유하고 이를 캐싱하게 되면 CPU 캐싱의 효율이 급격하게 저하된다.

논리적 프로세스 수 보다 Thread 를 많이 보유함으로써 얻을 수 있는 것은 CPU 를 항상 바쁘게 만들어 몇몇 Thread 가 블럭된 상태에서도 다른 여분의 Thread 에 작업을 할당하여 많은 작업을 처리할 수 있다는 데 있다. 하지만 블럭 상태에 대해 너무 민감하게 반응하지 않도록 주의해야한다.

결론적으로 이러한 시나리오까지 고려하고 고안한 것이 Hill climbing heruistic 이기 때문에, 더 나은 제안을 할 수 없다면 더 이상 의문을 품지 않는 것이 좋겠다. 참고로 이 ThreadPool 디자인과 구현에 대한 논문도 존재한다. ( 3 명의 마이크로소프트 소속으로 연구에 참여를 했는데, 그 중 하나는 이제 Google 소속인 것 같다. )

ThreadPool 의 과거

TPL ( Task Parallel Library ) 가 나오기 이전, 즉 .NET 4.0 이전에 ThreadPool 은 어땠을까? 4.0 이전 CLR 에서 ThreadPool.QueueUserWorkItem 을 이용해서 작업을 할당하면, 큐가 아니라 Linked List 에 작업 항목이 추가되었고, 상호배제 또한 Monitor 를 통해 lock 을 취득하면서 관리되는 방식이였다. 이로인해 매번 QueueUserWorkItem 을 호출 할 때마다 Linked List 에 노드가 추가 되야하고, 가비지 컬렉팅시 리스트의 모든 노드를 매번 탐색해야만 했다. Monitor 를 이용하여 Enqueue, Dequeue 때 마다 lock 을 취득해야 하는 추가 비용까지 발생했었다.

이 방식은 사실 ThreadPool 의 최초 설계를 생각하면, 현재 시점에서 보는 것 처럼 그렇게 나쁜 방식은 아니였다. ThreadPool 은 원래 다소 무거운 단위의 작업을 처리하기 위해 고안되었는데, ASP.NET 웹 어플리케이션이 좋은 예이다. ASP.NET 웹 어플리케이션에서는 작업이란 것이 전체 웹 페이지를 생성하는 과정을 의미하는데, 이것 자체로도 꽤 시간이 오래 걸리는 작업이기 때문에 노드를 추가하거나 Lock 을 취득하는 오버헤드가 크게 눈에 띄지 않는 요소였다.

그러나 하드웨어가 급격히 발달하면서 CPU 에 코어가 늘어남에 따라 동시처리에 대한 요구 사항이 지속적으로 증가했다. ThreadPool 의 작업은 최초에 비교적 오래 걸리는 작업을 지향했다면, 이제는 독립적이고 작은 단위로 미세하게 쪼개서 Thread 에 이를 할당한 후 처리하는 가볍고 빠르게 처리할 수 있는 작은 단위의 작업을 지향할 필요가 생겼다. 이에 따라 이런 요구사항에 맞게 ThreadPool 을 개선할 필요가 생긴 것이다.

Faster FIFO

위에서 언급된 Enqueue, Dequeue 마다 Lock 을 취득할 때와 Linked List 를 탐색하고 노드를 할당하는 비용으로부터 발생하는 CPU 오버헤드 때문에 .NET 4.0 에서는 Lock 을 사용하지 않는 자료구조를 사용하여 동기화 오버헤드를 줄여야 했다. 그리고 무엇보다 GC 친화적인 큐가 필요했는데, 여전히 새로운 객체를 할당해야 하긴 하지만, 이 객체 ( Delegate ) 를 훨씬 작고 추적이 쉽기 때문에 가비지 컬렉팅 처리시에도 훨씬 쉽게 처리할 수 있었다. ThreadPool 에서 사용하는 Queue 는 .NET 4.0 에서 새롭게 추가된 System.Collections.Concurrent.ConcurrentQueue<T> 와 사실상 거의 동일하다.

QueueUserWorkItem 의 성능을 개선한 것은 ThreadPool 을 사용하는 기존 어플리케이션에서 코드를 전혀 바꾸지 않고 성능 개선을 해준다는 점에서 확실히 긍정적이다. ThreadPool 에서 무거운 작업을 수행하던 어플리케이션에서는 큰 성능 향상은 없겠지만, 작은 단위로 작업을 나눠서 수행하는 멀티코어 하드웨어 환경에서 개발된 어플리케이션에서는 상당한 성능 향상이 있었다.

그러나 여전히 ThreadPool.QueueUserWorkItem API 를 이용하여 작업을 할당하는 것은, 여전히 Queue 에 적재된 작업에 대해서 ThreadPool 이 아는 바가 전혀 없다는 점에서 개선의 여지가 많이 남아있었다.

Task Parallel Library 등장

ThreadPool.QueueUserWorkItem 으로 한계를 느낀 .NET ThreadPool 팀에서 .NET 4.0 에 추가한 것이 바로 TPL 이다. 이제 드디어 TPL 에 대해서 다룰 차례가 왔다. 다음 포스트부터 시리즈로 기본적인 내용부터 디테일까지 샅샅히 살펴보도록 하겠다.

ThreadPool in details.

I'd like to write a post for Task, but first things first, I'm going to talk about the thread pool since it is the underlying mechanism of the task. I already put up a post for the basic of the thread pool in the last post, I'm going to look at the details this time.

ThreadPool has two types of threads.

There's a thread pool API ThreadPool.GetMinThreads(out int workerThreads, out int completionPortThreads); and it returns two types of the minimum number for the thread respectably. What are they exactly?

WorkerThread
This means pretty much every threads except for the main thread. The work that the thread pool offloads is typically a delegate and it means any jobs in your code.
CompletionPortThread
It's called I/O Thread and Threadpool in .NET separately maintains this thread to dispatch NativeOverlapped Callback.

First of all, these two threads are technically just normal threads. The reason the thread pool logically splits those into two types and maintains separately is to avoid deadlock. Imagine that there are too many tasks in the queue and the thread pool is offloading the works to the worker threads to consume the waiting tasks faster, ended up allocating jobs to the threads for the native I/O callback. There's no thread reserved for dispatching the I/O completion callback and it directly leads to the occupation of I/O resource for a long period, and eventually causes deadlock.

NativeOverlapped Callback
Kind of struct, and it means the overlapped win32 call. What it means by win32 call is the windows APIs in user32.dll or kernel32.dll that could be invoked with PInvoke in managed code.

If you learn a little bit about the diagram stacks below, you may see the point in CompletionPortThread. I recommend you to read There is no thread by Stephen Cleary, which explains how a thread in your code will be interacting with your devices in a brief detail. The Overlapped yellow square is the exact spot for NativeOverlapped callback.

Algorithm in ThreadPool

ThreadPool manages the life cycle of threads. It adds a new thread or delete one on demand to improve the TPS by offloading the jobs in the queue to its available threads. With no doubt, a group of brilliant developers gathered together and put a lot of efforts to make the best thread pool ever, thinking thoroughly for the variety of scenarios where the thread pool might show lower performances. But why there is still a performance issue with regard to the thread pool like I encountered before. I posted an article about the slow performance issue already, but it was just a detailed, but yet half proof of my wild guessing. So I really wanted to know the inside of the thread pool and this is it.

Starvation
In computer science, Starvation means a case of concurrent computing in which a process is repeatedly denied the required resources to process a job. An intentional DoS attack could result in starvation, I think that's the good case. If demanding tasks are offloaded to all of worker threads in the thread pool and the queue is piling up the tasks, there would be a plenty of starved items in the thread pool queue. To deal with this, the thread pool has a mechanism to add a new thread for the waiting task in its queue.
Hill climbing heuristic
Heuristic means "find / discover" and more accurately, it's a process or method that actively and responsively find a better way to solve a problem for itself. Hill Climbing is one of algorithms and It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution by making an incremental change to the solution. If the change produces a better solution, another incremental change is made to the new solution, and so on until no further improvements can be found. This is the behind scene in the thread pool.

This algorithm works entirely for effectively utilizing the CPU cores when threads are blocked by I/O or in waiting state by the process occupation. It makes a decision itself to add a new thread to its pool every time one of the threads finished its job or the static 500ms interval. When the thread pool thinks it'd better to create another thread, then one will be joining the forces otherwise will be falling back untill it reaches the minimum number of threads, which you can get from ThreadPool.GetMinThreads API. This responsive technique is called Hill climbing heuristic.

Forget about the algorithm. What about having a lot of threads in the pool?

Well creating a new thread makes consuming the work items in the work queue in the thread pool faster, then having a lot of threads from the start gets rid of the necessity of the hill climbing heuristic? The answer would be No.

The more threads you have, the more context-switching there would be. This can lead to CPU overhead and if you have a lot of threads, the overhead could pose a significant impact on the system.
The more threads you have, the more memory stacks should be created, and this can impact the data locality. If CPU has a memory stack per a thread and caches them like it juggles them all, then the caching would be very less effective.

Holding more threads than the number of the logical processors has a limited benefit. It just makes CPU always busy and do more works by offloading some work items to available threads when there're some of blocked, unavailable threads. But we should try not to overreact to the block state.

To put it in a nutshell, the hill climbing heuristic takes care of this kind of scenario as well, so if you cannot come up with the better, breaking-through solutions, accepting it as it is might be good for your mental health. FYI, there's a paper for the thread pool design and implementation. ( It's published by 3 Microsoft employees, but it looks like one of them is working for Google right now. )

ThreadPool in the past.

Before the introduction of TPL ( Task Parallel Library ), namely .NET 4.0, what the thread pool used to look like? In the early version of CLR, A work item went into a linked list, instead of a queue, when you used ThreadPool.QueueUserWorkItem in the thread pool. And the mutual exclusion was managed by using the Monitor class to acquire lock. This meant that when you called the thread pool API, a fresh new node should be added to the linked list, which is way more expensive than enqueueing an item to a queue, and it meant that the GC should traverse all of the items in the linked list. Acquiring lock every time enqueueing and dequeueing happened was a plus overhead.

This approach wasn't so pain in the neck at that time as it may look now. A unit of heavy workload was what the thread pool was designed for, take ASP.NET web application as its example. A task in the web application meant generating an entire web page. This kind of work itself took long enough to ignore the CPU overhead of acquiring lock and adding nodes.

However the concurrency requirement in applications keep increasing as the brand new hardwares usher in the new era of multiple CPU cores. Once the thread pool's purchase was taking a large chunk of tasks and offloading them to a few threads, now is taking independent, fine-grained, light weight small tasks that might take less than a second. That's why the thread pool was improved at the release of .NET 4.0.

Faster FIFO

Due to the CPU overhead of getting lock every time enqueue and dequeue happens and traversing the linked list thoroughly I mentioned above, the thread pool should minimize the synchronization overhead by using a data structure that doesn't use lock. Moreover the thread pool needs a GC friendly queue, for which developers still should allocate an object ( delegate ), but it is now much smaller than before and easily traceable, so GC wouldn't have to suffer from searching through a data structure. The queue in the thread pool is basically identical to System.Collections.Concurrent.ConcurrentQueue<T> introduced in .NET 4.0.

Improving QueueUserWorkItem is good in that it would make an existing application using the thread pool run faster without changing any code. This performance improvement wouldn't much shine on the application that still works on a heavy workload with the thread pool, but it will be significantly beneficial for the application that's developed on a multi-core hardware environment and executes a small, fine-grained work.

But we still don't know much about the works we are allocating through ThreadPool.QueueUserWorkItem API, which is just a fire and go, so there're still something we want to make differences.

Introduction of Task Parallel Library

The TPL is what the .NET Threadpool team has added for breaking through the limitation of the API. Now it's finally time for writing about TPL. My next post will cover the basics and the details of TPL.