A Rare CancelIoEx Hang in Go on Windows

By | August 10, 2025

I don’t consider myself a Go expert and have only occasionally used this language, but I’d like to share a story about a bug at the intersection of Go and the Windows kernel that I was “lucky” enough to encounter.

This bug is still present (GitHub issue #64482), although there’s reason to hope it will be fixed in the next Go release.

Nevertheless, if the stars align unfavorably and your Go program suddenly hangs on a client machine during a CancelIoEx call — and you can’t reproduce or analyze the problem — I hope the material below will help you understand its cause and find a possible workaround.

How the Problem Manifested

In a real-world service, one of the threads on Windows became completely stuck in a CancelIoEx call.

Microsoft’s documentation claims that CancelIoEx does not wait for canceled operations to complete:

The CancelIoEx function does not wait for all canceled operations to complete.

However, in practice, the thread remained waiting indefinitely.

The problem turned out to be extremely difficult to reproduce. A full memory dump was required for proper analysis, but without a stable reproduction scenario, this was nearly impossible. This pushed me to start looking for anomalies related to CancelIoEx. Eventually, I stumbled upon mysterious errors in Go’s TestPipeIOCloseRace — an issue that had been quietly sitting in the backlog for two years.

To make diagnostics easier, I wrote a small tool based on this test.

Hangtest: Reproducing a Rare Bug

To reproduce the problem more reliably, I wrote hangtest — a small application that launches several workers repeatedly creating pipes, reading, writing, and closing them. Essentially, it’s the same TestPipeIOCloseRace, but on a massive scale.

Here’s the full code so you can reproduce the issue yourself:

package main

import (
	"errors"
	"fmt"
	"io"
	"os"
	"runtime"
	"strings"
	"sync"
	"time"
)

const (
	iterationsPerWorker = 100000
	numWorkers          = 10 // итого 10_000 итераций параллельно
)

func main() {
	if runtime.GOOS == "js" || runtime.GOOS == "wasip1" {
		fmt.Printf("skipping on %s: no pipes\n", runtime.GOOS)
		return
	}

	var wg sync.WaitGroup
	wg.Add(numWorkers)

	for i := 0; i < numWorkers; i++ {
		go func(workerID int) {
			defer wg.Done()
			for j := 0; j < iterationsPerWorker; j++ {
				if err := runPipeIOCloseRace(); err != nil {
					fmt.Printf("worker %d, iteration %d: %v\n", workerID, j, err)
				}
			}
		}(i)
	}

	wg.Wait()
}

func runPipeIOCloseRace() error {
	r, w, err := Pipe()
	if err != nil {
		return fmt.Errorf("pipe creation failed: %w", err)
	}

	var wg sync.WaitGroup
	wg.Add(3)

	var errOnce sync.Once
	var firstErr error

	fail := func(e error) {
		errOnce.Do(func() {
			firstErr = e
		})
	}

	go func() {
		defer wg.Done()
		for {
			n, err := w.Write([]byte("hi"))
			if err != nil {
				switch {
				case errors.Is(err, ErrClosed),
					strings.Contains(err.Error(), "broken pipe"),
					strings.Contains(err.Error(), "pipe is being closed"),
					strings.Contains(err.Error(), "hungup channel"):
					// Ignore expected errors
				default:
					fail(fmt.Errorf("write error: %w", err))
				}
				return
			}
			if n != 2 {
				fail(fmt.Errorf("wrote %d bytes, expected 2", n))
				return
			}
		}
	}()

	go func() {
		defer wg.Done()
		var buf [2]byte
		for {
			n, err := r.Read(buf[:])
			if err != nil {
				if err != io.EOF && !errors.Is(err, ErrClosed) {
					fail(fmt.Errorf("read error: %w", err))
				}
				return
			}
			if n != 2 {
				fail(fmt.Errorf("read %d bytes, want 2", n))
			}
		}
	}()

	go func() {
		defer wg.Done()
		time.Sleep(time.Millisecond)
		if err := r.Close(); err != nil {
			fail(fmt.Errorf("close reader: %w", err))
		}
		if err := w.Close(); err != nil {
			fail(fmt.Errorf("close writer: %w", err))
		}
	}()

	wg.Wait()
	return firstErr
}

// --------- Pipe Implementation (simple wrapper using os.Pipe) --------- //

var ErrClosed = errors.New("pipe closed")

func Pipe() (*pipeReader, *pipeWriter, error) {
	r, w, err := os.Pipe()
	if err != nil {
		return nil, nil, err
	}
	return &pipeReader{r}, &pipeWriter{w}, nil
}

type pipeReader struct {
	*os.File
}

type pipeWriter struct {
	*os.File
}

Even with this approach, reproducing the problem wasn’t immediate. On my work laptop with a 12th-gen Intel processor, the hang never occurred. The first time I caught it was on Windows 11 ARM64, and later it was finally reproduced on a virtual x64 system. This finally allowed us to collect a dump and analyze it together with Microsoft support.

Cause of the Hang

The joint dump analysis with Microsoft support showed the following:

  • Thread #1 was performing a synchronous ReadFile / WriteFile on a pipe.
  • At the same time, it was the initiator of asynchronous I/O that was supposed to be canceled.
  • The CancelIoEx call from another thread queued an APC for cancellation, but it could not be delivered — during synchronous I/O, the Windows kernel disables the delivery of normal kernel APCs.
  • As a result, the second thread also became blocked, waiting for an APC that would never be delivered.

Since Go doesn’t allow direct thread control, a thread that previously executed asynchronous I/O (e.g., on a socket) may later be reused for synchronous I/O on a pipe. In such a case, that thread must cancel its own asynchronous I/O via CancelIoEx, but it can’t — because it’s stuck in synchronous I/O.

Microsoft Kernel Team Analysis (WinDbg Output)

Comment from Microsoft on hangtest:

In hangtest.exe, thread (ffffa2029d6c3080) is performing a synchronous WriteFile on a Named Pipe. Because this is synchronous I/O, normal kernel APC delivery is disabled. Another thread (ffffa2029f0d7080) calls CancelIoEx() for the same FileObject. It queues an APC for IRP cancellation and waits for completion. However, the APC cannot be delivered because thread 0xffffa2029d6c3080 is in synchronous wait. As noted, it’s critical not to mix synchronous and asynchronous I/O. To cancel synchronous operations, use CancelSynchronousIo.

Thread executing WriteFile (synchronous I/O):

Process                         Thread           CID       UserTime KernelTime ContextSwitches Wait Reason      Time State
hangtest.exe (ffffa202a18f0080) ffffa2029d6c3080 2d38.1cd0     63ms      344ms            3380 Executive   9m:35.406 Waiting

Irp List:
    IRP              File   Driver Owning Process
    ffffa202a47f2b00 (null) Npfs   hangtest.exe

# Child-SP         Return           Call Site
0 ffffe881ff1f7e70 fffff8067f02a4d0 nt!KiSwapContext+0x76
1 ffffe881ff1f7fb0 fffff8067f0299ff nt!KiSwapThread+0x500
2 ffffe881ff1f8060 fffff8067f0292a3 nt!KiCommitThreadWait+0x14f
3 ffffe881ff1f8100 fffff8067f1f1744 nt!KeWaitForSingleObject+0x233
4 ffffe881ff1f81f0 fffff8067f406d96 nt!IopWaitForSynchronousIoEvent+0x50
5 (Inline)         ---------------- nt!IopWaitForSynchronousIo+0x23
6 ffffe881ff1f8230 fffff8067f3cfdb5 nt!IopSynchronousServiceTail+0x466
7 ffffe881ff1f82d0 fffff8067f487ec0 nt!IopWriteFile+0x23d
8 ffffe881ff1f83d0 fffff8067f212005 nt!NtWriteFile+0xd0
9 ffffe881ff1f8450 00007fffdb70d5f4 nt!KiSystemServiceCopyEnd+0x25
a 00000055e4fff638 0000000000000000 0x7fffdb70d5f4

Irp Details: ffffa202a47f2b00
    Thread           Process      Frame Count
    ============================= ===========
    ffffa2029d6c3080 hangtest.exe           2

Irp Stack Frame(s)
      # Driver           Major Minor Dispatch Routine Flg Ctrl Status  Device           File                                                            
    === ================ ===== ===== ================ === ==== ======= ================ ================
    ->2 \FileSystem\Npfs WRITE     0 IRP_MJ_WRITE       0    1 Pending ffffa202987985f0 ffffa2029f495670

================================================================
ffffa2029f495670 
Related File Object: 0xffffa2029f494d10 
Device Object: 0xffffa202987985f0   \FileSystem\Npfs
Vpb is NULL

Flags:  0x40082
  Synchronous IO
  Named Pipe
  Handle Created

Thread executing CancelIoEx:

FsContext: 0xffffc40ddc15a8c0      FsContext2: 0xffffc40ddc6bb291
Private Cache Map: 0x00000001
CurrentByteOffset: 0

 
Process                         Thread           CID       UserTime KernelTime ContextSwitches Wait Reason      Time State
hangtest.exe (ffffa202a18f0080) ffffa2029f0d7080 2d38.241c     31ms      391ms            4151 Executive   9m:35.406 Waiting

# Child-SP         Return           Call Site
0 ffffe881fedd8fe0 fffff8067f02a4d0 nt!KiSwapContext+0x76
1 ffffe881fedd9120 fffff8067f0299ff nt!KiSwapThread+0x500
2 ffffe881fedd91d0 fffff8067f0292a3 nt!KiCommitThreadWait+0x14f
3 ffffe881fedd9270 fffff8067f4a72dd nt!KeWaitForSingleObject+0x233
4 ffffe881fedd9360 fffff8067f5218dc nt!IopCancelIrpsInThreadList+0x125
5 ffffe881fedd93b0 fffff8067f4a70d6 nt!IopCancelIrpsInThreadListForCurrentProcess+0xc4
6 ffffe881fedd9470 fffff8067f212005 nt!NtCancelIoFileEx+0xc6
7 ffffe881fedd94c0 00007fffdb70e724 nt!KiSystemServiceCopyEnd+0x25
8 00000055e45ffc88 0000000000000000 0x7fffdb70e724

This thread has been waiting 9m:35.406 on a kernel component request

===============================================================

0xffffa2029f494d10

Related File Object: 0xffffa2029fb0dd60 
Device Object: 0xffffa202987985f0   \FileSystem\Npfs
Vpb is NULL
Event signalled

Flags:  0x40082
  Synchronous IO
  Named Pipe
  Handle Created

Second Example: Real Case with Sockets and Pipes

In addition to hangtest, we provided Microsoft with a dump from another real-world process where the same mechanism occurred.

In that case, a thread first performed asynchronous I/O on a socket and later got blocked on a synchronous ReadFile for a pipe. Another thread called CancelIoEx() for the AFD endpoint, queuing a normal kernel APC to cancel the IRP and waiting for completion. No progress was made because normal kernel APC delivery was disabled on the thread due to the synchronous I/O wait.

In fact, asynchronous I/O on a socket and synchronous I/O on a pipe overlapped within the same thread. When it became necessary to cancel the I/O via CancelIoEx, this turned out to be impossible, and the thread hung.

Stacks of the blocked threads:

Process                       Thread           CID       UserTime KernelTime ContextSwitches Wait Reason         Time State
reducted_executable_name.exe (ffffaf85feb300c0) ffffaf8602c90080 e88.2b24   57s.703    24s.281          549327 Executive   3h:25:43.718 Waiting

Irp List:
    IRP              File   Driver Owning Process
    ffffaf8600e76640 (null) Npfs   reducted_executable_name.exe

# Child-SP         Return           Call Site
0 ffffd005e29d4570 fffff8025d7071d7 nt!KiSwapContext+0x76
1 ffffd005e29d46b0 fffff8025d706d49 nt!KiSwapThread+0x297
2 ffffd005e29d4770 fffff8025d705ad0 nt!KiCommitThreadWait+0x549
3 ffffd005e29d4810 fffff8025dca1548 nt!KeWaitForSingleObject+0x520
4 (Inline)         ---------------- nt!IopWaitForSynchronousIo+0x3e
5 ffffd005e29d48e0 fffff8025dc9ffc8 nt!IopSynchronousServiceTail+0x258
6 ffffd005e29d4990 fffff8025d877cc5 nt!NtReadFile+0x688
7 ffffd005e29d4a90 00007fff4e820094 nt!KiSystemServiceCopyEnd+0x25
8 000000002de1fbe8 00007fff4a9aa747 ntdll!ZwReadFile+0x14
9 000000002de1fbf0 000000000047337e KERNELBASE!ReadFile+0x77
a 000000002de1fc70 000000000000056c reducted_executable_name+0x7337e
b 000000002de1fc78 000000c000573ec0 0x56c
c 000000002de1fc80 0000000000008000 0xc000573ec0
d 000000002de1fc88 000000c001701d54 0x8000
e 000000002de1fc90 0000000000000000 0xc001701d54

Irp Details: ffffaf8600e76640
    Thread           Process    Frame Count
    =========================== ===========
    ffffaf8602c90080 reducted_executable_name.exe           2
Irp Stack Frame(s)
      # Driver           Major Minor Dispatch Routine Flg Ctrl Status  Device           File            
    === ================ ===== ===== ================ === ==== ======= ================ ================
    ->2 \FileSystem\Npfs READ      0 IRP_MJ_READ        0    1 Pending ffffaf85f7da2c00 ffffaf86107074b0

==============================================================
ffffaf86107074b0 
Related File Object: 0xffffaf8602a1f6a0

Device Object: 0xffffaf85f7da2c00   \FileSystem\Npfs
Vpb is NULL
Flags:  0x40082
  Synchronous IO
  Named Pipe
  Handle Created
File Object is currently busy and has 0 waiters.

Process                       Thread           CID       UserTime KernelTime ContextSwitches Wait Reason         Time State
reducted_executable_name.exe (ffffaf85feb300c0) ffffaf8602d06080 e88.2840   53s.969    24s.203          539612 Executive   3h:27:07.468 Waiting

# Child-SP         Return           Call Site
0 ffffd005e2b02630 fffff8025d7071d7 nt!KiSwapContext+0x76
1 ffffd005e2b02770 fffff8025d706d49 nt!KiSwapThread+0x297
2 ffffd005e2b02830 fffff8025d705ad0 nt!KiCommitThreadWait+0x549
3 ffffd005e2b028d0 fffff8025dd06e87 nt!KeWaitForSingleObject+0x520
4 ffffd005e2b029a0 fffff8025dd6b377 nt!IopCancelIrpsInThreadList+0x11f
5 ffffd005e2b029f0 fffff8025dd06cd2 nt!IopCancelIrpsInThreadListForCurrentProcess+0xc3
6 ffffd005e2b02ab0 fffff8025d877cc5 nt!NtCancelIoFileEx+0xc2
7 ffffd005e2b02b00 00007fff4e8211c4 nt!KiSystemServiceCopyEnd+0x25
8 000000002e01fc28 00007fff4a9f9ef0 ntdll!ZwCancelIoFileEx+0x14
9 000000002e01fc30 000000000047337e KERNELBASE!CancelIoEx+0x10
a 000000002e01fc70 000000002e01fc88 reducted_executable_name+0x7337e

0xffffaf86`11746570 
\Endpoint
Device Object: 0xffffaf85f7db0b20   \Driver\AFD
Vpb is NULL

Flags:  0x6040400
  Queue Irps to Thread
  Handle Created
  Skip Completion Port Queueing On Success
  Skip Set Of File Object Event On Completion

Microsoft’s Conclusion

Microsoft confirmed this is not a traditional bug but a specific behavior of Windows I/O implementation. The behavior of CancelIoEx in such a situation is indeed possible, even though it’s not reflected in the official documentation.

Unfortunately, Go does not account for this, making it especially problematic: the developer does not expect blocking and encounters it unexpectedly in production. Since Go lacks direct thread control, this scenario is hard to avoid without runtime changes.

Coordination with Go and Microsoft Developers

I was able to get the attention of a Go developer from Microsoft and coordinate their interaction with the Microsoft Kernel Team. The joint analysis confirmed the issue is reproducible and directly related to APC behavior during synchronous I/O.

In my opinion, this should either be properly handled at the Windows kernel level or at least clearly documented for CancelIoEx.The first option seems unrealistic because any change to APC mechanisms could introduce unpredictable backward-compatibility issues. As a result, the documentation promises non-blocking behavior, but in practice blocking can occur.

Practical Consequences

  • In Go tests, this appeared as a rare flaky test — TestPipeIOCloseRace — which would occasionally fail in CI.
  • In a real-world service, a thread working asynchronously with a network socket was scheduled to perform synchronous I/O on a pipe — and hung in CancelIoEx.
  • From the outside, the process still appeared alive, but internally a critical goroutine was permanently blocked.
  • The problem is extremely difficult to reproduce — it may not appear for weeks in testing, but can suddenly strike in production and freeze a critical service.

Possible Solutions

  1. Proper handle management
    • Do not call CancelIoEx for synchronous handles (created without FILE_FLAG_OVERLAPPED).
    • To cancel synchronous operations, use CancelSynchronousIo.
      These measures eliminate the blocking risk but require changes in the Go runtime.
  2. Minimize pipe usage
    If possible, avoid using pipes in critical paths of the code to reduce the likelihood of hitting this issue.
  3. Wait for the official fix
    The Go team confirmed the problem and plans to fix it in Go 1.26 (see issue #64482).

Conclusion

This case demonstrated that a rare flaky test can hide a very real and dangerous production problem.

  • Calling CancelIoEx on synchronous pipe handles can permanently block a thread, despite the documentation claiming otherwise.
  • In Go, the scheduler may assign a thread that previously performed asynchronous I/O to a synchronous operation. In such a situation, the thread will be unable to cancel its own asynchronous operations.
  • Until the fix in Go 1.26, developers have only workarounds.

⚡ If your Go service on Windows hangs in CancelIoEx and you can’t reproduce the problem, I hope this post helps you understand the cause and find a temporary workaround.

Leave a Reply

Your email address will not be published. Required fields are marked *