Windows Packet Filter and Gigabit networks

By | April 6, 2016

There is a very popular and important question about Windows Packet Filter: “Can I handle Gigabit traffic in WinpkFilter user-mode application without noticeable performance degradation?” I was asked quite often and usually my answer starts with “that depends…” followed up by various performance related considerations and ends with a sentence “if you need maximum possible performance then consider the possibility of putting your packet filtering code directly into the WinpkFilter driver. And yes, there are at least several important things to take into consideration:

  1. First, and the most important one, is CPU performance. The faster CPU the better because fetching packet from kernel, processing it and re-injecting back into the network stack ends in extra CPU cycles
  2. Application architecture. This is strongly connected to previous topic, but even with high-end incredibly fast CPU poor application design has a good chances to slow down your network
  3. Network latency. You may be already aware that TCP protocol performance is very sensitive to network latency. This parameter is called TCP round-trip time (RTT) and this is the length of time it takes for a packet to be sent plus the length of time it takes for an acknowledgment of that packet to be received . By design TCP limits the amount of data which can be sent without confirmation (TCP window size) and taking this into consideration theoretical Throughput (of single TCP session) = TCP maximum receive window size / RTT. And this also highlights the importance of previous topic, fetching network packet from kernel and back not only takes CPU cycles, it also takes time and thus increases RTT. If your packet processing code is ineffective and slow it adds even more. And all of these sequentially may decrease single TCP session throughput.

So, it looks reasonable to make some tests to see what throughput can be achieved using user-mode packet filtering. As a side effect of this research several performance optimizations were implemented in WinpkFilter driver code (version 3.2.7).

Prerequisites

Hardware

The configuration is very common, two “far from new” average performance mobile workstations with built-in Gigabit LAN interfaces connected by UTP-5E cables to Gigabit home router:

  • Core i7-2760QM 2.40 GHz 16GB RAM Windows 10 x64 Pro
  • Core i7-Q720 1.60 GHz 16GB RAM Windows 10 x64 Pro

Software

First of all we need a tool with capabilities to transfer data between our hosts and to measure the effective performance. I have got two favorite utilities: NetCPS and iPerf3. The second one is more advanced and allows to run more than one TCP session simultaneously.So iPerf3 (people like things starting from ‘i’) is the choice of the day.
We also need a simple WinpkFilter based network filter application to pass through network packets and may be gather some basic statistics. There are two samples we could basically use for testing: PassThru and PackThru. However, they are designed to be as simple as possible in order to demonstrate basic API usage and also heavily use console I/O to dump packets headers. Console I/O is very expensive from the performance point and we can’t use it in high load test. So let’s code something similar, but more performance optimized.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include "stdafx.h"
 
//
// Maximum number of packets to fetch from the driver per single operation
//
#define MAX_PACKET_CHUNK 256
 
//
// Some global definitions
//
TCP_AdapterList			g_AdList;
DWORD				g_iIndex;
CNdisApi			g_api;
HANDLE				g_hEvent;
BOOLEAN				g_bIsRunning = TRUE;
volatile unsigned long long	g_llPacketFiltered = 0;
volatile unsigned		g_dwReadOps = 1;

Instead of reading and forwarding packets in method main() we create a dedicated thread for reading and forwarding packets. Having this thread also allows us to experiment with creating multiply threads to filter the single network interface. Soon we will see if having more than one thread per interface makes any sense.
MAX_PACKET_CHUNK defines the maximum number of network packets to read from driver through the single API call. Please note that it is important to use ReadPackets/SendPacketsToAdapter/SendPacketsToMstcp to read/write a bulk of packets in one API call. You save a lot of CPU cycles required for user/kernel context switches if compare to single packet versions ReadPacket/SendPacketToAdapter/SendPacketToMstcp. The value of 256 for MAX_PACKET_CHUNK was chosen experimentally, in real application you can make this a dynamic parameter and adjust according the average amount of packets you read/write.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
// 
// Working thread routine. Started one thread per CPU core
//
unsigned __stdcall WorkingThread(void * index)
{	
   PETH_M_REQUEST		ReadRequest;
   PETH_M_REQUEST		ToMstcpRequest;
   PETH_M_REQUEST		ToAdapterRequest;
   INTERMEDIATE_BUFFER 	PacketBuffer[MAX_PACKET_CHUNK];
   unsigned			dwMaxRead = 0;
 
   ULONG_PTR		dwThreadIndex = reinterpret_cast(index);
 
   printf("Thread: %I64d started\n", (ULONGLONG)dwThreadIndex);
 
   //
   // Initialize Request
   //
   ReadRequest = (PETH_M_REQUEST)malloc(sizeof(ETH_M_REQUEST) +
      sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
   ToMstcpRequest = (PETH_M_REQUEST)malloc(sizeof(ETH_M_REQUEST) +
      sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
   ToAdapterRequest = (PETH_M_REQUEST)malloc(sizeof(ETH_M_REQUEST) +
      sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
 
   ZeroMemory(ReadRequest, sizeof(ETH_M_REQUEST) +
      sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
   ZeroMemory(ToMstcpRequest, sizeof(ETH_M_REQUEST) +
      sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
   ZeroMemory(ToAdapterRequest, sizeof(ETH_M_REQUEST) +
      sizeof(NDISRD_ETH_Packet)*(MAX_PACKET_CHUNK - 1));
 
   ZeroMemory(&PacketBuffer, sizeof(INTERMEDIATE_BUFFER)*MAX_PACKET_CHUNK);
   ReadRequest->hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
   ToMstcpRequest->hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
   ToAdapterRequest->hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
   ReadRequest->dwPacketsNumber = MAX_PACKET_CHUNK;
 
   for (unsigned i = 0; i EthPacket[i].Buffer = &PacketBuffer[i];
   }
 
   while (g_bIsRunning)
   {
      WaitForSingleObject(g_hEvent, INFINITE);
 
      // Reset event, as we don't need to wake up all working threads at once
 
      if (g_bIsRunning)
          ResetEvent(g_hEvent);
 
      // Start reading packet from the driver
 
      while (g_api.ReadPackets(ReadRequest))
      {
         if (ReadRequest->dwPacketsSuccess > dwMaxRead)
         {
            dwMaxRead = ReadRequest->dwPacketsSuccess;
 
            printf(
                "Thread: %I64d New read operation maximum %d packets from driver\n",
                 (ULONGLONG)dwThreadIndex,
                 dwMaxRead
                 );
         }
 
         for (unsigned i = 0; i dwPacketsSuccess; ++i)
         {
            InterlockedIncrement(&g_llPacketFiltered);
 
            if (PacketBuffer[i].m_dwDeviceFlags == PACKET_FLAG_ON_SEND)
            {
               ToAdapterRequest->EthPacket[ToAdapterRequest->dwPacketsNumber].Buffer = 
                   &PacketBuffer[i];
               ToAdapterRequest->dwPacketsNumber++;
             }
             else
             {
                ToMstcpRequest->EthPacket[ToMstcpRequest->dwPacketsNumber].Buffer = 
                    &PacketBuffer[i];
                ToMstcpRequest->dwPacketsNumber++;
             }
          }
 
          if (ToAdapterRequest->dwPacketsNumber)
          {
             g_api.SendPacketsToAdapter(ToAdapterRequest);
             ToAdapterRequest->dwPacketsNumber = 0;
             InterlockedIncrement(&g_dwReadOps);
          }
 
          if (ToMstcpRequest->dwPacketsNumber)
          {
             g_api.SendPacketsToMstcp(ToMstcpRequest);
             ToMstcpRequest->dwPacketsNumber = 0;
          }
 
         ReadRequest->dwPacketsSuccess = 0;
      }
   }
	return 0;
}

The working thread is very simple. It reads bulk of packets from the driver, sorts to incoming and outgoing and then re-injects back into the network stack. It also counts the amount of packets and ReadPackets API calls to calculate an average.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
int main(int argc, char* argv[])
{
   if (argc  g_AdList.m_nAdapterCount )
   {
      printf("There is no network interface with such index on this system.\n");
      return 0;
   }
 
   ADAPTER_MODE Mode;
 
   Mode.dwFlags = MSTCP_FLAG_SENT_TUNNEL|MSTCP_FLAG_RECV_TUNNEL;
   Mode.hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
 
   // Create notification event
   g_hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
 
   // Set event for helper driver
   if ((!g_hEvent) || 
       (!g_api.SetPacketEvent((HANDLE)g_AdList.m_nAdapterHandle[g_iIndex], g_hEvent)))
   {
      printf ("Failed to create notification event or set it for driver.\n");
      return 0;
   }
 
   g_api.SetAdapterMode(&Mode);
 
   HANDLE* phThreads = new HANDLE[concurentThreadsSupported];
 
   for (unsigned i = 0; i < concurentThreadsSupported; ++i)
   {
      phThreads[i] = 
       (HANDLE)_beginthreadex (	
                  NULL,
                  0,
                  WorkingThread,
                  reinterpret_cast((ULONG_PTR)i),
                  0,
                  NULL
                  );
   }
 
   printf("Press any key to stop packet filtering... \n");
 
   _getch();
 
   g_bIsRunning = FALSE;
 
   SetEvent(g_hEvent);
 
   WaitForMultipleObjects(concurentThreadsSupported, phThreads, TRUE, INFINITE);
 
   for (unsigned i = 0; i < concurentThreadsSupported; ++i)
   {
       CloseHandle(phThreads[i]);
   }
 
   printf(
      "Filtered %I64d packets in %d calls. Packets per call average:%I64d \n", 
      g_llPacketFiltered, 
      g_dwReadOps, 
      g_llPacketFiltered / g_dwReadOps
      );
 
   //
   // Although we exit application and all resources will be cleaned up
   // automatically release everything before we exit
   //
   Mode.dwFlags = 0;
   Mode.hAdapterHandle = (HANDLE)g_AdList.m_nAdapterHandle[g_iIndex];
 
   // Set NULL event to release previously set event object
   g_api.SetPacketEvent(g_AdList.m_nAdapterHandle[g_iIndex], NULL);
 
   // Close Event
   if (g_hEvent)
      CloseHandle(g_hEvent);
 
   // Set default adapter mode
   g_api.SetAdapterMode(&Mode);
 
   // Empty adapter packets queue
   g_api.FlushAdapterPacketQueue(g_AdList.m_nAdapterHandle[g_iIndex]);
 
   return 0;
}

Resulted console application accepts interface index and number of working threads as a parameters, starts the requested number of packet processing threads and waits for user to terminate. Now it is time to run some tests. We will be using the latest available WinpkFilter 3.2.7 (single TCP session performance was significantly improved in this version).

Performance tests

Both mobile workstations have iPerf3 started in server mode to measure the throughput for incoming and outgoing TCP data streams. We use iPerf3 in client mode on each of the workstations and measure the network throughput using single and 16 simultaneous TCP sessions in the following software configurations:

  1. WinpkFilter driver not installed
  2. WinpkFilter installed but not used
  3. WinpkFilter installed and perftest.exe (utility we created above) is started on Ethernet interface having single packet processing thread
  4. WinpkFilter installed and perftest.exe (utility we created above) is started on Ethernet interface having two packet processing threads.
Configuration Receive 1 TCP stream (Mbits/sec) Send 1 TCP stream (Mbits/sec) Receive 16 TCP streams (Mbits/sec) Send 16 TCP streams (Mbits/sec)
NOT INSTALLED 939 938 940 939
INSTALLED 941 940 940 933
PERFTEST 1 THREAD 937 935 934 939
PERFTEST 2 THREADS 648 880 923 934

Conclusion

Well, as we can see from the test results, our simple WinpkFilter application handles 1 Gbit for both single and for multiply TCP connections without network throughput degradation. Resulted CPU load with perftest.exe running is 17-20%, without perftest is 7-8%. This is good news and for now we can avoid the need to move our packet processing code in the kernel driver.
However, there is one one more thing to note. What happens when we start second working thread inside perftest.exe on the same network interface? Why throughput decreases so drastically? The reason is easy to find out if you start the network sniffer and look at the order of network packets. You will notice that the order of packets in single TCP session is sometimes broken causing DUP ACKS and retransmits and this seriously affects the TCP session throughput. So it is important to understand, if you use more than one thread to read packets from the same network interface or have some other packet processing flow which may potentially break the original order of network packets then you have to care to re-inject packets into the network stack in correct order.

PerfTest project source code is available on Github

2 thoughts on “Windows Packet Filter and Gigabit networks

    1. ntkernel.com Post author

      Windows network subsystem has not changed much since Windows Vista (when NDIS 6.0 was initially released) and for that reason WinpkFilter uses the same type of driver from Windows Vista to Windows 10 (driver binaries slightly differ because they are built targeting different minor versions of NDIS which depends on the OS and have backward compatibility, e.g. you can’t load Windows 8 driver on Windows 7, but you can load Windows 7 driver on Windows 8/10). So if Windows 7 x64 on particular hardware can handle Gigabit traffic without taking 100% CPU then it will be able to do it with WinpkFilter.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *