Monday, July 11, 2022

Windows Performance Thresholds

This article outlines a general Windows performance troubleshooting approach. It also contains some of the more important performance counters and the thresholds that indicate some kind of performance issue.

 

Some Definitions

  • Complete System Hang - System does not respond to any input from the keyboard or mouse for more than 1 minute after the system has already booted.
  • Hardware Hang - Hang due to hardware related problem.
  • Software Hang - Hang due to software related problem.
  • Application Hang - Application does not respond to any input from the keyboard or mouse for more than 1 minute after application is fully loaded. System is responsive, but application is not. Can be verified in Application event log (Event 1001).
  • Application Crash - Application closes due to exception. Can be verified in Application event log (Event 1000).

 

Troubleshooting Steps

The general order of troubleshooting performance related issues are as follows:

  1. Get detailed problem description. Don't underestimate the importance of a detailed problem description from an end user perspective. What is happening that shouldn't be happening?
    1. Is an application unresponsive / slow?
    2. Is the entire system unresponsive / slow?
    3. Under what conditions is it unresponsive / slow? Is it reproducible?
    4. How long is it unresponsive / slow
    5. What error messages do you see?
    6. How long does it normally take to complete the operation? How long is it taking now?
    7. Is it slow during certain times of the day?
    8. Are you seeing any 1000 events in the application log?
  2. Gather performance metrics from the time when problem is occurring. Performance metrics gathered during normal use is helpful for a baseline, but to identify the problem, you need metrics that were collected during the time of the problem. Generally start with perfmon (Splunk) counters.
    1. Include performance counters that are appropriate for the problem that you are experiencing.
    2. Once data is collected from time of reproduced problem check for bottlenecks.
      1. CPU
      2. Memory
      3. Disk
      4. Network
  3. Once you have identified which resource is bottle-necking, then dive into which process is responsible for the bottleneck.
    1. If you can identify one process that is taxing the bottle-necked resource then you know where to focus your troubleshooting efforts.
    2. Collect additional, more granular data on that process (procmon trace, xperf, WPR, memory dump, etc).
    3. Analyze traces.

 

Key Performance Counters

The following performance counters are part of the Windows counter set found in Windows Performance Monitor.

 Resource  

 Performance  Counter

 Description

 Threshold 

 Interpretation

Processor Counter

 

 

 

 

 CPU

Processor Queue Length

How many threads are waiting to run on the processor

Sustained 2 or more.

It's not uncommon for there to be threads in queue. If this number goes up right before the problem occurs it's an indicator that the processor is taxed. Don't use this counter by itself, look at other counters to confirm that the processor is taxed. Generally you are looking for a buildup in the processor queue length that corresponds with the problem description.

 

Also, this counts all threads in one queue no matter how many processors the machine has, so divide this number by the number of processors you have (where processor = ability to run one thread). This counter shows ready threads, not threads that are in a "running" state.

 CPU

% Processor Time

Overall processor utilization (all instances)

Sustained 91% to 100%

If CPU utilization is consistently between 91% to 100% it is likely that your compute resources are under spec for the given workload. If you see a sustained value of 100% along with an increase in the Processor queue length counter, then your CPU is a resource bottleneck.

 

In some cases, this counter will show values of more than 100%. This is likely due to having multiple processors. So consider the number of processors you have when you read this value.

 CPU

% User Time

Percent of time processor spent on user mode threads

Sustained 80% to 100%

We look at this counter to see how much CPU is being utilized by user mode applications. In general, software applications run in user mode. If you see a lot of user-mode utilization it likely indicates that some application (Office, Notepad++, etc.) is using CPU.

 CPU

% Privileged Time

Percent of time processor spent on kernel mode threads

Sustained 91% to 100%

We look at this counter to see how much CPU is being utilized by threads running in kernel mode. In general this indicates that a system process is using the CPU. If this occurs, be sure to look into the "System" process. However, there are some applications (antivirus, drivers) that run in kernel mode.

Values over 80% often indicate that there are driver or network issues.

Usually values <30% for application / IIS server are normal

High CPU on one or two threads is usually driver issue

High CPU on 4 or more threads with the SAME address generally means that the NIC is very active - use network capture to troubleshoot further.

If you can't find problem with above methods use Kernrate to list active API calls (google kernrate and or krview)

 CPU

% DPC Time

Percent of time processor spent on DPCs

 >15% should investigate

DPC's are a type of interrupt associated with your NIC. Generally these values are very low (below 15%). If the values are much higher than that, you should keep it it mind as you continue to troubleshoot.

This counter tells you the time required to complete I/O request.

Time required to complete an I/O. Threshold is 15% (included in %Privileged Time). This counter, along with the counter below, shows the amount of time sustained to complete the I/O.

This counter is rolled up in the system counter but will not show up in any of the system’s thread

counters. So look for confirmation there (system process CPU vs. system threads CPU), the difference

will be the sum of interrupt and DPC time.

 

Look for variations in this counter; sometimes a hardware device will get blocked yielding a delta

increase in this counter.

The values observed here are quite wide so a baseline is essential.

Normally any value of 25% is something that needs to be investigated. 

 CPU

% Interrupt Time

This counter shows the average number of hardware interrupts that the processor is receiving and servicing per second.

 >10% should investigate

This counter shows the average number of hardware interrupts that the processor is receiving and servicing per second. It does not include DPCs, which are counted separately. This value is an indirect indicator of the activity of devices that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network interface cards and other peripheral devices. These devices normally interrupt the processor when they have completed a task or require attention. Normal thread execution is suspended during interrupts. Most system clocks interrupt the processor every 10 milliseconds, creating a background of interrupt activity.

Interrupts are high priority threads that kick other threads off the processor. They can be initiated by hardware (keyboard, mouse, etc.) or software (call for the operating system to do something). In general, these do not usually take up much CPU so if they are, look for what is sending these high priority threads.

This counter shows the amount of time sustained doing I/O.

This counter is rolled up in the system counter but will not show up in any of the system’s thread

counters. So look for confirmation there (system process CPU vs. system threads CPU).

 

Look for variations in this counter; sometimes a hardware device will get blocked yielding a delta

increase in this counter.

 

ALL interrupt time is posted from any adapter is posted only on one CPU.

To spread the interrupts you must use multiple adapters

 

 

This is the time to SCHEDULE an I/O, not the time to actually complete the I/O.

Physical Disk Counter

 

 

 

 

Disk

% Idle Time

This counter provides a very precise measurement of how much time the disk remained in idle state, meaning all the requests from the operating system to the disk have been completed and there is zero pending requests.

This is how it’s calculated, the system timestamps an event when the disk goes idle, then timestamps another event when the disk receives a new request. At the end of the capture interval, we calculate the percentage of the time spent in idle. This counter ranges from 100 (meaning always Idle) to 0 (meaning always busy).

 <60% should investigate

Generally, this counter should not consistently be below 60%. If it is consistently below 60% you should investigate more. If the value flat-lines at 0% and you see the Avg. Disk Queue Length increasing during this time, you have a disk bottleneck. In other words, the disk cannot keep up with the workload it is being given.

To determine % usage you must use %Idle (to get a 100% scale). This can also be misleading

because if the drive has multiple disk, then all the disk must be idle to show up as idle.

 

This counter accurately determines the saturation of the disk subsystem. Some installations

(especially SQL, exchange servers and IIS servers) are designed to make the disk subsystem the

bottleneck which is the most cost effective method to build a large system.

 

But in general, we are looking for some idle time to be present especially in the SYSTEM drive or page files Disk idle time: does not help determine throughput it only determine how often the volume has nothing

to do.

 

The administrator cannot use the disk read or write time counters on computer systems in

order to determine how busy the system is because they will commonly read more than 100%.

Essentially write times overlap because all controllers can issue multiple I/Os at the same time so

disk write times of 100% to 10,000% are common on high performance disk drive subsystems.

 

% Idle Time is values that is independent of the number of spindles and is independent of the

number of simultaneous I/O’s.

 

Thresholds are dependent on server roles.

·         Application servers like 30% Idle Time.

·         File and print servers like 20% Idle Time.

·         Batch servers (like exchange) may show 0% Idle Time. In fact it is usually the design goal to be disk bound: 0 % Idle Time.

·         BUT for SAN LUN EXPECT Zero always zero percent idle time, no worries look at other counters to measure performance.

Disk

Avg. Disk sec/Transfer

Displays the average time the disk transfers took to complete, in seconds. Although the scale is seconds, the counter has millisecond precision, meaning a value of 0.004 indicates the average time for disk transfers to complete was 4 milliseconds.

This is the counter in Perfmon used to measure IO latency.

 >.030 is bad

 Standard Disk Thresholds

 .008 or less = Excellent

 .012 or less = Good

 .020 or less = Fair

 .030 or greater = Poor

 Cache Thresholds

 < .001 = Excellent

 < .002 = Good

Disk

Avg. Disk Queue Length

Avg. Disk Queue Length is equal to the (Disk Transfers/sec) *( Disk sec/Transfer). This is based on “Little’s Law” from the mathematical theory of queues. It is important to note this is a derived value and not a direct measurement.

 <2 is good

Less than 2 plus the number of spindles is an excellent value.

·         This should correspond to excellent response time.

·         Less than double the number of spindles is a good value.

·         This requires further investigation of the disk transfer time in order to see whether disk queue length would actually impact the system.

Less than triple the number of spindles is a fair value.

·         Generally not an issue if seen for period of 5-10 seconds

 

SAN ASSUME that average queue length will less than

·         16 for SCSIPort implementations

·         32 for StorPort implementations

 

If less than that then the SAN assumes it is lightly loaded, check the other counters to see if the Windows server thinks it is lightly loaded.

·         If it is greater than 16/32 you MUST investigate the possibility of throttling.

Disk

Split IO/Sec

Measures the rate of IO split due to file fragmentation. This happens if the IO request touches data on non-contiguous file segments.

<1 is good

If this value gets too high it can indicate disk fragmentation or that the NTFS block is too small or that free space is too low.

Disk

Disk Transfers/Sec

Perfmon captures the total number of individual disk IO requests completed over a period of one second. If the Perfmon capture interval is set for anything greater than one second, the average of the values captured is presented.

Disk Reads/sec and Disk Writes/sec are calculated in the same way, but break down the results in read requests only or write requests only, respectively.

 <400 is good depending on system.

SAN = 2000 max

Local Disk = 400 - 600 max

 

This concept is not well understood. ALL disk subsystems, even solid state one, and especially

SANS, have limits on the number of I/Os separate from limits on the volume (bytes) of I/Os!

 

Disks are saturated when EITHER limit is reached.

 

Current technology of disk drives show the following limits:

·         180 Sequential Transfers per 10,000 RPM of disk drive

·         Some spindle drives with good predictive read ahead will reach 240 Sequential Transfers per 10,000 RPM.

·         60 Random Transfers per 10,000 RPM of disk drive

It is necessary to know the disk speed and the type of I/O in order to determine the

maximum throughput.

 

 

Caching disk drive controllers nullifies this for writes only and yields a gain from 4x to 10x in write

transfers only but only as long as the cache is not overrun.

 

Caching disk drive controller on third generation LUN FREE SANS are less than 1 MS for writes

even under load.

 

The above listed limits are per spindle, not an overall limit for a RAID set. Due to RAID set design,

the limit or RAID set throughput is somewhat difficult to calculate. Below is a summary of the

Disk I/O per second generated for each type of RAID configuration based on a given number of

reads and writes per second.

·         RAID 0: READS + WRITES = I/Os / sec

·         RAID 1: READS + (2*WRITES) = I/Os / sec

·         RAID 5: READS + (4*WRITES) = I/Os / sec

·         RAID 0+1: READS + (2*WRITES) = I/Os per second.

See the whitepapers from your preferred hardware RAID vendor for a detailed explanation of how to observe Disk bottlenecks and to calculate disk I/O limits.

Network Interface Counter

 

 

 

 

Network

Bytes Total/sec

Rate at which bytes are sent and received on a NIC.

 65% - 100%

 < 40% = Healthy

 41% - 64% = Monitor or Caution

 65% - 100% = Critical, performance likely adversely affected

Network

Output Queue Length

 Shows length of output packet queue.

 1 or 2

 0 = Healthy

 1 - 2 = Monitor or Caution

 >2 = Critical, performance likely adversely affected

Network

Packets Received Discarded

 Shows number of of inbound packets that were chosen to be discarded even though no errors had been detected to prevent their being deliverable to a higher-layer protocol.

 >0

 You should not have any Packets Received Discarded. You may see packets received discarded if you have a bad NIC, mis-configured NIC, or bad NIC driver. Another reason you might see packets received and discarded is because the NIC is trying to free up buffer space.

Memory Counters

 

 

 

 

 

Available Bytes

 Shows the amount of physical memory, in bytes, available to processes running on the computer.

 <10%

50% of free memory available or more = Healthy

25% of free memory available = Monitor

10% of free memory available = Warning

Less than 100MB or %5 of free memory available = Critical / out of Spec

 

Pool Paged Bytes

 Shows the size, in bytes, of the paged pool. Memory\ Pool Paged Bytes is calculated differently than Process\ Pool Paged Bytes, so it might not equal Process(_Total )\ Pool Paged Bytes.

>61%

<60% of pool consumed = Healthy

61% to 80% of pool consumed = Warning or Monitor

>80% of pool consumed = Critical / out of Spec

 

Pool NonPaged Bytes 

 Shows the size, in bytes, of the nonpaged pool. Memory\ Pool Nonpaged Bytes is calculated differently than Process\ Pool Nonpaged Bytes, so it might not equal Process(_Total )\ Pool Nonpaged Bytes.

>61%

<60% of pool consumed = Healthy

61% to 80% of pool consumed = Warning or Monitor

>80% of pool consumed = Critical / out of Spec

 

Free System PTE's

 Shows the number of page table entries not currently in use by the system.

<10,000 Free

Greater than 10,000 free = Healthy 

 

Handle Count

 Number of open handles

>10,000 per process

This is not a hard and fast rule, but if a single process consistently has more than 10,000 open handles, it should be investigated to ensure that the application is not coded incorrectly - it needs to release handles.

 

Commit Limit

Shows the amount of virtual memory, in bytes, that can be committed without having to extend the paging file(s). Committed memory is physical memory which has space reserved on the disk paging files. There can be one or more paging files on each physical drive. If the paging file(s) are expanded, this limit increases accordingly.

Should not Change

When paging file is dynamic and it grows, this is due to memory pressure.

 

Some Additional Resources

Pushing the Limits of Windows (Memory) - https://docs.microsoft.com/en-us/archive/blogs/markrussinovich/pushing-the-limits-of-windows-paged-and-nonpaged-pool

Perfmon Memory Counters - https://docs.microsoft.com/en-us/previous-versions/ms804008(v=msdn.10)?redirectedfrom=MSDN

Perfmon Disk Counters - https://docs.microsoft.com/en-us/archive/blogs/askcore/windows-performance-monitor-disk-counters-explained

Perfmon Processor Counters - https://docs.microsoft.com/en-us/previous-versions/ms804036(v=msdn.10)?redirectedfrom=MSDN

Perfmon Network Counters - https://docs.microsoft.com/en-us/previous-versions/ms803962(v=msdn.10)?redirectedfrom=MSDN

Ask Perf Blog - https://techcommunity.microsoft.com/t5/ask-the-performance-team/bg-p/AskPerf

NTdebugging Blog - https://docs.microsoft.com/en-us/archive/blogs/ntdebugging/

No comments:

Post a Comment