Resplendence Software - Better never than late

Better never than late

Performance and Latency FAQ for audio developers on Windows

Last updated: Monday, September 13, 2010

General

I am writing an audio application or plugin and my audio stream hickups and stutters, what can be the cause ?
There can be several causes such as poorly behaving drivers and hardware. But as soon as you have singled those out, the first thing you should think about is a buffer underrun.

What is an audio buffer underrun ?
It means that audio buffers are not delivered in time. If your application or plugin delivers audio, it needs to calculate or at least provide an audio buffer to the audio subsystem, drivers and hardware. The samples are delivered in blocks. For instance if your audio sample rate is 96KHz and your block size 1024, it theoretically means there it about 10ms for each block available to be delivered to the hardware. If that deadline is missed, the audio hardware did not get the correct samples and this will have audible consequences.

Why should I care only about audio latency and forget about performance ?
In an audio application, we want to avoid anything that has audible consequences such as a buffer underrun. What we care about is maximum execution times (latencies) of the audio process functions. It does not matter to anyone that the audio application could calculate so many samples per second. Performance is important to other types of applications. If for instance a web server on average can serve more requests per hour, we talk about performance. We do not care if possibly one of the many request took a much longer time than the others. There is nothing wrong with measuring performance but if this means calculating averages over a long period of time (say seconds), the values obtained are not of immediate interest. As an audio developer you don't particularly care about averages but about the number of times that you actually missed the boat (buffer underruns).

About Windows

Is Windows a real-time operating system ?
No, all requests to the operating system are delivered on a best effort basis. There are no guarantees whatsoever that requests are delivered within a certain time frame, which are the characteristics of a real-time operating system. That is bad news for audio applications (which are considered soft real-time) because they have to avoid buffer underruns at any price.

What are ISRs ?
ISRs (Interrupt Service Routines) are kernel routines which are part of drivers or the OS which execute when hardware devices interrupt a CPU. They run at elevated IRQL which means that no other thread (or program) can run on the current processor until the ISR has finished executing. Awareness about ISRs is important because they can be the cause of audio buffer underruns. You can get some insight into ISR execution times with the LatencyMon analysis tool.

What are DPCs ?
DPCs (Deferred Procedure Calls) are kernel routines which are part of device drivers or the OS kernel. They are normally requested and scheduled by ISRs (interrupt service routines) or they are associated with a kernel timer. They run at elevated IRQL which means that no other thread (or program) can run on the current processor until the DPC has finished executing. Awareness about DPCs is important because they can be the cause of audio buffer underruns. That is because an audio program, if interrupted by a DPC routine cannot continue until the DPC has completed execution. You can get some insight into DPC latencies (execution time) with the DPC latency checker tool or the Windows Performance Toolkit.

Can my audio thread really get interrupted by the OS at any point in time ?
Yes. An interrupt can occur on the processor that your program is running on. Execution of your program will temporarily halt. The interrupt service routine (ISR) is executed and may schedule a DPC (Deferred Procedure Call). The DPC will most likely run immediately on the same processor which means your program will halt until both the ISR and the DPC have finished executing. The same goes for program errors which throw exceptions. An interrupt will take place on the CPU on which you are running and the exception handler (provided by Windows) will handle the problem. some examples of exceptions are pagefaults, FPU faults, stack faults, GPFs.

About Pagefaults

What are pagefaults ?
Windows uses a concept of virtual memory which relies on the page translation system provided by the CPU. Whenever a memory address is requested which is not available in physical memory (not resident), an INT 14 will occur. The OS provided INT 14 handler will decide how to proceed next. If the page in which the address resides is known to Windows but not resident, Windows will read in the required page from the page file. That is known as a hard pagefault and can take a lot of time to complete.

What is the difference between hard pagefaults and soft pagefaults ?
Soft pagefaults are requests for pages which are resident in RAM but not immediately available to the current task. They will be resolved much faster than hard pagefaults which need to go all the way through the file system.

Are hard pagefaults really that expensive ?
Yes, a single pagefault can literally take millions of instruction cycles. The pagefault handler needs to go through the filesystem which in its turn needs to access the disk to read in the requested page from the page file. That is very expensive and one of the most common causes of audio buffer underruns on Windows (buffer underruns, audio hickups, clicks, pops and stutter).

How can I find out about pagefaults occurring while my audio app/plugin is running ?
You can start Task Manager, select your process and check the pagefaults and PF delta column. Unfortunately this shows both hard and soft pagefaults. It's mostly hard pagefaults you want to know about because they can take up a LONG time to get resolved. If you experience stutter in your audio stream and at the same time you see the PF delta change value you have a very good indication that pagefaults are the main cause of hickups in your audio stream. For in-depth analysis of pagefaults, use the LatencyMon tool.

What is the working set of a process and why should I care ?
The working set of a process is the memory which is resident in RAM. You should care because if the working set of your audio application (or the host process if you are a plugin) is lower than the amount of memory it actually uses it means it uses memory which is not resident in RAM which means it is paged out to disk. Accessing this memory causes hard pagefaults which are very expensive in terms of execution time and a very common cause of buffer audio underruns. You should make sure the working set of your application is set to an acceptable minimum (by using the SetProcessWorkingSetSize API).

How can I find out about the actual working set of my application ?
Start Windows Task Manager, find your process and check the working set column.

Measuring and Profiling

Should I care about the percentage of CPU consumption of my audio application in Task Manager ?
No. It's useless to look at it because it does not matter to audio applications or plugins. Also this number is only an indication and inaccurate for several reasons. At any point in time a CPU is either busy (100%) or idle (0%). The number represented is an average over a long period of time. What you should care about is maximum execution times of your audio processing functions and avoid buffer underruns.

Why does the CPU column in Task Manager display inaccurate information ?
On each timer interrupt it checks what was running and the full time slice will be charged against the thread that was running. If multiple context switches took places, and possibly ISRs and DPCs were also running in a single quantum, only one of them will be charged with the full CPU time. Due to this "Monte Carlo" style of measuring it is easy to write a program that does nothing and remains idle most of the time but nonetheless displays 100% CPU usage (or the percentage of a full CPU) in Task Manager. This may not be true for certain versions of Windows which use another method of measuring. In any case what is displayed is an average over a long period which is not of concern if you are measuring maximum execution latencies.

How can I measure the execution times of my audio plugin/app from within my code?
Use the QueryPerformanceCounter function (and QueryPerformanceFrequency) to get a high resolution time stamp at the beginning and end of your audio process functions. There are some caveats to using this function, because each CPU has its own time stamp counter and they are not necessarily synchronized you need to make sure you execute this function always on the same CPU to get results that make sense. You can use SetThreadAffinityMask to make sure your thread will only run on a specific CPU. Use the GetCurrentProcessorNumber API to get the number of the CPU you are currently running on. Check out the code sample at the bottom of this page.

Why should I not use GetTickCount to measure execution times ?
GetTickCount returns a value with only millisecond precision. Then it's useless because its value is updated only at each clock interrupt. This means that calling this routine consecutively will have the same value returned.

Should I use a RDTSC instruction directly or not ?
There is a long list of issues, some of them are outinlined in the following article:
http://msdn.microsoft.com/en-us/library/ee417693
Still, what's not mentioned here is that RDTSC is not a serializing instruction and is subject to out-of-order execution and cache issues leading to inaccurate results. And the various implementations of QueryPerformanceCounter and KeQueryPerformanceCounter (kernel) actually do not add serializing instructions either possibly because they come with a heavy price tag.

Now that I know how to measure, what should I do with this information ?
You should take a good look at the MAXIMUM executions times of your audio processing functions and the frequency of them. As you wish to avoid buffer underruns at any price (which are only hit or miss) you do not care about AVERAGE execution times. You should care about the number of times your audio process function did not deliver its buffers in time. Unfortunately after your application delivered the buffer to the audio subsystem, all sorts of things can happen still (interrupts, exceptions..). Possibly, create a recording of your audio stream and do a graphical analysis to see if there are any hickups in there.
In case you wish to measure execution times of a particular routine or section of code for performance optimization you should also check the minimum execution times so that you concentrate on your code and not the bad weather conditions. The difference between minimum and average execution times gives you an idea of the OS factors that impact the execution of your code. Some of these can be taken care of programmatically (pagefaults) while others depend on the system configuration (DPCs, ISRs and other OS factors).

About the clock resolution

Should I use the timeBeginPeriod API in my software to change the clock resolution ?
Although changing the clock resolution may speed up the time required for your thread to get attention from the dispatcher, it also means that the thread which processes your audio will get a shorter time slice. The default time slice on most systems of about 16ms is exactly a healthy window for your audio application to fill an audio buffer. Setting the clock resolution to a lower interval means an audio thread may need to become scheduled multiple times in order to fill up a single buffer because it will use up its quantum, all this leading to higher latency. Also what you need to realize is that this is a global system-wide setting that affects everything else in the system so its not an option for a plugin which runs in a host application. Whether or not this is a good idea depends on the outcome of a complicated equation with many factors including the threads that your audio application makes use of and the number of CPUs in the system. It should not be considered the holy grail to improving audio latency on Windows, the best thing to find out is to measure maximum execution times and see if it decreases the number of buffer underruns. Some applications (Windows Messenger, Windows Mediaplayer, Borland Delphi) change the clock resolution so you cannot rely on it to work always.

How can I check the current clock resolution ?
Get the ClockRes utility from www.sysinternals.com

Practices and recommendations

Does it make sense to "reserve" a CPU for my audio application ?
No. Although you could set affinities for your threads to make sure they execute on a particular CPU only, there is no way you can stop ISRs, DPCs and exceptions from interrupting your audio and executing on your CPU without hacking into the Windows kernel. If another CPU is available it means your thread cannot run there, so that's a chance missed. You can use the technique of setting an affinity to your thread/application for the purpose testing and measuring but it's no sense to use this technique in production code unless you wish to hurt audio latency.

How can floating point errors be the cause of buffer underruns ?
Divisions by zero, numeric overflows, FPU stack overflows and denormalizations can all generate exceptions. This means your program will be interrupted by an interrupt (16) which causes the FPU exception handler to handle the matter. If this happens a lot of times it will dramatically impact the execution time of your audio app/plug-in. If such faults are not avoidable due to the nature of your calculations you should programmatically mask off FPU exceptions. In Visual Studio, FPU exceptions are masked off by default.

How can I use LatencyMon tool to find out about hard pagefaults, DPCs and ISRs which interrupted my program ?
You can download LatencyMon from here. Note: only Vista and higher is supported.

What can I do programmatically to improve audio latency for my audio plugin/application ?

Increase the MINIMUM working set of your application to increase the chances that the memory it touches stays resident in order to avoid hard pagefaults. The function used for that is SetProcessWorkingSetSize(Ex). If you are just one of many plugins (such as a VST DLL) inside a host application, check and set the minimum working set of the host application with consideration of the other plugins in mind.

Allocate memory which is guaranteed to be resident by using Address Windowing Extensions (AWE), available for APIs such as VirtualAllocEx. Other solutions include the unsafe technique of using process memory which was locked and made resident by a driver (MmProbeAndLockPages).

Mask off FPU exceptions or make sure your code avoids them (divisions by zero, overflows and denormalizations)

Do *NOT* run the debug version of your app/plug-in if you are measuring latencies

Do a silent dummy run of your routines to avoid code sections of your application will get swapped out

Allocate all required memory at initialization and do not allocate memory while processing audio

Touch all memory your application or plugin uses several times. Memory pages will get swapped out based on their use count.

Disable variable clock speed features such as Intel Speedstep and AMD Cool n Quiet. There are many ways these can distort the outcome of your measuring.

Not until you have found the root cause of your latency problem it starts to make sense to work on it. A very common failure is to optimize some microscopic aspect of some algorithm (example: CPU branch predictions) without determining whether it's the root cause of the problem or not. Pointlessly optimizing or improving this or that (say without having a clue about pagefaults) in the hope it might improve your software is not the way to go.

Code samples

Check out this code sample to get an idea how to measure minimum, maximum and average execution times of a routine or portion of code with Visual C++.

Links

Do you have some recommended reading and links to sites which discuss these topics ?
book: Windows Internals by Russinovich, Solomon and Ionescu
MSDN: http://msdn.microsoft.com
Raymond Chens blog: http://blogs.msdn.com/oldnewthing/
Windows Performance Analysis Toolkit (XPERF) and forums: http://msdn.microsoft.com/en-us/library/cc305187.aspx
Pigs might Fly, Windows performance blog: http://blogs.msdn.com/pigscanfly/archive/2008/03/02/using-the-windows-sample-profiler-with-xperf.aspx
The search tool at the Windows kernel newsgroups at http://www.osronline.com