Tuesday 20 January 2015

High-Order DSD

As support for regular DSD (aka DSD64) starts to become close to a requirement for manufacturers of not only high-end DACs, but also a number of entry-level models too, so the cutting edge of audio technology moves ever upward to more exotic versions of DSD denoted by the terms DSD128, DSD256, DSD512, etc.  What are these, why do they exist, and what are the challenges faced in playing them?  I thought a post on that topic might be helpful.

Simply put, these formats are identical to regular DSD, except that the sample rate is increased.  The benefit in doing so is twofold.  First, you can reduce the magnitude of the noise floor in the audio band.  Second, you can push the onset of undesirable ultrasonic noise further away from the audio band.

DSD is a noise-shaped 1-bit PCM encoding format (Oh yes it is!).  Because of that, the encoded analog signal can be reconstructed simply by passing the raw 1-bit data stream through a low-pass filter.  One way of looking at this is that at any instant in time the analog signal is very close to being the average of a number of consecutive DSD bits which encode that exact moment.  Consider this: the average of the sequence 1,0,0,0,1,1,0,1 is exactly 0.5 because it comprises four zeros and four ones.  Obviously, any sequence of 8 bits comprising four zeros and four ones will have an average value of 0.5.  So, if all we want is for our average to be 0.5, we have many choices as to how we can arrange the four zeros and four ones.

That simplistic illustration is a good example of how noise shaping works.  In effect we have a choice as to how we can arrange the stream of ones and zeros such that passing it through a low pass filter recreates the original waveform.  Some of those choices result in a lower noise floor in the audio band, but figuring out how to make those choices optimally is rather challenging from a mathematical standpoint.  Theory, however, does tell us a few things.  The first is that you cannot just take noise away from a certain frequency band.  You can only move it into another frequency band (or spread it over a selection of other frequency bands).  The second is that there are limits to both how low the noise floor can be depressed at the frequencies where you want to remove noise, and how high the noise floor can be raised at the frequencies you want to move it to.

Just like digging a hole in the ground, what you end up with is a low frequency area where you have removed as much of the noise as you can, and a high frequency area where all this removed noise has been piled up.  If DSD is to work, the low frequency area must cover the complete audio band, and the noise floor there must be pushed down by a certain minimum amount.  DSD was originally developed and specified to have a sample rate of 2,822,400 samples per second (2.8MHz) as this is the lowest convenient sample rate at which we can realize those key criteria.  We call it DSD64 because 2.8224MHz is exactly 64 times the standard sample rate of CD audio (44.1kHz).  The downside is that the removed noise starts to pile up uncomfortably close to the audio band, and it turns out that all the optimizing in the world does not make a significant dent in that problem.

This is the fundamental limitation of DSD64.  If we want to move the ultrasonic noise further away from the audio band we have to increase either the bit depth or the sample rate.  Of the two, there are, surprisingly enough, perhaps more reasons to want to increase the bit depth than the sample rate.  However, these are trumped by the great advantages in implementing an accurate D/A converter if the ‘D’ part is 1-bit.  Therefore we now have various new flavours of DSD with higher and higher sample rates.  DSD128 has a sample rate of 128 times 44.1kHz, which works out to about 5.6MHz.  Likewise we have DSD256, DSD512, and even DSD1024.

Of these, perhaps the biggest bang for the buck is obtained with DSD128.  Already, it moves the rise in the ultrasonic noise to nearly twice as far from the audio band as it was with DSD64.  Critical listeners - particularly those who record microphone feeds direct to DSD - are close to unanimous in their preference for DSD128 over DSD64.  The additional benefits in going to DSD256 and above seem to be real enough, but definitely fall into the realms of diminishing returns.  However, even though the remarkably low cost and huge capacity of hard disks today makes the storage of a substantial DSD library a practical possibility, if this library were to be DSD512 for example, this would start to represent a significant expense in both disk storage and download bandwidth costs.  In any case, as a result of all these developments, DSD128 recordings are now beginning to be made available in larger and larger numbers, and very occasionally we get sample tracks made available for evaluation in DSD256 format.  However, at the time of writing I don’t know where you can go to download samples of DSD512 or higher.

In the Apple World where BitPerfect users live, playback of DSD requires the use of the DoP (“DSD over PCM”) protocol.  This dresses up a DSD bitstream in a faux PCM format, where a 24-bit PCM word comprises 16 bits of raw DSD data plus an 8-bit marker which identifies it as such.  Windows users have the ability to use an ASIO driver which dispenses with the need for the 8-bit marker and transmits the raw DSD data directly to the DAC in its “native” format.  ASIO for Mac, while possible, remains problematic.

As mentioned, DoP encoding transmits the data to the DAC using a faux PCM stream format.  For DSD64 the DAC’s USB interface must provide 24-bit/176.4kHz support, which is generally not a particularly challenging requirement.  For DSD128 the required PCM stream format is 24-bit/352.8kHz which is still not especially challenging, but is less commonly encountered.  But if we go up to DSD256 we now have a requirement for a 24-bit/705.6kHz PCM stream format.  The good news is that your Mac can handle it out of the box, but unfortunately, very few DACs offer this.  Inside your DAC, if you prise off the cover, you will find that the USB subsystem is separate from the DAC chip itself.  USB receiver chipsets are sourced from specialist suppliers, and if you want one that will support a 24/705.6 format it will cost you more.  Additionally, if you are currently using a different receiver chipset, you may have a lot of time and effort invested in programming it, and you will have to return to GO if you move to a new design (do not collect $200).  The situation gets progressively worse with higher rate DSD formats.

Thus it is that we see examples of DSD-compatible DACs such as the OPPO HA-1 which offers DSD256 support, but only in “native” mode.  What this means is that if you have a Mac and are therefore constrained to using DoP, you need access to a 24/705.6 PCM stream format in order to deliver DSD256, and the HA-1 has apparently been designed with a USB receiver chipset that does not support it.  It may not be as simple as that, and there may be other considerations at play, but if so I am not aware of them.

Interestingly, the DoP specification does offer a workaround for precisely this circumstance.  It provides for an alternative to a 2-channel 24/705.6 PCM format using a 4-channel 24/352.8 PCM format.  The 8-bit DoP marker specified is different, which enables the DAC to tell 4-channel DSD128 from 2-channel DSD256 (they would otherwise be indistinguishable).  Very few DAC manufacturers currently support this variant format.  Mytek is the only one I know of - as I understand it their 192-DSD DAC supports DSD128 using the standard 2-channel DoP over USB, but using the 4-channel variant DoP over FireWire.

Because of its negligible adoption rate, BitPerfect currently does not support the 4-channel DoP variant.  If we did, it would require some additional configuration options in the DSD Support window.  I worry that such options are bound to end up confusing people.  For example, despite what our user manual says, you would not believe the number of customers who write to me because they have checked the “dCS DoP” checkbox and wonder why DSD playback isn’t working!  Maybe they were hoping it would make their DACs sound like a dCS, I dunno.  I can only imagine what they will make of a 2ch/4ch configurator!!!

As a final observation, some playback software will on-the-fly convert high-order DSD formats which are not supported by the user’s DAC to a lower-order DSD format which is.  While this is a noble solution, it should be noted that format conversion in DSD is a fundamentally lossy process, and that all of the benefits of the higher-order DSD format - and more - will as a result be lost.  In particular, the ultrasonic noise profile will be that of the output DSD format, not that of the source DSD format.  Additionally, DSD bitstreams are created by Sigma-Delta Modulators.  These are complex and very challenging algorithms which are seriously hard to design and implement successfully, particularly if you want anything beyond modest performance out of them.  The FPGA-based implementation developed for the PS Audio DirectStream DAC is an example of a good one, but there are some less-praiseworthy efforts out there.  In general, you can expect to obtain audibly superior results pre-converting instead to 24/176.4 (or even 24/352.8) PCM using DSD Master, which will retain both the extended frequency response and the lower ultrasonic noise floor of the DSD256 original.

Thursday 1 January 2015

A Convoluted Discussion

In mathematics, the word ‘convolution’ describes a very important class of manipulations.  If you want to know more about it, a pretty good treatment is shown on its Wikipedia page.  And even if you don’t, I am going to briefly summarize it here before going on to make my point :)

A convolution is an operation performed on two functions, or on two sets of data.  Typically (but not always) one is the actual data that we are trying to manipulate, and the other is a weighting function, or set of weights.  Convolution is massively important in the field of signal processing, and therefore is something that anybody who wants (or needs) to talk knowledgeably about digital audio needs to bone up on.  The most prominent convolution processes that you may have heard of are Fourier Transforms (which are used to extract from a waveform its audio spectrum) and digital filtering.  It is the latter of those that I want to focus on here.

In very simple terms, a filter (whether digital or analog) operates as a convolution between a waveform and an impulse response.  You will have heard of impulse responses, and indeed you may have read about them in some of my previous posts.  In digital audio, an impulse response is a graphical representation of the ‘weights’ or ‘coefficients’ which define a digital filter.  Complicated mathematical relationships describe the way in which the impulse response relates to the key characteristics of the filter, and I have covered those in my earlier posts on ‘Pole Dancing’.

Impulse responses are therefore very useful.  They are nice to look at, and easy to categorize and classify.  Unfortunately, it has become commonplace to project the aesthetic properties of the impulse response onto the sonic properties arising from the filter which uses it.  In simple language, we see a feature on the impulse response, and we imagine that such a feature is impressed onto the audio waveform itself after it comes out of the filter.  It is an easy mistake to make, since the convolution process itself is exactly that - a mathematical impression of the impulse response onto the audio waveform.  But the mathematical result of the convolution is really not as simple as that.

The one feature I see misrepresented most often is pre-ringing.  In digital audio, an impulse is just one peak occurring in a valley of flat zeros.  It is useful as a tool to characterize a filter because it contains components of every frequency that the music bit stream is capable of representing.  Therefore if the filter does anything at all, the impulse is going to be disturbed as a result of passing through it.  For example, if you read my posts on square waves, you will know that removing high frequency components from a square wave results in a waveform which is no longer square, and contains ripples.  Those ripples decay away from the leading edge of the square wave.  This is pleasing in a certain way, because the ripples appear to be caused by, and arise in response to, the abrupt leading edge of the square wave.  In our nice ordered world we like to see effect preceded by cause, and are disturbed by suggestions of the opposite.

And so it is that with impulse responses we tend to be more comfortable seeing ripples decaying away after the impulse, and less comfortable when they precede the impulse, gathering in strength as they approach it.  Our flawed interpretation is that the impulse is the cause and the ripples the effect, and if these don’t occur in the correct sequence then the result is bound to be unnatural.  It is therefore common practice to dismiss filters whose impulse response contains what is termed “pre-ringing” because the result of such filters is bound to be somewhat “unnatural”.  After all, in nature, effects don’t precede their cause, do they?

I would like you to take a short break, and head over to your kitchen sink for a moment.  Turn on the tap (or faucet, if you prefer) and set the water flow to a very gentle stream.  What we are looking for is a smooth flow with no turbulence at all.  We call this ‘laminar’ flow.  What usually happens, if the tap outlet is sufficiently far above the bottom of the sink is that the laminar flow is maintained for some distance and then breaks up into a turbulent flow.  The chances are good that you will see this happening, but it is no problem if you don’t - so long as you can find a setting that gives you a stable laminar stream.  Now, take your finger, and gently insert it into the water stream.  Look closely.  What you will see are ripples forming in the water stream **above** your finger.  If you don’t, gradually move your finger up towards the tap and they should appear (YTMV/YFMV).  What you will be looking at is an apparently perfect example of an effect (the ripples) occurring before, or upstream of, the cause (your finger).

What I have demonstrated here is not your comfortable world breaking down before your eyes.  What is instead breaking down is the comfort zone of an over-simplistic interpretation of what you saw.  Because the idea of the finger being the cause and the ripples being the effect is not an adequate description of what actually happened.

In the same way, the notion of pre-ringing in the impulse response of a filter resulting in sonic effects that precede their cause in the resultant audio waveform, is not an adequate description of what is happening.  However, the misconception gains credence for an important, if inconvenient reason, which is that filters which exhibit pronounced pre-ringing do in fact tend to sound less preferable than those which don’t.  These sort of things happen often in science - most notably in medical science - and when it does it opens a door to misinformation.  In this case, the potential for misinformation lies in the reason given for why one filter sounds better than another - that the one with pre-ringing in its impulse response results in sounds that precede the things which caused them.  By all means state your preference for filters with a certain type of impulse response, but please don’t justify your preference with flawed reasoning.  It is OK to admit that you are unclear as to why.

I want to finish this with an audio example to make my point.  The well-known Nyquist-Shannon theory states that a regularly sampled waveform can be perfectly recreated if (i) it is perfectly sampled; and (ii) it contains no frequency components at or above one half of the sample rate.  The theory doesn’t just set forth its premise, it provides a solid proof.  In essence, it does this by convolving the sampled waveform with a Sinc() function, in a process pretty much identical to the way a digital filter convolves the waveform with an Impulse Response.  Nyquist-Shannon proves that this convolution results in a mathematically perfect reconstruction of the original waveform if - and only if - the two stipulations I mentioned are strictly adhered to.  This is interesting in the context of this post because the Sinc() function which acts as the Impulse Response exhibits an infinitely long pre-ring at a significantly high amplitude.  Neither Nyqvist or Shannon, nor the entire industry which their theory spawned, harbour any concerns about causality in reconstructed waveforms!