VSVI Dev. Blog 7: Digital Audio Myths


Artificial harmonic glissando

In digital audio, we think about fidelity in four domains:

  • Bit Depth (bits)
  • Sample Rate (Hz)
  • Bitrate (kbps or kb/s)
  • Channels

It’s important to understand what each of the terms means when shopping for samples and sample libraries, as some advertised features will do very little more than exponentially increase the size of the library, making it seem more valuable than it is!

Bit Depth describes just one single element of the digital audio equation: noise floor. It is the measure of bits in each sample taken. It does not have ANY role in the sound-quality of your audio, only where the noise-floor is located. With noise-shaped Dither, 16-bit audio can easily cover beyond the theoretical range of human hearing (-96 dB, or as far down as -120 dB with shaped dither). We always record and process in 24-bit for improved filter performance, but provide our instruments in 16-bit.

Why? The preexisting noise-floor, even with careful recording and even denoising is always considerably higher than the theoretical -96 dB (possibly extendable down to -120 with correct dither usage) noise-floor found in all recordings done with microphones. In fact, any sample library developer that tries to sell you samples with more than 16-bit audio that were not recorded in a fully isolated and insulated anechoic chamber is wasting their bandwidth and storage space, and your hard-drive space and time.

The best way to understand bit depth is to imagine we have a set of sine waves we’ve cut into 44100 columns (samples). At each slice (in 16-bit PCM), we pick a number between −32,768 and +32,767 (with 0 representing the middle line of the waveform) that most closely resembles the point we see on our analog arc and place our sample there. It must be an integer (0,1,2,3, etc.). If we get something very small (i.e. quiet) and boost it a bunch digitally, then we will encounter artifacts from our earlier quantization (don’t worry, you would have to be recording something at close to -40 or lower dB for this to happen). For 24 bit, we get to pick from 16,777,216 possible integer points. Therefore, smaller waveforms are possible to represent quieter waveforms and boost them digitally without encountering quantization distortion. 32 bit float is another form, using a float rather than an integer, so it can provide a decimal value. Because it is so resource expensive and the dynamic range it provides is essentially completely unnecessary (extending exponentially beyond the range of human hearing), 32-bit float is not used except for the recording of highly unpredictable sources and ultra-critical processing, and requires ultra-high-end equipment and recording conditions to generate any necessary need (most mic and preamp self-noise is far too high for 32-bit noise-floor), i.e. industrial/scientific uses. It can be useful for extremely heavy effects processing on a single signal, where repetitive quantization could result in a noise increase, but the amount of usage would require a very, very powerful computer just to function.

Sample Rate is a function of the total frequency range that can be represented in the digital audio. It is a measure of the number of times the audio signal is sampled every second. Under the Nyquist Theorem, if we accept the hearing of a young female toddler may be, at its very greatest, 20,000 Hz, a sample rate that would fully include all frequencies in this range would be 40,000 Hz (40 kHz). Add a little buffer and do a little manipulation to make synchronization with video recording easier, and voila, 44.1 kHz! In Europe, they decided to add a bit more of a buffer (no pun intended), and went with 48 kHz. We use 44.1 kHz at all point in the sampling and distribution process.

Why? A sample rate of 44.1 kHz extends beyond the maximum range of human hear (if you’re male and/or over 20, your hearing likely drops off around 17-18 kHz). Recording too much higher results in distortion on equipment (amplifiers, speakers, etc.) not designed to handle those rates, which could result in issues with our customers, aside from using enormous amounts of space. Any developer who sells samples more than 44.1 kHz that are not intended for very extensive resampling/manipulation is possibly multiplying the size of their library by 2, 3, or even 4x for NO perceivable improvement. Beware!

If we take our bit depth example from before, imagine we had 16 (or 24) rows and wanted to cut our sine waves in a different number of columns. Increasing the number of samples would mean increasing the fidelity to each sine wave. Remember, a lower sample rate means any sound greater than 1/2 the frequency of the sample rate will be lost (this is why applying an 8kHz sample rate results in a sound not dissimilar to the fidelity of 78 rpm records, at which time, recordings could only reach around 4 kHz total frequency range).

Bitrate measures the number of total bits stored every second. In lossless audio, this is Bit Depth * Sample Rate (or for 44.1/16 mono, 705.6 kbps (stereo would be 1411.2 kbps)). Bitrate only changes from the lossless measurements if a form of lossy compression is applied, such as the .ogg vorbis or .mp3 lame codecs. Lossy compression, for obvious reasons, degrades the sound quality of the audio, no matter how little you use. For .mp3, anywhere down to 320 kbps is more or less indistinguishable from uncompressed signals for most music (particularly signals without strong transients) for consumers. We do not use any lossy compression in any stage of our development process.

Why? Compression compromises the signal much more than other formats. Chances are, many customers will want a higher fidelity sample than compressed audio is capable of.

Channels describes the number of different audio streams used. Most modern audio work is recorded in stereo (2-channel), and occasionally in mono (1-channel), although recent advances in technology have led to the development of affordable ambisonic microphone arrays, capable of recording a 360-degree signal, and, with the help of a decoder, reduce it to a single 2-channel, 4-channel, 7-channel, or so on experience. We record all instruments in stereo whenever possible, and if multi-mic recording is done, occasionally used arrays of mono or stereo design to capture different angles.

How does this fit in with other digital formats, such as video?

In digital video, we think of a number of frames per second, and an amount of data per frame (for example, 30 frames per second of 720p footage (that’s 921,600 pixels per frame) is 27.65 million pixels every second. Typically the color of each pixel is expressed in 8-bit, so we would end up with  221,184,000 bits (about 27.65 MB every second, or 221,200 kbps, compared to a mere 1,411.2 kbps for 44.1/16 audio). Of course, in this example, we assume zero compression and also leave out other information that might be included in the specific codec, but it is a good way to get a feeling for the size of audio data.

How do all of the above elements fit together?

In digital audio, we comprise our recording of a series of samples (sample rate), each containing a certain number of bits (bit depth), with a certain number of channels. Multiplying these three values will give us an understanding for the total amount of data being transferred in bits (make sure you convert bits to bytes if you are concerned with storage space).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s