Posted on Category

MP3 Decoding in C++

Raw digital audio is stored within a pulse code modulation (PCM) stream. The problem with PCM is that it takes up a lot of memory and can pose an inconvenience especially when streaming audio over the internet, TV, or radio. But we can process the signal so that it takes up less space.

I'm by no means an expert on this topic. This is a project in the works. The source code is available at github.com/FlorisCreyf/mp3-decoder.

ID3MP3
HeaderSide InformationMain DataHeader . . .

Initially I tried to interpret my test file using a hex dump and then looking for patterns. That might sound absurd, and I've gone a long way to doing actual research, but that's how I found out that some MP3 files don't necessarily start on the first byte.

The Decoder

There are two ways a file can be compressed: lossless, and lossy. Compressing a file in a lossless manner means that when the file is reconstructed we end up exactly with what we started with. Lossy compression ideally removes information we have no use for. Because we can only hear between a certain range of frequencies, depending on the number of decibels, we can remove samples that fall outside of the hearable range. It's also possible for a frequency to "mask" another frequency causing the second frequency to become inaudible.

MP3 and AAC encoders combine both principles, so that we can store more music, and stream those files at a faster rate. Once the file is about to be played the decoder is responsible for undoing the encoder's compression.

The decoder is responsible for:

  • Deriving PCM data from the bitstream.
  • Handing the data over to the operating system (or music player).

ID3 Meta Data

ID3 (version 2) is a block of bytes wherein meta data is stored. This meta data is mostly irrelevant to the decoder, except for an offset that that points to the end of the tag. Inconveniently, there might come subsequent ID3 tags after the first. A full documentation is available at ID3.org.

hexdump

The first couple hundred bytes from a MP3 file.

TypeOffset (bytes)Description
Identifier0 - 2Indicates the presence of ID3.
Version3 - 4Version and revision of the tag.
Flags5Four single-bit booleans.
Size6 - 9Size of the ID3 tag excluding this ten byte header.

MP3 Header

The MP3 header identifies parameters such as the sampling rate, bit rate, number of channels, and version. With this data we can determine the size of the MP3 frame, which will also tell us where the next header is.

samples_per_frame 8 · bit_rate sampeling_rate + padding

Knowing the size of the first frame, we can find the second header. Within the first frame there usually is a INFO or XING tag that can aid in navigating the file.

Side Information

Each header is followed by side information and contains the data needed to find where the "main data" is located. The main data isn't located after the side information due to the varying sizes of the Huffman encoded samples. The side information also includes additional values that will be used in the requantization formula to reconstruct the samples into real numbers.

Main Data

The main data is divided into two granules. Depending on the channel mode defined in the header each granule can have up to two channels. A channel starts with scale factors followed by Huffman bits. From the Huffman bits, 576 frequency lines are extracted per channel. At the end of the main data there is user defined data called ancillary data.

The samples are scaled by scale factors in the requantization formula. Different scaling factors are applied to different groups, or subbands, of samples.

Located after the scale factors are the Huffman bits. This region is divided into three big value regions, a quadruples region, and a zero region. Depending on the side information, each region uses one of the many Huffman tables. The quadruples region containing higher pitches is much more tightly compressed and makes use of just two tables. The zero region contains omitted samples.

Big Value Region
Big Value Region
Big Value Region
Quadruples Region
Zero Region

It's probably neccessary at this point to write a seperate program that can embed the Huffman tables in C code. The Huffman tables in my source code should be parsed so that the Huffman bits are shifted to the right of each "int" and so that the length of each Huffman value (in bits) is stored as well.

unsigned bit_sample = get_bits(bitstream, bit, bit + 32);
for ( . . . ) {
	int value = table.hcod[entry];
	int size = table.hlen[entry];

	if (value == bit_sample >> (32 - size)) {

Each region is further subdivided into two block types:

  • Long blocks: Higher frequency resolution.
  • Short blocks: Lower frequency resolution. Short blocks are one-third the size of long blocks and aren't aligned in order so they have to be reorded. This is assuming that the frequencies are closer together at certain points, therefore using fewer Huffman codes. Short blocks are used to counter pre-echo.

Inverse Quantization

Quantization takes something that is continues, or infinite, and turns it into a discrete value. The Huffman samples represent a discrete data set. The inverse quantization formula processes the scale factors and other data from the side information to construct real numbers (which are only theoretically continues) from our original Huffman samples.

S i = s i g n s i · s i 4 3 · 2 1 4 a · 2 - b

The exponents for long blocks are:

a = global_gain[gr][ch] - 210
b = scalefac_scale[gr][ch] == 0 ? 0.5 : 1 · scalefactor[gr][ch][sb] + preflag[gr][ch] · pretab[sb]

The exponents for short blocks are:

a = global_gain[gr][ch] - 210 - 8 · subblock_gain[gr][ch][window]
b = scalefac_scale[gr][ch] == 0 ? 0.5 : 1 · scalefactor[gr][ch][sb][window]

Reordering

Only short blocks need to be reordered. If the first few subbands are long blocks, then those bands should be excluded. During this phase, samples in each subband are mapped to blocks of 18 samples.

 // OUTPUT:
 0, 1, 2, 3, 12, 13, 4, 5, 6, 7, 16, 17, 8, 9, 10 . . .
 // TABLE:
 // 0 - 17 | subband_short_32[0] (4)
 0, 4, 8,    // 0,  6,  12,       0 + 0, 0 + 6, 0 + 12
 1, 5, 9,    // 1,  7,  13,       1 + 0, 1 + 6, 1 + 12
 2, 6, 10,   // 2,  8,  14,       2 + 0, 2 + 6, 2 + 12
 3, 7, 11,   // 3,  9,  15,       3 + 0, 3 + 6, 3 + 12
 // subband_short_32[1] (4)
 12, 16, 20, // 4,  10, 16,       4 + 0, 4 + 6, 4 + 12
 13, 17, 21, // 5,  11, 17,       5 + 0, 5 + 6, 6 + 12
 // 18 - 35
 14, 18, 22, // 18, 24, 30,       18 + 0, 18 + 6, 18 + 12
 15, 19, 23, // 19, 25, 31,
 // subband_short_32[2] (4)
 24, 28, 32, // 20, 26, 32,
 25, 29, 33, // 21, 27, 33,
 26, 30, 34, // 22, 28, 34,
 27, 31, 35, // 23, 29, 35,
 // 36 - 53 | subband_short_32[3] (4)
 36, 40, 44, // 36, 42, 48,
 37, 41, 45, // 37, 43, 49,
 38, 42, 46, // 38, 44, 50,
 39, 43, 47, // 39, 45, 51,
 // subband_short_32[4] (6)
 48, 54, 60, // 40, 46, 52,
 49, 55, 61, // 41, 47, 53,
 // 54 - 71
 50, 56, 62, // 54, 60, 66,
 51, 57, 63, // 55, 61, 67,
 52, 58, 64, // 56, 62, 68,
 53, 59, 65, // 57, 63, 69
 . . .

Inverse Modified Discrete Cosine Transform (IMDCT)

Another layer of lossy compression is the DCT. The MDCT takes closely related samples and places them on a cosine function. For long blocks the MDCT reduces 32 samples to 18 samples, and for short blocks the MCDT reduces 12 samples to six.

x i = k = 0 n 2 - 1 X k cos π 2 n 2 i + 1 + n 2 2 k + 1

Variable "n" is 12 for short blocks and 36 for long blocks. The IMDCT produces x0 through xn - 1. Once the cosine transform is complete the samples need to be windowed and overlapped.

Fast Fourier Transform

The pulse code modulation stream, the input to the encoder, is in a time domain. The encoder converts these time domain samples to frequency domain samples through a Fast Fourier Transform. When decoding, the IMDCT converts the frequency domain samples into time domain samples.

hexdump

Time domain vs. frequency domain.

Synthesis Filter bank

The encoder takes a PCM stream and divides it into several bands that approximate critical bands. A critical band contains frequencies that sound similar and affect the same area of the basilar membrane in the cochlea. Bands are structured similar to critical bands so that artifacts due to quantization are masked in the same band.

Critical bands become larger as frequency increases, and the encoder's filter divides the frequency spectrum into equal sized bands. In other words, it's harder to discern between two higher frequency sounds than two lower frequency sounds. The decoder eventually takes care of reconstruction.

Source Code

Source code can be found on GitHub. The decoder outputs the resulting PCM stream to the advanced Linux Sound Architecture (ALSA).

Sources