Audio journey part 1 : 10 keywords for audio processing

Tiya Vaj
6 min readMay 20, 2023

Let’s start with the definition of discrete data is the type of data that has clear spaces between values. While Continuous data is data that falls in a constant sequence.

image is from

10 keywords for audio processing

  1. Sample rate: Sample rate refers to the number of samples per second in an audio signal. It is crucial because it determines the frequency range that can be accurately represented in the audio and affects various aspects of audio processing, including bandwidth and fidelity.

Another clear definition from Adobe website has stated that the conversion process involves capturing samples of the audio source at regular intervals along the soundwaves. The number of samples taken, also referred to as the ‘sample rate,’ impacts the fidelity of the resulting digital file. A higher sample rate generally yields a more faithful reproduction of the original audio. Thus, a higher sample rate is associated with superior audio quality.

Sample rates are commonly expressed in kilohertz (kHz) or cycles per second, representing the number of samples captured within a single second. For instance, CDs typically have a sample rate of 44.1kHz, indicating that 44,100 samples were recorded for every second of audio.

2.Bit depth: Bit depth represents the number of bits used to represent the amplitude of each audio sample. Higher bit depth allows for a greater dynamic range and better resolution, leading to improved audio quality.

When recording audio, each sample captured must be stored using a certain number of bits in your computer. The quality of sound reproduction improves as you increase the number of bits used to record each sample.

Optimal audio quality in recordings can be achieved by combining a high sample rate and a high bit depth, also known as “bit rate” or “sample format.” A greater bit depth results in a wider dynamic range, enhancing the overall audio fidelity of the recording.

Dynamic range refers to the distinction between the low and high volume sections within a recording. It is typically measured in decibels (dB). While the human ear can perceive sounds up to 90dB, recording beyond this level enables the amplification of softer sounds, resulting in high-fidelity audio reproduction.

image is from


Normalization involves scaling the amplitude of audio signals to a standard range, typically between -1 and 1. It is important to ensure consistent loudness levels across different audio samples, which can help prevent bias during classification.

4.frequency :

sometimes referred to as pitch, is the number of times per second that a sound pressure wave repeats itself.The measurement unit for frequency is called hertz (Hz). People with normal hearing have the ability to perceive sounds ranging from 20 Hz to 20,000 Hz. Frequencies that exceed 20,000 Hz are referred to as ultrasound. When your dog displays curiosity and attentiveness towards seemingly imperceptible sounds, it means they are detecting ultrasonic frequencies, which can reach as high as 45,000 Hz.At the other end of the spectrum are very low-frequency sounds (below 20 Hz), known as infrasound. Elephants use infrasound for communication, making sounds too low for humans to hear.

image is from

5.Amplitude :

Amplitude is the measure of the strength of sound waves, which corresponds to our perception of loudness or volume. It is quantified in decibels (dB), which represent the level of sound pressure or intensity.

image is from

6. Windowing:

Windowing is the process of dividing an audio signal into smaller overlapping segments called windows. It helps to reduce spectral leakage and obtain more localized frequency information, which is valuable for many audio analysis tasks, including feature extraction.

The following diagram illustrates the concept of windowing in audio:

In the diagram, each rectangular box represents a window containing a segment of the audio signal. By applying windowing, the audio signal is analyzed in smaller chunks, allowing for more localized frequency analysis and providing better insights into the spectral characteristics of the audio over time.

7.Spectral analysis:

Spectral analysis involves examining the frequency content of an audio signal. Techniques like Fourier Transform and Short-Time Fourier Transform (STFT) are used to analyze the spectral characteristics of audio signals, providing valuable insights for feature extraction and classification.

Here’s a simplified diagram illustrating the spectral analysis of audio:

In the diagram, the audio signal is first pre-processed to remove noise or unwanted artifacts. Then, windowing is applied to divide the audio signal into smaller segments. Next, the Fourier Transform is performed on each window to convert the audio signal from the time domain to the frequency domain. Finally, the spectrum analysis is conducted to analyze the resulting frequency components and their amplitudes.

This spectral analysis provides valuable insights into the frequency content of the audio, allowing for tasks such as audio classification, feature extraction, or visualization of the audio signal.

— Time-domain and frequency domain are two representations used to analyze and understand audio signals. —

8.The time-domain representation of an audio signal displays the variations of the signal over time. In this representation, the x-axis typically represents time, and the y-axis represents the amplitude or intensity of the audio signal. By analyzing the time-domain representation, one can observe the waveform of the audio signal, including its duration, amplitude, and any changes that occur over time. This representation is useful for tasks such as visualizing the audio signal or detecting temporal characteristics like onset, duration, and timing.

9.The frequency domain representation, on the other hand, provides information about the distribution of frequencies present in the audio signal. It shows how the audio signal is composed of different frequency components. In the frequency domain, the x-axis represents frequency, while the y-axis represents the magnitude or power of each frequency component. The most common technique used to convert an audio signal from the time domain to the frequency domain is the Fourier Transform. By analyzing the frequency domain representation, one can identify the dominant frequencies, harmonic relationships, and spectral characteristics of the audio signal. This representation is useful for tasks such as audio equalization, pitch detection, and identifying specific frequency components.

10. Mel-frequency cepstral coefficients (MFCCs): MFCCs are widely used audio features that represent the spectral shape of an audio signal. They are obtained by applying a series of transformations, including the Mel-scale filtering and discrete cosine transform (DCT). MFCCs capture important perceptual characteristics of audio and are commonly used for audio classification.

Here’s a simplified diagram illustrating the process of computing Mel Frequency Cepstral Coefficients (MFCCs):

In the diagram, the audio signal is first pre-emphasized to enhance high-frequency components. Then, the audio signal is divided into frames using frame blocking. Windowing is then applied to each frame to reduce spectral leakage. The Fourier Transform is performed on each windowed frame to obtain the spectrum representation. The Mel Filterbank is applied to the spectrum to map it to the Mel scale. The resulting Mel-scaled spectrum is then subjected to a logarithmic operation. Next, the Discrete Cosine Transform (DCT) is applied to extract the MFCCs. The final step yields the MFCCs, which are coefficients that represent the spectral characteristics of the audio signal in a compressed form.

The MFCCs are widely used in various audio processing applications, such as speech recognition, speaker identification, and audio classification.




Tiya Vaj

Ph.D. Research Scholar in NLP and my passionate towards data-driven for social good.Let's connect here