- Pre-trained Model: YAMNet is a deep learning model trained to predict 521 audio event classes.
- Dataset: Trained on 1,574,587 10-second YouTube soundtrack excerpts from the AudioSet dataset (unbalanced train segments).
- Architecture: Uses the MobileNet_v1 depthwise-separable convolution architecture for efficient computation.
- Audio Processing: Designed to process audio files sampled at 16 kHz and make predictions at a 10 Hz frame rate.
- Performance:
- D-prime: 2.318
- Balanced mAP: 0.306
- Balanced average lwlrap: 0.393 (lwlrap: label-weighted label-ranking average precision, described in the DCASE 2019 Task 2 Overview Paper).
- Keras Model: Includes Keras code for constructing the model and applying it to input audio files.
- Purpose: Released as a baseline for audio event classification and to inspire new applications.
- Improvements: Model includes refinements to handle challenges of imbalanced priors and weak labels .
References :