There is a part of the world we do not see, but it explains almost everything: sound.
Sound captures presence, context, friction, emotion, movement, environment and change. A machine may not see a door opening, but it can hear it. It may not see fatigue in a call, but it can find signal in the voice. It may not see a mechanical anomaly in a facility, but it can detect it as an acoustic pattern before it becomes a failure. Audio contains a dense amount of information, yet many teams still find it difficult to turn that signal into useful models.
Today at Valendra we are sharing something we are especially excited about: ASTW.
ASTW stands for Audio Shapes The World. It is a small audio classification library that turns a set of audio files and labels into a portable model artifact. The intended workflow is deliberately simple: pass audios and labels, call fit, save the result with save, and get a self-contained folder with model.safetensors, config.json, preprocessor_config.json, labels.json, metrics.json and a README.md ready to push to Hugging Face.
The repository is public on GitHub: valendra-tech/valendra-labs-astw.
Why build another audio library
Audio classification often gets trapped between two extremes. On one side there are large pretrained models. They are powerful, but they bring size, external dependency, operational cost and a decision surface that is harder to inspect. On the other side there are research scripts that may work for a paper or a local experiment, but do not always produce an artifact that is easy to save, load, version and share.
ASTW sits between those two worlds.
The original intuition was concrete: start from approaches like Whisper and ask what would happen if, instead of stopping there, we built a direct route for classification. Whisper made log-mel audio representations and sequence models widely practical. But if the problem is classification rather than transcription, you often do not need to load an architecture designed for full speech recognition. You need to turn acoustic signal into a stable representation and train a classification head for a clearly defined task.
That question led us to build ASTW from scratch, without Whisper weights and without external pretrained audio backbones. Not because pretrained models are not useful, but because sometimes the value is precisely in reducing dependencies, size and opacity. There are cases where a compact model trained on your classes and exported as a portable artifact is easier to govern than a large component inheriting assumptions you do not control.
The central question behind ASTW is simple: how much can we lower the barrier to audio modeling for developers, researchers and product teams without forcing them to set up a full platform before validating an idea?
The workflow we wanted
ASTW tries to keep the happy path short and explicit:
from astw import ASTW
audios = [
"audio/cat_01.wav",
"audio/cat_02.wav",
"audio/dog_01.wav",
"audio/dog_02.wav",
]
labels = ["cat", "cat", "dog", "dog"]
model = ASTW()
model.fit(audios, labels, epochs=8, batch_size=4)
model.save("exports/my-audio-model")
Later, that model can be loaded from local disk or from a Hugging Face repository:
from astw import ASTW
local = ASTW.load("exports/my-audio-model")
print(local.predict("dog.wav"))
remote = ASTW.from_pretrained("your-user/your-audio-model")
print(remote.predict("clip.wav", top_k=3))
There is also a command line interface for one-off prediction:
astw predict path/to/audio.wav --model exports/my-audio-model
astw predict path/to/audio.wav --model your-user/your-audio-model --top-k 5
The API is small by design. Training, exporting, loading and predicting should not require reading ten modules before producing the first useful model.
What happens inside ASTW
The default model is a compact Transformer encoder audio classifier. The input is not text or language tokens. It is a local log-mel representation of the audio signal.
Feature extraction uses:
- mono audio resampled to 16 kHz when needed
- 80 log-mel bins
n_fft=400hop_length=160- Hann windows
- Whisper-style normalization over the log-mel spectrogram
- a default chunk length of 30 seconds
ASTW implements its own local extractor, astw-local-log-mel-80. That avoids downloading an external extractor when training new models. The resulting features feed the classifier.
The classifier uses this default structure:
- two
Conv1dlayers as a temporal stem - first convolution from 80 channels to
d_model - second convolution with
stride=2to reduce temporal length - learned positional embedding
- a stack of encoder blocks with multi-head self-attention
- final normalization
- pooling over valid frames
- linear classification head
The default configuration is:
d_model: 384
num_heads: 6
num_layers: 8
ff_mult: 4
dropout: 0.1
pooling: mean
sample_rate: 16000
Default pooling is masked mean pooling, so padded frames do not leak into the representation. ASTW also supports attentive_stats, which computes attention-weighted mean and standard deviation:
from astw import ASTW
model = ASTW(pooling="attentive_stats")
For small tasks, that choice matters. Mean pooling is simpler and stable. Attentive statistics pooling may capture localized signal more effectively, but it also adds capacity and therefore requires more careful validation.
Training from scratch does not mean training blindly
ASTW does not load pretrained weights. It trains from scratch. That makes the training recipe important because there is no giant model compensating for poor setup.
The library combines two optimizers:
- Muon for matrix-shaped parameters such as linear and convolution weights
- AdamW for the remaining parameters such as biases and positional embeddings
The default training setup includes:
muon_lr: 0.01
adam_lr: 3e-4
weight_decay: 0.01
label_smoothing: 0.1
warmup_ratio: 0.1
patience: 30
random_state: 42
The scheduler applies warmup followed by cosine decay. Model selection uses validation macro-F1, not only accuracy. That is important in audio classification because classes can be imbalanced or confused asymmetrically. Accuracy can hide that a model ignores a minority class. Macro-F1 forces class-level performance into the selection criterion.
Training also applies SpecAugment. The current implementation performs two frequency masks and two time masks per batch with bounded widths:
- frequency: up to 8 bins
- time: up to 20 valid frames
The point is not magic. The point is enough variation to reduce brittle memorization of the training set.
ASTW supports explicit validation through val_audios and val_labels, or an internal split through val_split=0.2. If explicit validation is provided, ASTW uses it. Otherwise, it creates a reproducible split, stratified when possible.
Accepted audio inputs
A useful API should not force every user into the same data shape. fit and predict accept:
- an audio file path
- a
numpy.ndarray - an
(array, sample_rate)tuple - a dictionary with
arrayandsample_rate
That makes ASTW usable with files on disk and with pipelines that already hold audio in memory. If the sample rate differs from 16 kHz, ASTW resamples with librosa. If the audio is multi-channel, it is converted to mono.
This reduces friction in real projects, where audio often arrives from mixed sources: local WAV files, arrays from datasets, downloaded clips, or fragments captured by a service.
The exported artifact as a contract
One of ASTW's core ideas is export. After training, save creates a folder with:
model.safetensors
config.json
preprocessor_config.json
labels.json
metrics.json
README.md
model.safetensors stores weights in a safe, portable format. config.json describes the format, model type, sample rate, labels, encoder configuration, metrics and metadata. preprocessor_config.json documents feature extraction. labels.json makes the id-to-label mapping explicit. metrics.json stores validation information. The generated README.md includes Hugging Face compatible metadata such as pipeline_tag: audio-classification.
This makes the model closer to a package than a loose weights file. A weights file without configuration is future debt. A self-contained artifact can be inspected, versioned, uploaded, downloaded and reproduced with less tribal knowledge.
Remote loading uses huggingface_hub.snapshot_download and restricts the required files:
README.md
config.json
labels.json
metrics.json
model.safetensors
preprocessor_config.json
The contract stays small and explicit.
Example 1: ESC-50 animals
The first bundled example trains a 10-class animal classifier on ESC-50:
- cat
- cow
- crow
- dog
- frog
- hen
- insects
- pig
- rooster
- sheep
The workflow is split into two scripts:
python examples/esc50_animals/0_train.py --epochs 30 --batch-size 8 --test-fold 1
python examples/esc50_animals/1_test.py --test-fold 1
The test script loads the exported artifact and writes:
examples/esc50_animals/confusion_matrix.png
examples/esc50_animals/metrics.json
The confusion matrix is especially useful because several animal classes share acoustic texture. Global accuracy is not enough. You want to know whether rooster is confused with hen, whether dog separates from cow, or whether insects behaves like a more diffuse class.

Example 2: ESC-50 transport
The second example uses the same dataset but changes the domain to five transport classes:
- airplane
- car_horn
- engine
- helicopter
- train
The basic workflow is:
python examples/esc50_transport/0_train.py --epochs 24 --batch-size 8 --test-fold 1
python examples/esc50_transport/1_test.py --test-fold 1
This example can also test attentive statistics pooling:
python examples/esc50_transport/0_train.py --pooling attentive_stats
It is a good comparison case because some transport signals are continuous, such as engine, while others are localized events, such as car_horn.

Example 3: speech emotion with RAVDESS
The third example works on speech emotion using RAVDESS, specifically xbgoose/ravdess. RAVDESS includes 1440 short clips, roughly 3 to 4 seconds each, recorded by 24 professional actors and balanced across 8 emotions.
The ASTW example focuses on four classes:
- neutral
- calm
- angry
- surprised
The other emotions (happy, sad, fearful, disgust) are filtered out to keep the task focused. They are acoustically closer and usually require pretraining, more data or a larger model to separate reliably.
The important part is the split. By default, actors 21, 22, 23 and 24 form the held-out test set. The model trains and validates on different speakers. This avoids an overly optimistic evaluation where the classifier learns speaker traits instead of emotion.
The workflow is:
python examples/speech_emotion/0_train.py --epochs 24 --batch-size 16
python examples/speech_emotion/1_test.py
You can also change the held-out speakers:
python examples/speech_emotion/0_train.py --test-actors 19 20 21 22 23 24

What ASTW can solve
ASTW is not a universal audio platform. It is a compact tool for supervised classification. That makes it useful when classes are defined and the team needs to validate quickly:
- acoustic event classification
- environmental sound categorization
- basic machinery or transport sound classification
- focused speech emotion classification
- internal labeling workflows for audio datasets
- audio ML prototypes that need portable exports
The key is not to confuse classification with general audio understanding. ASTW does not transcribe, separate sources or perform diarization. It is designed to reduce the distance between labeled audio and a reusable classifier.
Clear limits
Publishing open source also means publishing limits.
ASTW trains from scratch. That gives control, but it also requires enough data per class and honest evaluation. If the dataset is small, noisy or biased, the model will reflect that. If classes are not acoustically separable, the architecture will not perform miracles. If the split mixes speakers, devices or environments incorrectly, metrics can look better than production behavior.
It is also not a full serving system. The artifact can be loaded and used, but production still requires versioning, dataset traceability, regression tests, latency measurement, drift monitoring and validation with real data. ASTW lowers the entry barrier to audio modeling. It does not remove MLOps responsibilities.
At Valendra, we prefer bounded artifacts with explicit limits over broad promises. A small, well-defined tool is often more useful than a large ambiguous claim.
Why open source
We include complete examples because open source is not only about publishing. It is also about accompanying.
ASTW includes examples for animals, transport and speech emotion because they show three different patterns: environmental classes, machine-related events and a speech task where the split matters. They also produce metrics and confusion matrices because evaluation should not be optional.
At Valendra, we want to stand on the side of applied research: build, publish, expose assumptions and let others reproduce, challenge or adapt the result. Not every experiment has to become a product. Some experiments exist to open a path, reduce friction and leave a reusable piece for the next team.
ASTW is one experiment in that direction.
Getting started
Install from PyPI:
pip install astw
Local development install:
python -m pip install -e .
Install with example dependencies:
python -m pip install -e '.[examples]'
CLI prediction:
astw predict path/to/audio.wav --model exports/my-audio-model --top-k 5
Build a wheel:
python -m pip install build
python -m build








