ASTW: Audio Shapes The World and the path to compact audio classifiers

There is a part of the world we do not see, but it explains almost everything: sound.

Sound captures presence, context, friction, emotion, movement, environment and change. A machine may not see a door opening, but it can hear it. It may not see fatigue in a call, but it can find signal in the voice. It may not see a mechanical anomaly in a facility, but it can detect it as an acoustic pattern before it becomes a failure. Audio contains a dense amount of information, yet many teams still find it difficult to turn that signal into useful models.

Today at Valendra we are sharing something we are especially excited about: ASTW.

ASTW stands for Audio Shapes The World. It is a small audio classification library that turns a set of audio files and labels into a portable model artifact. The intended workflow is deliberately simple: pass audios and labels, call fit, save the result with save, and get a self-contained folder with model.safetensors, config.json, preprocessor_config.json, labels.json, metrics.json and a README.md ready to push to Hugging Face.

The repository is public on GitHub: valendra-tech/valendra-labs-astw.

Why build another audio library

Audio classification often gets trapped between two extremes. On one side there are large pretrained models. They are powerful, but they bring size, external dependency, operational cost and a decision surface that is harder to inspect. On the other side there are research scripts that may work for a paper or a local experiment, but do not always produce an artifact that is easy to save, load, version and share.

ASTW sits between those two worlds.

The original intuition was concrete: start from approaches like Whisper and ask what would happen if, instead of stopping there, we built a direct route for classification. Whisper made log-mel audio representations and sequence models widely practical. But if the problem is classification rather than transcription, you often do not need to load an architecture designed for full speech recognition. You need to turn acoustic signal into a stable representation and train a classification head for a clearly defined task.

That question led us to build ASTW from scratch, without Whisper weights and without external pretrained audio backbones. Not because pretrained models are not useful, but because sometimes the value is precisely in reducing dependencies, size and opacity. There are cases where a compact model trained on your classes and exported as a portable artifact is easier to govern than a large component inheriting assumptions you do not control.

The central question behind ASTW is simple: how much can we lower the barrier to audio modeling for developers, researchers and product teams without forcing them to set up a full platform before validating an idea?

The workflow we wanted

ASTW tries to keep the happy path short and explicit:

from astw import ASTW

audios = [
    "audio/cat_01.wav",
    "audio/cat_02.wav",
    "audio/dog_01.wav",
    "audio/dog_02.wav",
]
labels = ["cat", "cat", "dog", "dog"]

model = ASTW()
model.fit(audios, labels, epochs=8, batch_size=4)
model.save("exports/my-audio-model")

Later, that model can be loaded from local disk or from a Hugging Face repository:

from astw import ASTW

local = ASTW.load("exports/my-audio-model")
print(local.predict("dog.wav"))

remote = ASTW.from_pretrained("your-user/your-audio-model")
print(remote.predict("clip.wav", top_k=3))

There is also a command line interface for one-off prediction:

astw predict path/to/audio.wav --model exports/my-audio-model
astw predict path/to/audio.wav --model your-user/your-audio-model --top-k 5

The API is small by design. Training, exporting, loading and predicting should not require reading ten modules before producing the first useful model.

What happens inside ASTW

The default model is a compact Transformer encoder audio classifier. The input is not text or language tokens. It is a local log-mel representation of the audio signal.

Feature extraction uses:

mono audio resampled to 16 kHz when needed
80 log-mel bins
n_fft=400
hop_length=160
Hann windows
Whisper-style normalization over the log-mel spectrogram
a default chunk length of 30 seconds

ASTW implements its own local extractor, astw-local-log-mel-80. That avoids downloading an external extractor when training new models. The resulting features feed the classifier.

The classifier uses this default structure:

two Conv1d layers as a temporal stem
first convolution from 80 channels to d_model
second convolution with stride=2 to reduce temporal length
learned positional embedding
a stack of encoder blocks with multi-head self-attention
final normalization
pooling over valid frames
linear classification head

The default configuration is:

d_model: 384
num_heads: 6
num_layers: 8
ff_mult: 4
dropout: 0.1
pooling: mean
sample_rate: 16000

Default pooling is masked mean pooling, so padded frames do not leak into the representation. ASTW also supports attentive_stats, which computes attention-weighted mean and standard deviation:

from astw import ASTW

model = ASTW(pooling="attentive_stats")

For small tasks, that choice matters. Mean pooling is simpler and stable. Attentive statistics pooling may capture localized signal more effectively, but it also adds capacity and therefore requires more careful validation.

Training from scratch does not mean training blindly

ASTW does not load pretrained weights. It trains from scratch. That makes the training recipe important because there is no giant model compensating for poor setup.

The library combines two optimizers:

Muon for matrix-shaped parameters such as linear and convolution weights
AdamW for the remaining parameters such as biases and positional embeddings

The default training setup includes:

muon_lr: 0.01
adam_lr: 3e-4
weight_decay: 0.01
label_smoothing: 0.1
warmup_ratio: 0.1
patience: 30
random_state: 42

The scheduler applies warmup followed by cosine decay. Model selection uses validation macro-F1, not only accuracy. That is important in audio classification because classes can be imbalanced or confused asymmetrically. Accuracy can hide that a model ignores a minority class. Macro-F1 forces class-level performance into the selection criterion.

Training also applies SpecAugment. The current implementation performs two frequency masks and two time masks per batch with bounded widths:

frequency: up to 8 bins
time: up to 20 valid frames

The point is not magic. The point is enough variation to reduce brittle memorization of the training set.

ASTW supports explicit validation through val_audios and val_labels, or an internal split through val_split=0.2. If explicit validation is provided, ASTW uses it. Otherwise, it creates a reproducible split, stratified when possible.

Accepted audio inputs

A useful API should not force every user into the same data shape. fit and predict accept:

an audio file path
a numpy.ndarray
an (array, sample_rate) tuple
a dictionary with array and sample_rate

That makes ASTW usable with files on disk and with pipelines that already hold audio in memory. If the sample rate differs from 16 kHz, ASTW resamples with librosa. If the audio is multi-channel, it is converted to mono.

This reduces friction in real projects, where audio often arrives from mixed sources: local WAV files, arrays from datasets, downloaded clips, or fragments captured by a service.

The exported artifact as a contract

One of ASTW's core ideas is export. After training, save creates a folder with:

model.safetensors
config.json
preprocessor_config.json
labels.json
metrics.json
README.md

model.safetensors stores weights in a safe, portable format. config.json describes the format, model type, sample rate, labels, encoder configuration, metrics and metadata. preprocessor_config.json documents feature extraction. labels.json makes the id-to-label mapping explicit. metrics.json stores validation information. The generated README.md includes Hugging Face compatible metadata such as pipeline_tag: audio-classification.

This makes the model closer to a package than a loose weights file. A weights file without configuration is future debt. A self-contained artifact can be inspected, versioned, uploaded, downloaded and reproduced with less tribal knowledge.

Remote loading uses huggingface_hub.snapshot_download and restricts the required files:

README.md
config.json
labels.json
metrics.json
model.safetensors
preprocessor_config.json

The contract stays small and explicit.

Example 1: ESC-50 animals

The first bundled example trains a 10-class animal classifier on ESC-50:

cat
cow
crow
dog
frog
hen
insects
pig
rooster
sheep

The workflow is split into two scripts:

python examples/esc50_animals/0_train.py --epochs 30 --batch-size 8 --test-fold 1
python examples/esc50_animals/1_test.py --test-fold 1

The test script loads the exported artifact and writes:

examples/esc50_animals/confusion_matrix.png
examples/esc50_animals/metrics.json

The confusion matrix is especially useful because several animal classes share acoustic texture. Global accuracy is not enough. You want to know whether rooster is confused with hen, whether dog separates from cow, or whether insects behaves like a more diffuse class.

ESC-50 animals classifier confusion matrix

Example 2: ESC-50 transport

The second example uses the same dataset but changes the domain to five transport classes:

airplane
car_horn
engine
helicopter
train

The basic workflow is:

python examples/esc50_transport/0_train.py --epochs 24 --batch-size 8 --test-fold 1
python examples/esc50_transport/1_test.py --test-fold 1

This example can also test attentive statistics pooling:

python examples/esc50_transport/0_train.py --pooling attentive_stats

It is a good comparison case because some transport signals are continuous, such as engine, while others are localized events, such as car_horn.

ESC-50 transport classifier confusion matrix

Example 3: speech emotion with RAVDESS

The third example works on speech emotion using RAVDESS, specifically xbgoose/ravdess. RAVDESS includes 1440 short clips, roughly 3 to 4 seconds each, recorded by 24 professional actors and balanced across 8 emotions.

The ASTW example focuses on four classes:

neutral
calm
angry
surprised

The other emotions (happy, sad, fearful, disgust) are filtered out to keep the task focused. They are acoustically closer and usually require pretraining, more data or a larger model to separate reliably.

The important part is the split. By default, actors 21, 22, 23 and 24 form the held-out test set. The model trains and validates on different speakers. This avoids an overly optimistic evaluation where the classifier learns speaker traits instead of emotion.

The workflow is:

python examples/speech_emotion/0_train.py --epochs 24 --batch-size 16
python examples/speech_emotion/1_test.py

You can also change the held-out speakers:

python examples/speech_emotion/0_train.py --test-actors 19 20 21 22 23 24

RAVDESS speech emotion classifier confusion matrix

What ASTW can solve

ASTW is not a universal audio platform. It is a compact tool for supervised classification. That makes it useful when classes are defined and the team needs to validate quickly:

acoustic event classification
environmental sound categorization
basic machinery or transport sound classification
focused speech emotion classification
internal labeling workflows for audio datasets
audio ML prototypes that need portable exports

The key is not to confuse classification with general audio understanding. ASTW does not transcribe, separate sources or perform diarization. It is designed to reduce the distance between labeled audio and a reusable classifier.

Clear limits

Publishing open source also means publishing limits.

ASTW trains from scratch. That gives control, but it also requires enough data per class and honest evaluation. If the dataset is small, noisy or biased, the model will reflect that. If classes are not acoustically separable, the architecture will not perform miracles. If the split mixes speakers, devices or environments incorrectly, metrics can look better than production behavior.

It is also not a full serving system. The artifact can be loaded and used, but production still requires versioning, dataset traceability, regression tests, latency measurement, drift monitoring and validation with real data. ASTW lowers the entry barrier to audio modeling. It does not remove MLOps responsibilities.

At Valendra, we prefer bounded artifacts with explicit limits over broad promises. A small, well-defined tool is often more useful than a large ambiguous claim.

Why open source

We include complete examples because open source is not only about publishing. It is also about accompanying.

ASTW includes examples for animals, transport and speech emotion because they show three different patterns: environmental classes, machine-related events and a speech task where the split matters. They also produce metrics and confusion matrices because evaluation should not be optional.

At Valendra, we want to stand on the side of applied research: build, publish, expose assumptions and let others reproduce, challenge or adapt the result. Not every experiment has to become a product. Some experiments exist to open a path, reduce friction and leave a reusable piece for the next team.

ASTW is one experiment in that direction.

Getting started

Install from PyPI:

pip install astw

Local development install:

python -m pip install -e .

Install with example dependencies:

python -m pip install -e '.[examples]'

CLI prediction:

astw predict path/to/audio.wav --model exports/my-audio-model --top-k 5

Build a wheel:

python -m pip install build
python -m build

ASTW: Audio Shapes The World and the path to compact audio classifiers

Why build another audio library

The workflow we wanted

What happens inside ASTW

Training from scratch does not mean training blindly

Accepted audio inputs

The exported artifact as a contract

Example 1: ESC-50 animals

Example 2: ESC-50 transport

Example 3: speech emotion with RAVDESS

What ASTW can solve

Clear limits

Why open source

Getting started

Get the next technical briefing before the problem gets expensive

More technical articles

Multimodal embeddings: a practical guide for search and retrieval

MLOps in production: complete guide for taking ML models to the real world

RAG implementation in production: architecture, evaluation, and real costs

MCP in production: security, authorization, and governance for enterprise teams

Self-hosted LLMs in production: Ollama vs vLLM vs TGI with real criteria

Gemini 3.0 for enterprise: multimodality, long context, and operational control

GPT-5.1 for enterprise: adaptive reasoning, tools, and governance

Semantic search for ecommerce: relevance, control, and ROI