Tuesday, June 6, 2023
HomeArtificial IntelligenceRStudio AI Weblog: Audio classification with torch

RStudio AI Weblog: Audio classification with torch

Variations on a theme

Easy audio classification with Keras, Audio classification with Keras: Trying nearer on the non-deep studying components, Easy audio classification with torch: No, this isn’t the primary put up on this weblog that introduces speech classification utilizing deep studying. With two of these posts (the “utilized” ones) it shares the final setup, the kind of deep-learning structure employed, and the dataset used. With the third, it has in widespread the curiosity within the concepts and ideas concerned. Every of those posts has a special focus – do you have to learn this one?

Effectively, in fact I can’t say “no” – all of the extra so as a result of, right here, you have got an abbreviated and condensed model of the chapter on this subject within the forthcoming ebook from CRC Press, Deep Studying and Scientific Computing with R torch. By the use of comparability with the earlier put up that used torch, written by the creator and maintainer of torchaudio, Athos Damiani, vital developments have taken place within the torch ecosystem, the tip outcome being that the code received lots simpler (particularly within the mannequin coaching half). That stated, let’s finish the preamble already, and plunge into the subject!

Inspecting the info

We use the speech instructions dataset (Warden (2018)) that comes with torchaudio. The dataset holds recordings of thirty totally different one- or two-syllable phrases, uttered by totally different audio system. There are about 65,000 audio information general. Our job shall be to foretell, from the audio solely, which of thirty doable phrases was pronounced.


ds <- speechcommand_dataset(
  root = "~/.torch-datasets", 
  url = "speech_commands_v0.01",
  obtain = TRUE

We begin by inspecting the info.

[1]  "mattress"    "chicken"   "cat"    "canine"    "down"   "eight"
[7]  "5"   "4"   "go"     "completely happy"  "home"  "left"
[32] " marvin" "9"   "no"     "off"    "on"     "one"
[19] "proper"  "seven" "sheila" "six"    "cease"   "three"
[25]  "tree"   "two"    "up"     "wow"    "sure"    "zero" 

Selecting a pattern at random, we see that the knowledge we’ll want is contained in 4 properties: waveform, sample_rate, label_index, and label.

The primary, waveform, shall be our predictor.

pattern <- ds[2000]
[1]     1 16000

Particular person tensor values are centered at zero, and vary between -1 and 1. There are 16,000 of them, reflecting the truth that the recording lasted for one second, and was registered at (or has been transformed to, by the dataset creators) a price of 16,000 samples per second. The latter data is saved in pattern$sample_rate:

[1] 16000

All recordings have been sampled on the identical price. Their size virtually all the time equals one second; the – very – few sounds which are minimally longer we are able to safely truncate.

Lastly, the goal is saved, in integer kind, in pattern$label_index, the corresponding phrase being accessible from pattern$label:

[1] "chicken"
[ CPULongType{} ]

How does this audio sign “look?”


df <- knowledge.body(
  x = 1:size(pattern$waveform[1]),
  y = as.numeric(pattern$waveform[1])

ggplot(df, aes(x = x, y = y)) +
  geom_line(dimension = 0.3) +
      "The spoken phrase "", pattern$label, "": Sound wave"
  ) +
  xlab("time") +
  ylab("amplitude") +
The spoken word “bird,” in time-domain representation.

What we see is a sequence of amplitudes, reflecting the sound wave produced by somebody saying “chicken.” Put otherwise, now we have right here a time sequence of “loudness values.” Even for consultants, guessing which phrase resulted in these amplitudes is an not possible job. That is the place area information is available in. The professional might not be capable of make a lot of the sign on this illustration; however they might know a method to extra meaningfully signify it.

Two equal representations

Think about that as a substitute of as a sequence of amplitudes over time, the above wave had been represented in a approach that had no details about time in any respect. Subsequent, think about we took that illustration and tried to get better the unique sign. For that to be doable, the brand new illustration would in some way should include “simply as a lot” data because the wave we began from. That “simply as a lot” is obtained from the Fourier Remodel, and it consists of the magnitudes and section shifts of the totally different frequencies that make up the sign.

How, then, does the Fourier-transformed model of the “chicken” sound wave look? We get hold of it by calling torch_fft_fft() (the place fft stands for Quick Fourier Remodel):

dft <- torch_fft_fft(pattern$waveform)
[1]     1 16000

The size of this tensor is identical; nevertheless, its values will not be in chronological order. As a substitute, they signify the Fourier coefficients, similar to the frequencies contained within the sign. The upper their magnitude, the extra they contribute to the sign:

magazine <- torch_abs(dft[1, ])

df <- knowledge.body(
  x = 1:(size(pattern$waveform[1]) / 2),
  y = as.numeric(magazine[1:8000])

ggplot(df, aes(x = x, y = y)) +
  geom_line(dimension = 0.3) +
      "The spoken phrase "",
      "": Discrete Fourier Remodel"
  ) +
  xlab("frequency") +
  ylab("magnitude") +
The spoken word “bird,” in frequency-domain representation.

From this alternate illustration, we might return to the unique sound wave by taking the frequencies current within the sign, weighting them in keeping with their coefficients, and including them up. However in sound classification, timing data should certainly matter; we don’t actually wish to throw it away.

Combining representations: The spectrogram

The truth is, what actually would assist us is a synthesis of each representations; some kind of “have your cake and eat it, too.” What if we might divide the sign into small chunks, and run the Fourier Remodel on every of them? As you might have guessed from this lead-up, this certainly is one thing we are able to do; and the illustration it creates known as the spectrogram.

With a spectrogram, we nonetheless preserve some time-domain data – some, since there may be an unavoidable loss in granularity. Alternatively, for every of the time segments, we study their spectral composition. There’s an essential level to be made, although. The resolutions we get in time versus in frequency, respectively, are inversely associated. If we cut up up the alerts into many chunks (known as “home windows”), the frequency illustration per window won’t be very fine-grained. Conversely, if we wish to get higher decision within the frequency area, now we have to decide on longer home windows, thus dropping details about how spectral composition varies over time. What feels like an enormous drawback – and in lots of circumstances, shall be – gained’t be one for us, although, as you’ll see very quickly.

First, although, let’s create and examine such a spectrogram for our instance sign. Within the following code snippet, the scale of the – overlapping – home windows is chosen in order to permit for affordable granularity in each the time and the frequency area. We’re left with sixty-three home windows, and, for every window, get hold of 2 hundred fifty-seven coefficients:

fft_size <- 512
window_size <- 512
energy <- 0.5

spectrogram <- transform_spectrogram(
  n_fft = fft_size,
  win_length = window_size,
  normalized = TRUE,
  energy = energy

spec <- spectrogram(pattern$waveform)$squeeze()
[1]   257 63

We will show the spectrogram visually:

bins <- 1:dim(spec)[1]
freqs <- bins / (fft_size / 2 + 1) * pattern$sample_rate 
log_freqs <- log10(freqs)

frames <- 1:(dim(spec)[2])
seconds <- (frames / dim(spec)[2]) *
  (dim(pattern$waveform$squeeze())[1] / pattern$sample_rate)

picture(x = as.numeric(seconds),
      y = log_freqs,
      z = t(as.matrix(spec)),
      ylab = 'log frequency [Hz]',
      xlab = 'time [s]',
      col = hcl.colours(12, palette = "viridis")
important <- paste0("Spectrogram, window dimension = ", window_size)
sub <- "Magnitude (sq. root)"
mtext(aspect = 3, line = 2, at = 0, adj = 0, cex = 1.3, important)
mtext(aspect = 3, line = 1, at = 0, adj = 0, cex = 1, sub)
The spoken word “bird”: Spectrogram.

We all know that we’ve misplaced some decision in each time and frequency. By displaying the sq. root of the coefficients’ magnitudes, although – and thus, enhancing sensitivity – we had been nonetheless capable of get hold of an affordable outcome. (With the viridis coloration scheme, long-wave shades point out higher-valued coefficients; short-wave ones, the other.)

Lastly, let’s get again to the essential query. If this illustration, by necessity, is a compromise – why, then, would we wish to make use of it? That is the place we take the deep-learning perspective. The spectrogram is a two-dimensional illustration: a picture. With photographs, now we have entry to a wealthy reservoir of strategies and architectures: Amongst all areas deep studying has been profitable in, picture recognition nonetheless stands out. Quickly, you’ll see that for this job, fancy architectures will not be even wanted; an easy convnet will do an excellent job.

Coaching a neural community on spectrograms

We begin by making a torch::dataset() that, ranging from the unique speechcommand_dataset(), computes a spectrogram for each pattern.

spectrogram_dataset <- dataset(
  inherit = speechcommand_dataset,
  initialize = perform(...,
                        pad_to = 16000,
                        sampling_rate = 16000,
                        n_fft = 512,
                        window_size_seconds = 0.03,
                        window_stride_seconds = 0.01,
                        energy = 2) {
    self$pad_to <- pad_to
    self$window_size_samples <- sampling_rate *
    self$window_stride_samples <- sampling_rate *
    self$energy <- energy
    self$spectrogram <- transform_spectrogram(
        n_fft = n_fft,
        win_length = self$window_size_samples,
        hop_length = self$window_stride_samples,
        normalized = TRUE,
        energy = self$energy
  .getitem = perform(i) {
    merchandise <- tremendous$.getitem(i)

    x <- merchandise$waveform
    # be sure all samples have the identical size (57)
    # shorter ones shall be padded,
    # longer ones shall be truncated
    x <- nnf_pad(x, pad = c(0, self$pad_to - dim(x)[2]))
    x <- x %>% self$spectrogram()

    if (is.null(self$energy)) {
      # on this case, there may be an extra dimension, in place 4,
      # that we wish to seem in entrance
      # (as a second channel)
      x <- x$squeeze()$permute(c(3, 1, 2))

    y <- merchandise$label_index
    record(x = x, y = y)

Within the parameter record to spectrogram_dataset(), be aware energy, with a default worth of two. That is the worth that, except advised in any other case, torch’s transform_spectrogram() will assume that energy ought to have. Below these circumstances, the values that make up the spectrogram are the squared magnitudes of the Fourier coefficients. Utilizing energy, you’ll be able to change the default, and specify, for instance, that’d you’d like absolute values (energy = 1), another optimistic worth (corresponding to 0.5, the one we used above to show a concrete instance) – or each the actual and imaginary components of the coefficients (energy = NULL).

Show-wise, in fact, the complete complicated illustration is inconvenient; the spectrogram plot would wish an extra dimension. However we might nicely ponder whether a neural community might revenue from the extra data contained within the “complete” complicated quantity. In any case, when decreasing to magnitudes we lose the section shifts for the person coefficients, which could include usable data. The truth is, my assessments confirmed that it did; use of the complicated values resulted in enhanced classification accuracy.

Let’s see what we get from spectrogram_dataset():

ds <- spectrogram_dataset(
  root = "~/.torch-datasets",
  url = "speech_commands_v0.01",
  obtain = TRUE,
  energy = NULL

[1]   2 257 101

We’ve 257 coefficients for 101 home windows; and every coefficient is represented by each its actual and imaginary components.

Subsequent, we cut up up the info, and instantiate the dataset() and dataloader() objects.

train_ids <- pattern(
  dimension = 0.6 * size(ds)
valid_ids <- pattern(
  dimension = 0.2 * size(ds)
test_ids <- setdiff(
  union(train_ids, valid_ids)

batch_size <- 128

train_ds <- dataset_subset(ds, indices = train_ids)
train_dl <- dataloader(
  batch_size = batch_size, shuffle = TRUE

valid_ds <- dataset_subset(ds, indices = valid_ids)
valid_dl <- dataloader(
  batch_size = batch_size

test_ds <- dataset_subset(ds, indices = test_ids)
test_dl <- dataloader(test_ds, batch_size = 64)

b <- train_dl %>%
  dataloader_make_iter() %>%

[1] 128   2 257 101

The mannequin is an easy convnet, with dropout and batch normalization. The actual and imaginary components of the Fourier coefficients are handed to the mannequin’s preliminary nn_conv2d() as two separate channels.

mannequin <- nn_module(
  initialize = perform() {
    self$options <- nn_sequential(
      nn_conv2d(2, 32, kernel_size = 3),
      nn_max_pool2d(kernel_size = 2),
      nn_dropout2d(p = 0.2),
      nn_conv2d(32, 64, kernel_size = 3),
      nn_max_pool2d(kernel_size = 2),
      nn_dropout2d(p = 0.2),
      nn_conv2d(64, 128, kernel_size = 3),
      nn_max_pool2d(kernel_size = 2),
      nn_dropout2d(p = 0.2),
      nn_conv2d(128, 256, kernel_size = 3),
      nn_max_pool2d(kernel_size = 2),
      nn_dropout2d(p = 0.2),
      nn_conv2d(256, 512, kernel_size = 3),
      nn_adaptive_avg_pool2d(c(1, 1)),
      nn_dropout2d(p = 0.2)

    self$classifier <- nn_sequential(
      nn_linear(512, 512),
      nn_dropout(p = 0.5),
      nn_linear(512, 30)
  ahead = perform(x) {
    x <- self$options(x)$squeeze()
    x <- self$classifier(x)

We subsequent decide an acceptable studying price:

mannequin <- mannequin %>%
    loss = nn_cross_entropy_loss(),
    optimizer = optim_adam,
    metrics = record(luz_metric_accuracy())

rates_and_losses <- mannequin %>%
rates_and_losses %>% plot()
Learning rate finder, run on the complex-spectrogram model.

Based mostly on the plot, I made a decision to make use of 0.01 as a maximal studying price. Coaching went on for forty epochs.

fitted <- mannequin %>%
    epochs = 50, valid_data = valid_dl,
    callbacks = record(
      luz_callback_early_stopping(endurance = 3),
        max_lr = 1e-2,
        epochs = 50,
        steps_per_epoch = size(train_dl),
        call_on = "on_batch_end"
      luz_callback_model_checkpoint(path = "models_complex/"),
    verbose = TRUE

Fitting the complex-spectrogram model.

Let’s verify precise accuracies.


With thirty courses to tell apart between, a closing validation-set accuracy of ~0.94 appears like a really respectable outcome!

We will verify this on the check set:

consider(fitted, test_dl)
loss: 0.2373
acc: 0.9324

An attention-grabbing query is which phrases get confused most frequently. (After all, much more attention-grabbing is how error chances are associated to options of the spectrograms – however this, now we have to depart to the true area consultants. A pleasant approach of displaying the confusion matrix is to create an alluvial plot. We see the predictions, on the left, “circulation into” the goal slots. (Goal-prediction pairs much less frequent than a thousandth of check set cardinality are hidden.)

Alluvial plot for the complex-spectrogram setup.


That’s it for right now! Within the upcoming weeks, count on extra posts drawing on content material from the soon-to-appear CRC ebook, Deep Studying and Scientific Computing with R torch. Thanks for studying!

Photograph by alex lauzon on Unsplash

Warden, Pete. 2018. “Speech Instructions: A Dataset for Restricted-Vocabulary Speech Recognition.” CoRR abs/1804.03209. http://arxiv.org/abs/1804.03209.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments