第十章神奇的语音识别

# 1.喧嚣的世界

声音由物体振动产生，经过介质传播到达人耳被人类感知。原始的声波信号经过外耳收集之后，经一系列结构的传导到达耳蜗，耳蜗内有丰富的听觉感受器，将声波信号转化为生物电信号，传导到听觉神经从而引起听觉。由于人耳听觉系统非常复杂，迄今为止人类对它的生理结构和听觉特性还不能从生理解剖角度完全解释清楚。所以，对人耳听觉特性的研究目前仅限于在心理声学和语言声学。人耳对不同强度、不同频率声音的听觉范围称为声域。在人耳的声域范围内，声音听觉心理的主观感受主要有响度、音调、音色等特征和掩蔽效应、高频定位等特性。其中响度、音调、音色可以在主观上用来描述具有振幅、频率和相位三个物理量的任何复杂的声音，故又称为声音“三要素”；而在多种音源场合，人耳掩蔽效应等特性更重要，它是心理声学的基础。响度（loudness）：人主观上感觉声音的大小（俗称音量），由“振幅”（amplitude）和人离声源的距离决定，振幅越大响度越大，人和声源的距离越小，响度越大。音调（pitch）：声音的高低（高音、低音），由频率决定，频率越高音调越高（频率单位Hz，赫兹），人耳听觉范围20～20000Hz。20Hz以下称为次声波，20000Hz以上称为超声波）。音色（Timbre）：波形决定了声音的音调。由于不同对象材料的特点，声音具有不同的特性，音色本身就是抽象的东西，波形可以把抽象的音色直观的表达出来。波形因音调而异，不同的音调可以通过波形来区分。

# 2.语音识别的应用

语音识别技术已经在现实生活中得到了广泛的应用。微信聊天中的语音转文字功能，就是典型的语音识别技术。语音输入系统，相对于键盘输入方法，它更符合人的日常习惯，也更自然、更高效；语音控制系统，即用语音来控制设备的运行，相对于手动控制来说更加快捷、方便，可以用在诸如工业控制、语音拨号系统、智能家电、声控智能玩具等许多领域；智能对话查询系统，根据客户的语音进行操作，为用户提供自然、友好的数据库检索服务，例如家庭服务、宾馆服务、旅行社服务系统、订票系统、医疗服务、银行服务、股票查询服务等等。

# 3.语音识别原理

语音识别是一个非常复杂的任务，想要达到实用的水准并不容易。我们也可以把语音识别理解成一个分类任务，即把人说的每一个音都找到一个文字对应。可以想象，这样的分类任务是非常困难的。但是语音识别也有它简单的一面，人类的语言是很有规律的，我们在做语音识别的时候应该要考虑这些规律。第一，每种语言在声音上都有一定的特点，以汉语为例，我们都学过拼音，不认识的字我们通过拼音就能知道它的发音了。拼音的声母和韵母的数量比汉字的数量少很多，我们可以用汉语的声学特性提高语音识别的准确率。第二，汉语的语言表达也有一定的规律，比如我们根据声音的特性识别出来一个词“hen hao”,那么这个词更有可能是“很好”而不是“ 狠好”，因为前者在汉语的表达中具有一定的意义而且会经常出现。语音识别会先把一段语音分成若干小段，这个过程称为分帧。然后把每一帧识别为一个状态，再把状态组合成音素，音素一般就是我们熟知的声母和的韵母，而状态则是比音素更加细节的语音单位，一个音素通常会包含三个状态。把一系列语音帧转换为若干音素的过程利用了语言的声学特性，因而这一部分被称为声学模型(acoustic mode)。从音素到文字的过程需要用到语言表达的特点，这样才能从同音字中挑选出正确的文字，组成意义明确的语句，这部分被称为语言模型（language model）。

# 3.1经典的声学特征：梅尔频率倒谱系数（Mel-Frequency Cepstral Coefficients, MFCC）

我们要实现对声音的分类，理论上也可以直接把声音的频谱数据作为评判标准，但这么做很困难，我们需要一种维度更低的特征来表示声音，梅尔频率倒谱系数就是优秀的特征之一。梅尔频率倒谱系数（MFCC）被广泛应用于语音识别。它由Davis和Mermelstein在1980年提出。MFCC可以粗略地刻画出频谱的形状，因而可以大致描述出不同频率声音的能量高低。此外，MFCC也能够大致反映出声音的共振峰（声音频谱上能量相对集中的区域）。

# 3.2MFCC特征提取过程

MFCC特征提取过程分为两个步骤，首先对输入的音频信号，我们用梅尔频率对频谱进行处理得到一组26维的特征，然后再计算它的倒谱得到最终的13维MFCC特征。具体的计算过程比较复杂，这里不做展开。我们需要了解的是，音频在提取特征的过程中会被划分成若干等间隔的小段，它们可以相互有重叠，我们对每一小段进行MFCC特征提取。在切分音频的时候有窗口宽度和窗口间隔两个参数，这些参数可以根据音频的特点进行调节，一种常用的参数是窗口宽度25毫秒，窗口间隔10毫秒。经过了上述准备知识，有兴趣的小伙伴可以一起来看一段语音识别的案例。

# 4.拓展阅读：代码讲解

导入相关支持文件

import os
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf

from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras import layers
from tensorflow.keras import models
from IPython import display

1
2
3
4
5
6
7
8
9
10
11
12

设置随机种子

seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

1
2
3

# 4.1准备数据

data_dir = pathlib.Path('data/mini_speech_commands')
if not data_dir.exists():
  tf.keras.utils.get_file(
      'mini_speech_commands.zip',
      origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
      extract=True,
      cache_dir='.', cache_subdir='data')

1
2
3
4
5
6
7

查看音频对应的文字内容：'no'、 'stop'、 'go'、 'yes'、 'down'、 'right'、 'up'、 'left'

commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[commands != 'README.md']
print('Commands:', commands)

1
2
3

将音频文件提取到列表中并进行打乱

filenames = tf.io.gfile.glob(str(data_dir) + '/*/*')
filenames = tf.random.shuffle(filenames)
num_samples = len(filenames)
print('Number of total examples:', num_samples)
print('Number of examples per label:',
      len(tf.io.gfile.listdir(str(data_dir/commands[0]))))
print('Example file tensor:', filenames[0])

1
2
3
4
5
6
7

划分数据集，共有8000个音频样本，按照8：1：1的比例划分训练集、验证集、测试集

train_files = filenames[:6400]
val_files = filenames[6400: 6400 + 800]
test_files = filenames[-800:]

print('Training set size', len(train_files))
print('Validation set size', len(val_files))
print('Test set size', len(test_files))

1
2
3
4
5
6
7

# 4.2查看音频和标签

设置读取音频文件方法

def decode_audio(audio_binary):
  audio, _ = tf.audio.decode_wav(audio_binary)
  return tf.squeeze(audio, axis=-1)
def get_label(file_path):
  parts = tf.strings.split(file_path, os.path.sep)
  return parts[-2]

1
2
3
4
5
6

设置获取音频标签的方法

def get_waveform_and_label(file_path):
  label = get_label(file_path)
  audio_binary = tf.io.read_file(file_path)
  waveform = decode_audio(audio_binary)
  return waveform, label

1
2
3
4
5

# AUTOTUNE = tf.data.AUTOTUNE 			# 高版本（2.5）tensorflow使用该语法
AUTOTUNE = tf.data.experimental.AUTOTUNE
files_ds = tf.data.Dataset.from_tensor_slices(train_files)
waveform_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTOTUNE)

1
2
3
4

测试部分音频文件以及其对应的标签，训练过程中不需要添加此代码

rows = 3
cols = 3
n = rows*cols
fig, axes = plt.subplots(rows, cols, figsize=(10, 12))
for i, (audio, label) in enumerate(waveform_ds.take(n)):
  r = i // cols
  c = i % cols
  ax = axes[r][c]
  ax.plot(audio.numpy())
  ax.set_yticks(np.arange(-1.2, 1.2, 0.2))
  label = label.numpy().decode('utf-8')
  ax.set_title(label)

plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14

设置获取音频声谱图方法

def get_spectrogram(waveform):
  # Padding for files with less than 16000 samples
  zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)

  # Concatenate audio with padding so that all audio clips will be of the 
  # same length
  waveform = tf.cast(waveform, tf.float32)
  equal_length = tf.concat([waveform, zero_padding], 0)
  spectrogram = tf.signal.stft(
      equal_length, frame_length=255, frame_step=128)

  spectrogram = tf.abs(spectrogram)

  return spectrogram

1
2
3
4
5
6
7
8
9
10
11
12
13
14

接下来，我们看一看数据中的一条音频，查看它的音频、标签、音谱

for waveform, label in waveform_ds.take(1):
  label = label.numpy().decode('utf-8')
  spectrogram = get_spectrogram(waveform)

print('Label:', label)
print('Waveform shape:', waveform.shape)
print('Spectrogram shape:', spectrogram.shape)
print('Audio playback')
display.display(display.Audio(waveform, rate=16000))

1
2
3
4
5
6
7
8
9

绘制样例的波形图和声谱图

def plot_spectrogram(spectrogram, ax):
  # Convert to frequencies to log scale and transpose so that the time is
  # represented in the x-axis (columns).
  log_spec = np.log(spectrogram.T)
  height = log_spec.shape[0]
  width = log_spec.shape[1]
  X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
  Y = range(height)
  ax.pcolormesh(X, Y, log_spec)


fig, axes = plt.subplots(2, figsize=(12, 8))
timescale = np.arange(waveform.shape[0])
axes[0].plot(timescale, waveform.numpy())
axes[0].set_title('Waveform')
axes[0].set_xlim([0, 16000])
plot_spectrogram(spectrogram.numpy(), axes[1])
axes[1].set_title('Spectrogram')
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# 4.3声谱

现在，将波形数据集转换为具有谱图图像及其对应标签的整数ID。

def get_spectrogram_and_label_id(audio, label):
  spectrogram = get_spectrogram(audio)
  spectrogram = tf.expand_dims(spectrogram, -1)
  label_id = tf.argmax(label == commands)
  return spectrogram, label_id

spectrogram_ds = waveform_ds.map(
    get_spectrogram_and_label_id, num_parallel_calls=AUTOTUNE)

1
2
3
4
5
6
7
8

查看数据集中不同音频的声谱图

rows = 3
cols = 3
n = rows*cols
fig, axes = plt.subplots(rows, cols, figsize=(10, 10))
for i, (spectrogram, label_id) in enumerate(spectrogram_ds.take(n)):
  r = i // cols
  c = i % cols
  ax = axes[r][c]
  plot_spectrogram(np.squeeze(spectrogram.numpy()), ax)
  ax.set_title(commands[label_id.numpy()])
  ax.axis('off')

plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13

# 4.4创建和训练模型

将训练集、验证集、测试集格式进行规整，以备正式开始训练

def preprocess_dataset(files):
  files_ds = tf.data.Dataset.from_tensor_slices(files)
  output_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTOTUNE)
  output_ds = output_ds.map(
      get_spectrogram_and_label_id,  num_parallel_calls=AUTOTUNE)
  return output_ds
train_ds = spectrogram_ds
val_ds = preprocess_dataset(val_files)
test_ds = preprocess_dataset(test_files)

1
2
3
4
5
6
7
8
9

设置批量训练的参数

batch_size = 64
train_ds = train_ds.batch(batch_size)
val_ds = val_ds.batch(batch_size)

1
2
3

运用 dataset cache() 和 prefetch() 方法，以减少训练模型时的读取延迟

train_ds = train_ds.cache().prefetch(AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)

1
2

设置模型，这里模型我们使用卷积神经网络（CNN），你可能会疑惑，CNN不是擅长处理图像信息吗，为什么对于音频的识别分类也使用CNN呢？这是因为我们对音频进行特征处理时，将音频文件转化成了声谱图，使得声音信息涵盖于图像之中，因此使用CNN做模型也会有不错的效果。

for spectrogram, _ in spectrogram_ds.take(1):
  input_shape = spectrogram.shape
print('Input shape:', input_shape)
num_labels = len(commands)

norm_layer = preprocessing.Normalization()
norm_layer.adapt(spectrogram_ds.map(lambda x, _: x))

model = models.Sequential([
    layers.Input(shape=input_shape),
    preprocessing.Resizing(32, 32), 
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])

model.summary()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

设置优化器和损失函数

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)

1
2
3
4
5

开始训练，循环训练10次

EPOCHS = 10
history = model.fit(
    train_ds, 
    validation_data=val_ds,  
    epochs=EPOCHS,
    callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),
)

1
2
3
4
5
6
7

绘制图表，查看训练结果

metrics = history.history
plt.plot(history.epoch, metrics['loss'], metrics['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.show()

1
2
3
4

# 4.5完整代码

import os
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf

from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras import layers
from tensorflow.keras import models
from IPython import display


# Set seed for experiment reproducibility
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

data_dir = pathlib.Path('data/mini_speech_commands')
if not data_dir.exists():
  tf.keras.utils.get_file(
      'mini_speech_commands.zip',
      origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
      extract=True,
      cache_dir='.', cache_subdir='data')
commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[commands != 'README.md']

def decode_audio(audio_binary):
  audio, _ = tf.audio.decode_wav(audio_binary)
  return tf.squeeze(audio, axis=-1)
def get_label(file_path):
  parts = tf.strings.split(file_path, os.path.sep)
  return parts[-2]
def get_waveform_and_label(file_path):
  label = get_label(file_path)
  audio_binary = tf.io.read_file(file_path)
  waveform = decode_audio(audio_binary)
  return waveform, label
AUTOTUNE = tf.data.experimental.AUTOTUNE
files_ds = tf.data.Dataset.from_tensor_slices(train_files)
waveform_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTOTUNE)
def get_spectrogram(waveform):
  # Padding for files with less than 16000 samples
  zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32)

  # Concatenate audio with padding so that all audio clips will be of the 
  # same length
  waveform = tf.cast(waveform, tf.float32)
  equal_length = tf.concat([waveform, zero_padding], 0)
  spectrogram = tf.signal.stft(
      equal_length, frame_length=255, frame_step=128)

  spectrogram = tf.abs(spectrogram)

  return spectrogram
def plot_spectrogram(spectrogram, ax):
  # Convert to frequencies to log scale and transpose so that the time is
  # represented in the x-axis (columns).
  log_spec = np.log(spectrogram.T)
  height = log_spec.shape[0]
  width = log_spec.shape[1]
  X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
  Y = range(height)
  ax.pcolormesh(X, Y, log_spec)

def get_spectrogram_and_label_id(audio, label):
  spectrogram = get_spectrogram(audio)
  spectrogram = tf.expand_dims(spectrogram, -1)
  label_id = tf.argmax(label == commands)
  return spectrogram, label_id

spectrogram_ds = waveform_ds.map(
    get_spectrogram_and_label_id, num_parallel_calls=AUTOTUNE)

def preprocess_dataset(files):
  files_ds = tf.data.Dataset.from_tensor_slices(files)
  output_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTOTUNE)
  output_ds = output_ds.map(
      get_spectrogram_and_label_id,  num_parallel_calls=AUTOTUNE)
  return output_ds

train_ds = spectrogram_ds
val_ds = preprocess_dataset(val_files)
test_ds = preprocess_dataset(test_files)

batch_size = 64
train_ds = train_ds.batch(batch_size)
val_ds = val_ds.batch(batch_size)

train_ds = train_ds.cache().prefetch(AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)

for spectrogram, _ in spectrogram_ds.take(1):
  input_shape = spectrogram.shape
print('Input shape:', input_shape)
num_labels = len(commands)

norm_layer = preprocessing.Normalization()
norm_layer.adapt(spectrogram_ds.map(lambda x, _: x))

model = models.Sequential([
    layers.Input(shape=input_shape),
    preprocessing.Resizing(32, 32), 
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)
EPOCHS = 10
history = model.fit(
    train_ds, 
    validation_data=val_ds,  
    epochs=EPOCHS,
    callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),
)
metrics = history.history
plt.plot(history.epoch, metrics['loss'], metrics['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133