在Python中使用TensorFlow识别声音性别


在<a href='/tag/python.html'>Python</a>中使用TensorFlow识别声音性别

语音性别识别是一种可以通过处理语音信号来确定说话者性别类别的技术,在本教程中,我们将尝试使用Python中的TensorFlow框架按语音对性别进行分类。

性别识别在许多领域都可能有用,包括自动语音识别,可以帮助提高这些系统的性能。它也可以用于按性别对呼叫进行分类,或者您可以将其作为功能添加到虚拟助手中,以区分通话者的性别。

一、准备数据集


我们不会使用原始音频数据,因为音频样本可以是任意长度,并且在噪声方面可能会出现问题。结果,我们需要先进行某种特征提取,然后再将其输入神经网络。

特征提取始终是任何语音分析任务的第一阶段,基本上将任何长度的音频作为输入,并输出适合分类的固定长度向量。特征提取方法的一些例子是MFCC和Mel谱图。

我们将使用Mozilla的Common Voice Dataset(https://www.kaggle.com/mozillaorg/common-voice),它是用户在Common Voice网站上读取的语音数据集,其目的是实现对自动语音识别的培训和测试。但是,在查看了数据集后,实际上在流派列中标记了许多样本。因此,我们可以提取这些标记的样本并进行性别识别。

这是我为语音性别识别准备数据集的工作:

首先,我过滤了流派字段中标记的样本。

之后,我对数据集进行了平衡,以使女性样本的数量等于男性样本的数量,这将有助于神经网络不会针对特定性别过度拟合。

最后,我使用了梅尔频谱图提取技术从每个语音样本中获取了一个128长度矢量。

您可以在此存储库中(https://github.com/x4nth055/gender-recognition-by-voice)查看为本教程准备的数据集。

另外,如果您希望自己运行数据集,请运行以下(从文件.mp3到.npy文件)的脚本,如下:

import glob
import os
import pandas as pd
import numpy as np
import shutil
import librosa
from tqdm import tqdm


def extract_feature(file_name, **kwargs):
    """
    Extract feature from audio file `file_name`
        Features supported:
            - MFCC (mfcc)
            - Chroma (chroma)
            - MEL Spectrogram Frequency (mel)
            - Contrast (contrast)
            - Tonnetz (tonnetz)
        e.g:
        `features = extract_feature(path, mel=True, mfcc=True)`
    """
    mfcc = kwargs.get("mfcc")
    chroma = kwargs.get("chroma")
    mel = kwargs.get("mel")
    contrast = kwargs.get("contrast")
    tonnetz = kwargs.get("tonnetz")
    X, sample_rate = librosa.core.load(file_name)
    if chroma or contrast:
        stft = np.abs(librosa.stft(X))
    result = np.array([])
    if mfcc:
        mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
        result = np.hstack((result, mfccs))
    if chroma:
        chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
        result = np.hstack((result, chroma))
    if mel:
        mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
        result = np.hstack((result, mel))
    if contrast:
        contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
        result = np.hstack((result, contrast))
    if tonnetz:
        tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X), sr=sample_rate).T,axis=0)
        result = np.hstack((result, tonnetz))
    return result

dirname = "data"

if not os.path.isdir(dirname):
    os.mkdir(dirname)


csv_files = glob.glob("*.csv")

for j, csv_file in enumerate(csv_files):
    print("[+] Preprocessing", csv_file)
    df = pd.read_csv(csv_file)
    # only take filename and gender columns
    new_df = df[["filename", "gender"]]
    print("Previously:", len(new_df), "rows")
    # take only male & female genders (i.e droping NaNs & 'other' gender)
    new_df = new_df[np.logical_or(new_df['gender'] == 'female', new_df['gender'] == 'male')]
    print("Now:", len(new_df), "rows")
    new_csv_file = os.path.join(dirname, csv_file)
    # save new preprocessed CSV 
    new_df.to_csv(new_csv_file, index=False)
    # get the folder name
    folder_name, _ = csv_file.split(".")
    audio_files = glob.glob(f"{folder_name}/{folder_name}/*")
    all_audio_filenames = set(new_df["filename"])
    for i, audio_file in tqdm(list(enumerate(audio_files)), f"Extracting features of {folder_name}"):
        splited = os.path.split(audio_file)
        # audio_filename = os.path.join(os.path.split(splited[0])[-1], splited[-1])
        audio_filename = f"{os.path.split(splited[0])[-1]}/{splited[-1]}"
        # print("audio_filename:", audio_filename)
        if audio_filename in all_audio_filenames:
            # print("Copyying", audio_filename, "...")
            src_path = f"{folder_name}/{audio_filename}"
            target_path = f"{dirname}/{audio_filename}"
            #create that folder if it doesn't exist
            if not os.path.isdir(os.path.dirname(target_path)):
                os.mkdir(os.path.dirname(target_path))
            features = extract_feature(src_path, mel=True)
            target_filename = target_path.split(".")[0]
            np.save(target_filename, features)
            # shutil.copyfile(src_path, target_path)

好了,我们开始操作步骤吧

首先,请使用pip安装以下库:

pip3 install numpy pandas tqdm sklearn tensorflow pyaudio librosa
接下来,打开一个新的笔记本或bfwstudio并导入我们需要的模块:
import pandas as pd
import numpy as np
import os
import tqdm
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, EarlyStopping
from sklearn.model_selection import train_test_split
现在要获取每个样本的性别,有一个CSV元数据文件(在此处检查),可将每个音频样本的文件路径链接到其适当的性别:
df = pd.read_csv("balanced-all.csv")
df.head()
看起来是这样的:

filename gender
0 data/cv-other-train/sample-069205.npy female
1 data/cv-valid-train/sample-063134.npy female
2 data/cv-other-train/sample-080873.npy female
3 data/cv-other-train/sample-105595.npy female
4 data/cv-valid-train/sample-144613.npy female

让我们看看数据帧如何结束:
df.tail()
输出:

filename gender
66933 data/cv-valid-train/sample-171098.npy male
66934 data/cv-other-train/sample-022864.npy male
66935 data/cv-valid-train/sample-080933.npy male
66936 data/cv-other-train/sample-012026.npy male
66937 data/cv-other-train/sample-013841.npy male

让我们看看每种性别的样本数量:
# get total samples
n_samples = len(df)
# get total male samples
n_male_samples = len(df[df['gender'] == 'male'])
# get total female samples
n_female_samples = len(df[df['gender'] == 'female'])
print("Total samples:", n_samples)
print("Total male samples:", n_male_samples)
print("Total female samples:", n_female_samples)
输出:

Total samples: 66938
Total male samples: 33469
Total female samples: 33469


完美,有大量平衡的音频样本,以下函数将所有文件加载到单个阵列中,因为它适合内存,所以我们不需要任何生成机制(因为每个音频样本只是提取的功能,大小为1KB):

def load_data(vector_length=128):
    """A function to load gender recognition dataset from `data` folder
    After the second run, this will load from results/features.npy and results/labels.npy files
    as it is much faster!"""
    # make sure results folder exists
    if not os.path.isdir("results"):
        os.mkdir("results")
    # if features & labels already loaded individually and bundled, load them from there instead
    if os.path.isfile("results/features.npy") and os.path.isfile("results/labels.npy"):
        X = np.load("results/features.npy")
        y = np.load("results/labels.npy")
        return X, y
    # read dataframe
    df = pd.read_csv("balanced-all.csv")
    # get total samples
    n_samples = len(df)
    # get total male samples
    n_male_samples = len(df[df['gender'] == 'male'])
    # get total female samples
    n_female_samples = len(df[df['gender'] == 'female'])
    print("Total samples:", n_samples)
    print("Total male samples:", n_male_samples)
    print("Total female samples:", n_female_samples)
    # initialize an empty array for all audio features
    X = np.zeros((n_samples, vector_length))
    # initialize an empty array for all audio labels (1 for male and 0 for female)
    y = np.zeros((n_samples, 1))
    for i, (filename, gender) in tqdm.tqdm(enumerate(zip(df['filename'], df['gender'])), "Loading data", total=n_samples):
        features = np.load(filename)
        X[i] = features
        y[i] = label2int[gender]
    # save the audio features and labels into files
    # so we won't load each one of them next run
    np.save("results/features", X)
    np.save("results/labels", y)
    return X, y
上面的函数负责读取该CSV文件并将所有音频样本加载到一个数组中,这在您第一次运行时会花费一些时间,但是它将捆绑的数组保存在results文件夹中,这将节省我们第二次的时间跑。

现在这是一个数组,但是我们需要将数据集划分为训练集,测试集和验证集,下面的函数可以做到这一点:

def split_data(X, y, test_size=0.1, valid_size=0.1):
    # split training set and testing set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=7)
    # split training set and validation set
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=valid_size, random_state=7)
    # return a dictionary of values
    return {
        "X_train": X_train,
        "X_valid": X_valid,
        "X_test": X_test,
        "y_train": y_train,
        "y_valid": y_valid,
        "y_test": y_test
    }


我们正在使用sklearn的train_test_split()便捷函数,该函数将洗洗我们的数据集并将其分为训练集和测试集,然后在训练集上再次运行它以获取验证集。让我们使用以下功能:

# load the dataset
X, y = load_data()
# split the data into training, validation and testing sets
data = split_data(X, y, test_size=0.1, valid_size=0.1)


现在,该data词典包含了我们适合模型所需的所有内容,然后让我们构建模型!

二、建立模型


在本教程中,我们将使用具有5个隐藏层的深度前馈神经网络,它不是理想的体系结构,但到目前为止,它已经完成了工作:

def create_model(vector_length=128):
    """5 hidden dense layers from 256 units to 64, not the best model."""
    model = Sequential()
    model.add(Dense(256, input_shape=(vector_length,)))
    model.add(Dropout(0.3))
    model.add(Dense(256, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(128, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(0.3))
    # one output neuron with sigmoid activation function, 0 means female, 1 means male
    model.add(Dense(1, activation="sigmoid"))
    # using binary crossentropy as it's male/female classification (binary)
    model.compile(loss="binary_crossentropy", metrics=["accuracy"], optimizer="adam")
    # print summary of the model
    model.summary()
    return model



在每个完全连接的层之后,我们使用30%的辍学率,这种类型的正则化有望防止训练数据过度拟合。

这里要注意的重要一点是,我们在输出层中使用具有S型激活功能的单个输出单元(神经元),当音频的发声器是男性时,模型将输出标量1(或接近标量1),那么接近0时为女性。

另外,我们使用二进制交叉熵作为损失函数,因为当我们只有2个类别可以预测时,这是分类交叉熵的一种特殊情况。让我们使用此函数来构建模型:
# construct the model
model = create_model()


三、训练模型


现在我们已经构建了模型,让我们使用先前加载的数据集对其进行训练:

# use tensorboard to view metrics
tensorboard = TensorBoard(log_dir="logs")
# define early stopping to stop training after 5 epochs of not improving
early_stopping = EarlyStopping(mode="min", patience=5, restore_best_weights=True)

batch_size = 64
epochs = 100
# train the model using the training set and validating using validation set
model.fit(data["X_train"], data["y_train"], epochs=epochs, batch_size=batch_size, validation_data=(data["X_valid"], data["y_valid"]),
          callbacks=[tensorboard, early_stopping])


我们定义了两个回调,将在每个时期结束后执行:

第一个是张量板,我们将使用它来查看模型在训练过程中的损失和准确性。
第二个回调是提早停止,当模型停止改进时,它将停止训练,我指定了5的耐心,这意味着它将在5个未改善的时期后停止训练,设置restore_best_weights为True将恢复记录的最佳权重在训练期间,将其分配给模型权重。
让我们保存这个模型:

# save the model to a file
model.save("results/model.h5")


这是我的输出:

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 256) 33024 
_________________________________________________________________
dropout (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 65792
_________________________________________________________________
dropout_1 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 128) 32896
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_3 (Dense) (None, 128) 16512
_________________________________________________________________
dropout_3 (Dropout) (None, 128) 0
_________________________________________________________________
dense_4 (Dense) (None, 64) 8256
_________________________________________________________________
dropout_4 (Dropout) (None, 64) 0
_________________________________________________________________
dense_5 (Dense) (None, 1) 65
=================================================================
Total params: 156,545
Trainable params: 156,545
Non-trainable params: 0
_________________________________________________________________
Train on 54219 samples, validate on 6025 samples
Epoch 1/100
54219/54219 [==============================] - 8s 143us/sample - loss: 0.5514 - accuracy: 0.7651 - val_loss: 0.3807 - val_accuracy: 0.8508
Epoch 2/100
54219/54219 [==============================] - 5s 93us/sample - loss: 0.4159 - accuracy: 0.8326 - val_loss: 0.3464 - val_accuracy: 0.8536
Epoch 3/100
54219/54219 [==============================] - 5s 93us/sample - loss: 0.3860 - accuracy: 0.8466 - val_loss: 0.3112 - val_accuracy: 0.8744
<..SNIPPED..>
Epoch 16/100
54219/54219 [==============================] - 5s 96us/sample - loss: 0.2864 - accuracy: 0.8936 - val_loss: 0.2387 - val_accuracy: 0.9087
Epoch 17/100
54219/54219 [==============================] - 5s 95us/sample - loss: 0.2824 - accuracy: 0.8945 - val_loss: 0.2464 - val_accuracy: 0.9110
Epoch 18/100
54219/54219 [==============================] - 6s 103us/sample - loss: 0.2887 - accuracy: 0.8920 - val_loss: 0.2406 - val_accuracy: 0.9074
Epoch 19/100
54219/54219 [==============================] - 5s 95us/sample - loss: 0.2822 - accuracy: 0.8939 - val_loss: 0.2435 - val_accuracy: 0.9080
Epoch 20/100
54219/54219 [==============================] - 5s 96us/sample - loss: 0.2813 - accuracy: 0.8957 - val_loss: 0.2567 - val_accuracy: 0.8993
Epoch 21/100
54219/54219 [==============================] - 5s 89us/sample - loss: 0.2759 - accuracy: 0.8962 - val_loss: 0.2442 - val_accuracy: 0.9112


如您所见,模型训练在第21阶段停止,并达到了0.2387损失和几乎91%验证准确性(这是在第16阶段)。

四、测试模型


由于现在已经训练了模型并且权重是最优的,所以让我们使用我们先前创建的测试集进行测试:

# evaluating the model using the testing set
print(f"Evaluating the model using {len(data['X_test'])} samples...")
loss, accuracy = model.evaluate(data["X_test"], data["y_test"], verbose=0)
print(f"Loss: {loss:.4f}")
print(f"Accuracy: {accuracy*100:.2f}%")
看一下这个:

Evaluating the model using 6694 samples...
Loss: 0.2405
Accuracy: 90.95%


令人惊讶的是,我们已经达到91%了模型从未见过的样本的准确性!太棒了!

如果打开张量板(使用命令:)tensorboard --logdir="logs",则会看到类似于以下的损耗和精度曲线:

在<a href='/tag/python.html'>Python</a>中使用TensorFlow识别声音性别

训练期间的二进制交叉熵损失训练期间的模型准确性

蓝色曲线是验证集,而橙色曲线是训练集,您可以看到损失随着时间的流逝而减少,并且准确性不断提高,这正是我们所期望的!

那么我没来用自己的声音测试一下模型。

我知道这是令人兴奋的部分,我制作了一个脚本来记录您的声音,直到您停止说话为止(尽管您可以使用任何一种语言讲话)并将其保存到文件中,然后从该音频中提取特征并将其输入到模型中检索结果:

import librosa
import numpy as np

def extract_feature(file_name, **kwargs):
    """
    Extract feature from audio file `file_name`
        Features supported:
            - MFCC (mfcc)
            - Chroma (chroma)
            - MEL Spectrogram Frequency (mel)
            - Contrast (contrast)
            - Tonnetz (tonnetz)
        e.g:
        `features = extract_feature(path, mel=True, mfcc=True)`
    """
    mfcc = kwargs.get("mfcc")
    chroma = kwargs.get("chroma")
    mel = kwargs.get("mel")
    contrast = kwargs.get("contrast")
    tonnetz = kwargs.get("tonnetz")
    X, sample_rate = librosa.core.load(file_name)
    if chroma or contrast:
        stft = np.abs(librosa.stft(X))
    result = np.array([])
    if mfcc:
        mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
        result = np.hstack((result, mfccs))
    if chroma:
        chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
        result = np.hstack((result, chroma))
    if mel:
        mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
        result = np.hstack((result, mel))
    if contrast:
        contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
        result = np.hstack((result, contrast))
    if tonnetz:
        tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X), sr=sample_rate).T,axis=0)
        result = np.hstack((result, tonnetz))
    return result
上面的函数是负责加载音频文件并从中提取功能的函数,下面的代码行将使用argparse模块来解析从命令行传递的音频文件路径并对其进行推断:

import argparse
parser = argparse.ArgumentParser(description="""Gender recognition script, this will load the model you trained, 
                                    and perform inference on a sample you provide (either using your voice or a file)""")
parser.add_argument("-f", "--file", help="The path to the file, preferred to be in WAV format")
args = parser.parse_args()
file = args.file
# construct the model
model = create_model()
# load the saved/trained weights
model.load_weights("results/model.h5")
if not file or not os.path.isfile(file):
    # if file not provided, or it doesn't exist, use your voice
    print("Please talk")
    # put the file name here
    file = "test.wav"
    # record the file (start talking)
    record_to_file(file)
# extract features and reshape it
features = extract_feature(file, mel=True).reshape(1, -1)
# predict the gender!
male_prob = model.predict(features)[0][0]
female_prob = 1 - male_prob
gender = "male" if male_prob > female_prob else "female"
# show the result!
print("Result:", gender)
print(f"Probabilities::: Male: {male_prob*100:.2f}%    Female: {female_prob*100:.2f}%")
如果您执行此操作将不起作用,因为未定义record_to_file()方法(您可以在此处检查完整的脚本代码),但这有助于我解释代码。

我们正在使用argparse模块来解析从命令行传递的文件路径,如果未传递文件(使用--file或-f参数),则脚本将使用默认麦克风开始录制。

然后,我们创建模型并加载我们之前训练的最佳权重,然后提取通过(或记录)的音频文件的特征,并使用model.predict()来获得结果预测,这是一个示例:
$ python test.py --file "test-samples/16-122828-0002.wav"
输出:

Result: female
Probabilities: Male: 20.77% Female: 79.23%


实际上,从LibriSpeech数据集中(https://www.openslr.org/12)获取的样本是女性!

五、结论


现在,您有很多选择可以进一步提高模型的准确性,其中一个正在尝试提出另一种模型体系结构,您还可以使用卷积或循环网络来查看结果!我可以期待您达到的95%准确性超过准确性,如果您做到了,请在下面的评论中与我们分享!

您还可以从Kaggle下载原始数据集,下载地址:https://www.kaggle.com/mozillaorg/common-voice,并使用提供的函数使用另一种特征提取技术(例如MFCC)extract_feature(),然后可以比较结果。

源码地址:https://www.kaggle.com/mozillaorg/common-voice

{{collectdata}}

网友评论0