博客專欄

EEPW首頁 > 博客 > 音頻數(shù)據(jù)建模全流程代碼示例：通過講話人的聲音進(jìn)行年齡預(yù)測（2）

音頻數(shù)據(jù)建模全流程代碼示例：通過講話人的聲音進(jìn)行年齡預(yù)測（2）

發(fā)布人：數(shù)據(jù)派THU 時(shí)間：2022-03-13 來源：工程師

加入技術(shù)交流群
- 掃碼加入
  和技術(shù)大咖面對面交流
  海量資料庫查詢

特征提取

數(shù)據(jù)是干凈的，應(yīng)該繼續(xù)研究可以提取的特定于音頻的特征了。
1. 開始檢測
通過觀察一個(gè)信號(hào)的波形，librosa可以很好地識(shí)別一個(gè)新口語單詞的開始。

# Import librosa
import librosa
# Loads mp3 file with a specific sampling rate, here 16kHz
y, sr = librosa.load("c4_sample-1.mp3", sr=16_000)
# Plot the signal stored in 'y'
from matplotlib import pyplot as plt
import librosa.display
plt.figure(figsize=(12, 3))
plt.title("Audio signal as waveform")
librosa.display.waveplot(y, sr=sr);

2. 錄音的長度

與此密切相關(guān)的是錄音的長度。錄音越長，能說的單詞就越多。所以計(jì)算一下錄音的長度和單詞被說出的速度。

duration = len(y) / sr
words_per_second = number_of_words / duration
print(f"""The audio signal is {duration:.2f} seconds long,
with an average of {words_per_second:.2f} words per seconds.""")

>>> The audio signal is 1.70 seconds long,
>>> with an average of 4.13 words per seconds.

3. 節(jié)奏
語言是一種非常悅耳的信號(hào)，每個(gè)人都有自己獨(dú)特的說話方式和語速。因此，可以提取的另一個(gè)特征是說話的節(jié)奏，即在音頻信號(hào)中可以檢測到的節(jié)拍數(shù)。


# Computes the tempo of a audio recording
tempo = librosa.beat.tempo(y, sr, start_bpm=10)[0]
print(f"The audio signal has a speed of {tempo:.2f} bpm.")

>>> The audio signal has a speed of 42.61 bpm.

4. 基頻
基頻是周期聲音出現(xiàn)時(shí)的最低頻率。在音樂中也被稱為音高。在之前看到的譜圖圖中，基頻(也稱為f0)是圖像中最低的亮水平條帶。而在這個(gè)基本音之上的帶狀圖案的重復(fù)稱為諧波。
為了更好地說明確切意思，下面提取基頻，并在譜圖中畫出它們。


# Extract fundamental frequency using a probabilistic approach
f0, _, _ = librosa.pyin(y, sr=sr, fmin=10, fmax=8000, frame_length=1024)

# Establish timepoint of f0 signal
timepoints = np.linspace(0, duration, num=len(f0), endpoint=False)

# Plot fundamental frequency in spectrogram plot
plt.figure(figsize=(8, 3))
x_stft = np.abs(librosa.stft(y))
x_stft = librosa.amplitude_to_db(x_stft,ref=np.max)
librosa.display.specshow(x_stft, sr=sr, x_axis="time", y_axis="log")
plt.plot(timepoints, f0, color="cyan", linewidth=4)
plt.show();

在 100 Hz 附近看到的綠線是基本頻率。但是如何將其用于特征工程呢？可以做的是計(jì)算這個(gè) f0 的具體特征。

# Computes mean, median, 5%- and 95%-percentile value of fundamental frequency
f0_values = [
    np.nanmean(f0),
    np.nanmedian(f0),
    np.nanstd(f0),
    np.nanpercentile(f0, 5),
    np.nanpercentile(f0, 95),
]
print("""This audio signal has a mean of {:.2f}, a median of {:.2f}, a
std of {:.2f}, a 5-percentile at {:.2f} and a 95-percentile at {:.2f}.""".format(*f0_values))
>>> This audio signal has a mean of 81.98, a median of 80.46, a
>>> std of 4.42, a 5-percentile at 76.57 and a 95-percentile at 90.64.

除以上說的技術(shù)以外，還有更多可以探索的音頻特征提取技術(shù)，這里就不詳細(xì)說明了。

音頻數(shù)據(jù)集的探索性數(shù)據(jù)分析 (EDA)

現(xiàn)在我們知道了音頻數(shù)據(jù)是什么樣子以及如何處理它，讓我們對它進(jìn)行適當(dāng)?shù)?EDA。首先下載一個(gè)數(shù)據(jù)集Kaggle 的 Common Voice 。這個(gè) 14 GB 的大數(shù)據(jù)集只是來自 Mozilla 的 +70 GB 大數(shù)據(jù)集的一個(gè)小的快照。對于本文這里的示例，將只使用這個(gè)數(shù)據(jù)集的大約 9'000 個(gè)音頻文件的子樣本。
看看這個(gè)數(shù)據(jù)集和一些已經(jīng)提取的特征。

1. 特征分布調(diào)查

目標(biāo)類別年齡和性別的類別分布。

目標(biāo)類別分布是不平衡的。

下一步，讓我們仔細(xì)看看提取的特征的值分布。

除了 words_per_second，這些特征分布中的大多數(shù)都是右偏的，因此可以從對數(shù)轉(zhuǎn)換中獲益。

import numpy as np
# Applies log1p on features that are not age, gender, filename or words_per_second
df = df.apply(
    lambda x: np.log1p(x)
    if x.name not in ["age", "gender", "filename", "words_per_second"]
    else x)
# Let's look at the distribution once more
df.drop(columns=["age", "gender", "filename"]).hist(
  bins=100, figsize=(14, 10))
plt.show();

好多了，但有趣的是 f0 特征似乎都具有雙峰分布。讓我們繪制與以前相同的內(nèi)容，但這次按性別分開。

正如懷疑的那樣，這里似乎存在性別效應(yīng)！但也可以看到，一些 f0 分?jǐn)?shù)（這里特別是男性）比應(yīng)有的低和高得多。由于特征提取不良，這些可能是異常值。仔細(xì)看看下圖的所有數(shù)據(jù)點(diǎn)。

# Plot sample points for each feature individuallydf.plot(lw=0, marker=".", subplots=True, layout=(-1, 3),        figsize=(15, 7.5), markersize=2)plt.tight_layout()plt.show();

鑒于特征的數(shù)量很少，而且有相當(dāng)漂亮的帶有明顯尾部的分布，可以遍歷它們中的每一個(gè)，并逐個(gè)特征地確定異常值截止閾值。

2. 特征的相關(guān)性
下一步，看看所有特征之間的相關(guān)性。但在這樣做之前需要對非數(shù)字目標(biāo)特征進(jìn)行編碼。可以使用 scikit-learn 的 OrdinalEncoder 來執(zhí)行此操作，但這可能會(huì)破壞年齡特征中的正確順序。因此在這里手動(dòng)進(jìn)行映射。


import numpy as np

# Applies log1p on features that are not age, gender, filename or words_per_second
df = df.apply(
    lambda x: np.log1p(x)
    if x.name not in ["age", "gender", "filename", "words_per_second"]
    else x)

# Let's look at the distribution once more
df.drop(columns=["age", "gender", "filename"]).hist(
  bins=100, figsize=(14, 10))
plt.show();

現(xiàn)在可以使用 pandas 的 .corr() 函數(shù)和 seaborn 的 heatmap() 來更深入地了解特征相關(guān)性。


import seaborn as sns


plt.figure(figsize=(8, 8))
df_corr = df.corr() * 100
sns.heatmap(df_corr, square=True, annot=True, fmt=".0f",            mask=np.eye(len(df_corr)), center=0)
            
plt.show();

非常有趣！提取的 f0 特征似乎與性別目標(biāo)有相當(dāng)強(qiáng)的關(guān)系，而年齡似乎與任何其他的特征都沒有太大的相關(guān)性。

3. 頻譜圖特征
目前還沒有查看實(shí)際錄音。正如之前看到的，有很多選擇（即波形或 STFT、mel 或 mfccs 頻譜圖）。
音頻樣本的長度都不同，這意味著頻譜圖也會(huì)有不同的長度。因此為了標(biāo)準(zhǔn)化所有錄音，首先要將它們剪切到正好 3 秒的長度：太短的樣本會(huì)被填充，而太長的樣本會(huì)被剪掉。
一旦計(jì)算了所有這些頻譜圖，我們就可以繼續(xù)對它們執(zhí)行一些 EDA！而且因?yàn)榭吹健靶詣e”似乎與錄音有特殊的關(guān)系，所以分別可視化兩種性別的平均梅爾譜圖，以及它們的差異。

男性說話者的平均聲音低于女性。這可以通過差異圖中的較低頻率（在紅色水平區(qū)域中看到）的更多強(qiáng)度來看出。

模型選擇

現(xiàn)在已經(jīng)可以進(jìn)行建模了。我們有多種選擇。關(guān)于模型，我們可以：

訓(xùn)練我們經(jīng)典（即淺層）機(jī)器學(xué)習(xí)模型，例如 LogisticRegression 或 SVC。
訓(xùn)練深度學(xué)習(xí)模型，即深度神經(jīng)網(wǎng)絡(luò)。
使用 TensorflowHub 的預(yù)訓(xùn)練神經(jīng)網(wǎng)絡(luò)進(jìn)行特征提取，然后在這些高級特征上訓(xùn)練淺層或深層模型

而我們訓(xùn)練的數(shù)據(jù)是：

CSV 文件中的數(shù)據(jù)，將其與頻譜圖中的“mel 強(qiáng)度”特征相結(jié)合，并將數(shù)據(jù)視為表格數(shù)據(jù)集
單獨(dú)的梅爾譜圖并將它們視為圖像數(shù)據(jù)集
使用TensorflowHub現(xiàn)有模型提取的高級特征，將它們與其他表格數(shù)據(jù)結(jié)合起來，并將其視為表格數(shù)據(jù)集

當(dāng)然，有許多不同的方法和其他方法可以為建模部分創(chuàng)建數(shù)據(jù)集。因?yàn)槲覀儧]有使用全量的數(shù)據(jù)，所以在本文我們使用最簡單的機(jī)器學(xué)習(xí)模型。

經(jīng)典（即淺層）機(jī)器學(xué)習(xí)模型

這里使用EDA獲取數(shù)據(jù)，與一個(gè)簡單的 LogisticRegression 模型結(jié)合起來，看看我們能在多大程度上預(yù)測說話者的年齡。除此以外還使用 GridSearchCV 來探索不同的超參數(shù)組合，以及執(zhí)行交叉驗(yàn)證。


from sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import RobustScaler, PowerTransformer, QuantileTransformerfrom sklearn.decomposition import PCAfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import GridSearchCV
# Create pipelinepipe = Pipeline(    [        ("scaler", RobustScaler()),        ("pca", PCA()),        ("logreg", LogisticRegression(class_weight="balanced")),    ])
# Create gridgrid = {    "scaler": [RobustScaler(), PowerTransformer(), QuantileTransformer()],    "pca": [None, PCA(0.99)],    "logreg__C": np.logspace(-3, 2, num=16),}
# Create GridSearchCVgrid_cv = GridSearchCV(pipe, grid, cv=4, return_train_score=True, verbose=1)
# Train GridSearchCVmodel = grid_cv.fit(x_tr, y_tr)
# Collect results in a DataFramecv_results = pd.DataFrame(grid_cv.cv_results_)
# Select the columns we are interested incol_of_interest = [    "param_scaler",    "param_pca",    "param_logreg__C",    "mean_test_score",    "mean_train_score",    "std_test_score",    "std_train_score",]cv_results = cv_results[col_of_interest]
# Show the dataframe sorted according to our performance metriccv_results.sort_values("mean_test_score", ascending=False)

作為上述 DataFrame 輸出的補(bǔ)充，還可以將性能得分繪制為探索的超參數(shù)的函數(shù)。但是因?yàn)槭褂昧擞卸鄠€(gè)縮放器和 PCA ，所以需要為每個(gè)單獨(dú)的超參數(shù)組合創(chuàng)建一個(gè)單獨(dú)的圖。

在圖中，可以看到總體而言模型的表現(xiàn)同樣出色。當(dāng)降低 C 的值時(shí)，有些會(huì)出現(xiàn)更快的“下降”，而另一些則顯示訓(xùn)練和測試（這里實(shí)際上是驗(yàn)證）分?jǐn)?shù)之間的差距更大，尤其是當(dāng)我們不使用 PCA 時(shí)。

下面使用 best_estimator_ 模型，看看它在保留的測試集上的表現(xiàn)如何。






# Compute score of the best model on the withheld test setbest_clf = model.best_estimator_best_clf.score(x_te, y_te)
>>> 0.4354094579008074

這已經(jīng)是一個(gè)很好的成績了。但是為了更好地理解分類模型的表現(xiàn)如何，可以打印相應(yīng)的混淆矩陣。

雖然該模型能夠檢測到比其他模型更多的 20 歲樣本（左混淆矩陣），但總體而言，它實(shí)際上在對 10 歲和 60 歲的條目進(jìn)行分類方面效果更好（例如，準(zhǔn)確率分別為 59% 和 55%）。

總結(jié)

在這篇文章中，首先看到了音頻數(shù)據(jù)是什么樣的，然后可以將其轉(zhuǎn)換成哪些不同的形式，如何對其進(jìn)行清理和探索，最后如何將其用于訓(xùn)練一些機(jī)器學(xué)習(xí)模型。如果您有任何問題，請隨時(shí)發(fā)表評論。
最后本文的源代碼在這里下載：https://github.com/miykael/miykael.github.io/blob/master/assets/nb/04_audio_data_analysis/nb_audio_eda_and_modeling.ipynb作者：Michael Notter

*博客內(nèi)容為網(wǎng)友個(gè)人發(fā)布，僅代表博主個(gè)人觀點(diǎn)，如有侵權(quán)請聯(lián)系工作人員刪除。

<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=114&cb=INSERT_RANDOM_NUMBER_HERE&n=a7a83b30' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=115&cb=INSERT_RANDOM_NUMBER_HERE&n=a3d98779' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=116&cb=INSERT_RANDOM_NUMBER_HERE&n=abca108c' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=117&cb=INSERT_RANDOM_NUMBER_HERE&n=a1775170' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=118&cb=INSERT_RANDOM_NUMBER_HERE&n=a449048b' border='0' alt='' /></a>

關(guān)鍵詞： AI

焦點(diǎn)

更多>>

技術(shù)專區(qū)

關(guān)閉

博客專欄

音頻數(shù)據(jù)建模全流程代碼示例：通過講話人的聲音進(jìn)行年齡預(yù)測（2）

相關(guān)推薦

技術(shù)專區(qū)