博客專欄

EEPW首頁 > 博客 > GPT-3 vs Bert vs GloVe vs Word2vec 文本嵌入技術的性能對比測試

GPT-3 vs Bert vs GloVe vs Word2vec 文本嵌入技術的性能對比測試

發布人：AI科技大本營時間：2023-02-21 來源：工程師

加入技術交流群
- 掃碼加入
  和技術大咖面對面交流
  海量資料庫查詢

發布文章

2022年1月25日，OpenAI公布了一個embedding endpoint(Neelakantan et al.， 2022)。該神經網絡模型將文本和代碼轉換為向量表示，將它們嵌入到高維空間中。這些模型可以捕獲文本的語義相似性，并且在某些用例中似乎實現了最先進的性能。

由于chatgpt的大火，GPT-3又進入到了人們的視野中，本文將通過使用text-embedding-ada-002（GPT-3的一個Embeddings，選擇該模型是因為它價格適中且使用簡單），與三種傳統文本嵌入技術生成的嵌入的性能進行比較；GloVe（Pennington、Socher Manning，2014 年）、Word2vec（Mikolov ，2013 年）和 MPNet（Song ，2020 年）。這些嵌入將用于訓練多個機器學習模型，使用Amazon美食評論數據集中的食品評論評分進行分類。每種嵌入技術的性能將通過比較它們的準確性指標來評估。

數據準備

本文中使用的數據集是來自Amazon美食評論數據集的1000個數據集的子集。這個子集包含了使用GPT-3的“text- embedded -ada-002”模型已經生成的嵌入。嵌入是由標題(摘要)和文本的組合生成的。如圖1所示，每個評論還具有ProductId、UserId、Score和從組合文本生成的令牌數量。

 # Libraries from sentence_transformers import SentenceTransformer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import RobustScaler from sklearn.pipeline import Pipeline import gensim.downloader as api from sklearn.svm import SVC import pandas as pd import numpy as np import openai import re
 # import data df1 = pd.read_csv('https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv',                   index_col=0)
 # view first three rows df1.head(3)

對于換行符和空格會影響我們將嵌入表示為數組。所以需要一個函數來刪除不必要的字符并將嵌入轉換為適當的數組格式。GPT-3嵌入變量的名稱也將更改為' gpt_3 '，這樣可以區別本文后面生成的其他嵌入。

 # clean openai embeddings def clean_emb(text):
 # remove line break     text = re.sub(r'\n', '', text)
 # remove square brackets     text = re.sub(r'\[|\]', "", text)
 # remove leading and trailing white spaces     text = text.strip()
 # convert string into array     text = np.fromstring(text, dtype=float, sep=',')
     return text

 # Rename column to gpt_3 df1.rename(columns={'embedding': 'gpt_3'}, inplace=True)
 # Apply clean_emb function df1['gpt_3'] = df1['gpt_3'].apply(lambda x: clean_emb(x))

GPT-3嵌入

數據集包含預先生成的基于gpt -3的嵌入。但是我們為了生成最新的嵌入，還需要一個API密鑰來訪問模型。該密鑰可以通過注冊OpenAI API來獲得。然后就是創建一個函數，指定要使用的模型(在本例中為text-embedding-ada-002)。

 api_key = 'api key'
 # set api key as default api key for openai openai.api_key = api_key
 def get_embedding(text, model="text-embedding-ada-002"):
 # replace new lines with spaces    text = text.replace("\n", " ")
 # openai.Embedding.create to convert text into embedding array    return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

因為都是獲取API的返回結果，所以這個過程非常簡單。

GloVe嵌入

GloVe（用于詞表示的全局向量）是一種文本嵌入技術，它根據詞在大量文本中的共現統計來構建詞的向量表示。GloVe 的想法是，在可比較的情況下出現的詞在語義上是相關的，并且可以使用通過共現矩陣統計它們的共現來推斷這些詞之間的聯系。

使用 spaCy 庫可以輕松的生成基于 GloVe 的嵌入。這里我們使用“en_core_web_lg”英語管道。該管道對給定的文本輸入執行一系列步驟，例如標記化、標記和詞形還原，以將其轉換為合適的格式。該管道包含 514,000 個向量，對于當前的用例來說已經足夠大了。

GloVe是14年發布的，雖然到現在都快10年了，但是在transformers出現之前GloVe可以說是最成功的詞嵌入方法，所以這里我們還是要拿他來進行以下對比。

 import spacy # load pipeline nlp = spacy.load("en_core_web_lg")

這里我們也需要進行文本清理。如上圖 2 所示，在第一個文本輸入中連續出現了一些句號。這種模式必須加以糾正。

 df1.combined[0]

我們創建一個函數，用單個句號替換連續的句號，并刪除句子末尾的空格。

 def replace_multiple_fullstops(text):
 # replace 2 or more consecutive fullstops with 1      text = re.sub(r'\.{2,}', '.', text)
 # strip white spaces from ends of sentence      text= text.strip()
      return text
 # Apply function df1['clean_text'] = df1['combined'].apply(lambda x: replace_multiple_fullstops(x))

然后就可以在清理過程之后生成嵌入。

 df1['glove'] = df1['clean_text'].apply(lambda text: nlp(text).vector)

Word2vec嵌入

word2vec技術是基于一個經過大量文本訓練的神經網絡模型，從其周圍的上下文單詞中預測目標單詞。Word2vec的工作原理是用一個連續向量來表示詞匯表中的每個單詞，該向量捕獲了使用該單詞的含義和上下文。這些向量是通過無監督學習過程生成的，神經網絡模型嘗試預測給定上下的單詞。

Gensim庫可用于加載在word2vec技術上訓練的模型。Gensim庫中的“word2vic - Google - News -300”模型是在谷歌News數據集上訓練的，該數據集約有1000億個單詞，能夠表示數據集中的大部分單詞。

 import gensim.downloader as api
 # Load word2vec-google-news-300 model wv = api.load("word2vec-google-news-300")

因為Gensim庫提供的是模型而不是管道，所以在使用word2vec模型生成向量表示之前，還需要使用spaCy庫對文本輸入進行標記化、清理和lemm化。

 def wv_preprocess_and_vectorize(text):     # Process the input text using a natural language processing library     doc = nlp(text)
     # Initialize a list to store the filtered tokens     filtered_tokens = []
     # Loop through each token in the doc     for token in doc:         # If the token is a stop word or punctuation, skip it         if token.is_stop or token.is_punct:             continue         # Otherwise, add the lemma of the token to the filtered_tokens list         filtered_tokens.append(token.lemma_)
     # If there are no filtered tokens, return np.nan     if not filtered_tokens:         return np.nan     else:         # Otherwise, return the mean vector representation of the filtered tokens         return wv.get_mean_vector(filtered_tokens)
 # Apply function df1['word2vec'] = df1['clean_text'].apply(lambda text: wv_preprocess_and_vectorize(text))

MPNet嵌入（BERT）

MPNet(Masked and Permuted Language Model Pre-training)是一種用于NLP的基于transformer的語言模型預訓練技術。MPNet提供了BERT模型的變體。BERT在預訓練期間屏蔽一部分輸入令牌，并訓練模型根據未屏蔽令牌的上下文預測已屏蔽令牌。這個過程被稱為掩碼語言建模，它對于捕獲文本語料庫中單詞的含義和上下文是有效的。

除了屏蔽語言建模之外，MPNet還采用了一種隨機排列輸入標記順序的排列機制。這種排列有助于模型學習輸入序列中單詞之間的全局上下文和關系。

我們這里使用hug Face的句子轉換模型“all-mpnet-base-v2”來獲取基于mpnet的嵌入。該模型建立在MPNet基礎模型的基礎上，并對10億句對數據集進行微調。

 model_sent = SentenceTransformer('all-mpnet-base-v2') df1['mpnet'] = df1['clean_text'].apply(lambda text: model_sent.encode(text))

維度比較

下圖3顯示了每種嵌入的不同維度。GPT-3的最大維度為1536。然后是MPNet、Word2vec和GloVe，分別為768、300和300維。

 # assign data of lists.   data = {'Name': ['gpt_3', 'mpnet', 'word2vec', 'glove'],          'Dimension': [len(df1.gpt_3[0]), len(df1.mpnet[0]),                         len(df1.word2vec[0]), len(df1.glove[0])]}  
 # Create DataFrame   df_emb_len = pd.DataFrame(data)  
 # Set background style df_emb_len.style.background_gradient()

評估使用的模型

為了評估文本嵌入的性能，我們使用了四個分類器;隨機森林、支持向量機、邏輯回歸和決策樹對Score變量進行預測。數據集將被分成75:25的訓練與測試集來評估準確性。由于嵌入是二維的，因此在訓練之前將使用numpy函數將它們轉換為單個三維數組。

 # Define a list of embedding methods to evaluate embedding_var= ['gpt_3', 'mpnet', 'word2vec', 'glove']
 # Define a list of classifier models to use classifiers = [('rf', RandomForestClassifier(random_state=76)),                ('svm', SVC(random_state=76)),                ('lr', LogisticRegression(random_state=76, max_iter=400)),                ('dt', DecisionTreeClassifier(random_state=76))]
 # Define a dictionary to store accuracy results for each classifier accuracy_lists = {     'rf': [],     'svm': [],     'lr': [],     'dt': [] }
 # Loop through each embedding method for emb in embedding_var:
     # Split the data into training and testing sets using the 'train_test_split' function     X_train, X_test, y_train, y_test = train_test_split(         df1[emb].values,         df1.Score,         test_size=0.25,         random_state=76    )
     # Stack the training and testing sets into 3D arrays     X_train_stacked = np.stack(X_train)     X_test_stacked = np.stack(X_test)
     # Loop through each classifier model     for classifier_name, classifier in classifiers:
         # Create a pipeline that scales the data and fits the classifier         pipe = Pipeline([('scaler', RobustScaler()), (classifier_name, classifier)])         pipe.fit(X_train_stacked, y_train)
         # Use the pipeline to make predictions on the test data         y_pred = pipe.predict(X_test_stacked)
         # Evaluate the accuracy of the predictions         report = classification_report(y_test, y_pred ,output_dict=True)         acc = report['accuracy']
         # Store the accuracy results for each classifier         accuracy_lists[classifier_name].append(acc)

結果

下圖4所示，模型呈現了一些有趣的結果。GPT-3嵌入在所有模型中獲得了最高的精度。

MPNet嵌入在使用邏輯回歸和支持向量機時表現次之，但在隨機森林算法中被word2vec嵌入超越，在決策樹算法中表現最差。關于維數對模型性能的影響，還不能得出明確的結論，但是從結果中可以明顯看出，GPT-3嵌入始終優于所有其他嵌入，顯示了其在文本分類方面的優勢。

 # Add a new key 'embeddings' to the dictionary 'accuracy_lists' and assign the list 'embedding_var' to it accuracy_lists['embeddings'] = embedding_var  # Create a list of tuples using the values from the dictionaries df_zip = list(zip(accuracy_lists['embeddings'], accuracy_lists['lr'], accuracy_lists['svm'], accuracy_lists['rf'], accuracy_lists['dt']))  # Create a DataFrame 'df_accuracy' from the list 'df_zip' and specify the column names df_accuracy = pd.DataFrame(df_zip, columns = ['Embedding','Logistic_Regression','Support_Vector_Machine', 'Random_Forest','Decision_Tree'])  # Add a background gradient to the DataFrame for visual representation df_accuracy.style.background_gradient()

所以還是那句話"別問，問就是GPT3"

*博客內容為網友個人發布，僅代表博主個人觀點，如有侵權請聯系工作人員刪除。

<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=114&cb=INSERT_RANDOM_NUMBER_HERE&n=a7a83b30' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=115&cb=INSERT_RANDOM_NUMBER_HERE&n=a3d98779' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=116&cb=INSERT_RANDOM_NUMBER_HERE&n=abca108c' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=117&cb=INSERT_RANDOM_NUMBER_HERE&n=a1775170' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=118&cb=INSERT_RANDOM_NUMBER_HERE&n=a449048b' border='0' alt='' /></a>

關鍵詞： AI

焦點

更多>>

技術專區

關閉

博客專欄

GPT-3 vs Bert vs GloVe vs Word2vec 文本嵌入技術的性能對比測試

相關推薦

技術專區