自1990年代初以来,电子邮件垃圾邮件一直在增长,到2014年,据估计,垃圾邮件约占发送的电子邮件的90%。
由于我们所有人都有垃圾邮件填充收件箱的问题,因此在本教程中,我们将在Keras中建立一个模型来区分垃圾邮件和合法电子邮件。
一、安装和导入依赖项
我们首先需要安装一些依赖项:pip3 install keras sklearn tqdm numpy keras_metrics tensorflow==1.14.0现在打开一个交互式shell或bfwstudio并导入:
import tqdm import numpy as np import keras_metrics # for recall and precision metrics from keras.preprocessing.sequence import pad_sequences from keras.preprocessing.text import Tokenizer from keras.layers import Embedding, LSTM, Dropout, Dense from keras.models import Sequential from keras.utils import to_categorical from keras.callbacks import ModelCheckpoint, TensorBoard from sklearn.model_selection import train_test_split import time import numpy as np import pickle
SEQUENCE_LENGTH = 100 #序列的长度 (每个样本的单词数量)
EMBEDDING_SIZE = 100 # 使用100维GloVe嵌入向量
TEST_SIZE = 0.25 #测试比率BATCH_SIZE = 64 #批量大小
EPOCHS = 20 # 迭代次数## 将垃圾邮寄和正常邮件转成成0和1
label2int = {"ham": 0, "spam": 1}int2label = {0: "ham", 1: "spam"}
如果您不确定这些参数的含义,请不要担心,我们稍后将在构建模型时讨论它们。二、加载数据集
def load_data(): """ Loads SMS Spam Collection dataset """ texts, labels = [], [] with open("data/SMSSpamCollection") as f: for line in f: split = line.split() labels.append(split[0].strip()) texts.append(' '.join(split[1:]).strip()) return texts, labels
# load the data X, y = load_data()
三、准备数据集
现在,我们需要一种通过将每个文本转换为整数序列来矢量化文本语料库的方法,您现在可能想知道为什么我们需要将文本转换为整数序列,所以,记住我们要输入文本进入神经网络,神经网络只能理解数字。更精确地说,是固定长度的整数序列。
但是在执行所有这些操作之前,我们需要通过删除标点符号,小写所有字符等来清理该语料库。幸运的是,Keras具有内置类keras.preprocessing.text.Tokenizer(),该类可以在几行中完成所有这些工作。代码如下:
# Text tokenization # vectorizing text, turning each text into sequence of integers tokenizer = Tokenizer() tokenizer.fit_on_texts(X) # convert to sequence of integers X = tokenizer.texts_to_sequences(X)
In [4]: print(X[0]) [49, 472, 4436, 843, 756, 659, 64, 8, 1328, 87, 123, 352, 1329, 148, 2996, 1330, 67, 58, 4437, 144]
# convert to numpy arrays X = np.array(X) y = np.array(y) # pad sequences at the beginning of each sequence with 0's # for example if SEQUENCE_LENGTH=4: # [[5, 3, 2], [5, 1, 2, 3], [3, 4]] # will be transformed to: # [[0, 5, 3, 2], [5, 1, 2, 3], [0, 0, 3, 4]] X = pad_sequences(X, maxlen=SEQUENCE_LENGTH)
您可能还记得,我们将SEQUENCE_LENGTH设置为100,这样所有序列的长度都为100。
现在我们的标签也是文本,但是我们将在这里采取另一种方法,因为标签只是“ spam”和“ ham”,我们需要对它们进行一次热编码:# One Hot encoding labels # [spam, ham, spam, ham, ham] will be converted to: # [1, 0, 1, 0, 1] and then to: # [[0, 1], [1, 0], [0, 1], [1, 0], [0, 1]] y = [ label2int[label] for label in y ] y = to_categorical(y)
In [7]: print(y[0]) [1.0, 0.0]
# split and shuffle X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=7)
四、建立模型
现在我们准备构建模型,通用架构如下图所示:
网络架构第一层是预训练的嵌入层,它将每个单词映射到实数的N维向量(EMBEDDING_SIZE对应于该向量的大小,在这种情况下为100)。具有相似含义的两个单词往往具有非常接近的向量。
第二层是带有LSTM单元的递归神经网络。最后,输出层是2个神经元,每个神经元对应于具有softmax激活功能的“垃圾邮件”或“正常邮件”。
让我们从编写一个函数来加载预训练的嵌入向量开始:
def get_embedding_vectors(tokenizer, dim=100): embedding_index = {} with open(f"data/glove.6B.{dim}d.txt", encoding='utf8') as f: for line in tqdm.tqdm(f, "Reading GloVe"): values = line.split() word = values[0] vectors = np.asarray(values[1:], dtype='float32') embedding_index[word] = vectors word_index = tokenizer.word_index embedding_matrix = np.zeros((len(word_index)+1, dim)) for word, i in word_index.items(): embedding_vector = embedding_index.get(word) if embedding_vector is not None: # words not found will be 0s embedding_matrix[i] = embedding_vector return embedding_matrix
def get_model(tokenizer, lstm_units): """ Constructs the model, Embedding vectors => LSTM => 2 output Fully-Connected neurons with softmax activation """ # get the GloVe embedding vectors embedding_matrix = get_embedding_vectors(tokenizer) model = Sequential() model.add(Embedding(len(tokenizer.word_index)+1, EMBEDDING_SIZE, weights=[embedding_matrix], trainable=False, input_length=SEQUENCE_LENGTH)) model.add(LSTM(lstm_units, recurrent_dropout=0.2)) model.add(Dropout(0.3)) model.add(Dense(2, activation="softmax")) # compile as rmsprop optimizer # aswell as with recall metric model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy", keras_metrics.precision(), keras_metrics.recall()]) model.summary() return model
# constructs the model with 128 LSTM units model = get_model(tokenizer=tokenizer, lstm_units=128)
五、训练模型
我们需要用刚刚加载的数据来训练该模型:# initialize our ModelCheckpoint and TensorBoard callbacks # model checkpoint for saving best weights model_checkpoint = ModelCheckpoint("results/spam_classifier_{val_loss:.2f}", save_best_only=True, verbose=1) # for better visualization tensorboard = TensorBoard(f"logs/spam_classifier_{time.time()}") # print our data shapes print("X_train.shape:", X_train.shape) print("X_test.shape:", X_test.shape) print("y_train.shape:", y_train.shape) print("y_test.shape:", y_test.shape) # train the model model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=BATCH_SIZE, epochs=EPOCHS, callbacks=[tensorboard, model_checkpoint], verbose=1)
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 100, 100) 901300 _________________________________________________________________ lstm_1 (LSTM) (None, 128) 117248 _________________________________________________________________ dropout_1 (Dropout) (None, 128) 0 _________________________________________________________________ dense_1 (Dense) (None, 2) 258 ================================================================= Total params: 1,018,806 Trainable params: 117,506 Non-trainable params: 901,300 _________________________________________________________________ X_train.shape: (4180, 100) X_test.shape: (1394, 100) y_train.shape: (4180, 2) y_test.shape: (1394, 2) Train on 4180 samples, validate on 1394 samples Epoch 1/20 4180/4180 [==============================] - 9s 2ms/step - loss: 0.1712 - acc: 0.9325 - precision: 0.9524 - recall: 0.9708 - val_loss: 0.1023 - val_acc: 0.9656 - val_precision: 0.9840 - val_recall: 0.9758 Epoch 00001: val_loss improved from inf to 0.10233, saving model to results/spam_classifier_0.10 Epoch 2/20 4180/4180 [==============================] - 8s 2ms/step - loss: 0.0976 - acc: 0.9675 - precision: 0.9765 - recall: 0.9862 - val_loss: 0.0809 - val_acc: 0.9720 - val_precision: 0.9793 - val_recall: 0.9883
Epoch 20/20 4180/4180 [==============================] - 8s 2ms/step - loss: 0.0130 - acc: 0.9971 - precision: 0.9973 - recall: 0.9994 - val_loss: 0.0629 - val_acc: 0.9821 - val_precision: 0.9916 - val_recall: 0.9875
六、评估模型
让我们评估一下我们的模型:# get the loss and metrics result = model.evaluate(X_test, y_test) # extract those loss = result[0] accuracy = result[1] precision = result[2] recall = result[3] print(f"[+] Accuracy: {accuracy*100:.2f}%") print(f"[+] Precision: {precision*100:.2f}%") print(f"[+] Recall: {recall*100:.2f}%")
1394/1394 [==============================] - 1s 569us/step [+] Accuracy: 98.21% [+] Precision: 99.16% [+] Recall: 98.75%
def get_predictions(text): sequence = tokenizer.texts_to_sequences([text]) # pad the sequence sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH) # get the prediction prediction = model.predict(sequence)[0] # one-hot encoded vector, revert using np.argmax return int2label[np.argmax(prediction)]
text = "Congratulations! you have won 100,000$ this week, click here to claim fast" print(get_predictions(text))
ham
太棒了!这种方法是当前的最新技术,尝试调整训练和模型参数,看看是否可以改进。
网友评论0