Featured

Text Classification with TensorFlow

Vincent Yuan · updated on Jul 7, 2024 · 7 min read

Tech AI Python TensorFlow

1 Introduction

In natural language processing, text classification is a very common application. This article will introduce the use of TensorFlow to develop text classification models based on embedded. Because the API of TensorFlow has changed rapidly and compatible, this article is used in this article to 4 4 of 2022 4On the 16th, the latest version of TensorFlow (TF) and related libraries mainly includes: tensorflow (v2.8.0), tensorflow datasets (TFDS v4.0.1) and tensorFlow text (tf_text v2.8.1) `BUG, first check the version of the TensorFlow related library.The main APIs used in this workflow are:

tf.strings
tfds
tf_text
tf.data.Dataset
tf.keras (Sequential & Functional API)

2 Get data

TensorFlow DataSets (TFDS) contains a lot
[Example data] (https://www.tensorflow.org/datasets 'tensorflow datasets data set') is used for research trials. This article uses classic movie review data to conduct research on emotional dual -class tasks.First use the TFDS API to load the data directly. As a result, there will be a [tf.data.dataSet] (https://www.tersorflow.org/api_docs/python/data/dataSet 'tf.data.dataSet')
In the object.

import collections
import pathlib


import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization


import tensorflow_datasets as tfds
import tensorflow_text as tf_text


import plotly.express as px
import matplotlib.pyplot as plt




# Training set.
train_ds = tfds.load(
'imdb_reviews',
split='train[:80%]',
shuffle_files=True,
as_supervised=True)


# Validation set - a tf.data.Dataset object
val_ds = tfds.load(
'imdb_reviews',
split='train[80%:]',
shuffle_files=True,
as_supervised=True)




# Check the count of records
print(train_ds.cardinality().numpy())
print(val_ds.cardinality().numpy())

The return value is:

20000
5000

Use the following method to view a sample data:

for data, label in  train_ds.take(1):
print(type(data))
print('Text:', data.numpy())
print('Label:', label.numpy())

The return value is:

<class 'tensorflow.python.framework.ops.EagerTensor'>
Text: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label: 0

3 Text Pre -processing

This section uses the API of TF_TEXT and TF.STINGS processing text to process the data. TF.Data.DataSet can easily map the corresponding functions into the data, recommended learning and use.

3.1 Conversion text

The characters in the classification task have not contributed to the model prediction of the model. Therefore, using the map operation to the DataSet to turn all the characters into a lowercase, be sure to pay attention to the data format in tf.data.dataSet.

train_ds = train_ds.map(lambda text, label: (tf_text.case_fold_utf8(text), label))
val_ds = val_ds.map(lambda text, label: (tf_text.case_fold_utf8(text), label))

3.2 Text Format

This step is formatted by the regular expression of the text, such as adding spaces before and after the punctuation, which is conducive to subsequent steps to use spaces.

str_regex_pattern = [("[^A-Za-z0-9(),!?\'\`]", " "),("\'s", " \'s",) ,("\'ve", " \'ve"),("n\'t", " n\'t"),("\'re", " \'re"),("\'d", " \'d")
,("\'ll", " \'ll"),(",", " , "),("!", " ! "),("\(", " \( "),("\)", " \) "),("\?", " \? "),("\s{2,}", " ")]


for pattern, rewrite in str_regex_pattern:
train_ds = train_ds.map(lambda text, label: (tf.strings.regex_replace(text, pattern=pattern, rewrite=rewrite), label))
val_ds = val_ds.map(lambda text, label: (tf.strings.regex_replace(text, pattern=pattern, rewrite=rewrite), label))

3.3 Build a watch table

Use the training set constructor (be careful not to use the verification or test set, which will lead to information leakage). This step is to map the character to the corresponding index, which is conducive to the transformation of the data into a model for the model that can be trained and predictable.

# Do not use validation set as that will lead to data leak
train_text = train_ds.map(lambda text, label: text)


tokenizer = tf_text.WhitespaceTokenizer()


unique_tokens = collections.defaultdict(lambda: 0)
sentence_length =  []
for text in train_text.as_numpy_iterator():
tokens = tokenizer.tokenize(text).numpy()
sentence_length.append(len(tokens))
for token in tokens:
unique_tokens[token] += 1


# check out the average sentence length -> ~250 tokens
print(sum(sentence_length)/len(sentence_length))


# print 10 most used tokens - token, frequency
d_view = [ (v,k) for k,v in unique_tokens.items()]
d_view.sort(reverse=True)
for v,k in d_view[:10]:
print("%s: %d" % (k,v))

The return value shows that the words used in high -frequency are common characters in English:

b'the': 269406
b',': 221098
b'and': 131502
b'a': 130309
b'of': 116695
b'to': 108605
b'is': 88351
b'br': 81558
b'it': 77094
b'in': 75177

You can also use the chart to visually display the frequency of each word, which is conducive to helping the size of the choice.

fig = px.scatter(x=range(len(d_view)), y=[cnt for cnt, word in d_view])
fig.show()

As can be seen from the figure, among more than 70,000 characters, many characters appear very low, so the choice of vocabulary is 20,000.

3.4 Construction of Word Map

Use TensorFlow's tf.lookup.StaticVocabularTable to map the characters, which can map characters to the corresponding index and test with a simple sample.

keys = [token for cnt, token in d_view][:vocab_size]
values = range(2, len(keys) + 2)  # Reserve `0` for padding, `1` for OOV tokens.


num_oov_buckets =1


# Note: must assign the key_dtype and value_dtype when the keys and values are Python arrays
init = tf.lookup.KeyValueTensorInitializer(
keys= keys,
values= values,
key_dtype=tf.string, value_dtype=tf.int64)


table = tf.lookup.StaticVocabularyTable(
init,
num_oov_buckets=num_oov_buckets)


# Test the look up table with sample input
input_tensor = tf.constant(["emerson", "lake", "palmer", "king"])
print(table[input_tensor].numpy())

The output is:

array([20000,  2065, 14207,   618])

Next, you can map the text to the index, construct a function for transformation, and use it to the data set:

def text_index_lookup(text, label):
tokenized = tokenizer.tokenize(text)
vectorized = table.lookup(tokenized)
return vectorized, label


train_ds = train_ds.map(text_index_lookup)
val_ds = val_ds.map(text_index_lookup)

3.5 Configuration Data set

With the help of TF.Data.DataSet's Cache and PreFETCHapi, it can effectively improve performance. The Cache method loads data in memory to quickly read and write, while prefetch can process data synchronously when model predictions, Improve time utilization.

AUTOTUNE = tf.data.AUTOTUNE


def configure_dataset(dataset):
return dataset.cache().prefetch(buffer_size=AUTOTUNE)


train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

The text is different, but the neural network needs to enter the data with a fixed dimension. Therefore, the padding of the data ensures that the length is consistent and in batches.

BATCH_SIZE = 32
train_ds = train_ds.padded_batch(BATCH_SIZE  )
val_ds = val_ds.padded_batch(BATCH_SIZE  )

3.6 Treatment Test set

The test set used to verify the performance of the model can also be processed in the same way to ensure that the model can be predicted normally:

# Test set.
test_ds = tfds.load(
'imdb_reviews',
split='test',
# batch_size=BATCH_SIZE,
shuffle_files=True,
as_supervised=True)


test_ds = test_ds.map(lambda text, label: (tf_text.case_fold_utf8(text), label))


for pattern, rewrite in str_regex_pattern:
test_ds = test_ds.map(lambda text, label: (tf.strings.regex_replace(text, pattern=pattern, rewrite=rewrite), label))


test_ds = test_ds.map(text_index_lookup)
test_ds = configure_dataset(test_ds)
test_ds = test_ds.padded_batch(BATCH_SIZE  )

4 Establish a model

4.1 Use Sequential API to build convolutional neural networks

vocab_size += 2 # 0 for padding and 1 for oov token


def create_model(vocab_size, num_labels, dropout_rate):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, 128, mask_zero=True),


tf.keras.layers.Conv1D(32, 3, padding="valid", activation="relu", strides=1),
tf.keras.layers.MaxPooling1D(pool_size=2),


tf.keras.layers.Conv1D(64, 4, padding="valid", activation="relu", strides=1),
tf.keras.layers.MaxPooling1D(pool_size=2),


tf.keras.layers.Conv1D(128, 5, padding="valid", activation="relu", strides=1),
tf.keras.layers.GlobalMaxPooling1D( ),


tf.keras.layers.Dropout(dropout_rate),


tf.keras.layers.Dense(num_labels)
])
return model


tf.keras.backend.clear_session()
model = create_model(vocab_size=vocab_size, num_labels=2, dropout_rate=0.5)


#Momentum in SGD will significantly increase the convergence speed
loss = losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)


model.compile(loss=loss, optimizer=optimizer, metrics='accuracy')


print(model.summary())

The output is:

Model: "sequential"
_________________________________________________________________
Layer (type)                Output Shape              Param #
=================================================================
embedding (Embedding)       (None, None, 128)         2560256


conv1d (Conv1D)             (None, None, 32)          12320


max_pooling1d (MaxPooling1D  (None, None, 32)         0
)


conv1d_1 (Conv1D)           (None, None, 64)          8256


max_pooling1d_1 (MaxPooling  (None, None, 64)         0
1D)


conv1d_2 (Conv1D)           (None, None, 128)         41088


globalmaxpooling1d (Global (None, 128) 0
MaxPooling1D)


dropout (Dropout)           (None, 128)               0


dense (Dense)               (None, 2)                 258


=================================================================
Total params: 2,622,178
Trainable params: 2,622,178
Non-trainable params: 0
_________________________________________________________________

Next, you can train and evaluate the model:

# early stopping reduces the risk of overfitting
early_stopping = tf.keras.callbacks.EarlyStopping(patience=10)
epochs = 100
history = model.fit(x=train_ds, validation_data=val_ds,epochs=epochs, callbacks=[early_stopping])


loss, accuracy = model.evaluate(test_ds)


print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Considering that the model structure is simple, the effect is acceptable:

782/782 [==============================] - 57s 72ms/step - loss: 0.4583 - accuracy: 0.8678
Loss:  0.45827823877334595
Accuracy: 86.78%

4.2 Use Functional API to build a two -way LSTM

Steps are similar to using Sequential API, but Functional API is more flexible.

input = tf.keras.layers.Input([None] )
x = tf.keras.layers.Embedding(
input_dim=vocab_size,
output_dim=128,
# Use masking to handle the variable sequence lengths
mask_zero=True)(input)


x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.Dropout(dropout_rate)(x)
output = tf.keras.layers.Dense(num_labels)(x)


lstm_model = tf.keras.Model(inputs=input, outputs=output, name="text_lstm_model")


loss = losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)


lstm_model.compile(loss=loss, optimizer=optimizer, metrics='accuracy')


lstm_model.summary()

The output is:

Model: "text_lstm_model"
_________________________________________________________________
Layer (type)                Output Shape              Param #
=================================================================
input_5 (InputLayer)        [(None, None)]            0


embedding_5 (Embedding)     (None, None, 128)         2560256


bidirectional_4 (Bidirectio  (None, 128)              98816
nal)


dense_4 (Dense)             (None, 64)                8256


dropout_2 (Dropout)         (None, 64)                0


dense_5 (Dense)             (None, 2)                 130


=================================================================
Total params: 2,667,458
Trainable params: 2,667,458
Non-trainable params: 0
_________________________________________________________________

Similarly, train and predict the model:

history_2 = lstm_model.fit(x=train_ds, validation_data=val_ds, epochs=epochs, callbacks=[early_stopping])


loss, accuracy = lstm_model.evaluate(test_ds)


print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Considering that the model structure is simple, the effect is acceptable:

782/782 [==============================] - 84s 106ms/step - loss: 0.4105 - accuracy: 0.8160
Loss:  0.4105057716369629
Accuracy: 81.60%

5 Summary

Regarding text classification, there are many new technologies to try. There are also many decisions in the above workflow to test (alchemy). This article aims to use the latest TensorFlow API to pass the important knowledge points and commonly used API in text classification tasks and commonly used APIs in text classification tasks.There are still many places that can be optimized in actual work.I hope this sharing will help you, please leave a message in the comment area!