tensorflow keras vision transformerによる物体検知のサンプルコード解説。　初心者向けです。

はじめに
サンプルコード解説
まとめ

はじめに

　Vision Transformerは、画像認識で主流になりつつあるアルゴリズムですが、vision transformerモデルを用いて、物体検知ができることも提案されております。汎用性の高いアルゴリズムですね。

　今回は、tensorflowで、vision transofromerで物体検知を行うサンプルコードを少しかみ砕いて、初心者の方むけに解説します。

オリジナルのサンプルコードはこちらです。

Keras documentation: Object detection with Vision Transformers

Keras documentation

こちらの、画像認識のVitサンプルコードも、よろしければ参照ください。

Vision transformer (ViT)を用いた画像認識のコード解説。初心者向　けにtensorflow keras APIのコードをわかりやすく解説します。

画像認識のアルゴリズムで最近注目されている、Vision Transformer(ViT)のサンプルコードを解説します(Tensorflow keras API)。初心者の方にも理解しやすいように、必要以上に情報を詰め込まずに平易な文章で説明します。まずは手軽に実行してみましょう！

サンプルコード解説

準備

tensorflow2.4が使える環境下で、tensorflowのアドオンをインストールします。

アドオンは下記のようにコマンドライン上から、pipでインストールします。

pip install -U tensorflow-addons

インポート

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_addons as tfa
import matplotlib.pyplot as plt
import numpy as np
import cv2
import os
import scipy.io
import shutil

必要なライブラリをインポートします。

4行目のtensorflowのaddon以外は、おなじみのものですね。

データセットの準備

# Path to images and annotations
path_images = "/101_ObjectCategories/airplanes/"
path_annot = "/Annotations/Airplanes_Side_2/"

path_to_downloaded_file = keras.utils.get_file(
    fname="caltech_101_zipped",
    origin="https://data.caltech.edu/records/mzrjq-6wc02/files/caltech-101.zip",
    extract=True,
    archive_format="zip",  # downloaded file format
    cache_dir="/",  # cache and extract in current directory
)

# Extracting tar files found inside main zip file
shutil.unpack_archive("/datasets/caltech-101/101_ObjectCategories.tar.gz", "/")
shutil.unpack_archive("/datasets/caltech-101/Annotations.tar", "/")

# list of paths to images and annotations
image_paths = [
    f for f in os.listdir(path_images) if os.path.isfile(os.path.join(path_images, f))
]
annot_paths = [
    f for f in os.listdir(path_annot) if os.path.isfile(os.path.join(path_annot, f))
]

image_paths.sort()
annot_paths.sort()

image_size = 224  # resize input images to this size

images, targets = [], []

# loop over the annotations and images, preprocess them and store in lists
for i in range(0, len(annot_paths)):
    # Access bounding box coordinates
    annot = scipy.io.loadmat(path_annot + annot_paths[i])["box_coord"][0]

    top_left_x, top_left_y = annot[2], annot[0]
    bottom_right_x, bottom_right_y = annot[3], annot[1]

    image = keras.utils.load_img(
        path_images + image_paths[i],
    )
    (w, h) = image.size[:2]

    # resize train set images
    if i < int(len(annot_paths) * 0.8):
        # resize image if it is for training dataset
        image = image.resize((image_size, image_size))

    # convert image to array and append to list
    images.append(keras.utils.img_to_array(image))

    # apply relative scaling to bounding boxes as per given image and append to list
    targets.append(
        (
            float(top_left_x) / w,
            float(top_left_y) / h,
            float(bottom_right_x) / w,
            float(bottom_right_y) / h,
        )
    )

# Convert the list to numpy array, split to train and test dataset
(x_train), (y_train) = (
    np.asarray(images[: int(len(images) * 0.8)]),
    np.asarray(targets[: int(len(targets) * 0.8)]),
)
(x_test), (y_test) = (
    np.asarray(images[int(len(images) * 0.8) :]),
    np.asarray(targets[int(len(targets) * 0.8) :]),
)

２行目：飛行機画像が保存されているフォルダパスです。（画像は後でダウンロードします）
３行目：飛行機のアノテーションデータが保存されているフォルダパスで。（ファイルは後でダウンロードします）
５行目～１１行目：データセットをダウンロードします。
１４行目～１５行目：ダウンロードした圧縮ファイルを展開します。
１８行目～２０行目：飛行機画像データのファイルパスのリストを作成します
２１行目～２２行目：飛行機のアノテーションデータのファイルパスリストを作成します。
２５行目：画像データのファイルパスリストをソートしておきます。並べ替えですね。やらなくていいきもします。
２６行目：アノテーションデータのファイルパスリストをソートしておきます。並べ替えですね。画像データをソートしたならば、合わせておきましょう！
２８行目：画像サイズは、２２４×２２４pixel
３０行目：空のリストを作っておきます。
３３行目：アノテーションファイルの数分for分回します
３５行目：scipy.ioでアノテーションファイルを一つ読み込みます。読み込んだなかの、”box_cord”という行の1番目のデータを使います。scipyでデータを読み込むのちょっと珍しいですね。
３７行目：３５行目で読み込んだデータから、アノテーションの左上ｘ、左上y座標を取得します。
３８行目：３５行目で読み込んだデータから、アノテーションデータの右下ｘ、右下ｙ座標を取得します。
４０～４２行目：読み込んでいるアノテーションデータとセットになっている画像データを読み込みます。
４３行目：画像データの幅と高さを取得しておきます。３個目のデータch数は不要なので[:2]ですね。
４６行目：もし、今のデータが初めから数えて、８０％以内ならば。（学習データを８０用意してますね）
４８行目：２２４×２２４pixelにリサイズします
５１行目：画像データを行列データに変換してから、imagesというリストに格納します。今回はファイルがそれほど、多くないのでリスト化しちゃうみたいですね。
５４～６１行目：targetというリストに、バウンディングボックスの、左上座標と右下座標を格納します。それぞれ、ｘ座標は元画像（リサイズまえの画像）の幅、ｙ座標は元画像（リサイズ前の画像）の高さで割って正規化しています。
６４～６７行目：リストをnumpy形式に変換します。前半８０％の学習データデータだけを、取り出して、学習用画像データ、学習用アノテーションデータを作成します
６８～７１行目：同じくリストをnumpy形式に変換して、後半２０％のテスト用画像データと、テスト用アノテーションデータを作成します。

全結合層、MLP(multi layer perceptron)を作成しておきます。

def mlp(x, hidden_units, dropout_rate):
    for units in hidden_units:
        x = layers.Dense(units, activation=tf.nn.gelu)(x)
        x = layers.Dropout(dropout_rate)(x)
    return x

全結合層を作成しておきます

１行目：この関数の引数は、ｘ直前の層、hidden_units:隠れ層の数、dropout_rate:ドロップアウト率
２行目：隠れ層の数分for分回します
３行目：全結合層加えます
４行目：ドロップアウトします
５行目：出力返します

パッチを作成するクラスを作成

class Patches(layers.Layer):
    def __init__(self, patch_size):
        super().__init__()
        self.patch_size = patch_size

    #     Override function to avoid error while saving model
    def get_config(self):
        config = super().get_config().copy()
        config.update(
            {
                "input_shape": input_shape,
                "patch_size": patch_size,
                "num_patches": num_patches,
                "projection_dim": projection_dim,
                "num_heads": num_heads,
                "transformer_units": transformer_units,
                "transformer_layers": transformer_layers,
                "mlp_head_units": mlp_head_units,
            }
        )
        return config

    def call(self, images):
        batch_size = tf.shape(images)[0]
        patches = tf.image.extract_patches(
            images=images,
            sizes=[1, self.patch_size, self.patch_size, 1],
            strides=[1, self.patch_size, self.patch_size, 1],
            rates=[1, 1, 1, 1],
            padding="VALID",
        )
        # return patches
        return tf.reshape(patches, [batch_size, -1, patches.shape[-1]])

１行目：クラスを作成してます。layerクラスを継承しています。
２行目：初期化の関数です。引数は、パッチサイズです。
３行目：継承したクラスも初期化
４行目：パッチサイズをクラス変数にしておきます。何回も使えるようにです。
7行目：configを取得する関数です
８行目：継承したレイヤークラスから、configをコピーします
９行目～２０行目：configを更新します。　
引数は下記です。
input_shape:入力画像のサイズ（model側が対応する）
patch_size：パッチのサイズ
num_pathces:パッチの数
projection_dim:プロジェクションの次元（あまりきになさらず）
num_heads:attentino headの数
transformer_units:トランスフォーマーレイヤーのサイズです
transformer_layers:トランスフォーマーレイヤーの層の数です。
mlp_head_units:全結合層のサイズです

23行目：callの関数です
24行目：ミニバッチのデータセットの数からバッチサイズを取得
25行目～31行目：画像データからパッチを作成します
images:画像データ
sizes:パッチサイズ
strides:パッチのストライド。パッチ間の中心距離みたいなものですかね。
rates:パッチのストライド率
padding:padding方法です。画像端部の処理方法です。
33行目：作成したパッチたちを、リサイズして返します。

画像一枚を使ってパッチ作成を試してみます。

patch_size = 32  # Size of the patches to be extracted from the input images

plt.figure(figsize=(4, 4))
plt.imshow(x_train[0].astype("uint8"))
plt.axis("off")

patches = Patches(patch_size)(tf.convert_to_tensor([x_train[0]]))
print(f"Image size: {image_size} X {image_size}")
print(f"Patch size: {patch_size} X {patch_size}")
print(f"{patches.shape[1]} patches per image \n{patches.shape[-1]} elements per patch")


n = int(np.sqrt(patches.shape[1]))
plt.figure(figsize=(4, 4))
for i, patch in enumerate(patches[0]):
    ax = plt.subplot(n, n, i + 1)
    patch_img = tf.reshape(patch, (patch_size, patch_size, 3))
    plt.imshow(patch_img.numpy().astype("uint8"))
    plt.axis("off")

ここでは、画像を一枚開いて、パッチ作成を試してみます。グリッド状に切り刻まれた画像をご確認ください。

パッチエンコーディングレイヤーの実装

class PatchEncoder(layers.Layer):
    def __init__(self, num_patches, projection_dim):
        super().__init__()
        self.num_patches = num_patches
        self.projection = layers.Dense(units=projection_dim)
        self.position_embedding = layers.Embedding(
            input_dim=num_patches, output_dim=projection_dim
        )

    # Override function to avoid error while saving model
    def get_config(self):
        config = super().get_config().copy()
        config.update(
            {
                "input_shape": input_shape,
                "patch_size": patch_size,
                "num_patches": num_patches,
                "projection_dim": projection_dim,
                "num_heads": num_heads,
                "transformer_units": transformer_units,
                "transformer_layers": transformer_layers,
                "mlp_head_units": mlp_head_units,
            }
        )
        return config

    def call(self, patch):
        positions = tf.range(start=0, limit=self.num_patches, delta=1)
        encoded = self.projection(patch) + self.position_embedding(positions)
        return encoded

パッチを、project_dimの次元に変換する、パッチエンコーダを作成します。

1行目：パッチエンコーダークラスです。レイヤークラスを継承しています。
2行目：初期化クラスです
３行目：継承したクラスを初期化。
4行目：クラス変数　パッチの数
5行目：クラス変数　projection 全結合の出力です
6行目：クラス変数　position_embedding Embedding層の出力です。
11～25行目：configの更新です。ここはパッチ作成クラスと同様の書き方です。上の方を参照ください。
27行目：callの関数です。
28行目：positionを作成しています。tf.rangeで作成しています。ちょっとなじみが薄いですかね。。
29行目：encoder出力です。projectionとpositon_embeddingの和になっています。
30行目：encoder出力を返しておきます。

Vitモデルを構築

def create_vit_object_detector(
    input_shape,
    patch_size,
    num_patches,
    projection_dim,
    num_heads,
    transformer_units,
    transformer_layers,
    mlp_head_units,
):
    inputs = layers.Input(shape=input_shape)
    # Create patches
    patches = Patches(patch_size)(inputs)
    # Encode patches
    encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)

    # Create multiple layers of the Transformer block.
    for _ in range(transformer_layers):
        # Layer normalization 1.
        x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
        # Create a multi-head attention layer.
        attention_output = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=projection_dim, dropout=0.1
        )(x1, x1)
        # Skip connection 1.
        x2 = layers.Add()([attention_output, encoded_patches])
        # Layer normalization 2.
        x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
        # MLP
        x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)
        # Skip connection 2.
        encoded_patches = layers.Add()([x3, x2])

    # Create a [batch_size, projection_dim] tensor.
    representation = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
    representation = layers.Flatten()(representation)
    representation = layers.Dropout(0.3)(representation)
    # Add MLP.
    features = mlp(representation, hidden_units=mlp_head_units, dropout_rate=0.3)

    bounding_box = layers.Dense(4)(
        features
    )  # Final four neurons that output bounding box

    # return Keras model.
    return keras.Model(inputs=inputs, outputs=bounding_box)

ここが一番重要な、vitモデルを構築しているところです。

1～10行目：vitモデル構築関数です。引数がたくさんありますが、上の方を参照ください。
１１行目：画像input層
１３行目：パッチ作成
１５行目：パッチエンコーダ
１８行目：層の数分for文回します
２０行目：正規化
２２～２４行目：multi head attention 層　ｘ１を二つ入力しています
２６行目：x2 skip connection パッチエンコーダ出力と、attention出力を　つなぎます
２８行目：もう１回正規化を入れておいて
３０行目：x3 MLP　全結合層です　
３２行目：x2 と　ｘ３　をセットで　layerに
３５行目：正規化しておいて
３６行目：ベクトルに直して（１次元）
３７行目：ドロップアウトさせて
３９行目：全結合でつなげて、特徴量抽出
４1～４３行目：さらに、全結合でbounding boxの形に出力（４個の数字です。左上xy座標、右下xy座標）
４６行目：画像を入力して、バウンディングボックス座標を返すモデルとして出力

学習実行

def run_experiment(model, learning_rate, weight_decay, batch_size, num_epochs):

    optimizer = tfa.optimizers.AdamW(
        learning_rate=learning_rate, weight_decay=weight_decay
    )

    # Compile model.
    model.compile(optimizer=optimizer, loss=keras.losses.MeanSquaredError())

    checkpoint_filepath = "logs/"
    checkpoint_callback = keras.callbacks.ModelCheckpoint(
        checkpoint_filepath,
        monitor="val_loss",
        save_best_only=True,
        save_weights_only=True,
    )

    history = model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        epochs=num_epochs,
        validation_split=0.1,
        callbacks=[
            checkpoint_callback,
            keras.callbacks.EarlyStopping(monitor="val_loss", patience=10),
        ],
    )

    return history


input_shape = (image_size, image_size, 3)  # input image shape
learning_rate = 0.001
weight_decay = 0.0001
batch_size = 32
num_epochs = 100
num_patches = (image_size // patch_size) ** 2
projection_dim = 64
num_heads = 4
# Size of the transformer layers
transformer_units = [
    projection_dim * 2,
    projection_dim,
]
transformer_layers = 4
mlp_head_units = [2048, 1024, 512, 64, 32]  # Size of the dense layers


history = []
num_patches = (image_size // patch_size) ** 2

vit_object_detector = create_vit_object_detector(
    input_shape,
    patch_size,
    num_patches,
    projection_dim,
    num_heads,
    transformer_units,
    transformer_layers,
    mlp_head_units,
)

# Train model
history = run_experiment(
    vit_object_detector, learning_rate, weight_decay, batch_size, num_epochs
)

学習実行します。まずは学習の関数を定義しています。

１行目：学習関数です。引数は、モデル、学習率、重みの減衰、バッチサイズ、エポック数
３～５行目：オプティマイザーの定義です。引数で、学習率と重み減衰を使います。最適化アルゴルはadamWですね。
８行目：モデルをコンパイルします。オプティマイザーとlossを定義しています。lossはroot mean squareです。
１０行目：チェックポイントで保存するファイルパスです。
１１～１６行目：チェックポイントを設定しています。
１３行目：val lossをモニターして
１４行目：ベスト解を更新したら保存します
１５行目：保存するのは、重みのみです。（他はいらないので。。。）
１８～２８：学習実行します。
１９行目：学習データｘ
２０行目：学習データｙ
２１行目：バッチサイズ
２２行目：エポック数
２３行目：バリデーションに回すデータの比率
２４～２７行目：コールバックの設定。チェックポイント、アーリーストッピングを入れておきます。
３０行目：学習の履歴を返しておきます。
３３行目：入力画像のshape
３４行目：学習率
３５行目：重み減衰
３６行目：バッチサイズ
３７行目：エポック数
３８行目：パッチの数計算
３９行目：projection_dim 次元
４０行目：ヘッドの数
４２～４５行目：トランスフォーマーレイヤー数
４６行目：トランスフォーマーレイヤー
４７行目：MLP層のサイズ
５１行目：３８行目と一緒？？ここは、ミスですかね？
５３行目～６２行目：ViTモデルを構築。関数を使って。
６５～６７行目：学習実行。関数を使って。お疲れ様です！

Evaluate the model　モデルを検証してみよう

import matplotlib.patches as patches

# Saves the model in current path
vit_object_detector.save("vit_object_detector.h5", save_format="h5")

# To calculate IoU (intersection over union, given two bounding boxes)
def bounding_box_intersection_over_union(box_predicted, box_truth):
    # get (x, y) coordinates of intersection of bounding boxes
    top_x_intersect = max(box_predicted[0], box_truth[0])
    top_y_intersect = max(box_predicted[1], box_truth[1])
    bottom_x_intersect = min(box_predicted[2], box_truth[2])
    bottom_y_intersect = min(box_predicted[3], box_truth[3])

    # calculate area of the intersection bb (bounding box)
    intersection_area = max(0, bottom_x_intersect - top_x_intersect + 1) * max(
        0, bottom_y_intersect - top_y_intersect + 1
    )

    # calculate area of the prediction bb and ground-truth bb
    box_predicted_area = (box_predicted[2] - box_predicted[0] + 1) * (
        box_predicted[3] - box_predicted[1] + 1
    )
    box_truth_area = (box_truth[2] - box_truth[0] + 1) * (
        box_truth[3] - box_truth[1] + 1
    )

    # calculate intersection over union by taking intersection
    # area and dividing it by the sum of predicted bb and ground truth
    # bb areas subtracted by  the interesection area

    # return ioU
    return intersection_area / float(
        box_predicted_area + box_truth_area - intersection_area
    )


i, mean_iou = 0, 0

# Compare results for 10 images in the test set
for input_image in x_test[:10]:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 15))
    im = input_image

    # Display the image
    ax1.imshow(im.astype("uint8"))
    ax2.imshow(im.astype("uint8"))

    input_image = cv2.resize(
        input_image, (image_size, image_size), interpolation=cv2.INTER_AREA
    )
    input_image = np.expand_dims(input_image, axis=0)
    preds = vit_object_detector.predict(input_image)[0]

    (h, w) = (im).shape[0:2]

    top_left_x, top_left_y = int(preds[0] * w), int(preds[1] * h)

    bottom_right_x, bottom_right_y = int(preds[2] * w), int(preds[3] * h)

    box_predicted = [top_left_x, top_left_y, bottom_right_x, bottom_right_y]
    # Create the bounding box
    rect = patches.Rectangle(
        (top_left_x, top_left_y),
        bottom_right_x - top_left_x,
        bottom_right_y - top_left_y,
        facecolor="none",
        edgecolor="red",
        linewidth=1,
    )
    # Add the bounding box to the image
    ax1.add_patch(rect)
    ax1.set_xlabel(
        "Predicted: "
        + str(top_left_x)
        + ", "
        + str(top_left_y)
        + ", "
        + str(bottom_right_x)
        + ", "
        + str(bottom_right_y)
    )

    top_left_x, top_left_y = int(y_test[i][0] * w), int(y_test[i][1] * h)

    bottom_right_x, bottom_right_y = int(y_test[i][2] * w), int(y_test[i][3] * h)

    box_truth = top_left_x, top_left_y, bottom_right_x, bottom_right_y

    mean_iou += bounding_box_intersection_over_union(box_predicted, box_truth)
    # Create the bounding box
    rect = patches.Rectangle(
        (top_left_x, top_left_y),
        bottom_right_x - top_left_x,
        bottom_right_y - top_left_y,
        facecolor="none",
        edgecolor="red",
        linewidth=1,
    )
    # Add the bounding box to the image
    ax2.add_patch(rect)
    ax2.set_xlabel(
        "Target: "
        + str(top_left_x)
        + ", "
        + str(top_left_y)
        + ", "
        + str(bottom_right_x)
        + ", "
        + str(bottom_right_y)
        + "\n"
        + "IoU"
        + str(bounding_box_intersection_over_union(box_predicted, box_truth))
    )
    i = i + 1

print("mean_iou: " + str(mean_iou / len(x_test[:10])))
plt.show()

１行目：matplotのパッチ関係のライブラリをインポートします。
４行目：Vitモデルを保存しておきます。
７行目：IoUという精度指標を計算するところです。予測のbounding boxと、正解のbounding boxの重なりを計算しています。
９行目：予測と正解のbounding boxの左上x座標で大きいほう
１０行目：予測と正解のbounding boxの左上y座標で大きいほう
１１行目：予測と正解のbounding boxの右下x座標で小さいほう
１２行目：予測と正解のbounding boxの右下y座標で小さいほう
１５～１７行目：予測と正解のbounding boxの重なっている面積を計算
２０～２２行目：予測のbounding boxの面積
２３～２５行目：正解のbounding boxの面積
３２～３４行目：３つの面積から、IoUを計算して、返します。
３７行目：初期値を設定しておいて
４０行目：テスト画像１０枚をfor文で回します
４１行目：１行２列のサブプロット
４２行目：元画像を用意しておいて
４５行目：左側に元画像表示
４６行目：右側に元画像表示
４８～５０行目：画像をリサイズ
５１行目：numpy形式で１次元増やしておく（いつもは、バッチデータ数のところがかけているので。。。）
５２行目：学習済みモデルで推論
５４行目：画像のサイズを覚えておく
５６行目：bounding boxの左上ｘ、ｙ座標　計算
５８行目：右下ｘ、ｙ座標計算
６０行目：予測した、bounding box 座標
６２～６９行目：matplotでbounding boxを描画するための　rectangleオブジェクト作成。赤い、幅１の線。（左上座標ｘ、ｙ）、幅、高さ　の表現ですね。
７１行目：rectangle描画
７２～８１行目：x軸ラベルを設定。predictとbounding box座標を文字で表示しています。
８３行目：正解の左上座標
８５行目：正解の右下座標
８７行目：正解のbounding　box
８９行目：関数を使って、IoU計算。
９１行目～９８行目：描画に使う正解のbounding box オブジェクト。
１００行目：正解のbounding box　描画
１０１～１１３行目：ｘ軸ラベル表示、”Target”とbounding box座標を文字で表示
１１４行目：iを１個増やす
１１６行目：全体、画像１０枚分のIoUを計算表示
１１７行目：matplotで書いたものを見えるようにする。