Tensorflow keras GPT 2 pythonサンプルコード解説

今回は、chat gptで皆さんおなじみの大規模言語モデルのサンプルコード解説です。

GPT２ではありますが、kerasのサンプルコードが公開されています。

GPT2ですので、ご注意ください。

はじめに
サンプルコード解説
まとめ

はじめに

参照しているサンプルコードはこちらになります↓↓↓

Keras documentation: GPT2 Text Generation with KerasNLP

Keras documentation

tensorflow keras（google）でOpen AI(microsoft傘下)のGPTを動かしているのは、おもしろいですね。

colaboへのリンクもありますので、そちらが便利ですよ！

サンプルコード解説

KerasNLPインストールとインポート

!pip install -q keras-nlp

インストールです。コマンドラインで実行する場合は！を消しましょう。

import keras_nlp
import tensorflow as tf
from tensorflow import keras
import time

インポートです。

keras NLP、tensorflow本体、keras、timeですね。

生成大規模言語モデル (LLM) の概要

皆さんご存じのように、chat GPTのように、テキスト生成、質問応答、機械翻訳などいろんな、自然言語処理 (NLP)ができるモデルです。

googleが考えたのTransformerを使っています。

KerasNLP概要

keras NLPは、学習済みの大規模言語モデルもあるので、便利ですよ。

Load a pre-trained GPT-2 model and generate some text（学習済みのGPT2モデルをロードして、文章を作成してみよう！）

# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

３～６行目：前処理の定義です。”gpt2_base_en”、gpt2向けの、sequence_lenth:？？
７～９行目：モデルの定義です。”gpt2_base_en”で、gpt2を選んで、preprpcessorで前処理を設定しています。

start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

ここからモデルを使って文章を生成します。

1行目：計算開始時間を覚えておく
３行目：モデルに文章を入力して、テキスト生成
入力：”My trip to Yosemite was” 私のヨセミテへの旅行は…..だ。max_lengthは、出力文章の最大長さを指定しています。
４～５行目：モデル出力を表示
７行目：計算終了時間を覚えておく
８行目：計算にかかった時間を表示

GPT-2 output:
My trip to Yosemite was one of the best experiences of my life. I was so close to the top of the mountains, I could feel the sun shining through my eyes. I was so close to the top of the mountains, the sun had a nice view of the valley and I couldn't believe the sun came out of nowhere. The sun shone in all directions and I could feel it. I was so close to the top of the mountains, it felt like I was in the middle of a volcano. It was amazing to see all of that. I felt like a volcano. I felt so close to all of the things. I felt like an island in a sea of lava.

こんな文章が出力されます。

”私のヨセミテへの旅は人生で最高の経験だった”うんぬんかんぬん。。。

start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

別の文章を入力して試してみよう！

３行目：文章入力は、”That Italian restaurant is” あのイタリア料理屋は。。。。

GPT-2 output:
That Italian restaurant is now closed, according to a report from Bloomberg.

出力はこんな感じです。

”あのイタリア料理屋は、ブルームバーグによると今閉店しているよ”　うんぬんかんぬん。。。

てっきり、おいしいとか答えるかと予想してたのですが。。。。GPT2は意表をついてきます。

ちなみに、２回目の推論は、高速になります。。。１.７２ｓとのことです。（１回目は18.25ｓ）

確かに速いですね。。

More on the GPT-2 model from KerasNLP(kerasNLPのGPT2のモデル)

モデル構造の説明をしています。

Finetune on Reddit dataset（ファインチューニング）

Reddit dataset というデータセットで、ファインチューニングを行います。

import tensorflow_datasets as tfds

reddit_ds = tfds.load("reddit_tifu", split="train", as_supervised=True)

１行目：tensorflwo に備わっている、データセットをインポート
３行目：”reddit_tifu”データセットをロードします。　split：”train”　学習データを、as_supervised：ラベル付きで

for document, title in reddit_ds:
    print(document.numpy())
    print(title.numpy())
    break

redditデータベースを少しだけ見てみましょう。

文章とそのタイトルのデータセットですね

b"me and a friend decided to go to the beach last sunday. we loaded up and headed out. we were about half way there when i decided that i was not leaving till i had seafood. \n\nnow i'm not talking about red lobster. no friends i'm talking about a low country boil. i found the restaurant and got directions. i don't know if any of you have heard about the crab shack on tybee island but let me tell you it's worth it. \n\nwe arrived and was seated quickly. we decided to get a seafood sampler for two and split it. the waitress bought it out on separate platters for us. the amount of food was staggering. two types of crab, shrimp, mussels, crawfish, andouille sausage, red potatoes, and corn on the cob. i managed to finish it and some of my friends crawfish and mussels. it was a day to be a fat ass. we finished paid for our food and headed to the beach. \n\nfunny thing about seafood. it runs through me faster than a kenyan \n\nwe arrived and walked around a bit. it was about 45min since we arrived at the beach when i felt a rumble from the depths of my stomach. i ignored it i didn't want my stomach to ruin our fun. i pushed down the feeling and continued. about 15min later the feeling was back and stronger than before. again i ignored it and continued. 5min later it felt like a nuclear reactor had just exploded in my stomach. i started running. i yelled to my friend to hurry the fuck up. \n\nrunning in sand is extremely hard if you did not know this. we got in his car and i yelled at him to floor it. my stomach was screaming and if he didn't hurry i was gonna have this baby in his car and it wasn't gonna be pretty. after a few red lights and me screaming like a woman in labor we made it to the store. \n\ni practically tore his car door open and ran inside. i ran to the bathroom opened the door and barely got my pants down before the dam burst and a flood of shit poured from my ass. \n\ni finished up when i felt something wet on my ass. i rubbed it thinking it was back splash. no, mass was covered in the after math of me abusing the toilet. i grabbed all the paper towels i could and gave my self a whores bath right there. \n\ni sprayed the bathroom down with the air freshener and left. an elderly lady walked in quickly and closed the door. i was just about to walk away when i heard gag. instead of walking i ran. i got to the car and told him to get the hell out of there."
b'liking seafood'

文章が、”この前の日曜私と友人はビーチに行くことにした”うんぬんかんぬん。。。

タイトルが、シーフードの好み

です。

このデータセットを使って、GPT2を、文章を入力したときに、その文章のタイトルが出力されるように、ファインチューニングします。

train_ds = train_ds.take(500)
num_epochs = 1

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

計算時間がめちゃくちゃかかるので、少しだけデータ使います。

１行目：使うのは５００データだけです。
２行目：epochも1回だけです。
５～９行目：学習率の設定。少し細かくしています。
６行目：学習率　５e-5
７行目：decay_step 学習率を小さくしていくステップです。学習が進んできたら弱くします。今回１epochなので、設定しても無駄な気が。。。。
８行目：最終的に学習率は０まで減らす感じで
１０行目：損失定義、スパースカテゴリカルクロスエントロピー　です。
１１～１５行目：コンパイルします。optimizerは安心のAdam、lossは上で設定したやつ、weighted_metricsはaccuracy(精度)
１７行目：学習実行

start = time.time()

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

redditデータセットにファインチューニングさせたモデルでの推論です。

1行目：計算開始の時刻覚える
３行目：学習済みモデルに入力、”I like basketball” 　私はバスケットボールが好きです。　max_lengthは、出力の最大長さ。
４～５行目：出力表示
７～８行目：計算終了時刻を覚えておいて、計算にかかった時間を表示

GPT-2 output:
I like basketball. i've been a big fan of it since high school, and it's been pretty cool to me.

出力は、こんな感じですね。radditのデータセット風の返し方をするとこのこです。フランクな感じなのですかね。

私は、バスケットボールが好きです。高校生のころから、すっげー楽しんでいたよ。。。とってもクールだぜ！うんぬんかんぬん。。。

Into the Sampling Method

# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_nlp.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

サンプラーというものを変えると、答え方が変わります。

２～６行目：デフォルトのサンプラーを使っています。Top-K searchというものです。
８～１３行目：Greedyというサンプラーを使っています。

Finetune on Chinese Poem Dataset（漢詩のデータセット）

英語以外のデータセットでも学習できます。例えば漢詩のデータセットです。

!# Load chinese poetry dataset.
!git clone https://github.com/chinese-poetry/chinese-poetry.git

gitからデータセットをダウンロードします

import os
import json

poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
    if ".json" not in file or "poet" not in file:
        continue
    full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
    with open(full_filename, "r") as f:
        content = json.load(f)
        poem_collection.extend(content)

paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]

Jsonファイルを読み込んでいます。

５行目：全唐詩　というフォルダの中のファイルパスを全部for文で回す
６行目：jsonファイルか、”poet”といファイル以外はパス
８行目：ファイル名を拾って
９行目：ファイルopen
１０行目：jsonをロードして
１１行目：リストに加えておく
１２行目：パラグラムだけのリストをつくっておく。

print(paragraphs[0])

１個確認

train_ds = (
    tf.data.Dataset.from_tensor_slices(paragraphs)
    .batch(16)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1

learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-4,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

学習しています。英語の時とほぼ一緒です。

output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200)
print(output)

推論、文章生成です。こちらも英語とほぼいっしょです。

日本語のデータセットでファインチューニングすれば、日本語にも対応できそうですね。

まとめ

大規模言語モデルGPT2の tensorflow kerasを用いたサンプルコードを解説しました。

思ったより、シンプル内容ですね。

最新のGPTはおそらく、公開されないと思います。

ただ、GPT2でも、身近な課題の解決やサービスに使える可能性は十分ありそうですね。