Nano GPT
#ai
#colab
#gpt
#transformer
This post contains some notes on building nano GPT following Andrej Karpathy’s YouTube video Let’s build GPT: from scratch, in code, spelled out.
Set up
See https://yxiong.github.io/2023/05/19/colab-with-custom-gce-vm.html for more instructions.
- Create a new project in https://console.cloud.google.com/ named
Nano GPT
- enable billing (I do this at project level in order to properly keep track of costs)
- request to increase GPU quota to 1 (the default is 0)
- Deploy pre-configured colab VM from marketplace: https://console.cloud.google.com/marketplace/product/colab-marketplace-image-public/colab
- need to try different
Zone
to show see the availableMachine type
- started with a CPU instance
c2-standard-4
- Go to https://colab.research.google.com/ and “Connect to a custom GCE VM”
- need to try different
- Deploy pre-configured deep learning VM from marketplace: https://pantheon.corp.google.com/marketplace/details/click-to-deploy-images/deeplearning
- Start with a CPU VM, with
PyTorch 2.4 (CUDA 12.4, Python 3.10)
framework - For GPU, choose the following:
- Zone:
us-west4-a
(the zone needs to have enough GPUs, trial and error) - GPU type:
NVIDIA T4
(1) - Machine type:
n1-standard-8
- Framework:
PyTorch 2.4 (CUDA 12.4, Python 3.10)
- Check the
Install NVIDIA GPU driver ...
box
- Zone:
- Start with a CPU VM, with
- The dataset can be found at https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Bi-gram baseline model
- Random initial state before any optimization:
- Cross entropy loss is 4.68 (theoretical loss is $-\log(1/65)=4.17$)
- Generative output looks like
pdcbf?pGXepydZJSrF$Jrqt!:wwWSzPNxbjPiD&Q!a;yNt$Kr$o-gC$WSjJqfBKBySKtSKpwNNfyl&w:q-jluBatD$Lj;?yzyUca!UQ!vrpxZQgC-hlkq,ptKqHoiX-jjeLJ &slERj KUsBOL!mpJO!zLg'wNfqHAMgq'hZCWhu.W.IBcP RFJ&DEs,nw?pxE?xjNHHVxJ&D&vWWToiERJFuszPyZaNw$ EQJMgzaveDDIoiMl&sMHkzdRptRCPVjwW.RSVMjs-bgRkzrBTEa!!oP fRSxq.PLboTMkX'D
- After 10,000 steps of optimization (batch size 32, Adam optimizer)
- Cross entropy loss is 2.47 on trianing data and 2.49 on validation data
- Generative output looks like
od nos CAy go ghanoray t, co haringoudrou clethe k,LARof fr werar, Is fa! Thilemel cia h hmboomyorarifrcitheviPO, tle dst f qur'dig t cof boddo y t o ar pileas h mo wierl t, S: STENENEat I athe thounomy tinrent distesisanimald 3I: eliento ald, avaviconofrisist me Busarend un'soto vat s k, SBRI he the f wendleindd t acoe ts ansu, thy ppr h.QULY: KIIsqu pr odEd ch, APrnes ouse bll owhored miner t ooon'stoume bupromo! fifoveghind hiarnge s. MI aswimy or m, wardd tw'To tee abifewoetsphin sed The a
Transformer
- Adding a single attention head of 32 dimension:
- Cross entropy loss is 2.40 on trianing data and 2.41 on validation data
- Generative output looks like
Wes le isen. Woto teven INGO, ous into CYedd shou maithe ert thethens the the del ede cksy ow? Wlouby aicecat tisall wor G'imemonou mar ee hacreancad hontrt had wousk ucavere. Baraghe lfousto beme, S m; ten gh; S: Ano ice de bay alysathef beatireplim serbeais I fard Sy, Me hallil: DWAR: us, Wte hse aecathate, parrise in hr'd pat ERY: Bf bul walde betl'ts I yshore grest atre ciak aloo; wo fart hets atl. That at Wh kear ben. hend. KTh'd foushe d'l otacaengs p bloul blod arme foot buthes fo boe
- Changing from single 32 dimension attention head to multi-head attention (4 heads, 8 dimension each):
- Cross entropy loss is 2.27 on trianing data and 2.29 on validation data
- Generative output looks like
We! le ises. Wmay they row we thutinte Caldd shou mait tiertlentthens the the dol ede cksy ba? Wlouby arceckentisste wre G'imemonot mar ef hacr COngd Go mringt thouskiu? Fre. Bardageplftisto be ess the to hon; Soretr ice we bay, Thouthe wome isspe, laveberis I fald Sy, Whissitill there git; the se aTist, Anos arrise in wito pat ER: Tuche'l walde, Anl'th I yourre grest at west ont of; wonf Gost theatl. That Ongss to habe to hend. KTh'd forshe do- of warngs pis ond blou arout: o ce ther fo boe
- Adding a feed forward layer after multi-head attention
- Cross entropy loss is 2.23 on trianing data and 2.24 on validation data
- Generative output looks like
Ba hil thill shat coo he hot mes fin. Cy hirad I four shat son yald hat lods guk- of din! Cod swo-t sand tlo, with. BANTAUCHAR: I ork sak's willl eger aepale ganed ay wouce song thy in noduace being uliths tot upiord--mard meme the fles, Tharr, leanven-ty's. I Ron it in weancepte, digus I the souts pof ther cork, is woth' that he the bre tolf, An live: Mrins, What, Whorend; Ant to mal ry fa of thaw a tes tir, to tis argito-, setolf brntens? I' sit mem, thand yought is an he frifperend tepefur
- Using 3 multi-head attention plus feed forward blocks, adding residual layer, and increasing feed forward dimension by 4 times:
- Cross entropy loss is 2.00 on trianing data and 2.08 on validation data
- Generative output looks like
Row duke be greford, What prach-deay loved, of persack'd the your farod, dide a coble, but so her had not muchth-. O, baving hear! the of these wamist I hat in and your eif hew leart a hese facce, To mockn youd you fam do be king up too ever tow, omme as to louds as the moweres it? That I'll to doys batise dut bed sover your nithere dusery goe agay sif. JULET: Where' but of theirst they cwound must ke? Had mady to those, That king my seep, Ceagoond parcullick, be but as come: Bur on the owese f
- Adding LayerNorm for normalization
- Cross entropy loss is 1.98 on trianing data and 2.06 on validation data
- Generative output looks like
Row duke beig eftall you seep'd-- Fy loved, of, I say that ond the Suplat she a cebad, brive dong hat not my on this that with thout a lost, Lead, As to Hear have not woer leew leat that yower come staved I nod yop and do broke. ancallowelds towful my as to lo, shat the most our but hat I owser'd you cale most bender; hay ale, the by be and deready sid. JULEOLIO: I weesecter Sell'TIt, you sudden but Enive: mady tock os and. COLY ELOMENB: The good, pricI. Or, beesterath, Thus grons! I Loody To
- Scaling up the transformer and adding dropout, training on a T4 GPU for ~1 hour
- Hyperparameters:
batch_size = 32 -> 64
block_size = 8 -> 256
learning_rate = 1e-3 -> 3e-4
n_embed = 32 -> 384
n_head = 4 -> 6
n_layer = 3 -> 6
dropout = 0 -> 0.2
- Cross entropy loss is 1.08 on trianing data and 1.49 on validation data
- Generative output looks like
But with prison: I will stead with you. ISABELLA: Carress, all do; and I'll say your honour self good: Then I'll regn your highness and Compell'd by my sweet gates that you may: Valiant make how I heard of you. ANGELO: Nay, sir, Isay! ISABELLA: I am sweet men sister as you steed. LUCIO: As it if you in the case would princily, I'll rote, sir, I did cannot now at me? That look thence, thy children shall be you called. DUKE VINCENTIO: Marry, though I do read you! LUCIO: O that mufflest than
- Hyperparameters:
Notes
- At each layer of the transformer, we have an embedding $e_i$ for each token.
- From these embeddings, we derive key, query, and value for the next layer through linear projection
- Attention is in essence a weighted average:
- The weights are data dependent: $W = \textrm{softmax}(QK^T)$
- In auto-regressive decoder, we make each token only looking at previous but not future tokens, by wiping out (set to $-\infty$) the upper-right portion of $(QK^T)$, such that $W$ is lower-triangular
- For numerical stability for softmax, we normalize the weights by $(QK^T/\sqrt{d_k})$, where $d_k$ is the head dimension
- The attention head (which later becomes new embeddings for the next layer) is weighted average $h_i = \sum w_{i,j} v_j$
- The weights are data dependent: $W = \textrm{softmax}(QK^T)$
References
- Let’s build GPT: from scratch, in code, spelled out.: YouTube video by Andrej Karpathy
- ng-video-lecture: GitHub repo by Andrej Karpathy
- My colab: https://colab.research.google.com/gist/yxiong/0c36d1c23b10c68a64ce614a394bb44a/nano-gpt.ipynb
- My script: https://gist.github.com/yxiong/90944710251d2467675ad7adee38b4ee