This post contains some notes on building nano GPT following Andrej Karpathy’s YouTube video Let’s build GPT: from scratch, in code, spelled out.

Set up

See https://yxiong.github.io/2023/05/19/colab-with-custom-gce-vm.html for more instructions.

  1. Create a new project in https://console.cloud.google.com/ named Nano GPT
    • enable billing (I do this at project level in order to properly keep track of costs)
    • request to increase GPU quota to 1 (the default is 0)
  2. Deploy pre-configured colab VM from marketplace: https://console.cloud.google.com/marketplace/product/colab-marketplace-image-public/colab
    • need to try different Zone to show see the available Machine type
    • started with a CPU instance c2-standard-4
    • Go to https://colab.research.google.com/ and “Connect to a custom GCE VM”
  3. Deploy pre-configured deep learning VM from marketplace: https://pantheon.corp.google.com/marketplace/details/click-to-deploy-images/deeplearning
    • Start with a CPU VM, with PyTorch 2.4 (CUDA 12.4, Python 3.10) framework
    • For GPU, choose the following:
      • Zone: us-west4-a (the zone needs to have enough GPUs, trial and error)
      • GPU type: NVIDIA T4 (1)
      • Machine type: n1-standard-8
      • Framework: PyTorch 2.4 (CUDA 12.4, Python 3.10)
      • Check the Install NVIDIA GPU driver ... box
  4. The dataset can be found at https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Bi-gram baseline model

  1. Random initial state before any optimization:
    • Cross entropy loss is 4.68 (theoretical loss is $-\log(1/65)=4.17$)
    • Generative output looks like
      pdcbf?pGXepydZJSrF$Jrqt!:wwWSzPNxbjPiD&Q!a;yNt$Kr$o-gC$WSjJqfBKBySKtSKpwNNfyl&w:q-jluBatD$Lj;?yzyUca!UQ!vrpxZQgC-hlkq,ptKqHoiX-jjeLJ &slERj KUsBOL!mpJO!zLg'wNfqHAMgq'hZCWhu.W.IBcP 
      RFJ&DEs,nw?pxE?xjNHHVxJ&D&vWWToiERJFuszPyZaNw$
      EQJMgzaveDDIoiMl&sMHkzdRptRCPVjwW.RSVMjs-bgRkzrBTEa!!oP fRSxq.PLboTMkX'D
      
  2. After 10,000 steps of optimization (batch size 32, Adam optimizer)
    • Cross entropy loss is 2.47 on trianing data and 2.49 on validation data
    • Generative output looks like
      od nos CAy go ghanoray t, co haringoudrou clethe k,LARof fr werar,
      Is fa!
      
      
      Thilemel cia h hmboomyorarifrcitheviPO, tle dst f qur'dig t cof boddo y t o ar pileas h mo wierl t,
      S:
      STENENEat I athe thounomy tinrent distesisanimald 3I: eliento ald, avaviconofrisist me Busarend un'soto vat s k,
      SBRI he the f wendleindd t acoe ts ansu, thy ppr h.QULY:
      KIIsqu pr odEd ch,
      APrnes ouse bll owhored miner t ooon'stoume bupromo! fifoveghind hiarnge s.
      MI aswimy or m, wardd tw'To tee abifewoetsphin sed The a
      

Transformer

  1. Adding a single attention head of 32 dimension:
    • Cross entropy loss is 2.40 on trianing data and 2.41 on validation data
    • Generative output looks like
      Wes le isen.
      Woto teven INGO, ous into CYedd shou maithe ert thethens the the del ede cksy ow? Wlouby aicecat tisall wor
      G'imemonou mar ee hacreancad hontrt had wousk ucavere.
      
      Baraghe lfousto beme,
      S m; ten gh;
      S:
      Ano ice de bay alysathef beatireplim serbeais I fard
      Sy,
      Me hallil:
      DWAR: us,
      Wte hse aecathate, parrise in hr'd pat
      ERY:
      Bf bul walde betl'ts I yshore grest atre ciak aloo; wo fart hets atl.
      
      That at Wh kear ben.
      hend.
      
      KTh'd foushe d'l otacaengs p bloul blod arme foot buthes fo boe
      
  2. Changing from single 32 dimension attention head to multi-head attention (4 heads, 8 dimension each):
    • Cross entropy loss is 2.27 on trianing data and 2.29 on validation data
    • Generative output looks like
      We! le ises.
      Wmay they row we thutinte Caldd shou mait tiertlentthens the the dol ede cksy ba? Wlouby arceckentisste wre
      G'imemonot mar ef hacr
      COngd Go mringt thouskiu?
      
      Fre.
      
      Bardageplftisto be ess the to hon;
      Soretr ice we bay, Thouthe wome isspe, laveberis I fald
      Sy,
      Whissitill there git; the se aTist,
      Anos arrise in wito pat
      ER:
      Tuche'l walde,
      Anl'th I yourre grest at west ont of; wonf Gost theatl.
      
      That Ongss to habe to hend.
      
      KTh'd forshe do- of warngs pis ond blou arout: o ce ther fo boe
      
  3. Adding a feed forward layer after multi-head attention
    • Cross entropy loss is 2.23 on trianing data and 2.24 on validation data
    • Generative output looks like
      Ba hil thill shat coo he hot mes fin.
      
      Cy hirad I four shat son yald hat lods guk- of din!
      
      Cod swo-t sand tlo, with.
      
      BANTAUCHAR:
      I ork sak's willl eger aepale ganed ay wouce
      song thy in noduace being uliths tot upiord--mard meme the fles,
      Tharr, leanven-ty's. I Ron it in weancepte, digus I the souts pof ther cork, is woth' that he the bre tolf,
      An live:
      Mrins,
      What,
      Whorend;
      Ant to mal ry fa of thaw a tes tir, to tis argito-, setolf brntens?
      I' sit mem, thand yought is an he frifperend tepefur
      
  4. Using 3 multi-head attention plus feed forward blocks, adding residual layer, and increasing feed forward dimension by 4 times:
    • Cross entropy loss is 2.00 on trianing data and 2.08 on validation data
    • Generative output looks like
      Row duke be greford,
      What prach-deay loved, of persack'd the your farod, dide a coble, but so her had not muchth-. O, baving hear! the of these wamist I hat in and your eif hew leart a hese facce,
      To mockn youd you fam do be king up too ever tow, omme as to louds as the moweres it?
      That I'll to doys batise dut bed sover your nithere dusery goe agay sif.
      
      JULET:
      Where' but of theirst they cwound must ke?
      Had mady to those,
      That king my seep,
      Ceagoond parcullick, be but as come:
      Bur on the owese f
      
  5. Adding LayerNorm for normalization
    • Cross entropy loss is 1.98 on trianing data and 2.06 on validation data
    • Generative output looks like
      Row duke beig eftall you seep'd--
      Fy loved, of, I say that ond the
      Suplat she a cebad, brive dong hat not my on this that with thout a lost,
      Lead,
      As to Hear have not
      woer leew leat that yower come staved
      I nod yop and do broke. ancallowelds towful my as to lo, shat the most our but hat I owser'd you cale most bender; hay ale, the by be and deready sid.
      
      JULEOLIO:
      I weesecter
      Sell'TIt, you sudden but Enive: mady tock os and.
      
      COLY ELOMENB:
      The good, pricI.
      
      Or, beesterath,
      Thus grons! I Loody
      To
      
  6. Scaling up the transformer and adding dropout, training on a T4 GPU for ~1 hour
    • Hyperparameters:
      • batch_size = 32 -> 64
      • block_size = 8 -> 256
      • learning_rate = 1e-3 -> 3e-4
      • n_embed = 32 -> 384
      • n_head = 4 -> 6
      • n_layer = 3 -> 6
      • dropout = 0 -> 0.2
    • Cross entropy loss is 1.08 on trianing data and 1.49 on validation data
    • Generative output looks like
      But with prison: I will stead with you.
      
      ISABELLA:
      Carress, all do; and I'll say your honour self good:
      Then I'll regn your highness and
      Compell'd by my sweet gates that you may:
      Valiant make how I heard of you.
      
      ANGELO:
      Nay, sir, Isay!
      
      ISABELLA:
      I am sweet men sister as you steed.
      
      LUCIO:
      As it if you in the case would princily,
      I'll rote, sir, I did cannot now at me?
      That look thence, thy children shall be you called.
      
      DUKE VINCENTIO:
      Marry, though I do read you!
      
      LUCIO:
      O that mufflest than 
      

Notes

  1. At each layer of the transformer, we have an embedding $e_i$ for each token.
    • From these embeddings, we derive key, query, and value for the next layer through linear projection
  2. Attention is in essence a weighted average:
    • The weights are data dependent: $W = \textrm{softmax}(QK^T)$
      • In auto-regressive decoder, we make each token only looking at previous but not future tokens, by wiping out (set to $-\infty$) the upper-right portion of $(QK^T)$, such that $W$ is lower-triangular
      • For numerical stability for softmax, we normalize the weights by $(QK^T/\sqrt{d_k})$, where $d_k$ is the head dimension
    • The attention head (which later becomes new embeddings for the next layer) is weighted average $h_i = \sum w_{i,j} v_j$

References