This post contains some notes on building nano GPT following Andrej Karpathy’s YouTube video Let’s build GPT: from scratch, in code, spelled out.

Set up

See https://yxiong.github.io/2023/05/19/colab-with-custom-gce-vm.html for more instructions.

  1. Create a new project in https://console.cloud.google.com/ named Nano GPT
    • enable billing (I do this at project level in order to properly keep track of costs)
    • request to increase GPU quota to 1 (the default is 0)
  2. Deploy pre-configured colab VM from marketplace: https://console.cloud.google.com/marketplace/product/colab-marketplace-image-public/colab
    • need to try different Zone to show see the available Machine type
    • started with a CPU instance c2-standard-4
  3. Go to https://colab.research.google.com/ and “Connect to a custom GCE VM”

Bi-gram baseline model

  1. Random initial state before any optimization:
    • Cross entropy loss is 4.68 (theoretical loss is -$\log(1/65)$=4.17)
    • Output looks like
      pdcbf?pGXepydZJSrF$Jrqt!:wwWSzPNxbjPiD&Q!a;yNt$Kr$o-gC$WSjJqfBKBySKtSKpwNNfyl&w:q-jluBatD$Lj;?yzyUca!UQ!vrpxZQgC-hlkq,ptKqHoiX-jjeLJ &slERj KUsBOL!mpJO!zLg'wNfqHAMgq'hZCWhu.W.IBcP 
      RFJ&DEs,nw?pxE?xjNHHVxJ&D&vWWToiERJFuszPyZaNw$
      EQJMgzaveDDIoiMl&sMHkzdRptRCPVjwW.RSVMjs-bgRkzrBTEa!!oP fRSxq.PLboTMkX'D
      
  2. After 10,000 steps of optimization (batch size 32, Adam optimizer)
    • Cross entropy loss is 2.43
    • Output looks like
      CYOx?
      
      DUThinqunt.
      
      LaZAnde.
      athave l.
      KEONH:
      ARThanco be y,-hedarwnoddy scace, tridesar, wnl'shenous s ls, theresseys
      PlorseelapinghiybHen yof GLUCEN t l-t E:
      I hisgothers je are!-e!
      QLYotouciullle'z,
      Thitertho s?
      NDan'spererfo cist ripl chys er orlese;
      Yo jehof h hecere ek? wferommot mowo soaf yoi
      

References