My Predictions About AI From May 2022

it is end of 2024, I recently re-discovered my notes from may 2022, lets see how many of them became true.

[x] AI search engine
[x] AI’s good with tools not precise (recurse it)
[ ] ai package installer
[ ] obsolete code/file/command update (e.g. old hexo to new format, old dependencies to new)
[x] gpt twitch chat interaction
[ ] how to train llms to delete?

My Opinion on LLM With Episodic Memory

About a week ago, I have listened to a Lex Fridman Podcast with Charan Ranganath and took notes on it.

I immediately thought it could be applied in LLMs, as many ideas that C.R. proposed are quite interesting, such as:

  • Simillar neurons are used by the brain when predicting the future and remembering the past
  • Humans don’t replay the past, we imagine what the past could’ve been by taking bits and pieces
  • “You don’w want to remember more, but better” (Remembering things at a higher abstract, cramming vs actually knowing)
    • Maybe there’s something here
  • He suggested Internal Models of Events
  • Forgetting and Retrieval failure

Internal Models of Events

Internal Models of Events are formed with both Semantic and Episodic memory at particular points of high prediction error. And those points are when its maximally optimal to encode as episodic memory.

Thoughts

  • wait do human store episodic memories for training while sleeping?

Streetfight Transformers Notes

My Takeway

Example: LLama 3.1 8B
Layers (N): 32, Model Dim(D): 4096, FFN Dim: 14336, Attention Heads(NH): 32, K/V Head 8, Vocab Size(V): 128,000
Estimated Param ≈ 128,000 * 4096 + 32 * 4 * 4096 * 4096 + 32 * (4096 * 14336 * 2) ≈ 6.4 B
Miss Params that I can think of: WPE
Overhead = 25% ig

Memory wise:
In order to Inference it you need:
B = 1
Param + N * T * B * 2D = 8B + 32 + 128000 * B * 4096*2 ≈ 8B
inference:
INT8 = 1 byte = 8GB
BFloat16 = 2 byte = 16 GB
TF32 = 2.375 byte = 19 GB

Train:
3 * N * 12 * D * D + N * T * B * 12 * D = 3 * 32 * 12 * 4096 * 4096 + 32 * 128000 * 1 * 12 * 4096 ≈ 220B
TF32 = 220 * 2.375 ≈ 522 GB = 8 H100s to train batch size of 1

Finetune:
B = 1
R = 64
N*12*D*D+N*2*D*R (Param) +2*N*2*D*R (Adam) + B*N*T*(D+R+D+D+D) = 32 * 12 * 4096 * 4096 + 2 * 4096 * 64 + 2 * 32 * 2 * 4096 * 64 + 1 * 32 * 128000 * (4 * 4096 + 64) ≈ 73B
TF32 ≈ 173 GB = 8 5090s(if its 24/32gb) to train batch size of 1
B = 4
2 * 12 * 4096 * 4096 + 2 * 4096 * 64 + 2 * 32 * 2 * 4096 * 64 + 4 * 32 * 128000 * (4 * 4096 + 64) = 270B ≈ 641GB ≈ 8 H100s with mixed precision maybe

maybe with mixed precision you can finetune on 8xH100s

Ok time to earn enough money to buy 8xH100s! It’s only $260,500!

Read more