Posted 2025-01-18随笔

hidden

Posted 2024-12-23随笔

My Predictions About AI From May 2022

it is end of 2024, I recently re-discovered my notes from may 2022, lets see how many of them became true.

[x] AI search engine
[x] AI’s good with tools not precise (recurse it)
[ ] ai package installer
[ ] obsolete code/file/command update (e.g. old hexo to new format, old dependencies to new)
[x] gpt twitch chat interaction
[ ] how to train llms to delete?

Posted 2024-11-30随笔

Pretending I'm Llm

hidden

Posted 2024-11-09随笔

My Opinion on LLM With Episodic Memory

About a week ago, I have listened to a Lex Fridman Podcast with Charan Ranganath and took notes on it.

I immediately thought it could be applied in LLMs, as many ideas that C.R. proposed are quite interesting, such as:

Simillar neurons are used by the brain when predicting the future and remembering the past
Humans don’t replay the past, we imagine what the past could’ve been by taking bits and pieces
“You don’w want to remember more, but better” (Remembering things at a higher abstract, cramming vs actually knowing)
- Maybe there’s something here
He suggested Internal Models of Events
Forgetting and Retrieval failure

Internal Models of Events

Internal Models of Events are formed with both Semantic and Episodic memory at particular points of high prediction error. And those points are when its maximally optimal to encode as episodic memory.

Thoughts

wait do human store episodic memories for training while sleeping?

Posted 2024-09-28

What Should My Training Set Be?

Q

What should be the training set of your meat computer? How to pick your training set?

Posted 2024-09-27

Streetfight Transformers Notes

My Takeway

Example: LLama 3.1 8B
Layers (N): 32, Model Dim(D): 4096, FFN Dim: 14336, Attention Heads(NH): 32, K/V Head 8, Vocab Size(V): 128,000
Estimated Param ≈ 128,000 * 4096 + 32 * 4 * 4096 * 4096 + 32 * (4096 * 14336 * 2) ≈ 6.4 B
Miss Params that I can think of: WPE
Overhead = 25% ig

Memory wise:
In order to Inference it you need:
B = 1
Param + N * T * B * 2D = 8B + 32 + 128000 * B * 4096*2 ≈ 8B
inference:
INT8 = 1 byte = 8GB
BFloat16 = 2 byte = 16 GB
TF32 = 2.375 byte = 19 GB

Train:
3 * N * 12 * D * D + N * T * B * 12 * D = 3 * 32 * 12 * 4096 * 4096 + 32 * 128000 * 1 * 12 * 4096 ≈ 220B
TF32 = 220 * 2.375 ≈ 522 GB = 8 H100s to train batch size of 1

Finetune:
B = 1
R = 64
N*12*D*D+N*2*D*R (Param) +2*N*2*D*R (Adam) + B*N*T*(D+R+D+D+D) = 32 * 12 * 4096 * 4096 + 2 * 4096 * 64 + 2 * 32 * 2 * 4096 * 64 + 1 * 32 * 128000 * (4 * 4096 + 64) ≈ 73B
TF32 ≈ 173 GB = 8 5090s(if its 24/32gb) to train batch size of 1
B = 4
2 * 12 * 4096 * 4096 + 2 * 4096 * 64 + 2 * 32 * 2 * 4096 * 64 + 4 * 32 * 128000 * (4 * 4096 + 64) = 270B ≈ 641GB ≈ 8 H100s with mixed precision maybe

maybe with mixed precision you can finetune on 8xH100s

Ok time to earn enough money to buy 8xH100s! It’s only $260,500!

Posted 2024-09-23

Personal Keybind Settings

hidden

Internal Models of Events

Thoughts

Q

My Takeway

Categories

Recents

Archives

Tags