Great breakdown. Since Ollama support for AMD has become decent, a good bang for buck is the MI50 16Gb. I did a similar test for comparison and it comes in a bit about the 4060ti for output, prompt tokens faster due to sheer memory speed (HBM2). ~20 toks/sec out. Not bad for a card that can be had on eBay for $150-$200 usd.
Like the screen layout you got for the test runs and nice intro. Impressive, well structured presentation.
Thank you. It would be interesting to see some evaluation of multiple consumer gpus working on the same llm.
I was considering buying 12 Tesla M40 so I could train and use the Largest language models. But after calculating how much wattage and electricity that is, I realize the city and the electric company might pay me a visit to figure out what's going on.😅
Thanks for this nice test! I just bought the 4060Ti 16GB, to complement my two RTX 3070 8Gb - Now I have 32GB, good enough to run Mixtral 8x7b or Qwen2.5-Code 32b model. A note: for small models, like llama3.2 3b, I put it in just one GPU, as splitting LLM model across all GPUs hurts a lot the tokens per second. Only big models taken advantage of multi-gpu, due memory constraints.
13:34 considering the power draw and DIY fan, i prefer the i9 over the m40.
great content and relevant to me since I recently bought a 4060 ti 16gb for ai.
Really enjoyed this exploration of old and new - hope you you use it as a regular feature (I will go and check for updates) and maybe consider multi-gpu - - you got 4 slots there - what can they do combined ???
Based on your findings I'm a suite and tie kind of guy. I just purchased a Xilinx ALVEO U200 Data Center Accelerator Card A-U200-A64G off Ebay which I will be running LLM's on. It would be interesting to see how this enterprise FPGA compares to the A4500. Great video!!
18:05 i am VERY GOOD with this! Just wondering if the 4070 Super can do same or close. Maybe not, I am seeing 15GB in memory, 4070s has 12GB. would have to load a smaller quantization.
I want to run big models cheaply, I use a 1080 TI now on 8b llama, fast enough but would like a reliable code assistant with bigger model. Suggestions? Can you test multiple 3060s in parallel on big model?
I may be wrong but I am pretty sure you can change the seed from random to fixed so given the same prompt with the same seed the responses should be exactly the same across multiple tests.
dunno, going from 10 tokens/sec to 30 tokens/sec, yes its 3x but still feels slow Excellent content, thank you
It doesn't make sense to buy something less than RTX 3090. 8 of them will be okay for 671b heavily quantisized and a small context window for parallel tensor inference.
Thanks for comparing the different GPU hardware. Can you run a test like, there is 6k input token and 1k output token. So, we can known that how large LLM perform under 6k input and 1k output token.
Can you try to run the llama3.1 405B model on the CPU and see what kind of response we can get?
If I run Codestral 22b Q4_K_M on my P5000 (Pascal architecture), I get 11 t/s evaluation, so that means the P5000 performs around 75% of a 4060TI. But now, when I open Nvidia Power Management I can observe it only consumes 140W when under load while it should be ablte to go up to 180W. B.T.W. both these cards have 288GB/s memory bandwidth. I must have a bottleneck in my system which is a Intel 11th gen i7 laptop (4-core CPU) and eGPU over Thunderbolt 3.
I am going back into your videos to see if you did one, but would LOVE to see what a p40 will do.
does M40 support oculink station connect with mini PC?
@ovalwingnut