@ovalwingnut

1st 5:30 seconds of talking... I bet it was going to start any time but I just ran out of patients.. GR8T Job. You RoCk! :face-red-heart-shape:

@DarrenReidAu

Great breakdown. Since Ollama support for AMD has become decent, a good bang for buck is the MI50 16Gb. I did a similar test for comparison and it comes in a bit about the 4060ti for output, prompt tokens faster due to sheer memory speed (HBM2). ~20 toks/sec out. Not bad for a card that can be had on eBay for $150-$200 usd.

@udirt

Like the screen layout you got for the test runs and nice intro. Impressive, well structured presentation.

@andrewowens5653

Thank you. It would be interesting to see some evaluation of multiple consumer gpus working on the same llm.

@aaviusa835

I was considering buying 12 Tesla M40 so I could train and use the Largest language models. But after calculating how much wattage and electricity that is, I realize the city and the electric company might pay me a visit to figure out what's going on.😅

@Lemure_Noah

Thanks for this nice test!
I just bought the 4060Ti 16GB, to complement my two RTX 3070 8Gb - Now I have 32GB, good enough to run Mixtral 8x7b or Qwen2.5-Code 32b model.
A note: for small models, like llama3.2 3b, I put it in just one GPU, as splitting LLM model across all GPUs hurts a lot the tokens per second. Only big models taken advantage of multi-gpu, due memory constraints.

@mrbabyhugh

13:34 considering the power draw and DIY fan, i prefer the i9 over the m40.

@fooboomoo

great content and relevant to me since I recently bought a 4060 ti 16gb for ai.

@kimroscoe5725

Really enjoyed this exploration of old and new -  hope you you use it as a regular feature (I will go and check for updates) and maybe consider multi-gpu - - you got 4 slots there - what can they do combined ???

@squeakytoyrecords1702

Based on your findings I'm a suite and tie kind of guy.  I just purchased a Xilinx ALVEO U200 Data Center Accelerator Card A-U200-A64G off Ebay which I will be running LLM's  on.  It would be interesting to see how this enterprise FPGA compares to the A4500.  Great video!!

@mrbabyhugh

18:05 i am VERY GOOD with this! Just wondering if the 4070 Super can do same or close. Maybe not, I am seeing 15GB in memory, 4070s has 12GB.  would have to load a smaller quantization.

@C650101

I want to run big models cheaply,  I use a 1080 TI now on 8b llama, fast enough but would like a reliable code assistant with bigger model.  Suggestions?  Can you test multiple 3060s in parallel on big model?

@benjaminhudsondesign

I may be wrong but I am pretty sure you can change the seed from random to fixed so given the same prompt with the same seed the responses should be exactly the same across multiple tests.

@eukaliptal

dunno, going from 10 tokens/sec to 30 tokens/sec, yes its 3x
but still feels slow

Excellent content, thank you

@alekseyburrovets4747

It doesn't make sense to buy something less than RTX 3090.  8 of them will be okay for 671b heavily quantisized and a small context window for parallel tensor inference.

@nithinbhandari3075

Thanks for comparing the different GPU hardware.

Can you run a test like, there is 6k input token and 1k output token.
So, we can known that how large LLM perform under 6k input and 1k output token.

@ZIaIqbal

Can you try to run the llama3.1 405B model on the CPU and see what kind of response we can get?

@jeroenadamdevenijn4067

If I run Codestral 22b Q4_K_M on my P5000 (Pascal architecture), I get 11 t/s evaluation, so that means the P5000 performs around 75% of a 4060TI. But now, when I open Nvidia Power Management I can observe it only consumes 140W when under load while it should be ablte to go up to 180W. B.T.W. both these cards have 288GB/s memory bandwidth. I must have a bottleneck in my system which is a Intel 11th gen i7 laptop (4-core CPU) and eGPU over Thunderbolt 3.

@jcirclev2

I am going back into your videos to see if you did one, but would LOVE to see what a p40 will do.

@luckystrikehk

does M40 support oculink station connect with mini PC?