Which one is more important: more parameters or more computation? (2021)

(parl.ai)

43 points | by jxmorris12 1 day ago

5 comments

vorticalbox 6 hours ago
This reminds me of https://dnhkng.github.io/posts/rys/
David looks into the LLM finds the thinking layers and cut duplicates then and put them back to back.
This increases the LLM scores with basically no over head.
Very interesting read.
[-]
- renticulous 3 hours ago
  Jeff Dean says models hallucinate because their training data is "squishy."
  But what's in the context window is sharp, the exact text or video frame right in front of them.
  The goal is to bring more of the world into that context.
  Compression gives it intuition. Context gives it precision.
  Imagine if we could extract the model's reasoning core and plug it anywhere we want.
  [-]
  - 2ndorderthought 2 hours ago
    LLMs "hallucinate" because they are stochastic processes predicting the next word without any guarantees at being correct or truthful. It's literally an unavoidable fact unless we change the modelling approach. Which very few people are bothering to attempt right now.
    Training data quality does matter but even with "perfect" data and a prompt in the training data it can still happen. LLMs don't actually know anything and they also don't know what they don't know.
    https://arxiv.org/abs/2401.11817
    [-]
    - electroglyph 1 hour ago
      > they also don't know what they don't know
      they sort of do tho:
      https://transformer-circuits.pub/2025/introspection/index.ht...
      [-]
      - 2ndorderthought 56 minutes ago
        I won't quibble even though I likely should. Have to remember this is HN and companies need to shill their work otherwise ... Yes.
        I will play along and assume this is sound. 10-40% +/- 10% is along the lines of "sort of" in a completely unreliable, unguaranteed and unproven way sure.
kang 3 hours ago
The answer should be obvious that its both.
Zurada was one of our AI textbook that makes it visual that right from a simple classifier to a large language model, we are mathematically creating a shape(, that the signal interacts with). More parameters would mean shape can be curved in more ways and more data means the curve is getting hi-definition.
They reach something with data, treating neural network as blackbox, which could be derived mathematically using the information we know.
mskogly 2 hours ago
Selective training data, lora fine tuning or MOE are other solutionsZ Sure, creating a model with 100 billion parameters will yield good results, but it’s sort of like employing a million random people to play darts. Or shooting sparrows with A nuclear bomb.
l4tq3 4 hours ago
[dead]
34ylsh 5 hours ago
[flagged]