• 0 Posts
  • 59 Comments
Joined 2 years ago
cake
Cake day: March 14th, 2023

help-circle







  • From someone in the field

    It lowered training costs by quite a bit. To learn from preference data (whats termed as alignment with human values), we used a very large reward model as a proxy for human feedback.

    They completely got rid of this, hence also the need to have very large clusters

    This has serious implications for spending though. Big companies who would have to train foundation models coz they couldnt directly use meta’s llama, can now just use deepseek.

    and directly move to the human/customer alignment phase, which was already  significantly cheaper than pretraining (first phase of foundation model training). With their new algorithm, even the later stage does not need huge compute

    so they def got rid of a big chunk of compute by not relying on what is called a “reward” model

    GRPO: group relative policy optimization

    huggingface is trying to replicate their results

    https://github.com/huggingface/open-r1