A Comparative Analysis of Popular Large Language Models in 2024

In recent years, the field of natural language processing has seen remarkable advancements with the emergence of powerful Large Language Models (LLMs). These models have revolutionized various applications, from chatbots to content generation. In this post, we’ll explore six prominent LLMs: Llama, Gemini, GPT from OpenAI, Mistral AI, Wizard LM, and Vicuna, discussing their strengths and weaknesses.

1. Llama

Llama, developed by Meta, has gained significant attention in the AI community. According to Meta, Llama 3 was trained on approximately 15 trillion tokens of data, seven times larger than the dataset used to train Llama 2. This data was collected from publicly available sources. Meta used heuristic and NSFW filters, semantic deduplication, and text classifiers to predict quality and ensure the model would ingest a high-quality training dataset. Meta’s model card on Github also states that the fine-tuning data used on Llama 3 included 10 million human-annotated assets on top of publicly available instruction datasets.

Interestingly, Meta used a combination of human annotators and Llama 2 to build this training dataset. Meta discovered that Llama 2 was proficient at identifying high-quality data. Therefore, they leveraged Llama 2 to generate training data for the classifiers used in Llama 3.

To achieve Llama 3’s performance level, Meta did five things:

Supervised Fine-tuning: In short, Supervised Fine Tuning requires training the model on carefully selected examples with desired outputs, where the quality of these examples, determined by humans, impacts how effectively the model learns to generate accurate and contextually relevant responses.
Rejection Sampling: Assists in improving the model’s results by screening out less satisfactory responses according to predetermined criteria.
Proximal Policy Optimization: In PPO, the model learns by experimenting with various actions and observing which ones result in the most favorable results, similar to learning through trial and error. This is achieved by making incremental adjustments to ensure the model does not undergo drastic changes during training iterations. It sets a boundary on how the model’s decision-making can evolve following each training iteration.
Direct Preference Optimization: DPO provides clearer information. DPO offers a more straightforward method for integrating human feedback into training models. This process entails fine-tuning the model’s parameters by taking into account human preferences between different outputs. In essence, it is a type of learning that specifically focuses on aligning the model’s results with human assessments.
Multiple Quality Assurance Iterations: Here, we see a classic data-centric artificial intelligence (DCAI) approach. Iterative refinement in data quality helps correct errors and inconsistencies in the training data. According to Meta, this practice of several iterations has led to great improvements in the Llama 3’s quality and performance.

2. Gemini

Gemini is a group of advanced language models (LLMs) created by Google AI. They are created to be stronger and more versatile than earlier LLMs, able to manage a broader variety of tasks and data types. According to Google, the dataset used to train Gemini was collected from publicly available sources and data collected from usage history. However, proper information regarding the data source remains unclear. The models were trained using datasets that contain multiple modes of information and languages.

Google is constantly trying to solve the ambiguity issue for better logical reasoning. Usually, natural language introduces some degree of ambiguity, which can be generally understood by human audiences. But in the case of LLMs, solving this kind of ambiguity becomes challenging. Currently, Google is working on RIG and RAG, which can potentially solve this issue partially, which in turn can improve the performance and accuracy of the model.

The important information is provided below:

Tokenizer: SentencePiece tokenizer, with the ability to efficiently tokenize scripts that are not in the Latin alphabet as well.
Data Sources: Extensive amount of multimodal information gathered from online documents, books, and code in addition to select internal data sources from Google. The size of the dataset is the deciding factor in determining the size of the model, with the number of tokens used for training the largest models decided based on the method described in this work.
Training Techniques: Training smaller models involves using a larger number of tokens to enhance performance within a specific inference budget.
Data Curation: Implementation of quality filters, including heuristic rules and model-based classifiers. Filtering for safety is carried out to eliminate damaging content.

3. GPT

GPT, developed by OpenAI, has become synonymous with advanced language models. The latest iteration, GPT-4, represents a significant advancement in capabilities. This iteration focuses on improving the quality of responses through larger training datasets and more refined training techniques compared to GPT-3.5. GPT-4 has ~1.8 trillion parameters across 120 layers, which is over 10 times larger than GPT-3. GPT-4 also includes a vision encoder for autonomous agents to read web pages and transcribe images and videos. The architecture is similar to Flamingo. This adds more parameters on top and it is fine-tuned with another ~2 trillion tokens.

GPT-4 is trained on ~13T tokens, including both text-based and code-based data, with some fine-tuning data from ScaleAI and their internal sources.

The key techniques used in GPT-4’s development include:

Supervised Fine-tuning: OpenAI utilizes supervised fine-tuning with annotated examples to steer the model in the direction of intended results. Carefully selected samples by human annotators guarantee the model learns relevant and precise replies.
Reinforcement Learning from Human Feedback (RLHF): RLHF helps align the model’s outputs with human preferences. By gathering human feedback on various outputs, the model can be fine-tuned iteratively to produce responses that align better with human expectations. This iterative feedback loop significantly improves the model’s safety and alignment.
Sparse Mixture of Experts (MoE): GPT-4 employs MoE techniques to improve computational efficiency. Instead of activating all model parameters for every input, a subset of experts within the model is selected dynamically, allowing the model to focus on the most relevant aspects of the input. This strategy optimizes performance while reducing computational costs.
Prompt Engineering and Prompt Chaining: By leveraging prompt engineering and chaining techniques, GPT-4 can be adapted for various tasks, ensuring it performs well in different scenarios without requiring extensive re-training.

4. Mistral AI

Mistral AI is an emerging player in the LLM space, introducing new techniques to optimize model training and inference efficiency. Mistral’s approach focuses on building smaller yet highly capable models that perform competitively with larger LLMs. Mistral offers several models, excelling in certain fields. For example, their premier model Codetral has been trained in 80 programming languages for better efficiency in code completion and suggestion. They also proposed Mathstral which is capable of solving complicated and advanced mathematical problems.

Mistral employs several techniques to achieve its goals:

Knowledge Distillation: Mistral utilizes knowledge distillation to transfer the learning from larger models (teachers) to smaller models (students). The smaller models are trained to mimic the behavior of the teacher models, allowing them to achieve high performance with fewer parameters.
Quantization: The model uses techniques such as 8-bit or 4-bit quantization to reduce the memory footprint and speed up inference. This enables Mistral models to run efficiently on a wider range of hardware, making them suitable for deployment on edge devices.
Efficient Training Regimes: Mistral adopts efficient training schedules that balance computational cost and performance. The training strategy includes iterative pruning and re-training to optimize the model’s structure while retaining the performance of the original dense model.
Sparse Architectures: The model incorporates sparsity at both the architectural and training levels to further reduce the number of active parameters during inference. This allows for a reduction in computational cost while maintaining a high level of accuracy.

5. Wizard LM

Wizard LM is another innovative LLM designed to optimize performance by leveraging domain-specific fine-tuning and user interaction data. Wizard LM aims to offer highly customized language models for specific industries, such as healthcare, finance, and education.

Key components of Wizard LM’s training approach include:

Domain-Specific Fine-Tuning: Wizard LM employs domain-specific fine-tuning to adapt general-purpose language models to specialized areas. This involves using domain-related corpora and curated datasets to make the model more effective for specific applications, such as medical diagnostics or financial analysis.
Direct Preference Optimization (DPO): DPO is used to incorporate user feedback directly into the training process, optimizing the model’s responses according to user preferences. This feedback-driven approach allows the model to learn from real-world interactions, continually improving its output quality.
Human-in-the-Loop Feedback (HITL): Wizard LM incorporates HITL to iteratively refine the model based on user feedback. This process involves deploying the model in a real-world setting, gathering user interactions, and using this data to inform subsequent fine-tuning cycles.
Adaptive Learning Techniques: The model dynamically adjusts its training objectives based on feedback trends and evolving domain requirements. This adaptability ensures the model remains relevant and accurate over time, even as domain-specific knowledge evolves.
Quality Control Iterations: Multiple quality control cycles ensure that the training data is refined and filtered to maintain high quality. This involves rigorous filtering of low-quality or irrelevant data, thereby improving the model’s consistency and reliability.

6. Vicuna

Vicuna is an open-source LLM developed by a team of researchers and engineers with a focus on accessibility and customization for a broader community of developers. Built on top of Meta’s Llama model, Vicuna aims to offer a more affordable yet powerful alternative to commercial LLMs by providing high-quality performance with fewer resources. Vicuna’s approach leverages community-driven efforts and fine-tuning techniques to achieve competitive performance in natural language processing tasks.

Here are some key aspects of Vicuna’s development process:

Foundation on Llama: Vicuna builds directly on top of Meta’s Llama, specifically fine-tuning models like Llama 2 to enhance its capabilities. By leveraging a solid pre-trained foundation, Vicuna can optimize its resources to focus on domain-specific fine-tuning and performance improvements without having to start training from scratch.
Instruction-Tuned with User Data: To make Vicuna more effective for real-world applications, it is fine-tuned using high-quality instruction-following data derived from user interactions. The model is exposed to various prompt types, user queries, and responses, allowing it to learn the nuances of generating relevant and helpful answers. This instruction-tuning process significantly improves the model’s ability to follow complex instructions and deliver contextually appropriate responses.
Community-Driven Development: Unlike many proprietary LLMs, Vicuna’s development is community-centric, involving contributions from researchers, developers, and AI enthusiasts who help fine-tune the model with specialized datasets, scripts, and evaluation techniques. This collaborative approach fosters innovation and allows Vicuna to be rapidly updated with the latest advancements in LLM research.
Cost-Effective Training: Vicuna’s fine-tuning process is designed to be more resource-efficient compared to training a new model from scratch. Techniques like low-rank adaptation (LoRA) and parameter-efficient fine-tuning (PEFT) help to reduce the computational costs associated with training while still allowing the model to achieve high levels of accuracy. These methods adapt the weights of the pre-trained Llama model without requiring significant computational resources.
Evaluation against GPT-3.5 and GPT-4: Vicuna’s performance has been benchmarked against models like GPT-3.5 and GPT-4, showing competitive results across various tasks, including question-answering, text summarization, and dialogue generation. Evaluations using established benchmarks like MMLU (Massive Multitask Language Understanding) have demonstrated that Vicuna can achieve up to 90% of GPT-4’s quality on certain tasks, making it a viable open-source alternative for many applications.
Support for Custom Fine-Tuning: Vicuna is particularly valuable for developers and researchers who need customized LLMs for specific tasks. The model’s architecture and open-source nature make it easier to fine-tune using task-specific data, enabling the creation of specialized LLMs that are fine-tuned for unique requirements, such as legal document analysis, healthcare-related queries, or technical support.
Active Community and Continuous Improvement: The Vicuna project benefits from an active community that regularly updates the model with fine-tuning scripts, new datasets, and optimization techniques. The open-source nature of the model encourages experimentation, ensuring that the latest LLM advancements are quickly incorporated into Vicuna’s training pipeline.

Conclusion

Each LLM brings unique strengths and strategies to the table, from GPT’s massive data-driven approach to Mistral’s efficient architectures and Wizard LM’s domain-specific fine-tuning. These models represent a growing trend toward optimizing LLMs for various use cases, balancing performance, efficiency, and adaptability.

Understanding their development techniques and training methodologies sheds light on how the field is evolving to address the ever-expanding range of natural language processing tasks, making LLMs more accessible and capable for both general-purpose and specialized applications.

References:

[1] Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., … Zhao, Z. (2024). The Llama 3 Herd of Models (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2407.21783

[2] Google AI Blog. (2024). DataGemma: Using real-world data to address AI hallucinations https://blog.google/technology/ai/google-datagemma-ai-llm/

[3] DataCamp Blog. (2024). What is GPT-4 and Why Does it Matter? https://www.datacamp.com/blog/what-we-know-gpt4

[4] Mistral AI. (2024). https://mistral.ai/technology

[5] Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., & Jiang, D. (2023). WizardLM: Empowering Large Language Models to Follow Complex Instructions (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2304.12244

[6] Chiang, W. L., et al. (2023). Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/