Artificial Intelligence (AI)

What Is Llama? Meta’s Open-Weight AI Model Family That Became the De Facto Default

What is Llama: Meta's family of open-weight large language models launched in February 2023 with Llama 1 and now in its fourth generation with Llama 4 across 8 billion 70 billion and 405 billion parameter sizes that has become the most-downloaded open-weight LLM by a substantial margin and the default starting point for teams running local AI inference building agent frameworks on top of open-weight models or fine-tuning models with their own data, distributed under the Llama Community License that permits commercial use with the notable threshold that organizations exceeding 700 million monthly active users must negotiate a separate commercial agreement with Meta, supported by a substantial ecosystem of fine-tuned variants specialized models like Code Llama for code generation and the broader Llama Stack reference framework that Meta maintains as an opinionated path to building applications on Llama models with consistent patterns across the various deployment targets.

Llama is Meta’s family of open-weight large language models, first released in February 2023 and now in its fourth generation with Llama 4 across 8-billion, 70-billion, and 405-billion parameter sizes. Through 2024-2026, Llama has become the most-downloaded open-weight LLM by a substantial margin and the default starting point for teams running local AI inference, building agent frameworks on top of open-weight models, or fine-tuning models with their own data. The distribution model matters: Llama is released under the Llama Community License, which permits commercial use without per-request fees but includes the notable threshold that organizations exceeding 700 million monthly active users must negotiate a separate commercial agreement with Meta. For the vast majority of organizations (which are well below that threshold), the practical effect is "free for commercial use" with attribution. The license combined with the model quality is most of why Llama became the de facto open-weight default rather than just one option among several.

This post covers what Llama actually is, the lineage from Llama 1 through Llama 4, the current Llama 4 generation in detail, the license structure and what it means in practice, why Llama became the default open-weight choice rather than just one option, the broader Llama ecosystem (Code Llama for code generation, the Llama Stack reference framework, the fine-tuned variant community), where Llama fits in the broader open-weight landscape, and the practical considerations for teams choosing whether to build on Llama. For broader context, our Local AI Models pillar covers the broader open-weight category, and our Ollama pillar covers the runtime tool that most local Llama deployments use.

What Llama actually is

Llama is a family of large language models trained by Meta and released as open-weight downloads. The "open-weight" framing is specific: Meta publishes the trained model parameters (the actual weights that determine model behavior) under a license that permits download, modification, deployment, and commercial use within stated boundaries. This is distinct from "open source" in the strict sense (which would also require training code and data to be open) but represents the most accessible model release pattern among major commercial labs.

The architectural pieces that matter for understanding what Llama is:

Decoder-only transformer architecture. Llama uses the same fundamental architecture as GPT, Claude, Gemini, and most other modern LLMs: a decoder-only transformer that generates text autoregressively (one token at a time, each conditioned on the previous tokens). The architecture isn’t distinctive; what’s distinctive is the training quality and the release model.

Multiple parameter sizes. Each Llama generation ships in multiple sizes (8B, 70B, 405B for Llama 4) to serve different hardware capabilities and use cases. Small models run on consumer laptops; large models need workstation or server hardware. The size mix is what makes the family broadly applicable.

Trained on substantial data. Meta has invested substantially in pre-training data quality, alignment, and safety processes. The specific data composition isn’t fully public but the training scale is large enough that Llama models compete with proprietary frontier models on many benchmarks.

Released as model files, not as a service. Llama is not an API you call; it’s a model you download. The distribution is via Hugging Face Hub, Meta’s own download channels, and curated libraries like Ollama’s model registry. Customers run inference on their own infrastructure or via third-party inference providers (Together AI, Fireworks AI, Replicate, Groq, AWS Bedrock, and others) that host Llama as a service.

Maintained by Meta with substantial community contribution. The base models come from Meta; the broader Llama ecosystem (fine-tuned variants, specialized derivatives, quantization formats, runtime integrations) comes from a substantial community of developers and labs building on the foundation.

For broader open-weight category context, our Local AI Models pillar covers the broader landscape and our Ollama pillar covers the runtime tool most local Llama deployments use.

The Llama lineage

The Llama family has shipped four generations over roughly three years, each meaningfully more capable than the last:

Llama 1 (February 2023). The original family, released initially for researchers under a restrictive license and subsequently leaked publicly. Sizes from 7B to 65B parameters. Llama 1 was the model that demonstrated open-weight LLMs could be competitive with closed-weight commercial alternatives at substantially smaller scale than competitors assumed possible. The leak (and the subsequent flowering of community work built on Llama 1) is what established Llama as a meaningful open-weight option in the first place.

Llama 2 (July 2023). The first generation released with a commercial license (the Llama 2 Community License, predecessor to the current license structure). Sizes of 7B, 13B, and 70B parameters. Llama 2 was the first Llama generation that organizations could deploy commercially without legal ambiguity, which substantially expanded adoption. Also notably included instruction-tuned and chat-tuned variants alongside the base models, making it the first Llama generation that was directly usable for conversational applications without further fine-tuning.

Llama 3 (April 2024). Substantially better quality through improved training data, refined alignment, and architecture improvements. Sizes of 8B, 70B, and 405B parameters. Llama 3 was the generation where the open-weight quality gap with closed-weight frontier models narrowed substantially. The 405B variant in particular demonstrated that open-weight could compete at the frontier when the training investment was sufficient. Llama 3 also introduced specialized variants (Llama 3.1, Llama 3.2 vision models, Llama 3.3) that extended the family across multimodal and specialized use cases through 2024-2025.

Llama 4 (2025-2026). The current generation, with continued capability improvements and architectural refinements. Sizes of 8B, 70B, and 405B parameters in the standard release; specialized variants for specific use cases continue to ship through the Llama 4 generation. The current models are competitive with closed-weight frontier models on many benchmarks for general tasks, with the closed-weight alternatives retaining advantages primarily on specific edge cases (peak reasoning, multimodal capability, certain professional domain knowledge).

The progression from Llama 1 to Llama 4 represents one of the most consequential developments in commercial AI: open-weight models gaining capability fast enough that they’re now genuine alternatives to closed-weight commercial APIs for many use cases, rather than the strictly-research-grade tools they were in early 2023.

The current Llama 4 generation

The Llama 4 generation as of mid-2026 includes three main sizes plus specialized variants:

Llama 4 8B is the small-model variant. Suitable for consumer hardware (modern laptops with 16GB+ RAM, mid-range GPUs, M-series Macs). Inference latency is excellent. The capability ceiling is below the larger variants but covers most practical workloads. The starting point for most local Llama deployment.

Llama 4 70B is the medium-large variant and arguably the most operationally useful for serious work. Capability is meaningfully higher than 8B and competitive with closed-weight commercial alternatives on most workloads. Requires more substantial hardware (Apple Silicon M-series Max or Ultra, RTX 5090 or workstation cards, server-class hardware for production serving) but produces output quality that justifies the hardware investment for many use cases.

Llama 4 405B is the flagship variant. Competitive with frontier closed-weight models on many benchmarks. Requires substantial hardware for inference: multiple H100 or B100 GPUs, Apple’s M3 Ultra Mac Studio configurations with 192GB+ unified memory, or similar server-class infrastructure. The use case here is genuinely serious local inference rather than personal experimentation.

Llama 4 specialized variants ship for specific use cases: code generation (continuing the Code Llama tradition with Llama 4-based code models), multimodal capability (vision-language variants), function calling (variants tuned for tool use and structured output), multilingual capability, and others. The specialized variants extend the family’s coverage across use cases that the general-purpose variants don’t address as directly.

Quantization variants. Each size ships in multiple quantization levels. The standard pattern uses 4-bit or 8-bit quantization to reduce memory requirements substantially for modest quality loss. The Ollama model library curates the appropriate quantization for typical hardware; manual quantization choice is available for users with specific requirements.

For most teams getting started with Llama, the right starting point is Llama 4 8B for evaluation and small-scale work, with graduation to 70B when capability matters, and consideration of 405B only when peak capability matters more than hardware cost.

The license structure

The Llama Community License is the legal framework that makes Llama broadly useful. The key terms in practical language:

Free for commercial use under most conditions. Organizations can download, modify, deploy, and commercially use Llama models without paying per-request fees, subscription fees, or any other Meta-collected charges. The license permits substantial commercial value creation on the Llama base.

The 700-million MAU threshold. Organizations with monthly active users greater than 700 million across their products must request a separate commercial license from Meta. This threshold is high enough that essentially every organization except a handful of consumer-tech giants is below it. For practical purposes, "free for commercial use" is the accurate framing for nearly everyone.

Attribution requirement. Modified Llama models must include a "Built with Llama" attribution in their documentation, user-facing materials where appropriate, and the model card. The requirement is light operationally but should be planned for in product design.

Modification permitted. Fine-tuning, distillation, derivative model creation, and other modifications are all permitted. The substantial community of Llama-derived models (specialized fine-tunes, instruction-tuned variants, multilingual adaptations, etc.) exists because the license explicitly permits this work.

Use restrictions. Some specific use cases are prohibited under the license’s Acceptable Use Policy: military weapons systems, exploitation of minors, generation of CSAM, fraud, harassment, and various other categories. The restrictions are similar to other commercial AI vendors’ acceptable use policies but worth reviewing for specific use cases.

No redistribution of the underlying weights. While modifications and derivative models are permitted, redistributing the original Llama weights through unauthorized channels is prohibited. The intent is that Meta controls the canonical distribution; derivative work flows through community channels.

For most teams, the license terms are unambiguously favorable: the model is freely usable for nearly any commercial purpose without per-request fees, with the only meaningful operational requirement being the attribution. The 700M MAU threshold sounds restrictive but applies to almost no real organizations.

Why Llama became the default

The question of why Llama specifically became the default open-weight choice (rather than just one option among Mistral, Qwen, DeepSeek, Phi, Gemma, and others) has several reasonable answers:

Meta’s training investment was substantially larger than most competing open-weight labs. The compute, data curation, and alignment investment in Llama 3 and Llama 4 produced quality that smaller labs couldn’t match at equivalent model sizes. The quality gap became part of the default-choice argument.

The community ecosystem around Llama compounded fastest. Fine-tuned variants, runtime integrations, tooling support, and documentation accumulated around Llama at higher rates than around alternatives. Each Llama release benefited from a larger ecosystem than its competitors, which made Llama the easier default choice for new teams.

The license terms were broadly acceptable for commercial use. The 700M MAU threshold gives Meta strategic optionality without practically restricting most users. Earlier open-weight licenses were more restrictive; later open-weight licenses are competitive but Llama’s license established the de facto standard that Meta has continued.

Ollama, Hugging Face, AWS Bedrock, Together AI, and other inference infrastructure made Llama the path of least resistance. When the runtime tools default to Llama (Ollama’s ollama run llama4 is the canonical example), Llama becomes the model new users encounter first.

Meta’s strategic commitment to open-weight AI is unambiguous. Mark Zuckerberg has publicly committed Meta to releasing future Llama generations as open-weight. The commitment matters for organizations betting on the long-term Llama trajectory; it reduces the risk that future Llama releases will move behind a commercial API barrier.

The deployment pattern is genuinely broad. Llama runs on customer hardware via Ollama, in customer cloud via various inference providers, in AWS Bedrock as a managed service, on Together AI and Fireworks AI for pay-per-token usage, and across many other paths. The deployment flexibility removes friction for teams choosing their inference architecture.

The result is that "Llama" has become roughly synonymous with "open-weight LLM" for many practitioners. Other families (Mistral, DeepSeek, Qwen, Phi) are genuine alternatives with specific advantages, but Llama is the default that other choices are compared against.

The broader Llama ecosystem

Beyond the base Llama models, several adjacent projects and tools matter for the broader Llama ecosystem:

Code Llama is Meta’s specialized variant for code generation, available in sizes tuned for code-focused workloads. The Code Llama tradition continues across Llama generations; current code-specialized variants build on the Llama 4 foundation. For code completion, code review, code generation, and developer-tool integration use cases, the code-specialized variants are typically better than general-purpose Llama variants of the same parameter count.

Llama Stack is Meta’s opinionated reference framework for building applications on Llama models. The Stack provides consistent patterns across deployment targets (local, cloud, third-party providers) for the common application primitives: agents, tool calling, RAG, evaluation, fine-tuning. For teams wanting Meta’s recommended path to Llama-based applications, the Llama Stack is the canonical reference.

Llama Guard is Meta’s content moderation model derived from the Llama base. Tuned specifically for evaluating whether user inputs or model outputs cross various policy lines. Useful as a safety layer for applications building on Llama models.

Llama Index (originally called GPT Index, now broader) is a popular open-source framework for building RAG (retrieval-augmented generation) applications. Despite the name suggesting Llama-specific focus, Llama Index works with any LLM but pairs naturally with Llama models for fully local RAG pipelines.

The fine-tuned variant community. Hundreds of fine-tuned Llama variants exist on Hugging Face Hub for specialized use cases: medical, legal, multilingual, creative writing, customer support, function calling, and many others. The variant community is one of the strongest signals of Llama’s ecosystem maturity; the equivalent variant counts for other open-weight families are substantially lower.

Inference providers as a service. Together AI, Fireworks AI, Replicate, Groq, AWS Bedrock, and others host Llama models for pay-per-token API access. For teams that want Llama’s open-weight benefits (no vendor lock-in for the model, full data control if they self-host) but also want the operational simplicity of API access, the inference providers fill a useful middle ground.

The ecosystem depth is most of what makes Llama practically useful versus just technically capable. The model itself is one piece; the runtime tools, deployment paths, fine-tuned variants, application frameworks, and inference providers together make Llama-based development practical at the scale most teams operate.

Where Llama fits versus other open-weight families

Llama is the default open-weight choice for most teams, but several alternatives serve specific use cases better:

Mistral family is the closest competitor on general capability with particular strengths in efficient inference (Mixtral’s mixture-of-experts architecture) and permissive licensing (Apache 2.0 for many models). For teams that want unambiguously open-source licensing or that benefit from Mistral’s specific architectural choices, Mistral is a strong alternative.

DeepSeek V4 is distinguished by strong reasoning capability that competes with closed-weight frontier models. For workloads where reasoning quality dominates, DeepSeek may produce better results than equivalent-sized Llama variants. The license is generally permissive though specific variants have specific terms worth reviewing.

Qwen 3 from Alibaba has strong multilingual capability (particularly Chinese) and competitive general capability. For multilingual workloads or teams with Chinese-language requirements, Qwen often outperforms Llama. Apache 2.0 licensing is broadly permissive.

Microsoft’s Phi-4 is the small-model specialist. For use cases where capability per parameter matters more than absolute ceiling, Phi-4 punches above its weight on benchmarks. Particularly suited for edge deployment, mobile, and other small-hardware use cases.

Google’s Gemma 3 is derived from Gemini lineage with Apache 2.0 licensing. For teams that want Google-style capability without using the Gemini API, Gemma 3 is the path. Smaller scale than the largest Llama variants but capable in its size class.

Smaller specialized models (TinyLlama, Phi-3, smaller Mistral variants) target specific resource-constrained use cases where larger models can’t fit at all. For embedded, IoT, or extremely cost-sensitive deployments, the smaller specialized models matter.

The honest framing: Llama is the right starting point for most teams. Alternatives become attractive for specific characteristics (Mistral’s licensing or efficiency, DeepSeek’s reasoning, Qwen’s multilingual capability, Phi’s small-model performance, Gemma’s lineage). Most teams evaluating open-weight options should start with Llama and consider alternatives only when specific requirements push toward them.

What teams considering Llama should think about

Six concrete considerations:

  • Start with Llama 4 8B for evaluation. The small variant runs on accessible hardware and lets you judge whether Llama-family quality fits your use case. If 8B is sufficient, no hardware investment is needed. If you need more capability, the evaluation tells you whether to invest in hardware for 70B or 405B.
  • Plan the deployment path deliberately. Local inference via Ollama is the simplest for development. Production deployment may use vLLM for self-hosted serving, AWS Bedrock or other managed providers for hands-off operation, or inference providers like Together AI for pay-per-token access. The right deployment depends on your specific operational requirements.
  • Verify the license terms for your specific use case. The 700M MAU threshold is high but not universally inapplicable. Organizations near the threshold should plan accordingly. The Acceptable Use Policy restrictions apply to all users; review them against your use case.
  • Evaluate fine-tuned variants for your domain. The community has produced substantial specialized variants. Before fine-tuning your own variant, check whether existing community variants already cover your use case. For specialized domains (medical, legal, multilingual, code), pre-existing variants often produce better results than custom fine-tuning of the base Llama.
  • Plan the upgrade path across Llama generations. New Llama generations typically improve capability substantially but introduce breaking changes in tokenization, fine-tuning artifacts, or other technical details. Build with eventual model upgrade in mind rather than tightly coupling to a specific Llama version.
  • Consider the operational overhead honestly. Local Llama deployment requires hardware management, model file management, runtime updates, and operational discipline that managed cloud APIs don’t require. For some teams the operational overhead is acceptable; for others, paying for managed Llama access (AWS Bedrock, Together AI) is the right tradeoff between the open-weight benefits and the operational simplicity of vendor-managed inference.

The deeper takeaway is that Llama in 2026 is the de facto default for teams running open-weight AI. The combination of model quality, broad license terms, deep ecosystem support, and Meta’s commitment to continued open-weight releases makes Llama the natural starting point rather than one option among many. For teams getting started with local AI or open-weight inference, the question typically isn’t "Llama versus alternatives" but "which Llama variant and what deployment path."

Frequently Asked Questions

What is Llama?

Llama is Meta’s family of open-weight large language models, first released in February 2023 and now in its fourth generation with Llama 4 across 8-billion, 70-billion, and 405-billion parameter sizes. The “open-weight” framing means Meta publishes the trained model parameters under a license permitting download, modification, deployment, and commercial use within stated boundaries. Through 2024-2026, Llama has become the most-downloaded open-weight LLM by a substantial margin and the default starting point for teams running local AI inference, building agent frameworks on open-weight models, or fine-tuning models with custom data.

What’s the license for Llama?

Llama is distributed under the Llama Community License, which permits commercial use without per-request fees but includes the notable threshold that organizations exceeding 700 million monthly active users must negotiate a separate commercial license with Meta. The license requires “Built with Llama” attribution in documentation and user-facing materials where appropriate. Modifications, fine-tuning, and derivative model creation are permitted. The Acceptable Use Policy prohibits specific use cases (military weapons systems, CSAM, fraud, harassment, etc.) similar to other commercial AI vendors’ policies. For the vast majority of organizations (well below 700M MAU), the practical effect is “free for commercial use” with attribution.

What sizes does Llama 4 come in?

The Llama 4 generation includes 8-billion, 70-billion, and 405-billion parameter sizes in the standard release, plus specialized variants for code generation, multimodal capability, function calling, and other specific use cases. The 8B variant runs on consumer hardware (modern laptops with 16GB+ RAM, mid-range GPUs, M-series Macs). The 70B variant requires more substantial hardware (Apple Silicon M-series Max or Ultra, RTX 5090, workstation cards). The 405B variant requires server-class hardware (multiple H100 or B100 GPUs, Apple’s M3 Ultra Mac Studio with 192GB+ unified memory). Each size ships in multiple quantization variants to fit different hardware capabilities.

Why is Llama the default open-weight choice?

Several reasons combined. Meta’s training investment in Llama 3 and Llama 4 produced quality that smaller labs couldn’t match at equivalent sizes. The ecosystem around Llama compounded fastest (fine-tuned variants, runtime integrations, tooling, documentation). The license terms are broadly acceptable for commercial use. Runtime tools like Ollama default to Llama, making it the model new users encounter first. Inference providers (Together AI, Fireworks AI, AWS Bedrock, others) all host Llama prominently. Meta’s strategic commitment to continued open-weight Llama releases is unambiguous. The combination makes Llama the path of least resistance for new teams, which compounds into the default-choice positioning.

How does Llama compare to Mistral, DeepSeek, and Qwen?

Each alternative has specific strengths. Mistral is the closest competitor on general capability with stronger licensing (Apache 2.0 for many models) and efficient mixture-of-experts architecture in the Mixtral variants. DeepSeek V4 has distinctively strong reasoning capability that competes with closed-weight frontier models on reasoning-heavy workloads. Qwen 3 from Alibaba has strong multilingual capability (particularly Chinese) and Apache 2.0 licensing. For specific characteristics (Mistral’s licensing, DeepSeek’s reasoning, Qwen’s multilingual), the alternatives may fit better than Llama. For most teams, Llama is the right starting point with alternatives evaluated for specific requirements.

Can I run Llama on my laptop?

The 8B variant runs comfortably on modern laptops with 16GB+ RAM, including most laptops sold in the past few years. M-series Macs handle small Llama variants particularly well. Mid-range discrete GPUs (RTX 4060, RTX 4070) run small Llama models with good latency. The 70B variant requires more substantial hardware (Apple Silicon M-series Max or Ultra with 64GB+ unified memory, or workstation-class GPUs). The 405B variant requires server-class hardware. For most teams getting started, installing Ollama on existing laptop hardware and running Llama 4 8B is the right starting point; upgrade only if capability ceiling matters for your specific use case.

What is Code Llama?

Code Llama is Meta’s specialized variant of Llama tuned for code generation use cases. The Code Llama tradition started with Llama 2-based code models and continues through Llama 4-based variants. For code completion, code review, code generation, and developer-tool integration use cases, the code-specialized variants are typically better than general-purpose Llama variants of the same parameter count. Multiple sizes ship to match different hardware constraints; many third-party code-focused AI tools (CodeGPT, various IDE integrations, autocomplete tools) use Code Llama variants as their underlying model.

What is the Llama Stack?

Llama Stack is Meta’s opinionated reference framework for building applications on Llama models. The Stack provides consistent patterns across deployment targets (local, cloud, third-party providers) for common application primitives: agents, tool calling, RAG (retrieval-augmented generation), evaluation, and fine-tuning. For teams wanting Meta’s recommended path to building Llama-based applications, the Llama Stack is the canonical reference. Alternatives (LangChain, LlamaIndex, Vercel AI SDK, custom integrations) also work well with Llama; the Stack is one option among several rather than required.

Digital Matters

Artificial Intelligence (AI) Desk