Blogs/AI

How To Use Open Source LLMs (Large Language Model)?

Written by Ajay Patel
Apr 24, 2026
4 Min Read
How To Use Open Source LLMs (Large Language Model)? Hero

Open source LLMs have made it possible for developers, researchers, and builders to run powerful language models without paying for API access or building proprietary infrastructure. The challenge is knowing where to start. Between model selection, hardware requirements, and setup steps, it can feel overwhelming before you run a single line of code.

This guide walks through how to use open source LLMs using Hugging Face and Google Colab, from picking a model to running your first inference.

Where Open Source LLMs Live: Hugging Face

Just as GitHub is the standard platform for storing and sharing code, and Docker Hub is where container images are distributed, Hugging Face is the central hub for AI models. It hosts over 800,000 models and 186,000 datasets, all open source and publicly available.

Hugging Face is a platform where developers and researchers can discover pre-trained models, share their own, collaborate on projects, and access datasets. It is the starting point for anyone looking to work with open source LLMs.

How to Find the Right Model on Hugging Face?

Getting started on Hugging Face takes three steps.

First, create a free account at huggingface.co. Second, navigate to the Models section. Third, use the left sidebar to filter by task type. Options include text generation, translation, question answering, summarization, and more.

For this guide, we are using the google/gemma-2-2b-it model, a capable instruction-tuned model from Google that runs well on a free GPU in Google Colab.

CPU vs GPU: What Hardware You Actually Need

When running an LLM, you can use either a CPU or a GPU. A CPU handles general computing tasks one operation at a time. A GPU is designed for parallel mathematical computation and can process many operations simultaneously.

For LLM inference, this difference matters. A GPU can complete the same task significantly faster than a CPU because neural network inference involves large matrix multiplications that benefit directly from parallel processing. The model output is identical either way. The only difference is how long you wait for it.

For most open source LLMs, a GPU is strongly recommended.

Why Google Colab Is the Easiest Way to Start

Google Colab is a free cloud-based notebook environment that gives you access to GPUs without any local setup. You write and run Python code directly in your browser, and the compute happens on Google's servers.

For learning how to use open source LLMs, Colab removes the biggest barrier: hardware. It provides free access to NVIDIA Tesla T4 GPUs, which are well suited for running models like Gemma-2-2b-it. You do not need to install drivers, configure environments, or pay for cloud compute to get started.

Running and Using Open-Source LLMs Effectively
Discover top open-weight models, how to deploy them locally or in the cloud, and best practices for fine-tuning and evaluation.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 2 May 2026
10PM IST (60 mins)

To set up your environment, create a new Colab notebook and change the runtime type to T4 GPU under Runtime > Change runtime type.

Step-by-Step: Running Gemma-2-2b-it on Google Colab

Step 1: Install Required Packages

python

!pip install transformers torch bitsandbytes accelerate huggingface_hub

Step 2: Log Into Hugging Face

python

from huggingface_hub import notebook_login
notebook_login()

Enter your Hugging Face access token when prompted. You can generate one from your account settings at huggingface.co/settings/tokens.

Step 3: Accept Google's Usage License

Navigate to the google/gemma-2-2b-it model page on Hugging Face and accept the license agreement. This is required before you can download the model weights.

Step 4: Load the Model and Tokenizer

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

This loads the tokenizer and model weights. The device_map="auto" setting automatically places the model on the available GPU. The torch_dtype=torch.bfloat16 setting reduces memory usage without significantly affecting output quality.

Step 5: Run Inference

python

query = "what is AI?"
input_ids = tokenizer(query, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This tokenizes the input query, passes it to the model, generates up to 1024 new tokens, and decodes the output back into readable text.

Model Output:

what is AI?

Artificial Intelligence (AI) is a broad field of computer science that aims 
to create machines capable of performing tasks that typically require human 
intelligence.

Key Concepts:
- Learning: AI systems can learn from data and improve their performance over time.
- Reasoning: AI systems can use logic and rules to solve problems.
- Problem-solving: AI systems can identify and solve complex problems.
- Perception: AI systems can interpret sensory information such as images and sounds.
- Natural Language Processing: AI systems can understand and generate human language.

Conclusion

Learning how to use open source LLMs is now within reach for anyone with a browser and a Hugging Face account. Platforms like Hugging Face and Google Colab remove the hardware and infrastructure barriers that previously made running large models impractical. Follow the steps in this guide, pick a model that fits your use case, and you can go from zero to running inference in under an hour.

Running and Using Open-Source LLMs Effectively
Discover top open-weight models, how to deploy them locally or in the cloud, and best practices for fine-tuning and evaluation.
Murtuza Kutub
Murtuza Kutub
Co-Founder, F22 Labs

Walk away with actionable insights on AI adoption.

Limited seats available!

Calendar
Saturday, 2 May 2026
10PM IST (60 mins)

Frequently Asked Questions

1. What is Hugging Face?

Hugging Face is an open-source AI platform that hosts over 800,000 models and 186,000 datasets. It is the primary hub for discovering, downloading, and sharing pre-trained language models.

2. Do I need a GPU to run an open source LLM?

A GPU is strongly recommended for LLM inference. It significantly reduces inference time compared to a CPU. Google Colab provides free GPU access, making it the easiest option for getting started without any hardware investment.

3. What is Google Colab and why use it for LLMs?

Google Colab is a free cloud-based Python notebook environment with free GPU access. It requires no local setup and is well suited for running and experimenting with open source LLMs like Gemma, LLaMA, and Mistral.

4. What does torch_dtype=torch.bfloat16 do?

It loads the model in 16-bit precision instead of 32-bit, which reduces GPU memory usage by roughly half. This makes it possible to run larger models on free-tier GPUs without running out of memory.

5. Can I use other models besides Gemma-2-2b-it?

Yes. The same steps work for most models on Hugging Face that use the Transformers library, including Mistral, LLaMA, Falcon, and others. Some models may require additional license agreements or different loading configurations.

Author-Ajay Patel
Ajay Patel

Hi, I am an AI engineer with 3.5 years of experience passionate about building intelligent systems that solve real-world problems through cutting-edge technology and innovative solutions.

Share this article

Phone

Next for you

Active vs Total Parameters: What’s the Difference? Cover

AI

Apr 10, 20264 min read

Active vs Total Parameters: What’s the Difference?

Every time a new AI model is released, the headlines sound familiar. “GPT-4 has over a trillion parameters.” “Gemini Ultra is one of the largest models ever trained.” And most people, even in tech, nod along without really knowing what that number actually means. I used to do the same. Here’s a simple way to think about it: parameters are like knobs on a mixing board. When you train a neural network, you're adjusting millions (or billions) of these knobs so the output starts to make sense. M

Cost to Build a ChatGPT-Like App ($50K–$500K+) Cover

AI

Apr 7, 202610 min read

Cost to Build a ChatGPT-Like App ($50K–$500K+)

Building a chatbot app like ChatGPT is no longer experimental; it’s becoming a core part of how products deliver support, automate workflows, and improve user experience. The mobile app development cost to develop a ChatGPT-like app typically ranges from $50,000 to $500,000+, depending on the model used, infrastructure, real-time performance, and how the system handles scale. Most guides focus on features, but that’s not what actually drives cost here. The real complexity comes from running la

How to Build an AI MVP for Your Product Cover

AI

Apr 16, 202613 min read

How to Build an AI MVP for Your Product

I’ve noticed something while building AI products: speed is no longer the problem, clarity is. Most MVPs fail not because they’re slow, but because they solve the wrong problem. In fact, around 42% of startups fail due to a lack of market need. Building an AI MVP is not just about testing features; it’s about validating whether AI actually adds value. Can it automate something meaningful? Can it improve decisions or user experience in a way a simple system can’t? That’s where most teams get it