
Open source LLMs have made it possible for developers, researchers, and builders to run powerful language models without paying for API access or building proprietary infrastructure. The challenge is knowing where to start. Between model selection, hardware requirements, and setup steps, it can feel overwhelming before you run a single line of code.
This guide walks through how to use open source LLMs using Hugging Face and Google Colab, from picking a model to running your first inference.
Just as GitHub is the standard platform for storing and sharing code, and Docker Hub is where container images are distributed, Hugging Face is the central hub for AI models. It hosts over 800,000 models and 186,000 datasets, all open source and publicly available.
Hugging Face is a platform where developers and researchers can discover pre-trained models, share their own, collaborate on projects, and access datasets. It is the starting point for anyone looking to work with open source LLMs.
Getting started on Hugging Face takes three steps.
First, create a free account at huggingface.co. Second, navigate to the Models section. Third, use the left sidebar to filter by task type. Options include text generation, translation, question answering, summarization, and more.
For this guide, we are using the google/gemma-2-2b-it model, a capable instruction-tuned model from Google that runs well on a free GPU in Google Colab.
When running an LLM, you can use either a CPU or a GPU. A CPU handles general computing tasks one operation at a time. A GPU is designed for parallel mathematical computation and can process many operations simultaneously.
For LLM inference, this difference matters. A GPU can complete the same task significantly faster than a CPU because neural network inference involves large matrix multiplications that benefit directly from parallel processing. The model output is identical either way. The only difference is how long you wait for it.
For most open source LLMs, a GPU is strongly recommended.
Google Colab is a free cloud-based notebook environment that gives you access to GPUs without any local setup. You write and run Python code directly in your browser, and the compute happens on Google's servers.
For learning how to use open source LLMs, Colab removes the biggest barrier: hardware. It provides free access to NVIDIA Tesla T4 GPUs, which are well suited for running models like Gemma-2-2b-it. You do not need to install drivers, configure environments, or pay for cloud compute to get started.
Walk away with actionable insights on AI adoption.
Limited seats available!
To set up your environment, create a new Colab notebook and change the runtime type to T4 GPU under Runtime > Change runtime type.
python
!pip install transformers torch bitsandbytes accelerate huggingface_hubpython
from huggingface_hub import notebook_login
notebook_login()Enter your Hugging Face access token when prompted. You can generate one from your account settings at huggingface.co/settings/tokens.
Navigate to the google/gemma-2-2b-it model page on Hugging Face and accept the license agreement. This is required before you can download the model weights.
python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-2b-it",
device_map="auto",
torch_dtype=torch.bfloat16,
)This loads the tokenizer and model weights. The device_map="auto" setting automatically places the model on the available GPU. The torch_dtype=torch.bfloat16 setting reduces memory usage without significantly affecting output quality.
python
query = "what is AI?"
input_ids = tokenizer(query, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))This tokenizes the input query, passes it to the model, generates up to 1024 new tokens, and decodes the output back into readable text.
what is AI?
Artificial Intelligence (AI) is a broad field of computer science that aims
to create machines capable of performing tasks that typically require human
intelligence.
Key Concepts:
- Learning: AI systems can learn from data and improve their performance over time.
- Reasoning: AI systems can use logic and rules to solve problems.
- Problem-solving: AI systems can identify and solve complex problems.
- Perception: AI systems can interpret sensory information such as images and sounds.
- Natural Language Processing: AI systems can understand and generate human language.Learning how to use open source LLMs is now within reach for anyone with a browser and a Hugging Face account. Platforms like Hugging Face and Google Colab remove the hardware and infrastructure barriers that previously made running large models impractical. Follow the steps in this guide, pick a model that fits your use case, and you can go from zero to running inference in under an hour.
Walk away with actionable insights on AI adoption.
Limited seats available!
Hugging Face is an open-source AI platform that hosts over 800,000 models and 186,000 datasets. It is the primary hub for discovering, downloading, and sharing pre-trained language models.
A GPU is strongly recommended for LLM inference. It significantly reduces inference time compared to a CPU. Google Colab provides free GPU access, making it the easiest option for getting started without any hardware investment.
Google Colab is a free cloud-based Python notebook environment with free GPU access. It requires no local setup and is well suited for running and experimenting with open source LLMs like Gemma, LLaMA, and Mistral.
It loads the model in 16-bit precision instead of 32-bit, which reduces GPU memory usage by roughly half. This makes it possible to run larger models on free-tier GPUs without running out of memory.
Yes. The same steps work for most models on Hugging Face that use the Transformers library, including Mistral, LLaMA, Falcon, and others. Some models may require additional license agreements or different loading configurations.
Walk away with actionable insights on AI adoption.
Limited seats available!