Large Language Models (LLMs) such as ChatGPT have gained a lot of traction over the last year or two, and more tech companies are looking to include artificial intelligence (AI) in their products via LLMs. This includes Apple with their Apple Intelligence in iOS and MacOS which can be expanded with ChatGPT, and Microsoft with its Microsoft 365 Copilot and Copilot for Bing offerings. Even GitHub has a Copilot option.

In this blog post I am going to look at installing Ollama, then running some queries against its one of its LLMs.

Ollama

Ollama is an LLM from Meta, the company probably most famous for Facebook, What’s App or Instagram. Ollama has multiple different models available, each potentially offering something to meet a requirement. These models come in various sizes and have various requirements to run. The list of models can be found at: https://ollama.com/library. The number of parameters is roughly means the size of the data set the model has been trained on. A model with a larger number of parameters will generally be larger in size (requiring more space to download) and require more memory to run.

TinyLlama

The model I am going to use is called tinyllama. As the name suggests, tinyllama is small. At the time of writing it is roughly 640MB in size and has around 1.1 billion parameters. This makes it great for quickly downloading and not having a big impact on system requirements.

Installing Ollama

curl -fsSL https://ollama.com/install.sh | sh

The above command downloads the install script from ollama.com and runs it. Ollama can then be run using the ollama command in a terminal.

Adding a Model

A model can be added using the run command in a terminal, for example:

ollama run tinyllama

Once a model is running Ollama should allow you to ask questions and should reply with answers. To exit the model type:

/bye

Once a model has been installed, Ollama can be used offline.

Editing Ollama API

Although Ollama runs okay in the terminal I’m going to want to send it questions from other applications (i.e. potentially my own web apps). With this in mind, I’m going to adjust the ollama.service file so that the API is available via a particular port.

Note: In a production environment this would need properly restricting and testing.

Edit /etc/systemd/system/ollama.service, for example:

sudo nano /etc/systemd/system/ollama.service

Within the [Service] section add a new line that reads:

Environment="OLLAMA_HOST=0.0.0.0:8181"

This is adding an environment variable to tell Ollama what its host should be. I’ve asked it to run via port 8181 to everywhere whilst I’m testing. Make sure to save the file when exiting it.

systemctl will then need reloading, ollama restarting and checking that ollama is running okay.

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl status ollama

API via Python Requests

Opening a browser to http://127.0.0.1:8181 on the local device should show a page saying “Ollama is running“. This means Ollama should also be available to web requests. The Python requests and json libraries are able to help send queries and show the results.

Ollama is expecting queries to /api/generate (e.g. http://127.0.0.1:8181/api/generate), and is expecting queries to be in json format. The request should contain json data to explain which model to use, the prompt (i.e. question) and if the response should be streamed.

The response will then be sent back in json format with keys for model, created_at, response and context.

If you only want the response (i.e. answer) then use the response key to grab it.

Docker and Python Libraries

There is also an official Docker image for Ollama, and Ollama Python libraries available.