George Robotington - image made with DALL-E

How easy is it to create a Retrieval Augmented Generation (RAG) application? By using the recently released Pinecone Assistant you can start interrogating your documents in minutes.

And with a bit more effort you can have a fully functional Streamlit app.

We are going to use a Streamlit chat UI to develop an app that answers queries or generates documents about the US Constitution. It will mimic online course notes that a student would have access to and consists of the Constitution, itself, and a Wikipedia article about it.

RAG and Vector Databases: a very quick overview

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances Large Language Models (LLMs) by grounding their knowledge in external sources. Instead of relying solely on their training data, RAG models can access and process information from a corpus of documents, enabling them to provide more accurate and relevant responses.

Vector databases play a fundamental role in RAG by efficiently storing and retrieving information from your documents. They can identify the most relevant information required to answer a query, allowing a RAG app to access precisely the data it needs.

Pinecone is well-known as a provider of vector databases that have been used in RAG applications for some time. Their products include a vector database service that is aimed at RAG developers. However, recently they have released Pinecone Assistant, an out-of-the-box RAG solution which makes it easy to create apps by simply uploading documents and querying with plain English prompts.

Even better, they provide a free tier that can be used for small apps or experimentation.

To give you an idea of how Pinecone Assistant simplifies RAG solutions, below is a diagram from their documentation (reproduced with permission).

Diagram Image courtesy of Pinecone

The Assistant does three things:

Data ingestion A RAG system needs to be provided with documents to work with. The ingestion process splits a document into smaller parts and generates vector embeddings for each part. The embeddings are stored in the vector database and indexed.

Data retrieval When the assistant receives a query, it is processed and relevant chunks from the uploaded content are retrieved from the database.

Response generation The assistant ranks the chunks for relevance and this, the chat history and assistant instructions are then used by a large language model (LLM) to generate appropriate responses.

Basically, all you need to do to create an app is to upload documents and provide a user interface - Pinecone Assistant takes care of the rest.

Getting started - the assistant console

The are two ways of interacting with an agent: through the API or via the Assistant Console. We will look at both.

First, thought you need to get on board with Pinecone. Go to the the pricing page and sign up for the free Starter plan. You can then get an API key, and, of course, to use the API, you will need to install the Python library. When you get you API key, you are given one chance to copy it and keep it somewhere safe - don’t lose it!

The quick start documentation is a good guide to get you started with the Python library but we are going to start with the console and move on to the API when we look at the Streamlit app.

The Assistant console lets us create an assistant, upload documents and write queries in a web interface. Below is a screenshot of my console with a single assistant called demo1.

Console screenshot

Selecting that demo1 will take you to the assistant where you can upload documents and query them. Below is a screenshot of demo1. You can see that I have uploaded two PDF documents one is the US Constitution document from the US Senate and the other is a PDF version of the Wikipedia entry for the US Constitution. (You can find lots of versions of the constitution document on the web - some are more attractive than others but they all contain the same information, of course).

In this screenshot you will also see that I have made a query “List the articles” and that demo1 has responded accordingly. It is good to see that the assistant gives references in its response. If you hover over the ‘[1]’ in the response it will tell you which document it found the information in and where.

assistant with query screenshot

Not all of the functions of the API are available in the console. One of those that we will explore shortly is the ability to specify a JSON format for the response.

The API and a Streamlit app

In the app, I build on the demo1 agent and the documents that I have uploaded. You can find the code and the PDFs in the GitHub repo for this project.

The Streamlit code is quite short: three functions and a bit of main code. I’ll go through how it works first and then we’l see how it can be used.

Streamlit interface components create a nice UI: here is a screenshot of the app in action.

The main panel contains the chat interface and you can see that I have submitted a query and the assistant has answered.

On the left is a sidebar that contains two UI elements: a pair of radio buttons that allow you to select the format of the response and an expander panel that contains the full response from the agent. There is more to the full response that you can see in the screenshot; the content of the response is what is displayed in the main panel but there are other fields that may be populated by the assistant that are not relevant here but can be seen in the full response panel.

I’ll go through the code showing how the Pinecone API is used but first, we need to import the libraries.

import streamlit as st
from pinecone import Pinecone
from pinecone_plugins.assistant.models.chat import Message

The first two are obvious and the second is, as we shall see, necessary in order to construct a prompt for the assistant.

The remainder of the code consists of three functions initialize_pinecone, retrieve_answer, and main.

The first is called once to get the assistant and requires the API key which should be stored in a Streamlit secret in the file .streamlit/secrets.toml (see here if you are unfamiliar with Streamlit secrets).

The function retrieve_answer queries the assistant and main contains the UI.

Before looking at these in more detail we should see the main code which is run first.

if __name__ == "__main__":

    pa = initialize_pinecone()

    st.sidebar.markdown("# :blue[Options]")

    json_mode = st.sidebar.radio("Select Answer Format", 
                                 ("Normal text", "JSON"),
                                 horizontal=True) == "JSON"

    full_response = st.sidebar.expander("Full response",
                                        expanded=False)

    main(pa)

We first initialise the assistant, then create a sidebar that contains the radio buttons and the expandable panel for the full response. Note that the variables that are set by these UI elements, json_mode and full_response, are globally available.

Lastly, we call the main function and pass the assistant to it.

Here’s the code that initialises the assistant:

def initialize_pinecone():
        api_key = st.secrets["PINECONE_API_KEY"]
        pc = Pinecone(api_key=api_key)
        assistant = pc.assistant.Assistant(
            assistant_name="demo1", 
        )
        return assistant

It reads the key from a secret, creates a Pinecone instance, retrieves the assistant demo1 and returns it.

Next, we’ll look at the main function.

def main(assistant):
    st.markdown("# :blue[Pinecone Assistant]: US Constitution")

    # User query input
    user_query = st.text_input("Enter your query:")
    if st.button("Submit"):
        if user_query:
            answer = retrieve_answer(assistant, user_query, json_mode)
            if json_mode:
                st.json(answer.content)
            else:
                st.write(answer.content)
            full_response.write(answer)
        else:
            st.warning("Please enter a query.")

This prompts the user for a query and when the Submit button is pressed it retrieves an answer from the assistant, passing the assistant instance, the actual query and the output mode.

Having retrieved the response it writes the content either as JSON or plain text. The full response is written to the panel in the sidebar.

That leaves the function retrieve_answer which makes the actual query.

def retrieve_answer(assistant, query, json_mode):
    msg = Message(role="user", content=query)

    resp = assistant.chat(messages=[msg],
                          json_response=json_mode)
    return resp.message

This is quite straightforward. First, we create a Pinecone Message with the query, we then call the chat methods from the assistant and this returns a response. The part of the response we want is called message and this is what is returned for display by the main function.

So, that is how it works; now, how I used it.

Usage

I’m imagining a hypothetical high school class that is studying the US Constitution. This app can be used by students to easily find answers to questions about the subject as we have already seen. For example, you could ask for a list of the articles, or amendments, in the document and then drill down to a particular item by asking more about that.

But it can do much more.

I asked the assistant to “Write a brief study guide” and it responded with a comprehensive overview that included the structure of the constitution, key concepts, historical context and so on. This sort of thing could be an invaluable guide that could be created by the teacher and circulated to students. Here is a snippet of what I generated:

Another neat use was to ask “Write a list of 5 multiple choice questions” and set the output mode to JSON. Here is part of the response:

{
    "questions":[
        {
        "question":
            "What does the Thirteenth Amendment of the U.S. Constitution address?",
        "choices":[
            "Abolition of slavery",
            "Women's suffrage",
            "Prohibition of alcohol",
            "Income tax"
            ],
        "answer":"Abolition of slavery"
        },
        {
        "question":
            "Which amendment granted women the right to vote?",
        "choices":[
            "Fifteenth Amendment",
            "Nineteenth Amendment",
            "Twenty-First Amendment",
            "Twenty-Sixth Amendment"
            ],
        "answer":"Nineteenth Amendment"
        },

This output could be easily processed into an online test for the students and the Streamlit st.json function that is used to display the response incorporates a widget for easily copying the JSON code.

These examples will also be in the GitHub repo.

Conclusion

Pinecone Assistant makes writing RAG apps a breeze. The free tier allows you up to three assistants and each assistant has its own documents. So you could have different assistants for different purposes each with their own set of documents. (The paid plan gives you an unlimited number of assistants).

A good thing is that you don’t need your own subscription to an LLM service. You can choose between OpenAI and Anthropic models and the associated costs are included in your Pinecone plan (even the free one).

There are limitations on the resources that you can use in the free plan, see the docs for details.

You will also notice from the console image, above, that, while I have used next to no storage, I’m nearly up to my limit of input tokens. Input tokens are generated in response to a query and are presumably partly a function of the data that is retrieved from the database. That query saw the input tokens rise from 1.3 M to 1.4 M. That’s quite a lot for a simple query but it’s probably because just about the whole of each document needs to be used to answer it.

When I asked “What is the fifth amendment?” there was no noticeable change in the number of input tokens used and I imagine this must be because only a small part of one document was needed to answer the question.

My usage may not be typical, I’ve uploaded very little but made quite a lot of queries most of which required fairly short answers. A real application may have quite a different usage profile.

So, whether the quota in the free tier is enough to support serious work is an open question. If you are using Pinecone commercially then the Standard plan might be an option - it is probably not going to break the finances of your company. If you want to use it for personal use and don’t want to pay, you will need to keep an eye on your usage.

Personally, I am definitely thinking about how I can use Pinecone Assistant… i just need a project.


Thanks for reading and, if you haven’t tried Pinecone Assistant yet, maybe this will encourage you to give it a try.

To read more of my articles and tutorials please see my website and consider subscribing to my occasional newsletter where I link to new articles. You can also follow me on Medium.

Notes

  • The code and data for this project can be found here
  • The US Constitution is a public domain document and the Wikipedia article has a Creative Commons license.
  • All images and screenshots are by the author unless otherwise noted.
  • Disclaimer: I have no commercial interest in Pinecone