Retrieval Augmented Generation (RAG) lets you enrich LLM prompts with your own knowledge base. You embed documents into vectors and search them at runtime to pull in context. The question is: do you really need an external vector database? If you're just experimenting or working with small datasets, keeping everything in memory might be all you need.
Before deciding on infrastructure, it's worth grounding ourselves in how RAG works. In a typical flow you break your documents into chunks, embed them, and store the vectors in a searchable index. When the user submits a query, you embed the query, find the closest vectors, and send those documents along with the prompt to the language model. The idea is to provide context from your private data without having to train the LLM itself.
RAG is simple in concept but relies heavily on fast vector search. Local libraries like FAISS handle this beautifully when the dataset is small. Managed services such as Pinecone, Weaviate, or Qdrant offer additional features like sharding, durability, and advanced filtering. The trick is deciding at which point it makes sense to switch.
When you're first tinkering with a RAG workflow, it's tempting to stand up a fully managed vector database from the start. But doing so can create unnecessary complexity. Local, in-memory indexing offers a handful of benefits:
Let's look at a simple example using FAISS and LangChain to create a throwaway in-memory index:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
texts = ["alpha", "beta", "gamma"]
embeddings = OpenAIEmbeddings()
index = FAISS.from_texts(texts, embeddings)
result = index.similarity_search("bet")
print(result)
Running this snippet gives you instant semantic search. It's perfect for local experiments or command‑line tools where the data lives only in memory.
In each of these cases, an in-memory index is simple to manage and fast to query, giving you a quick feedback loop during development.
As soon as your project grows past a certain point, you'll start feeling friction:
At this point, relying on a managed vector database or self-hosted service often makes more sense.
FAISS stores vectors as floating point numbers. By default each dimension uses 4 bytes of memory. A 1,536‑dimensional embedding (the size used by many popular models) therefore occupies about 1,536 × 4 = 6,144 bytes – roughly 6 KB. Multiply that by the number of documents to get a ballpark figure for your memory needs. For example:
This back‑of‑the‑envelope math shows why a small virtual machine with 4–8 GB of RAM can comfortably handle tens of thousands of vectors. An AWS t3.medium has 4 GB, while a t3.large offers 8 GB. Once your collection pushes beyond a few hundred thousand vectors, you'll start bumping into memory limits and should consider a service or sharding across machines.
Some projects only need persistence while the data volume remains modest. You can serialize an in-memory index to disk and reload it later. FAISS exposes a simple API for this:
index.save_local("tmp/faiss_index")
# Later...
restored = FAISS.load_local("tmp/faiss_index", embeddings)
Persisting the index as a plain file means a microservice can load it at startup and still benefit from in‑memory speed. You can sync that file to S3, NFS, or even ship it in a Docker image.
Imagine delivering a support tool that runs on laptops out in the field. Rather than rely on an external service, you precompute the embeddings, package the index alongside your application, and load it in memory each time the tool launches. Updates are distributed by simply shipping a new file.
A batch job may run once a day to process fresh documents. Instead of maintaining a live vector database, the job loads the index file from object storage, performs its tasks, then writes the updated file back. You get durability without a constantly running service.
Services like Pinecone, Qdrant Cloud, or Weaviate Cloud handle the heavy lifting of distributed indexing. They provide high‑level APIs for upserting vectors, filtering by metadata, and storing the data durably. Here's a short example showing how you might replace the in-memory FAISS code with a Pinecone index:
import pinecone
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone as PineconeStore
pinecone.init(api_key="YOUR_KEY", environment="us-east1-gcp")
index = pinecone.Index("my-index")
vector_store = PineconeStore(index, OpenAIEmbeddings())
vector_store.add_texts(["alpha", "beta", "gamma"])
Now your vectors live in Pinecone and can be queried from multiple machines. Updates are persisted automatically, and the service scales as your data grows.
Using a service can remove a lot of operational burden, but you pay for that convenience both in dollars and additional complexity. It's wise to start with in-memory and move up only when necessary.
A common pattern is to begin with a lightweight in-memory setup and graduate to a service once traffic or data volume increases. Here's a step-by-step approach I've followed on several projects:
The key is to let real usage guide your infrastructure choices rather than over-engineering from the start.
The following diagram summarizes the thought process when deciding between an in-memory approach and a managed vector database:
flowchart TD
Start([Start]) --> Size{Dataset < 500,000 docs?}
Size -- Yes --> Single{Single machine runtime?}
Size -- No --> Service[Vector DB Service]
Single -- No --> Service
Single -- Yes --> Persist{Need persistence?}
Persist -- No --> InMem[In-Memory RAG]
Persist -- Yes --> Service
Use this as a quick rule of thumb. If you're below a few hundred documents and are running a single service instance, sticking with in-memory is usually just fine. Otherwise, the benefits of a managed store quickly outweigh the overhead.
Imagine you're building a documentation assistant for a small startup. In the early days you only have fifty markdown files with internal process docs. You start with a simple script that loads the files, embeds them with OpenAI, and builds a FAISS index each time it runs. The system answers employee questions in a Slack bot. Everything lives on a single machine, and deploying updates means rerunning the script. It's fast and free—perfect during the experiment phase.
Fast forward six months and your company has hundreds of new docs plus a growing set of spreadsheets, presentations, and meeting transcripts. Now the index rebuild takes several minutes and the dataset doesn't fit comfortably in memory. On top of that, other teams want to access the same knowledge base from their own services. At this point, you stand up a Pinecone index. Migrating is straightforward because your code already encapsulates the vector store behind an interface. The new service provides persistence, replication, and an easy path to scale as your knowledge base keeps expanding.
When deciding whether to use in-memory storage or a vector database, consider the following questions:
By answering these questions honestly, you can pick the right tool for the stage your project is in.
In-memory RAG is a fantastic way to start your journey with minimal friction. It lets you experiment rapidly and keeps your stack simple. As your data grows and your use case solidifies, managed vector databases offer durability and advanced features that become worth the added complexity. The key is to evolve your architecture gradually, adopting the tools you need when you need them. Start small, validate, then scale with confidence.