MemoRAG – Enhance RAG with memory-based knowledge discovery for long contexts

180 points by taikon 9 months ago

bogwog 9 months ago

I don't know how this is different from regular rag yet, but that harry potter example sucks. The "inferior answer" seems much more accurate to the prompt with much higher information density, and the "good answer" just seems like the type of generic slop any old LLM would produce if you asked it to summarize harry potter.

Also, the prompt itself is in semi-broken english and it's not clear what exactly is being asked.

simpaticoder 9 months ago

I am naive about LLM technology, in particular the relationship between base models, fine-tuning, and RAG. This particular branch of effort seems aimed at something that is of great interest to me (and I'm sure many others) which is to specialize a more general base model to know a particular domain in great detail and so improve it's responses within that domain. In the past, this might have been called an "expert system". For example, you might want to train an LLM on your project codebase and documentation such that subsequent code suggestions prioritize the use of internal libraries or code conventions over those represented by the public sources encoded in the base model.

I found the Google Colab notebook of MemoRag[1] to be of great use in understanding roughly the scope and workflow of this work. The interesting step is when you submit your domain as text to encode a new thing that requires a GPU, a process they call "forming memory"[2]. Perhaps there is some sort of back-and-forth between the base model and your data that results in new weights added to the base model. As I said, I am naive about LLM technology so I'm not sure about the details or the nomenclature. However, if this is even partially correct I'd like to understand how the "formed memory" and the base model cohabitate during inference, because this would create memory pressure on the GPU. If the memory required for the base model is M, and the formed memory is N, it's reasonable to assume you'd need M+N memory to use both.

1 - https://colab.research.google.com/drive/1fPMXKyi4AwWSBkC7Xr5...

2 - https://colab.research.google.com/drive/1fPMXKyi4AwWSBkC7Xr5...

bbor 9 months ago
```
   In the past, this might have been called an "expert system". 
```
Heh, it comes full circle... After ~50 years of Expert Systems winter, we're training our new AGIs to become more specialized! This is a memorable lesson that binaries must always be deconstructed, at least to some extent -- kinda like the endless dance we're doing between monoliths and microservices as each new generation of tools runs into the problems inherent in each.
```
  I am naive about LLM technology so I'm not sure about the details or the nomenclature
```
You've got all the details right though, so that's pretty impressive :). AFAICT from a quick glance at the code (https://github.com/qhjqhj00/MemoRAG/blob/main/memorag/memora...), it is indeed "fine tuning" (jargon!) a model on your chosen book, presumably in the most basic/direct sense: asking it reproduce sections of text at random from the book given their surrounding context, and rewarding/penalizing the neural network based on how well it did. The comment mentions GPU memory in the Colab Notebook merely because this process is expensive -- "fine tuning" is the same thing as "training", just with a nearly-complete starting point. Thus the call to `AutoModelForCausalLM.from_pretrained()`.
To answer your question explicitly: the fine-tuning step creates a modified version of the base model as an "offline" step, so the memory requirements during inference (aka "online" operation) are unaffected. Both in terms of storage and in terms of GPU VRAM. I'm not the dev tho so obv apologies if I'm off base!
I would passionately argue that that step is more of a small addition to the overall pipeline than a core necessity, though. Fine-tuning is really good for teaching a model to recreate style, tone, structure, and other linguistic details, but it's not a very feasible way to teach it facts. That's what "RAG" is for: making up for this deficiency in fine-tuning.
In other words, this repo is basically like that post from a few weeks back that was advocating for "modular monoliths" that employ both strategies (monolith vs. microservices) in a deeply collaborative way. And my reaction is the same: I'm not convinced the details of this meshing will be very revolutionary, but the idea itself is deceptively clever!
- spmurrayzzz 9 months ago
  
  > AFAICT from a quick glance at the code (https://github.com/qhjqhj00/MemoRAG/blob/main/memorag/memora...), it is indeed "fine tuning" (jargon!) a model on your chosen book, presumably in the most basic/direct sense: asking it reproduce sections of text at random from the book given their surrounding context, and rewarding/penalizing the neural network based on how well it did.
  Maybe your use of quotes is intentional here, but for posterity's sake there is no actual fine-tuning happening using user input in the code you linked, insofar as the weights of the model aren't being touched at all, nor are they modifying anything else that could impact the original weights (like a LoRA adapter). You touch on this, I think (?), in some of your subsequent language but it read as a little confusing to me at first glance. Or maybe I've been too deep in the ML weeds for too many years at this point.
  The paper details the actual process, but the TL;DR is that the memory module they use, basically a draft model, does go through a pretraining phase using the redpajama dataset, and then an SFT phase with a different objective. This all happens before and irrespective of the inference-time task (i.e. asking questions about a given text). Also, as has been pointed out in other comments, the draft model could really be any model that supports long context and has decent retrieval performance. So the actual training phases here may be non-essential depending on your infra/cost constraints.
  - bbor 9 months ago
    
    Thanks for the corrections! I’m very much not an expert on LLM usage in the real world. But I’m a bit confused:
    does go through a pretraining phase using the redpajama dataset, and then an SFT phase with a different objective
    Isn’t that equivalent to what I said, since “SFT” seems to stand for “supervised fine-tuning”? That it starts with a pre trained model, and then modifies that model according to your corpus?
    Perhaps the confusion here is my ambiguity with “model”; I now see that there’s really two models-one for generating a draft + clues and one for constructing the final output—and this library only concerns/modifies the former. Maybe?
    
    spmurrayzzz 9 months ago
    
    I should have quoted you more specifically, my apologies. I was responding to the comment that there was some training of the "model on your chosen book".
    There is no fine-tuning done specific to the corpus you own. I noted this in a sibling comment, but both the pretraining and fine-tuning objective uses a generic dataset (redpajama) which "aims to maximize the generation probability of the next token given the KV cache of the previous memory tokens" (quote from section 2.2 of the paper).
    This is why I noted you could really use any long-context model that also has good retrieval performance. They're training their own draft model in lieu of using an existing model, but you could get similar/better outcomes using something like claude sonnet 3.5.
    
    bbor 9 months ago
    
    Thanks for taking the time, that makes sense. This is not the first time I've misunderstood something by having opinions about what it should be doing in my opinion, haha. I absolutely agree with your last point, too.
nl 9 months ago

> However, if this is even partially correct I'd like to understand how the "formed memory" and the base model cohabitate during inference, because this would create memory pressure on the GPU.
Not really. RAG loads selected data into the neural network which changes the state of the existing "neurons" (aka parameters), so the memory usage on GPU is only the size of the neural network.
You will hear about "context size" a lot. This means the amount of tokens a particular model can have loaded without becoming saturated and starting to lose things that were previously loaded.

quantadev 9 months ago

The overview paragraph needs to be expanded quite a bit. The only operative phrase about how this thing works is "By recalling query-specific clues". I think people need a bit more knowledge about what this is and how this works, in an overview, to get them interested in trying it. Surely we can be a bit more specific.

3abiton 9 months ago

This comment brought back academic paper reviewer associated ptsd
diggan 9 months ago

I think leaving just an overview in the repository is fine considering they've released a paper describing it in detail (https://arxiv.org/pdf/2409.05591, linked in the README).
- quantadev 9 months ago
  
  Sure an "overview" is fine. However 4 words of meaningful content isn't an overview. The overview contained no meaningful content regarding whatever it is they claim to have done.
afro88 9 months ago

It reads like an LLM wrote it. Word salad that waffles on without any substance. In fact I think an LLM wrote most of the README. There are the telltale bullet points with bold starting words for example.
- thelastparadise 9 months ago
  
  > There are the telltale bullet points with bold starting words for example.
  Is this where we're at now, really? Basic markdown formatting is a telltale sign that something was written by AI?
  - afro88 9 months ago
    
    Basic markdown formatting, no. But using bullet points with bold starting words, after a really waffly introduction makes it more likely they just asked an llm to write it.
- quantadev 9 months ago
  
  I think lots of modern writing apps allow people to let AI "reword" their own content, into better sentence structures, etc. I'm fine with that actually. Doesn't mean the AI invented the content itself.
  - isoprophlex 9 months ago
    
    If only the AI could get to the point quickly instead of running its mouth on and on...
- herval 9 months ago
  
  your examples are all telltale signals an LLM DID NOT write this text, to be fair
  - afro88 9 months ago
    
    You need to spend more time getting LLMs to write documentation. Without examples of how you want it to do it, it defaults to word salad that sounds impressive on the surface, but doesn't really say anything. And it very commonly uses bullet points with a few starting words bolded. At least in my experience.

davedx 9 months ago

I don’t understand what the memory is or does from the README. Can anyone explain how it works differently from vector database results in vanilla RAG applications?

jszymborski 9 months ago

Ok, I think I get it now from scanning the paper and reading Eq. 1 and 2.
Normally RAG just sends your query `q` to a information retrieval function which searches a database of documents using full-text search or vector search. Those documents are then passed to a generative model along with your query to give you your final answer.
MemoRAG instead immediately passes `q` to a generative model to generate some uninformed response `y`. `y` is then passed to the information retrieval function. Then, just like vanilla RAG, `q` and the retrieved documents are sent to a generative model to give you your final answer.
Not sure how this is any more "memory-based" than regular RAG, but it seems interesting.
Def check out the pre-print, especially eq. 1 and 2. https://arxiv.org/abs/2409.05591
EDIT: The "memory" part comes from the first generative model being able to handle larger context, covered in Section 2.1
- bbor 9 months ago
  Not sure how this is any more "memory-based" than regular RAG, but it seems interesting.
  I can't remember where I read this joke, but as a self-proclaimed Cognitive Engineer I think about it every day: "An AI startup's financial evaluation is directly proportional to how many times they can cram 'mind' into their pitch deck!"
- isoprophlex 9 months ago
  
  thanks for boiling it down to the most salient point... to me, their approach is just query rewriting, which is pretty standard when doing RAG.
  - fraboniface 9 months ago
    
    Not exactly, they use a small but long-context model that has the whole dataset in its context (or a large part of it) to generate the chunks as elements of the reply, before passing those to the final model. So the retrieval itself is different, there is no embeddeding model or vector db.
  - opdahl 9 months ago
    
    Agreed. In the RAG space there are a million Open Source projects on GitHub all calling it memory, recreating the same thing over and over again.
  - jszymborski 9 months ago
    
    There's a lot there about the generative model ("Memory Models") in the paper, so perhaps I've misrepresented it, but generally speaking yah I agree with you. It doesn't sound like a fundamental change to how we think about RAG, but it might be a nice formalization of an incremental improvement :)
- mycall 9 months ago
  
  I do memory-based RAGs using Semantic Kernel function calls which do specialized memory caches with useful calculations based on telemetry data. It is so simple to do and I love the LLM figures out how to call the SemanticFunctions on its own.
  ollama and langchain can do something simimlar.
- danielbln 9 months ago
  
  I wonder how this different from HyDE. https://docs.haystack.deepset.ai/docs/hypothetical-document-...
  - jszymborski 9 months ago
    
    It seems to be fundamentally the same deal except instead of passing `q` to GPT-4, they have some long-context "Memory Model" (whose details I've yet to fully understand). Also, MemoRAG uses a more conventional Retrieve/Generate pipeline downstream of the generated queries than "Contriever" (whose details I similarly haven't informed myself on).
    It would be interesting to see a performance comparison, it certainly seems the most relevant one (that or an ablation of their "memory model" with the LLMs upon which they are based).
    
    spmurrayzzz 9 months ago
    
    > they have some long-context "Memory Model" (whose details I've yet to fully understand)
    Section 2.2 of the paper[1] goes into this in more detail. They pretrain the draft model using the redpajama dataset, followed by a supervised fine-tuning step. The training objective "aims to maximize the generation probability of the next token given the KV cache of the previous memory tokens".
    This suggests that any model with long context and good retrieval performance could do the same job (and maybe better in the case of the SOTA frontier models).
    [1] https://arxiv.org/pdf/2409.05591

novoreorx 9 months ago

Interesting. Splitting one question into multiple clues is actually how the human mind thinks about a question. This makes me think of OpenAI's GPT-4, though GPT-4 focuses on rethinking mistakes. It seems that imitating the human mind is the trend to improve LLM technology.