The CRDT Dictionary: A Field Guide to Conflict-Free Replicated Data Types

198 points by birdculture a day ago

btown 21 hours ago

One of the most interesting things to me about CRDTs, and something that a skim of the article (with its focus on low-level CRDTs) might give the wrong impression on... is that things like https://automerge.org/ are not just "libraries" that "throw together" low-level CRDTs. They are themselves full CRDTs, with strong proofs about their characteristics under stress.

Per the Automerge website:

> We are driven to build high performance, reliable software you can bet your project on. We develop rigorous academic proofs of our designs using theorem proving tools like Isabelle, and implement them using cutting edge performance techniques adopted from the database world. Our standard is to be both fast and correct.

While the time and storage-space performance of these new-generation CRDTs may not be ideal for all projects, their convergence characteristics are formalized, proven, and predictable.

If you're building a SaaS that benefits from team members editing structured and unstructured data, and seeing each others' changes in real time (as one would expect of Notion or Figma), you can reach for CRDTs that give you actionable "collaborative deep data structures" today, without understanding the entire history of the space that the article walks through. All you need for the backend is key-value storage with range/prefix queries; all you need for the frontend is a library and a dream.

michelpp 19 hours ago

Automerge is an excellent library, with a great API, not just in Rust, but also Javascript and C.
> All you need for the backend is key-value storage with range/prefix queries;
This is true, I was able to quickly put together a Redis automerge library that supports the full API, including pub/sub of changes to subscribers for a full persistent sync server [0]. I was surprised how quickly it came together. Using some LLM assistance (I'm not a frontend specialist) I was able to quickly put together a usable web demo of synchronized documents across multiple browsers using the Webdis [1] websocket support over pub/sub channels.
[0] https://github.com/michelp/redis-automerge
[1] https://webd.is/
mentalgear 18 hours ago

Automerge is a great project, but it feels still way to academic in it's setup. If you need a superior DX and CRDT-based full-stack database, I'd recommend you to look at Triplit.dev and their docs. (while development has decreased somewhat, the product is in a fully-featured phase and should work well for anything from small to medium, probably also very large projects depending on your configuration). Give it a try, you will like it.
- satvikpendem 15 hours ago
  
  Sadly Triplit is web only, the one place I'd least have expected people needing offline access because...they can access the website from the browser. CRDTs are primarily useful in mobile or desktop apps, and yes Electron and React Native exist but it's better to have a language agnostic API, or something straight in the database like Postgres that any app regardless of implementation language can use.
- GermanJablo 18 hours ago
  
  Triplit is my favorite local-first database. However, it doesn't compete in the same space as Automerge, which is doc-based. If you want a user-friendly alternative, I'm launching my proposal this week: https://docnode.dev
- isaachinman 12 hours ago
  
  InstantDB is a far better choice
  - jtesp 11 hours ago
    
    yeah it's great as long as you understand it's not a traditional character level crdt. it's last write wins so you have to be careful with it
satvikpendem 15 hours ago

Well yeah, who expected them not to be a full blown CRDT? Similarly, I like Loro (https://loro.dev) but the fundamental problem remains that they're document based without a good query engine, ie you have to literally target a specific nested entry in the CRDT to get the data you want.

rdtsc a day ago

That's a great summary of CRDTs, starting from the basics and to the more advanced ones.

Speaking of Riak, it's still around, in the form of https://github.com/OpenRiak!

macintux 19 hours ago

Thank you, I was completely unfamiliar with OpenRiak. Pretty cool to see some of my former co-workers chiming in on the effort. Basho was a remarkable collection of smart people.

jv22222 10 hours ago

The thing I find interesting about CDRT and OT is it’s built to solve people typing in the same paragraph at the exact same time, which is something that very rarely happens in my experience. (Talking about text based collaboration aspect)

pentaphobe an hour ago

Not sure if helpful, but..
Ive found that systems which _don't_ support this often end up accidentally putting people's cursors in the same sentence/block (resulting in one or more editors losing content or wasting time trying to get detached from the other cursors)

GermanJablo 18 hours ago

Interesting read. I’ve spent the past two years developing my own CRDT, but along the way, I realized a CRDT involves too many trade-offs, so I ended up implementing an ID-based OT framework. Coincidentally, I’m planning to launch it this Tuesday, so here’s an exclusive for you: https://docnode.dev. I'd like to hear your thoughts!

In the future, I plan to add a CRDT mode for scenarios where P2P is required.

josephg 18 hours ago

Out of curiosity, which tradeoffs were problematic for your design?
- GermanJablo 4 hours ago
  
  Hi Seph, great to hear from you! I emailed you about a week ago with a private beta to thank you for your contributions (you’re in the acknowledgements section [1]) and to ask for your feedback. I’m not sure if I got your email right.
  The tradeoffs I mention mostly concern metadata: insert (OriginLeft, OriginRight), delete (tombstones), and moving (a full topic, you know what I mean).
  I know that with eg-walker you managed to reduce those costs by loading metadata into memory only when required. Still, I believe that for me, and for many others, a central server makes more sense, since P2P is a requirement for very few.
  DocNode isn’t traditional OT. It’s ID based instead of positional. I essentially started from a CRDT and stripped out the compromises that come with supporting P2P.
  That said, CRDT trade-offs weren’t my only motivation for building DocNode. Even if I had gone with a “classic” CRDT, I wanted a different API and a new approach to type safety.
  On top of that, I also have a non-mainstream stance regarding text CRDTs. I wrote a blog post explaining it, and I mention you there as well [2].
  I'd love to hear your feedback!
  [1] https://docnode.dev/docs#credits--acknowledgements [2] https://docnode.dev/blog/text-conflicts

tbrownaw 20 hours ago

what this calls OR-Set looks equivalent to what Monotone uses (used? It's kinda mostly dead now) for merging scalar values (eg names, content hashes) since 2005.

The best current page I can find is https://tonyg.github.io/revctrl.org/MarkMerge.html . Boo link rot.

trm217 17 hours ago

Very interesting read! Thanks for sharing!

bbor 10 hours ago

Great article that I haven't finished, but if the author ends up reading this: any good dictionary of terms needs an index!

fellowniusmonk 21 hours ago

CRDTs are something you still have to write by hand, I finished creating a custom sequence based CRDT engine about 2 months ago (inspired by diamond types) and it was hilarious to ask Ai for assistance.

It's interesting when you are working on something that:

1. Is essentially a logic problem.

2. That LLMs aren't trained on.

3. That can have dense character sequences when testing.

4. To see how completely useless an LLM is outside of pre-trained areas.

There needs to be some blackbox test based on pure but niche logic to see if an LLM model is capable of understanding and even noticing exposure to new logics.

canadiantim 20 hours ago

What about just using something like Loro?
- fellowniusmonk 20 hours ago
  
  I love Loro and its probably my favorite open source project (you can see me refer to it as such in my comment history), I have a very specific multi CRDT and search indexing architecture that precluded me from using it.
- fellowniusmonk 18 hours ago
  
  Ah, sorry, I completely left out context, when I say by hand I don't mean there are no good CRDT projects, Loro is absolutely great, others are very good as well.
  I mean only in the context of writing your own, you can't use Ai, Ai can be used to write code and certainly can explain a lot of code and as a resuly people start ascribing more reasoning power to Ai than it has, CRDTs are an area where current models just completely lose the plot.
  If you're only using Ai in well mapped areas it's easy to start assuming it has human level reasoning capabilities, the illusion is quickly shattered if you're operating at the edge.
  - satvikpendem 15 hours ago
    
    I don't understand what AI has to do with CRDTs, you don't need to use one with the other, as in, your initial comment seems like a non sequitur to the topic of the post.

ForHackernews 16 hours ago

Am I right in thinking CRDTs just push merge conflicts out of the database and up into the application logic?

macintux 15 hours ago

They can be designed into the database itself; that was our objective with Riak.
And if written and applied correctly, they can automate a resolution to the merge conflict, no matter what layer they reside in.
cryptonector 11 hours ago

I've been thinking about building a graph DB in PG with a DB schema that uses CRDT techniques. Every time I think about it the biggest problem ends up being about unique/primary keys, and then I end up with an idea that looks a lot like an Active Directory as for UNIQUE/PRIMARY KEY keys:
- every object has an ID from a single object ID namespace allocated in large chunks to all the participating servers, and
- first UNIQUE/PRIMARY KEY key creation wins / competing ones get renamed (to "Copy of <original>" or similar) or deleted.
I'm a fan of EAV schema design for graph DBs, so that's what my schema would use, at least for inter-object relations, though maybe also for object attributes (but that's tricky to do in PG since there is no ANY type for columns, even though the JSON and JSONB support essentially gives one an approximation of ANY). Why EAV? Because graph traversal (for transitive closure or reachability closure computations) is very simple to express generically in SQL when using EAV schemas, using a RECURSIVE CTE).
Yeah, so you have to re-implement FOREIGN KEY functionality, but you use the same sort of SQL that the RDBMS would generate internally for FKs, so no problem.
EAV also ends up creating a second layer of schema: for the application whose metaschema you define in the selected RDBMS' language (typically SQL). But then another thing you get trivially out of EAV schemas is class inheritance for your higher-level schema, and, yeah, OOP _is_ bad, but for a graph DB it's actually quite handy and convenient, and it need not bleed into client applications. SQL RDBMSes often don't really do inheritance well -- at least PG doesn't.
It'd be nice to be able to create first-class CRDT types in PG for columns that are not UNIQUE/PRIMARY KEY columns... but IIUC while you can create TYPEs and operators, you can't limit the operators to the type's monoid operators. The CRDTs for the non-key columns would have to be exported to the application schema layer, which is what I'd want anyways. The typical CRDTs for non-key columns would be ones like the one in TFA, especially last-write-wins types.
All of this is by way of answering your question by mentioning PG's first-class TYPEs and operators: CRDTs _can_ be built into the RDBMS, or the RDBMS can let you implement them yourself. But a question I have is: what is the application in the RDBMS context? Is it... the SQL tables and views? The tables, views, triggers, and the rest of the SQL schema? Is it the code that generates SQL statements and sends them to the RDBMS? Is it both those things (schema, clients)? The answer will depend on how much of the business logic you end up implementing in the SQL schema.