The Oximeter Query Language (Oxide RFD 463)

tapoxi 9 months ago

When a small hardware company is not only making it's own full hardware and software stack but brings that all the way down to the telemetry query language, I get a lot of NIH vibes and question if any of these elements will get the attention they deserve.

mtndew4brkfst 9 months ago

Note that some Oxide employees (including the ones I personally admire) have histories of working with Solaris, building their own cloud stack and hypervisor atop it (Joyent SmartOS) and if you'll permit me some editorializing, some general nonconformist habits. This is not the first time your sentiment was expressed about these folks, I'm sure it won't be the last.
I don't personally see it as NIH tendencies, especially since after firsthand experience they're making similar choices again, but that's something subjective to evaluate for yourself. I happen to admire the professed goal of disrupting data center hardware-rack design and economy around it, even if I'm indifferent to Solaris as a technical centerpiece.
steveklabnik 9 months ago

(I work at Oxide and helped review this RFD.)
There are always pros and cons to the choice to use something that already exists before making something new. Ultimately, it is our job as engineers to take a look at the tradeoffs and make the call.
I would agree that Oxide has taken the "so we should build it" choice more than other places I have ever worked. However, Oxide is a bigger swing than many of the places I have worked. And we have higher intentions of excellence than many places I have worked. I think both of these factors make it more likely to want to build your own thing: it is less often that something off the shelf does exactly what we need, and it is easier to provide something great if you have the ability to take control of it.
Furthermore, there's also some amount of bias in the stuff you hear about: people love to hear stories about some new shiny thing, but they don't want to hear about very meat and potatoes boring choices. So you're going to see the story on HN about us making a query language, but you're not going to hear the story about us using TypeScript on our frontend rather than say, some Rust + wasm thing, or about us choosing CockroachDB for the data plane. Until circumstances change and suddenly the Cockroach choice becomes interesting, which it did! So it's worth keeping that in mind as well.
Time will tell, but I can't think of an obvious place yet where we made the choice to build our own something and it clearly was the wrong decision in retrospect. And it has very often been clearly the right decision in retrospect. That's my own personal metric for "are we making good decisions over time?"
jsiepkes 9 months ago

It's explained in section 1.2.
My interpretation is; Oxide already chose to use Clickhouse as their DB for metrics in it's rack. You can't just slap PromQL on Clickhouse because DSL's like PromQL reflect some of the inner workings of for example Prometheus. Hence they needed a DSL which was somewhat opinionated towards Clickhouse in order to make it work.
- antonyt 9 months ago
  
  You can, however, slap PRQL or KQL on top of Clickhouse. https://clickhouse.com/docs/en/guides/developer/alternative-...
  1.2 doesn't go into nearly enough detail to be convincing, in my opinion.
  - jsiepkes 9 months ago
    
    As far as I know neither PRQL nor KQL are DSL's intended for querying timeseries (like PromQL). Even other timeseries DB's like Influx DB which are somewhat similar to Prometheus have their own DSL (for example InfluxQL in case of influx DB).
    Personally I think this sentence summarizes the need pretty well: "Many telemetry systems have implemented their own DSL, and each tends to be tailored (intentionally or not) to the underlying data model and storage mechanisms.".
    
    antonyt 9 months ago
    
    KQL, at least, is exactly a DSL intended for querying timeseries.
    Also it's not like they've implemented a custom storage back-end, it's ClickHouse. ClickHouse itself thinks SQL is fine for querying time-series data (as do Snowflake and TimeScale, for that matter).
    Ultimately of course Oxide can do what it wants, but the justifications provided in this doc are thin enough to make NIH seem like a very plausible explanation. Perhaps the problem domain is more complicated than just retrieving some time-series data out of ClickHouse, and the doc fails to make that clear.