Rust questoin, cyclic data structures

solrize@lemmy.world · edit-2 2 months ago

Rust questoin, cyclic data structures

FizzyOrange · 2 months ago

Basically, if your tree is static or you only add nodes, the easiest option is to store all nodes in a Vec and have the child/parent links be indices. I recommend the typed-index-collection crate if you go that route.

If you need to move/delete nodes a lot then either Rc<Refcell< or you can wrap the Vec in something that takes care of the admin of fixing up pointers.

E.g. see this crate https://crates.io/crates/orx-tree

If you want to get really fancy you can even do things like compacting GC. You’re basically implementing a memory allocator at that point.

Note that you might say “what’s the point of Rust if I just have to implement a memory allocator to bypass the borrow checker?” but that’s silly. You still get the benefits of Rust for the rest of your code, and Rust still prevents things like type confusion and UB.

solrize@lemmy.world · edit-2 2 months ago

Thanks, I’ll look at the box-tree crate, but I’d say using indices instead of pointers is somewhat unsatisfying. C++'s advertised property is zero-cost abstractions, and Rust is supposed to have C++'s performance without the footguns. But if you compare those vector index operatons with primitive pointer dereferences, there are a bunch of extra instructions run at each operation, including a layer of memory lookups that can cause cache pressure and misses.

I think the most practical answers I’ve heard so far are to 1) use Rc/Arc and weakrefs, or 2) use unsafe operations on machine pointers in a hopefully carefully contained and abstracted part of the code.

Sorry about the slow responses as I’m occupied with RL stuff most of the time for now.

Jan :rust: :ferris:@floss.social · 2 months ago

@FizzyOrange

Wow, this crate looks like the most feature-rich tree crate I’ve ever seen!

It seems very underrated (only ~1000 downloads and one star on GitHub (by me)).

Thank you for the suggestion!😊

#Rust #RustLang #DataStructure #Tree #Algorithms

d_k_bo@feddit.org · 2 months ago

One way to do this is to use reference-counting pointers such as std::rc::Rc or std::sync:Arc. The parent node can hold a strong reference to each child node and each child node has a Weak reference to its parent.

solrize@lemmy.world · 2 months ago

I guess that’s possible. The sibling pointers can also be weakrefs, I think. I don’t know if there are ever arbitrary links between DOM nodes rather than just the tree-like ones. But, you are onto something if creation of weakrefs safely avoids the pointer ownership rules. I’ll have to look into this further.

quilan@lemmy.world · edit-2 2 months ago

Typically for this I’ve done some wrapper type around a vector storage for the node data (a wish.com arena, if you will). Links are just some abstraction around indices.

It’s kind of annoying to write initially, but it was easier than learning unsafe stuff. To do it correctly & most efficiently though, one would likely need unsafe code.

solrize@lemmy.world · 2 months ago

Typically for this I’ve done some wrapper type around a vector storage for the node data (a wish.com arena, if you will). Links are just some abstraction around indices.

Not sure what wish.com is, but yeah, you’re describing a traditional Fortran approach. I would say machine pointers are a hardware primitive, and using indices like that are an abstraction. Anyway, the reply is appreciated. It at least tells me that I’m not missing something super obvious.

arendjr · 2 months ago

Another data structure that you can consider is the red green tree: https://willspeak.me/2021/11/24/red-green-syntax-trees-an-overview.html

We use it in Biome too, and it’s great for building trees that are immutable and yet still need frequent updates, as well as traversal in all directions. Its implementation contains quite a bit of unsafe to make it fast, though as a consumer you’re not really exposed to that.

calcopiritus@lemmy.world · 2 months ago

The safe, fast and easy way to do trees is by using Rc<RefCell<T>>. Rc/Arc allows data to be owned multiple times. You want this because this way a node can be referenced by its parent and its child at the same time. However, Rc makes the inner type inmutable. And you probably will want to mutate it in a tree, that’s what RefCell is for. With RefCell you do the borrow checking at run-time instead of at compile-time. This allows you to mutate T even though Rc only gives you an inmutable reference. This is called interior mutability.

RefCell doesn’t eliminate the borrow checker though, you must still follow its rules. If you try to get 2 mutable references to the inner type of RefCell, it will panic.

I know you don’t want to read unsafe, but you gotta hear about the alternative. Just use pointers. Pointers don’t have the borrow checker to restrict them. And self-referencing structures with interior mutability are not easy to borrow-check automatically. You can have the raw pointers as private fields of the struct so the code that is actually unsafe will be a few very small functions.

Here’s why the other options worse than pointers:

Rc<RefCell<T>> will clutter your code with boilerplate and it’s a pain to deal with. Pointers are not too ergonomic in rust (mainly because there is no -> operator), but they need way less boilerplate. Also, you already need to manually check the mutability rules, why not all the rules.

Another option that I’ve seen is “have a hashmap with all the nodes, and just store the id of the node instead of a reference”. This is the same as “have all the nodes on a Vector and store the index”. Think about this for a second. You have a pool of memory and a number that identifies what part of that pool is the memory you want. Seen it yet? That is exactly what a pointer is! If you do that, you’re just disabling the borrow-checker anyway. You just created your own memory allocator and will have to manage your memory manually, at that point just use pointers, it will be the same except with fewer boilerplate and indirection.

solrize@lemmy.world · 2 months ago

Thanks, that is interesting about using a refcell. It hadn’t occurred to me that a refcell was considered immutable for purposes of Rc. What about the issue of refcounting cylic structures though? I thought the route around that was weakrefs per someone else’s suggestion.

By pointers, do you mean unsafe pointers that aren’t borrow checked? I guess that’s not really worse than writing a little bit of the program in C so it may be the right approach pragmatically. So I’m again wondering about Servo. I guess eventually Rust will get verification tools that let you check the soundness of “unsafe” code too. So the discomfort might only be temporary.

Sure, hashmap or numeric indices are abstractly equivalent to pointers, but Rust is supposed to be a low level language that can deal with machine primitives, and pointers are machine primitives. Simulating them through extra levels of tables is unsatisfying. Also, the uses of those pseudo-pointers are no longer checked by the compler, so your program is vulnerable to many (not all) of the same old-fashioned pointer errors that Rust was supposed to rescue us from.

calcopiritus@lemmy.world · 2 months ago

RefCell is neither mutable nor immutable. It’s a type like any other. What is special about RefCell is that it has a method like:

fn borrow_mut(&self) -> &mut T

Which means you can get a mutable reference to its contents by only having a normal reference to the RefCell.

By pointers I mean raw pointers. The pointers themselves are not unsafe. They are just normal pointers like you would have in C.

Rc can be used to generate weak refs. Which is what you want for your tree.

I don’t know about servo. So I can’t tell you much about it.

Don’t hope too much about the temporary unsafe thing. It’s not practical (maybe impossible) to make a safety checker that checks the entire program. The practical way is to make safe abstractions using unsafe code. For example rust itself is built upon plenty of unsafe code, however, it provides to you the appropriate abstractions so what you do is safe.

In this case. You can have a bit of unsafe code on your tree, so that the users of that tree have a safe API to do something that you needed unsafe code for.

For example one of the cases where you cannot automatically check the safety is in dereferencing a raw pointer returned by a FFI function call to a dynamic library. Your automatic safety checker would need to be able to read the library’s documentation. And at that point it’s not a real safety checker because the documentation may lie or have bugs.

livingcoder · edit-2 2 months ago

One way of solving this is to structure all of your nodes into a HashMap with the node ID as the key and the node type as the value. The underlying node type could have node IDs for referencing purposes. You lose the ability to reference the parent/child/sibling directly, but you avoid direct circular dependencies. That said, now you need to manage dangling references for when the node is removed from the main HashMap collection.

solrize@lemmy.world · 2 months ago

Thanks, this sounds like a pretty serious abstraction inversion, exposes you to various kinds of bugs such as memory leaks, and gives a performance hit from all the hashing compared with using machine pointers, but I guess it does get rid of the circular references. Is it really the idiomatic answer? I’m ok if it is, but it’s not what I’d have expected from a low level language designed to replace C.

livingcoder · 2 months ago

There are a few different ways to solve this problem without using unsafe and I’m not sure what would be considered idiomatic. Another option is to ultimately encapsulate all of the nodes in a reference-counted box like Rc or Arc and specify the type of your parent/child/sibling references to the same reference-counted box type. With that, you just share cloned instances around as needed.

The primary drawback here is that for mutable access you end up having to perform a lock every time on an underlying Mutex (or something similar). You also no longer have direct access to the singular instance of the node.

There are pros and cons to every approach.

solrize@lemmy.world · 2 months ago

Thanks. I might ask on irc how Servo does it. Given Servo’s connection with Rust, it would be surprising for this to be too awkward.

livingcoder · 2 months ago

Sounds good. Please share what you find and what you end up going with.

solrize@lemmy.world · 2 months ago

I’m not trying to code anything right now. I’m just trying to understand Rust itself better, so I asked how its users deal with a particular problem. Thanks!

milicent_bystandr@lemm.ee · 2 months ago

One note about doubly linked lists - I believe the received wisdom these days is that in almost all situations they’re a bad algorithm to solve your problem. (I watched a video about it sometime… And read something in the Rust docs… I’m not an expert!)

I see other comments mentioning a hash map… Hopefully you’ve got some good options.

solrize@lemmy.world · 2 months ago

Linked lists in general are cache unfriendly, though it helps if you have a relocating GC that puts the nodes back in order. The hash map idea is a possibility. I’m not actually trying to implement something like this. It’s more of a question about Rust’s approach from a PL perspective.

badmin@lemm.ee · 2 months ago

Check out the crate indexmap. Check its dependants too if you want to see how higher level abstractions can be built utilizing it.

solrize@lemmy.world · 2 months ago

Thanks, I will check that. It sounds interesting.

livingcoder · 2 months ago

deleted by creator

BB_C · 2 months ago

Doubly linked list is one of std’s collections. And safety in Rust is built on top of unsafely, because there is no way around that.

Did you try to actually look up literally anything before asking?! Because simply checking out std::collections docs would have given you some answers.

solrize@lemmy.world · edit-2 2 months ago

I don’t see anything in the std::collections doc relevant to this question. Yes I knew about the presence of doubly linked lists in that library. Obviously you could implement the DOM structure unsafely with C-style pointers but then what are you getting from using Rust instead of C? So I hoped to hear how to do it safely, maybe using a library object as a building block. But, I don’t think the stuff in std::collections suffice for the purpose.

Rust safety isn’t necessarily built on unsafety, though apparently it is in the case of doubly linked lists. I think singly linked lists can be made safely in Rust, similar to using C++ std::unique_ptr. It’s possible, though very complicated, to eliminate all the unsafety everywhere. Rust doesn’t attempt this, but (for example) ATS does (ats-lang.org).

Anyway your scolding is unhelpful. If you’ve got a usable and clear answer to my question I’d be happy to try to digest it. Otherwise I don’t see any point in continuing to respond.