This gets even trickier when it comes to serialization and how that integrates with ...

This gets even trickier when it comes to serialization and how that integrates with the expectations and understanding of the group of ppl the format is designed for.

Neuroscientists (and probably most scientists and normal computer using people) deal with files. they don't usually deal with abstract types that can be realized equivalently as a database entry, a json object, or whatever other kind of gobbledygook the computer priests deal with. So #NWB committed early to #HDF5 as a means of storing complex datasets in a single file so it would be easy to share.

This ends up being pretty deeply reflected in the schema itself, so the abstract representation of the data is tightly coupled to the storage medium and file format - not trivial to untangle.

The natural problem is that now the files are big because they store lots and lots of data in them, and so checking some scalar metadata value in the header requires me to download this huge 16GB blob. So external file references are added, instead of storing everything in the hdf5, some terminal dataset object is replaced with a relative path.

Additionally, since files are the unit of exchange, there's no means of sharing objects across files - so say I have a huge time series that gets analyzed a zillion different ways. I can refer to the top-level file object, but I can't index within that. Its the same problem, structurally, as the anonymous subobject problem, except now the best I can do is assume the file is in a fixed relative position to my referring file, or a globally absolute position (ie. A URI).

So we end up in the same place as RDF again. At the root of a lot of these problems is the ability to link and reference, and you get there sooner or later once your format has to contend with very real complexities of use. We then could introduce related problems like schema extensions - "where" is my colleague's schema extension, how can I reuse it? - that we mostly paper over by eg. Assuming that GitHub will exist forever unchanged.

So anyway, yet another reason why a p2p system of language-like references and local systems of meaning is necessary for a lot of interesting digital infrastructure and pops up in perhaps unexpected places. The problem has reflections at a global indexing level (where is this thing in the world) and also at very local, format level problems (where is this object within my system of information). The tension of "unnamed but identifiable" is something we solve in language with context and norms, but we only have very crude analogies in data systems.

Dr. jonny phd on Nostr: This gets even trickier when it comes to serialization and how that integrates with ...