Improved ways of working with tabular data #1996

cmungall · 2024-03-20T02:33:30Z

cmungall
Mar 20, 2024
Maintainer

If we assume a highly limited profile of LinkML:

no multivalued
a single class
all ranges are literals/types

Then CSV/TSV loading is mostly trivial, because the CSV is isomorphic to the underlying objects. In practical terms we can feed the dict into python csvreader/writer and it will do the right thing.

Of course, even with this limited subset, there are multiple annoying edge cases due to lack of adhered to standards around CSVs, in particular

conventions for missing data (Dealing with missing data #1994)
escaping of newlines inside values

When we start to extend the profile, it gets harder to have predictable behavior. Even something as simple as allowing multivalued has annoying edge cases. We can pick an internal delimiter, such as |. This mostly works, modulo the missing data case above, and also assuming sensible escaping rules. There are edge cases where we might want a range to be an any_of multivalued and single valued, and there is no way to distinguish these cases.

any_of is problematic in general - consider trying to distinguish "1" from 1 in a CSV.

Things get harder when we start to allow ranges to be classes; sometimes there are "obvious" ways to flatten this but usually not.

Another approach with multivalued is to use the relmodel_transformer, which introduces new linking tables, but here let's assume that by "tabular format" we mean wide table denormalized where all observations fit into one table rather than linked relational tables.

For modern tabular databases and formats, there is no problem

jsonl
parquet
arrow
duckdb
pickled data frames

allow for a richer tabular profiles, including mutlivalued and inlined object references, with clear non-YOLO distinctions between different base types.

An orthogonal concern is that there is a lack of standards for dataset-level metadata. A sensible paradigm is to follow sssom and include a schema-controlled yaml block in the header, but everyone does this differently.

But if we are limited to CSV/TSV, one approach is to treat this as a transformation problem. Different schemas can define different transforms for flattening data. See https://github.com/linkml/linkml-transformer. This is the most principled approach. However, it may be overkill for cases when the transformation is "obvious" (e.g. use a pipe to separate multivalued). The problem is when everyone's obvious transforms come together it can be hard to reason over the combination.

The current approach with the tsv loaders and dumpers is to try and do the "obvious" transforms, and it mostly works. It uses this under the hood: https://github.com/cmungall/json-flattener -- it will do things like use json serialization for nested objects, etc.

Can we do better than this? There is an argument for having a separate library just for tabular data, where we could escape from the "ismorphism assumption" underpinning the current runtime loaders/dumpers. This could also have a plugin architecture that would make it easy for people to use other tabular/columnar formats like parquet, dask, etc. It would allow for a certain amount of flexibility without requiring the use of linkml-transformers, with some reasonable defaults.

We may want to look into using flatten-tool as a drop in replacement for json-flattener (see #1031). This is schema guided and uses simple json-schema profiles to guide behavior.

turbomam · 2024-06-26T14:51:51Z

turbomam
Jun 26, 2024
Collaborator

I jut wrote a some really crude Makefile targets for Kai Blumberg's Minimal Information about Food Components repo. They convert TSV example data files to YAML, which are then included in the standard examples validation step. It makes assumptions about the names of the TSV files, among other things.

At this time, all slots in all classes (besides the Container class) use atomic type or enum ranges. This is the same pattern that MIxS follows. MIFC doesn't use mutlivalued slots outside of Container but MIxS does.

I really want to work towards very fast and easy validation of tabular data in these projects, because I believe that's when the utility of the schemas will become most apparent to end users.

0 replies

turbomam · 2024-06-28T21:10:25Z

turbomam
Jun 28, 2024
Collaborator

cmungall
Aug 16, 2024
Maintainer Author

FWIW, LinkML-store is using its own approach for unifying loading and parsing of frame-shaped data
https://github.com/linkml/linkml-store/blob/main/src/linkml_store/utils/format_utils.py

Although this would not have been necessary had we merged @sneakers-the-rat PR in earlier: linkml/linkml-runtime#305 (which provides a raw dict iterator for the different loaders)

We may also want to look at https://ibis-project.org/ suggested by a linkml community member, which provides a unified interface onto multiple backends.

0 replies

sneakers-the-rat · 2024-08-16T03:16:19Z

sneakers-the-rat
Aug 16, 2024
Collaborator

i would be interested in this, i am doing my own hack on top of the pydantic generator to treat a special-cased class in the nwb schema as tables: https://github.com/p2p-ld/nwb-linkml/blob/ff77f0a2b8c95a46ee03ef81e35d8dd699feef15/nwb_linkml/src/nwb_linkml/includes/hdmf.py

and would way rather have a way of indicating that a given class expects tabular data, which as far as i can tell would just be a way of expressing that each of its slots/attributes are multivalued and have the same length. is there an elegant way we can think of expressing that? or should we just ducktype it?

The basic problem (i think) is that there are two orientations that tabular data could be modeled in, a class with slots s.t. each slot is multivalued and becomes a column, or a class with non-multivalued slots and each class instance is a row. In general I prefer the latter, but for the sake of expressiveness and compatibility with other schemas the former should be possible as well.

0 replies

turbomam · 2025-03-28T20:29:38Z

turbomam
Mar 28, 2025
Collaborator

2025-03-28 LinkML developer's meeting

Ideally this would be maximally informed by the schema, as opposed to lots of command line or fields in a config file
Different version of JSON schema have different options about truthiness (like 'yes')
Might want to move away from JSON flattener
- Old linkml-validate did use JSON flattener. New validator does not.
Should we expect that the CSV is provided with SSSOM-like headers including a specification of delimiter, quotation rules, etc.

0 replies

tfliss · 2025-04-01T01:06:35Z

tfliss
Apr 1, 2025
Collaborator

This is an informative discussion, thank you @turbomam for pointing it out. I will try to restate some of what you, @cmungall and @sneakers-the-rat say above along with some thoughts from the perspective of working on the Pandera generator.

I agree that modern tabular databases/dataframes are not a problem (compared to CSV) in the sense that they are strongly typed and operations are well defined (executable). Conversion to/from CSV may best be treated as a read of simple string columns and then casting operations based on the model. I do believe multivalued and even nesting or association can also be handled by other transformations (split, unnest). I briefly looked at nesting and multiples in the context of MIxS containers with good results. This collab notebook Gist shows simple manual transforms on the MIxS example between the table form(s) from loading JSON and CSV. I do also have a LinkML Pandera validator running on these files in code that is not yet committed.

The multiple tabular representations @sneakers-the-rat speaks of correctly gets to a core issue. Though even in this short gist, there are more than two tabular representations, though they are transforms of each other. Any other popular database table arrangement would be another example, and there are contextual design decisions in favor of any one of them. It may be necessary to be able to represent many/all of them.

Having the configuration be part of the LinkML model sounds correct (as it is part of the dataframe schema), however that may lead to a proliferation of slightly different schemas. Also fine as long as transforming between the models forms is easy to do. I need to read more about LinkML transformers that @cmungall linked to. I would expect most dataframe operations could map to an analogous operation on a LinkML model.

These considerations will come up for the next stage of the Pandera generator for class ranges. I'm interested in expanding the example in the gist to cover the most needed cases.

1 reply

tfliss Apr 3, 2025
Collaborator

I'm also looking at an apparent 'impedance mismatch' between tabular schemas and the inlined-as-simple-dict convention.The dataframe libraries tend to prefer the more strongly typed list[struct] form and don't have a Union type that I'm aware of. The built in loaders and constructors also struggle with the convention. I have Pandera code that is passing the main compliance for inlined-as-simple-dict by removing the convention after loading data using a generic Object type. However I'll also circle back to check how linkml-store and flatten-tool approach this.

Linked data Modeling Language

Improved ways of working with tabular data #1996

Uh oh!

Uh oh!

cmungall Mar 20, 2024 Maintainer

Replies: 6 comments · 1 reply

Uh oh!

turbomam Jun 26, 2024 Collaborator

Uh oh!

turbomam Jun 28, 2024 Collaborator

Uh oh!

Uh oh!

cmungall Aug 16, 2024 Maintainer Author

Uh oh!

sneakers-the-rat Aug 16, 2024 Collaborator

Uh oh!

Uh oh!

turbomam Mar 28, 2025 Collaborator

Uh oh!

tfliss Apr 1, 2025 Collaborator

Uh oh!

tfliss Apr 3, 2025 Collaborator

cmungall
Mar 20, 2024
Maintainer

Replies: 6 comments 1 reply

turbomam
Jun 26, 2024
Collaborator

turbomam
Jun 28, 2024
Collaborator

cmungall
Aug 16, 2024
Maintainer Author

sneakers-the-rat
Aug 16, 2024
Collaborator

turbomam
Mar 28, 2025
Collaborator

tfliss
Apr 1, 2025
Collaborator

tfliss Apr 3, 2025
Collaborator