Skip to content

Task label feature refined#983

Open
ottointhesky wants to merge 29 commits intoipython:mainfrom
ottointhesky:task_label_feature
Open

Task label feature refined#983
ottointhesky wants to merge 29 commits intoipython:mainfrom
ottointhesky:task_label_feature

Conversation

@ottointhesky
Copy link
Contributor

This merge request will contain small improvements regarding the docu and unittests of the label feature. For now, only unittests were added which check submitted labels with the dictDB and sqliteDB backend.

A couple of days ago I realized that label will be written twice to the DB which is maybe unwanted (since its wastes resources):

grafik

We need label as explicit column to make it queryable. Hence, we could remove the entry from the metadata before writing the record to the database. This can be handled centrally. However, retrieving a record needs re-adding the label to metadata which makes everything more complicated since it requires specific handling for the different DB backends. So is it worth the effort since label will be empty for most users anyway? Probably not...

As mentioned earlier we also need the possibility to find records based on substrings within DB columns. In monoDB syntax this can be achieved using regex. E.g.
{'label': {'$regex': 'my'}}
would find any records where label contains the string my (at any position). So this would require a new comparision operator ($regex) for the filter defintion in ipp. Supporting this operator in dictDB shouldn't be to difficult but for sqlite, only a strongly reduced regex defintion could be supported via like. sql like basically only support wildcard (single and multi character macthing). So a regex only containing ^ $ . .* .+ could be translated. Anything else isn't possible. So the question here is, should we extended the supported operators by $regex or should we go a different/new way by add the possibility of passing backend specific filter objects (e.g. lamba object for dictDB and where clauses for sqliteDB)? If you are thinking of dropping support of mongoDB the second option might be more appealing. If you do not want to drop support for mongoDB yet, I sugguest that we add a monoDB installation to the github actions. Using the following action script this should be to difficult. No matter which way you want to go, I'm happy to provid the necessary implementation...

@minrk
Copy link
Member

minrk commented Feb 12, 2026

I don't think we need to worry about the cost of writing the label twice to make it queryable. It's quite small compared to anything else, so the impact will be negligible.

I don't imagine full regex search is going to be that useful, since users would only craft the labels specifically to make them searchable, I imagine wildcard matching is plenty.

If you wanted to put some time into testing mongodb, that would be super appreciated! If it takes too much of your time, just say so, and we can probably drop it.

@ottointhesky
Copy link
Contributor Author

I don't think we need to worry about the cost of writing the label twice to make it queryable. It's quite small compared to anything else, so the impact will be negligible.

Ok & thanks. I just wanted to double check with you...

If you wanted to put some time into testing mongodb, that would be super appreciated! If it takes too much of your time, just say so, and we can probably drop it.

As presumed, adding mongodb to the github tests was easy. supercharge/mongodb-github-action only works for linux container but that's definitely better than no test. I also changed to pymongo api 4.x and raise an exception if pymongo version is below 4

I don't imagine full regex search is going to be that useful, since users would only craft the labels specifically to make them searchable, I imagine wildcard matching is plenty.

Agreed, but how should a wildcard matching look in python code? So far the query objects syntax is defined by mongodb (query objects are passed to mongodb untouched) and there is no wildcard syntax there. If we come up with something new, e.g. based on sql like

{'label': {'$like': '%my%'}}

query objects will need preprocessing also for mongodb as it is NOT currently the case. Which direction should we go?

@ottointhesky
Copy link
Contributor Author

FYI: for what ever reason the mongodb container seem to interfere with the slurm container. Sometimes it works but most of the time it doesn't. Deactivating mongodb via if for the slurm test doesn't seem to work. Hopefully I can find a solution to this problem...

@ottointhesky
Copy link
Contributor Author

FYI: for what ever reason the mongodb container seem to interfere with the slurm container. Sometimes it works but most of the time it doesn't. Deactivating mongodb via if for the slurm test doesn't seem to work. Hopefully I can find a solution to this problem...

The error message

image "docker.io/library/ipp-cluster:slurm": already exists

seems to be a racing condition of docker build. With a little bit of research I discovered this bug report. Calling the actual build command twice (on error), seems to resolve the problem. Once the problem in docker is fixed, the double call can be removed

@ottointhesky
Copy link
Contributor Author

  • I have updated docu regarding label
  • github actions include all mongodb tests
  • label tests now also running with mongodb
  • I have added label support for View.execute and View.run

From my point of view only one question remains:
How do we proceed with the wildcard matching (for labels)?

@ottointhesky
Copy link
Contributor Author

ottointhesky commented Feb 23, 2026

By the way, today I realized that the broadcast view does not write any entries to the hub db. Is this on purpose?

Again, reading the docu on Broadcast View helps to understand its idea/concept :-) Since the Broadcast View is tuned for efficiency (and no task entries are written to the hub db) it doesn't make much sense to support labels for this view. Do you agree?

@minrk
Copy link
Member

minrk commented Feb 24, 2026

That's great!

How do we proceed with the wildcard matching (for labels)?

Both do seem to have similar definitions with different symbols:

meaning sql LIKE mongodb wildcard python fnmatch
0-to-many % * *
exactly one _ ? ?

so I think it makes sense to use the fnmatch-style, as the Python-native form, which needs no modification for dictdb or mongodb, and use

pattern.replace('*', '%').replace('?', '_')`

in the sql backend. Does that sound reasonable?

I realized that the broadcast view does not write any entries to the hub db. Is this on purpose?

I don't think it is, but you don't need to fix that here, it might be complicated. We can open an Issue for it.

@ottointhesky
Copy link
Contributor Author

Thanks for your input, but I fear it’s not as easy. Mongodb supports wildcard matching but not in the way, the ipp-mongodb client accesses entries in the DB. The code uses find for querying corresponding entries:

matches = list(self._records.find(check, keys))

and there only the following operators are supported. The wildcard operator only works with aggergation (which is a different query concept) and I wasn't able to get it to work with find in my local mongodb installation. Hence, it cannot be directly integrated in the current code concept (at least by my understanding, but I’m definitely not an mongodb expert)

So, I think we are back to my original suggestions:

  1. Introducing a new $like operator which is translated to regular expressions for dictdb and mongodb. I do have already some code that can handle this even supporting definable escape charaters as it is possible by sql like
  2. Alternatively, a $wildcard operator could be introduced using ? and * for matching which is maybe more commonly known.
  3. Preserving the original concept of strictly sticking to the mongodb syntax and using regular expressions ($regex). This works straight forward for dictdb and for sqlite we could support regex that only contain ^ $ . .* .+ (or filtering entries within python after the sqlite query)
  4. Passing native db dependent query objects to the db backend (new function or additional parameter)

Of course it possible to also implement multiple solutions.
May I ask you again which way you want to go (or do you have a better solution)? If you are unsure it’s maybe better to merge and close this pull request and I will create a new one…

@minrk
Copy link
Member

minrk commented Mar 2, 2026

In that case, I think we can say that wildcard matches aren't supported in mongodb, only dictdb/sqlitedb. If someone ever comes wanting to add support in the mongo backend, we can do it, but no need to put in the work now.

So let's use:

  • fnmatch syntax as input
  • use fnmatch module in dictdb
  • two-character substitution for sql LIKE
  • only support strict equality in mongodb

How does that sound?

@ottointhesky
Copy link
Contributor Author

thanks for your comments!
sounds good to me. how should the query object look like? e.g.

{'label': {'$fnmatch' : '*my?'}}

translation to sql is straight forward (as you suggested). Even mongodb support could be added easily using the fnmatch.translate to convert wildcards to a regular expression. But if you do not want me to touch the mongodb code, I will postpone it...

So main question here: Do you like the naming of the new wildcard operator $fnmatch?

@minrk
Copy link
Member

minrk commented Mar 3, 2026

The structure looks great!

Naming things is hard, but I wouldn't pick the Python module name, I'd pick a more generic word like 'match' or what these patterns are, which are often called 'globs'. Naming discussions can get into the weeds, so I'll suggest you pick from this list and not go back and forth too much:

  • $glob - specific, refers to this kind of matching we are doing so indicates syntax for folks who know the name, but a bit jargony
  • $wildcard - like glob, but a bit more generic
  • $match - generic, might imply regex
  • $like - generic, but references the sql function we use

If the regex fits easily, feel free to implement the mongo one. I only didn't want to require it for you to be able to finish the feature, but by all means feel free if it's not a problem to implement the same semantics across the board.

@ottointhesky
Copy link
Contributor Author

Thanks again for your suggestions.

I have two favourites: $glob and $wildcard. Maybe the correct candidate turns out itself, if we consider escaping (or not) which we haven’t discussed yet.
I wasn’t aware that glob not only supports * and ? but also character sequence matching if wrapped in [] . This also allows escaping ? or *. Hence, if we use glob/fnmatch (operator name -> $glob) as a basis for the new operator the sql translation should also understand [] expressions at least for escaping the meta-characters. Unfortunately, there is no equivalent for any character in sequence matching in the sql like syntax. Hence, we could translate such an expression to any single character match or throw an exception.

If we create our own wildcard operator (operator name -> $wildcard), we could limit its functionality to * and ? without any escape character support. Literals such as *, ?, _, %, [ and ] should not be used in relevant db columns or in the wildcard pattern to make the wildcard matching work consistently across all db classes. If we want to secure a consistent behaviour at least wildcard pattern checking is needed in all three db classes. Other option: only document it but do not check it...

So in short
$glob:

  • Easy and rigours support for dictdb and mongodb (via fnmatch.translate as regex)
  • Limited syntax support for sqlite (translation to like syntax more complicated)
  • Escape character support
  • Known and standardized behaviour

$wildcard:

  • Without pattern checking simple implementation
  • With rigours pattern checking maybe more complicate than $glob implementation
  • Easy support for dictdb, mongodb and sqlite
  • No escape character support
  • Without rigours pattern checking different result maybe returned based on different db backends (maybe no so relevant for user since one typically stays with one db)

I would go for $glob since it’s more rigorous and has a (hopefullly) 100% predictable behaviour. What do you think?

@minrk
Copy link
Member

minrk commented Mar 4, 2026

Great, let's do $glob. For ~100% of cases, all glob means to people is * support, so that seems totally fine to me. No need to go to too much trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants