ICAT Collaboration Meeting - 26th October 2023

Attendance

Attendees:

Salman Matalgah SESAME
Andy Gotz
Louise Davies
Alex de Maria
Alexander Kemp
Patrick Austin
Malik AlMohammad
Rodrigo Cabezas
Allan Pinto
Viktor Bozhin
Marjolaine Bodin

Apologies:

Alejandra Gonzalez-Beltran
Rolf Krahl

Site Updates

Sesame:

Recent visit to ESRF Plan to start with beamlines to use ICAT, browse and download data. Document issues from beamlines. Installed the h5web browser (hibou) as a service in the data portal. So far, so good. Intending to produce best practice documentation, suitable for new starters. Intending to produce paper for Sesame's use case. Ingest raw data with Python code, data acquisition already has a directory structure. Want to make it more performant. 20TB in 16 minutes, which is OK.

Q: Have you installed all parts of ICAT+? Including logbook? Not focused on the logbook yet. Focused on delivery of the service first. Want to use all the features eventually

ESRF:

Been busy, partly with Sesame's visit Trying to replace ispyb with ICAT

Single production beamline ingesting processed data into ICAT. Working pretty good. Adding a second beamline soon.
Users can visualise processed data in DataHub
Can send jobs to the cluster to reprocess data
Will present interface to ispyb community @ end of Nov

New data policy agreed by directors - processed data as a "first class citizen"

Interest from ICALEPCS - AI of logbooks, ask "what went wrong" and would give answer based on log history.

SLides from ICALEPCS talk are available here: https://cloud.esrf.fr/s/7AM52CepXWH8HSE

ICALEPCS paper has been edited and accepted.

An internal seminar on activities around data held last week can be viewed here:

https://cloud.esrf.fr/s/b7MtBXoKDHkcFR8

HZB:

No one present to give update

ALBA:

Deployed in production, IT making security checks, on the firewall, before going public. Using Keycloak for authn. Requirement for 2FA. Waiting to ingest more data. "Hopefully" working by end of November

Sirius:

Trying to connect tool for data standardization to ICAT, hopefully soon Started putting ICAT into production, but not there yet. Trying to validate the incoming data in dev environment. Q: What is the current blocker? Not really a blocker, working with current infra to map data automatically into ICAT. More like just working with their specific infrastructure. Running OK in development.

ISIS:

Recently started migrating to ICAT6/Payara6

Early(?) November planned for production system

Machine room at RAL has power work this weekend (affects DLS as well)

Not everything is going down, but a lot is

DLS:

Ditto for power works. Ongoing investigations in long running queries.

Oracle driver was an old version. Changed it but that broke something else. Now resolved and using the most recent driver

Work on the underlying tape storage is ongoing and blocking ICAT work.

Component Updates

icat.lucene & icat.server (& icat.utils, icat.client DataGateway)

Patrick gave a presentation

Snapshot releases for both icat.lucene & icat.server. DataGateway testing has been performed so current snapshots are considered release candidates.

Release for DLS early next year? Blocked by underlying tape storage work

New versions: icat.server: 6.0.0 -> 6.1.0 icat.lucene: 2.0.2 -> 3.0.0 icat.utils: 4.16.1 -> 4.17.0 icat.client: 6.0.0 -> 6.1.0

Separate endpoint for new search functionality

Justification:

Performance improvements
- Ability to index more than 2 billion entities per index (relevant for DLS Datafiles)
- Reducing number of DB calls
- Results are authorised in batched
- Optional ability for users only to search their own data (data where they're the instrumentscientist or investigationuser)
- Timeout long running searches
New features
- Can target specific fields e.g. title:"example"
- Synonym injection e.g. XAS = x-ray absorption spectroscopy
- Sorting options: relevancy (default), date, size, alphabetical
- Faceting results by e.g. instrument
- Filter results by parameters
- Infinite scrolling support
Engine support
- icat.server SearchManager has engine independent logic
- Implemented an OpensearchAPI for use with ElasticSearch/OpenSearch
  - Would mean you don't need icat.lucene - you talk to ES/OS directly
  - This hasn't been tested as extensively

Impact

If you don't use icat.lucene: no impact
If you do
- old endpoints still work but deprecated
- New endpoints should accept requests using old format
- icat.server 6.1.0 not compatible with icat.lucene < 3.0.0
- icat.lucene indexes pre v3 will need to be re-indexed

Andy: ElasticSearch uses Lucene as well? Patrick: yes, ES, OS and Solr all use Lucene under the hood. icat.lucene is effectively equivalent to these solutions, just ICAT specific.

Andy: Results should be the same right then? No matter which engine you use? Patrick: Yes, should be equivalent to users. Performance would be the big difference between different engines.

Marjolaine: Synonym injection? Do you rely on ontology? Patrick: Yes, this feature came out of PANET. There's an import script that helps automate ingest PANET terms. Theoretically you can manually specify. We're going to ship with PANET and periodic table symbols, but it's fully configurable Andy: we're looking into adopting PANET stuff Patrick: this is a simple version, just synonyms

Andy: Is Lucene forgiving with typos? Patrick: it's configurable. Verb conjugation is stripped off e.g. ionizing = ion. We remove things like plurals as well. (this is called stemming). Users can opt in to typos, typing test~1 allows 1 character to be mispelt. Can make this a global option as well.

Alex: Units? Our units are a mess, how are you managing them? Patrick: Diamond units are also bad. By default it understands SI units and prefixes. This is all configurable, e.g. can say eV = J Alex: Can you get it to standardise? e.g. User wants to search for 180 m, how to ignore 180 cm? Patrick: Should convert to SI unit on indexing. If users specified no unit, it either goes by SI unit or raw value (we can probably make this configurable). If user specifies units, Lucene will convert to SI unit and search against this (index is already in in SI units). You need to know at index time & and search time for it to work properly.

DataGateway

Deveoping E2E tests for the new Lucene new functionality. Leading to small bugfixes/improvements on the backend.

This is significant, don't want to start on too much other stuff while this is pending (but outside of our control, somewhat)

Made sorting more intuitive for users.

Q: DOI landing page? At the begining had two different applicatons, one for page and one for data?

Currently ISIS have a separate component for the landing page. Want to bring the landing page into DG. Done the work to enable this. Problem is, the migration to DataPublications hasn't completed so don't have pages for everything.
Plan to integrate all the metadata information, and either download or add data to cart from the landing page (in addition to navigating to the data - which would should its own metadata).
- Landing pages for Diamond under development - back and forth over features.
Alex: Worried about the uptime - need to keep the pages up. Static html is ugly but maybe easier to keep running.
- We (STFC) haven't considered it yet. Maintenance would be the only reason to take them all down. E.G. expect our existing, non-DG pages to be down during out upcoming power works.

icat-doi-api

DataHub

Lots of work to add support for rich metadata for MX. The latest version has equivalent set of features and performance as ISPyB and could replace it.

icat+ e-logbook

Improved searching of tags. Added support for exporting to text + improved pdf export.

AOB

Other meetings

Allan: Upcoming sample tracking meeting - can we have this by 13th(?) of November?

Should define sample tracking - some people think of couriers, some people think of metadata
- ESRF have tracking of Samples "sent in" already, for all techniques
Allan to send proposed dates

Next Collaboration Meeting

ESRF won't be able to join on the 30th of November.

Move it to another Thursday
Try to move it earlier & not on Thursday for Sesame as well (current meeting time is equivalent to 5pm on a Friday)