Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mordred (eos78ao) output formatting issue #42

Open
JHlozek opened this issue Apr 16, 2024 · 7 comments
Open

Mordred (eos78ao) output formatting issue #42

JHlozek opened this issue Apr 16, 2024 · 7 comments

Comments

@JHlozek
Copy link
Collaborator

JHlozek commented Apr 16, 2024

Describe the bug
I've identified the source of the issue I'm having in ZairaChem with the Mordred (eos78ao) model.
The descriptors from the Mordred model are saved as a list '[]' whereas all the other descriptor models don't save the feature lists with the square brackets.

This causes issues after reading the raw.h5 descriptor file during further processing in zairachem.utils.matrices.py in save_summary_as_csv() where 'values[0]' in line 128 is not the expected 1-dimensional array and cannot be converted to floats.

I confirmed this by editing the code to reshape the array from (342, 1, 1613) to (342, 1613) which stopped this specific issue (but breaks the pipeline in other areas). So I suggest altering the output of eos78ao to be consistent with the other descriptor outputs (i.e. without explicitly saving the brackets)?

To Reproduce
Steps to reproduce the behavior in Ersilia:

  1. ersilia serve eos78ao
  2. ersilia run -i test_smiles.csv -o mordred_out.csv
  3. See descriptors saved with '[]'

Steps to reproduce the behavior in ZairaChem:

  1. fresh install of ZairaChem and eos78ao
  2. zairachem fit -i test_smiles.csv -m test_model.csv
  3. See test_model/descriptors/mordred only has 'raw.h5' and no 'raw_summary.csv'
  4. Navigate to test_model/descriptors/mordred and open a Python runtime
  5. import zairachem
  6. from zairachem.utils.matrices import Hdf5
  7. h = Hdf5("raw.h5")
  8. h.save_summary_as_csv()

Expected behavior
ZairaChem completes post-processing of Mordred descriptors (in zairachem.utils.matrices.py in save_summary_as_csv()).
I think this requires eos78ao to save its output without the square brackets in the feature list.

Desktop (please complete the following information):
Both in Rocky Linux 9 and Ubuntu 20.04

@JHlozek
Copy link
Collaborator Author

JHlozek commented Apr 20, 2024

I've been digging some more into this and discovered that the reason the descriptors are not read correctly is due to the eos78ao model's api_schema.json not getting configured correctly during the model fetch. Then the data is not stored as expected. So I think this is more an Ersilia model issue rather than ZairaChem.

The data type and shape are not specified and gets configured as:
"output": {
"descriptor": {
"meta": null
"type": "null",
}
}

If I manually edit the file to the following, then the model output is saved correctly and ZairaChem reads the data and proceeds as expected:
"output": {
"descriptor": {
"type": "mixed_array",
"shape": [
1613
],
"meta": null
}
}

I've attached the fetch log from the hub. I see that there is a file that couldn't be fetched that may be related to production of the api_schema file?:
"Error response from daemon: Could not find the file /root/eos/dest/eos78ao/model/framework/example.csv"

eos78ao_fetch.log

@GemmaTuron
Copy link
Member

I can get the model to work fine when fetching from DockerHub (but not tested the model inside ZairaChem - maybe that pops up the error due to h5 conversion)
I do get fetched a weird image artifact that might be due to the new image of ersilia? see the screenshot:

Screenshot 2024-04-24 at 09 15 00

@GemmaTuron
Copy link
Member

indeed with h5 it does fail

ersilia run -i my_input.csv -o my_output.h5
eos78ao_h5.txt

@DhanshreeA
Copy link
Member

DhanshreeA commented Apr 24, 2024

Hi @GemmaTuron this, to the best of my knowledge, seems to be a result of two issues - one from the ersilia CLI, and the other related to Docker itself. Ersilia thinks it is having trouble working with conventional pull which is truly not the case so it ends up pulling the same image in both ARM and AMD versions. And currently Docker is not mature enough with multi platform builds wherein even if both the architectures exist on your system, docker should list it as a single image. The default behavior within Docker is to reuse the label when two images are labeled the same way, thereby making one of them dangling (or none as you see here).

My best guess is that it is some combination of those things, but knowing your Docker version might be helpful as well!

@GemmaTuron
Copy link
Member

ah that part makes sense, is there anything we can do to avoid it happening? I had docker 4.23 but updating to 4.29 right now

@DhanshreeA
Copy link
Member

From our end, I need to work on the conventional pull issue so we don't end up pulling two flavors of the same image ever.

@JHlozek
Copy link
Collaborator Author

JHlozek commented May 2, 2024

Just in case this got lost on Slack, the recent changes did fix some of the metadata processing issue but the api_schema.json is still not completely correct. The 'type' field is still null:
api_schema.json

These are the molecules that were used during the fetch:
example_standard_input.csv
example_standard_output.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants