Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPL 2.34.0 failed in ufs weather model #207

Open
junwang-noaa opened this issue Feb 24, 2023 · 47 comments
Open

MAPL 2.34.0 failed in ufs weather model #207

junwang-noaa opened this issue Feb 24, 2023 · 47 comments
Assignees

Comments

@junwang-noaa
Copy link

We are trying to update MAPL library from 2.23.1 to MAPL 2.34.0 in ufs-weather-model. However we got following error message in GOCART:

pe=00098 FAIL at line=03053 Base_Base_implementation.F90 <status=57>
pe=00098 FAIL at line=00685 SU2G_GridCompMod.F90 <status=57>
pe=00098 FAIL at line=01817 MAPL_Generic.F90 <status=57>
pe=00098 FAIL at line=00193 BaseProfiler.F90 <Timer does not match start timer >
pe=00098 FAIL at line=01838 MAPL_Generic.F90 <status=1>
pe=00098 FAIL at line=00161 Aerosol_GridComp.F90
pe=00098 FAIL at line=01817 MAPL_Generic.F90 <status=1>

I tried the latest GOCART version acc574f in the develop branch (https://github.com/GEOS-ESM/GOCART/tree/develop), I got the same error. The code ran fine when switching back to MAPL 2.23.1. The ESMF library is: 8.4.1b07.

May I ask if anything needs to be updated to use MAPL 2.34.0? Thanks

@mathomp4
Copy link
Member

@junwang-noaa I think your libraries are fine, the line it is dying on is:

    call ESMF_AttributeGet(grid, name='GridType', value=grid_type, _RC)
    if(trim(grid_type) == "Cubed-Sphere") then

This came in between 2.23.1 and 2.34.0 in changes from @aoloso. Before it was:

    if (im_world*6==jm_world) then

So it's like it doesn't know your grid is a cubed-sphere grid. We might need to ping @weiyuan-jiang and @bena-nasa to see if maybe you need to set something in a file somewhere?

@bena-nasa
Copy link
Contributor

bena-nasa commented Feb 24, 2023

@junwang-noaa
As Matt said, the issue is that rather than relying on a silly hack (that the global size of the 2nd dimension of the grid was 6 times the 1st dimension) to detect the presence of a cubed-sphere grid, we changed the logic in that library. We now say, you must add the attribute to the grid that explicit tells it what type of grid this is so the procedure can take the appropriate action as the hack was no longer tenable with other changes to the GEOS model.

The grid you are using in the UFS application, however it is created, that is passed to GOCART clearly does not have this attribute set. The solution is to add the appropriate attribute to the grid ('Gridtype' as the key, and value is 'Cubed-Sphere'), that is used in GOCART so that this routine can know what the grid type is.

@weiyuan-jiang since you know how to build UFS can you take it from here? I've no idea where the grid comes from within UFS so can't really be of further advice.

@bbakernoaa
Copy link

I think that we need to come up with a more robust way to do this. For instance, what if we run this on a regional grid (a single tile). We should not expect that the only solution is the global cube sphere.

@junwang-noaa
Copy link
Author

junwang-noaa commented Feb 24, 2023

Thank you all for looking into this issue. @weiyuan-jiang I transferred the run directory to Orion at:

/work/noaa/stmp/junwang/gocart/rt_189024/cpld_control_p8_mixedmode

It has all the configure .rc files. Thanks

@weiyuan-jiang
Copy link
Contributor

@junwang-noaa I don't have the permission to that folder

@junwang-noaa
Copy link
Author

junwang-noaa commented Feb 24, 2023

@weiyuan-jiang Please try it again. Thanks

@weiyuan-jiang
Copy link
Contributor

@junwang-noaa I still cannot access gocart directory. To build mapl_v2.34.0, where can I load the esmf8.4.0? I got error
CMake Error at /apps/cmake-3.22.1/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find ESMF: Found unsuitable version "8.3.0", but required is at
least "8.4.0" (found
/work/noaa/epic-ps/hpc-stack/libs/intel/2022.1.2/intel-2022.1.2/impi-2022.1.2/esmf/8.3.0b09/lib,
)

@junwang-noaa
Copy link
Author

@weiyuan-jiang Sorry, please try again. The library team installed the library on acorn. UFS failed with ESMF 840 due to a bug in ESMF 840. The ESMF 841b07 works in UFS, GOCART runs with MAPL v2.23.1. Just when we try MAPL v2.34.0, we got error message. We don't have the library installed on Orion yet. Please let me know if you need to run tests on Orion, and I can see if EPIC team can install them.

@bena-nasa
Copy link
Contributor

bena-nasa commented Feb 24, 2023

I think that we need to come up with a more robust way to do this. For instance, what if we run this on a regional grid (a single tile). We should not expect that the only solution is the global cube sphere.

The whole reason we changed the logic to what is there now is for MORE robubstness. We were doing some open work where the component had a "grid" was still a cube-sphere in the sense that it had a copy of the local cubed-sphere domain. In which case we still want to go through this code path since the cells are still great circles on all 4 sides and same search algorithm can should be used.

A fully robust, efficient grid agnostic implementation of this (additive point binning to an arbitrary grid) where the cell boundaries may not be great circles (like a tripolar grid for example), is beyond the scope of what MAPL can do but I fully agree that is what we need. ESMF does have an action item I believe to someday implement this, a generic point binning that is additive to a grid given a set of points as an extension of existing regridding methods.

@weiyuan-jiang
Copy link
Contributor

@junwang-noaa I think the problem is this line

call cap % cap_gc % set_grid(grid, lm=nlev, _RC)

The temporary fix should be

call cap % cap_gc % set_grid(grid, lm=nlev, grid_type ="Cubed-Sphere", _RC)

I am wondering how to pass the type info into the cap options.

@weiyuan-jiang
Copy link
Contributor

weiyuan-jiang commented Feb 24, 2023

Oh, Sorry, I spoke to soon. Let me verify again . Yes, the change should work. Does UFS always work with cubed-sphere grid ? @junwang-noaa

@junwang-noaa
Copy link
Author

@weiyuan-jiang Currently UFS is using cubed sphere grid atmosphere model fv3atm, it has 6 tiles for global domain and 1 or multiple domains for regional domains. I am not sure if UFS will always use cubed sphere grid though as people might use other grids if they want to integrate other dycores.

@bbakernoaa
Copy link

bbakernoaa commented Feb 25, 2023 via email

@junwang-noaa
Copy link
Author

junwang-noaa commented Feb 26, 2023

bbakernoaa I want to confirm that the regional application in UFS also runs on native cubed sphere grid as I explained in the message, and currently UFS does not have a regional fv3atm coupled with GOCART yet, please clarify if your group is working on a different grid or you know what other grid it will be. Also I believe coupling GOCART in regional requires additional work on boundaries, which is beyond the issue here.. Thanks

@junwang-noaa
Copy link
Author

@weiyuan-jiang There is an issue on our wcoss2 test platform, it may take some time to verify your fix in UFS. Thanks

@bbakernoaa
Copy link

bbakernoaa commented Feb 26, 2023 via email

@junwang-noaa
Copy link
Author

@weiyuan-jiang The fix you provided resolved the issue. Now the test ran successfully in UFS WM. Please let us know if you'd like to have a new MAPL library or you will update GOCART with the fix. If you are going to update MAPL, we will stop installing MAPL 2.34.0. Thanks

@mathomp4
Copy link
Member

@junwang-noaa If I can get my tests done, I hope to put out a MAPL 2.35 today that should have this fix.

@mathomp4
Copy link
Member

mathomp4 commented Mar 1, 2023

@junwang-noaa MAPL 2.35.0 has been released:

https://github.com/GEOS-ESM/MAPL/releases/tag/v2.35.0

I think @weiyuan-jiang can tell you how to use this release for this issue.

@weiyuan-jiang
Copy link
Contributor

@junwang-noaa , Without changing ufs but adding this line to AERO.rc should work

GridType: Cubed-Sphere

@junwang-noaa
Copy link
Author

@mathomp4 @weiyuan-jiang Thank you very much for fixing the issue! I will ask our library team to install MAPL 2.35.0 and test in UFS with the change Weiyuan suggested. Will let you know how it goes.

@mathomp4
Copy link
Member

mathomp4 commented Mar 2, 2023

@mathomp4 @weiyuan-jiang Thank you very much for fixing the issue! I will ask our library team to install MAPL 2.35.0 and test in UFS with the change Weiyuan suggested. Will let you know how it goes.

@junwang-noaa You might actually want to wait for 2.35.1 which I'll issue soon. 2.35.0 had a bug in handling monthly history output. My guess is you don't use thatm but you might as well not have a buggy version!

@junwang-noaa
Copy link
Author

Sure, we will wait for 2.35.1. Thanks

@mathomp4
Copy link
Member

mathomp4 commented Mar 2, 2023

Sure, we will wait for 2.35.1. Thanks

Whoops. Forgot to update this. 2.35.1 is out now! My guess is 2.35.0 is fine for your runs, but now the bug isn't there. 😄

@junwang-noaa
Copy link
Author

@mathomp4 Thanks a lot!
@weiyuan-jiang With MAPL 2.35.1, I updated the AERO.rc, please see below:

NX: 4
NY: 24            

# Atmospheric Model Configuration Parameters
# ------------------------------------------
IOSERVER_NODES: 0

DYCORE: NONE

NUM_BANDS: 30

GridType: Cubed-Sphere

Now I got error message:

pe=00086 FAIL at line=01464    MAPL_CapGridComp.F90                     <status=51>
pe=00086 FAIL at line=00342    Aerosol_Cap.F90                          <status=51>

I see the line 342 in Aerosol_Cap.F90 is:

    call cap % cap_gc % set_grid(grid, lm=nlev, _RC)

Do I need to make any change in this line? Thanks

@junwang-noaa
Copy link
Author

@weiyuan-jiang I tried the following in the AERO.rc, I got the same error.

GridType: "Cubed-Sphere"

@weiyuan-jiang
Copy link
Contributor

@junwang-noaa I know what happens here. The cf_root is not created when we set_grid. It seems that we need to exchange the two lines.

call cap % cap_gc % set_clock(clock, _RC)
! initialize aerosol grid component
call cap % cap_gc % initialize(rc=rc)

However, the grid should be set at this point. That creates a circular dependence:
https://github.com/GEOS-ESM/MAPL/blob/7da78c3664acbe39f328543cb7427502a1a1a9fc/gridcomps/Cap/MAPL_CapGridComp.F90#L625

Maybe the best solution is to just change this line 342 of Aerosol_Cap.F90 and get back to the old set_grid of MAPL?

call cap % cap_gc % set_grid(grid, lm=nlev, grid_type ="Cubed-Sphere", _RC)

@bena-nasa @tclune

@junwang-noaa
Copy link
Author

@weiyuan-jiang Thanks for looking into this. I confirm that with the change above in Aerosol_cap.F90 and MAPL 2.34.0, the UFS WM tests ran successfully. Thanks

@weiyuan-jiang
Copy link
Contributor

@junwang-noaa Could you please try this branch? https://github.com/GEOS-ESM/MAPL/tree/fix/wjiang/set_grid_fix ? The cf_root is not created when setting the grid. So you would need to move this line to CAP.rc
GridType: Cubed-Sphere

@junwang-noaa
Copy link
Author

@weiyuan-jiang I tried your branch and I got this error:

pe=00053 FAIL at line=03053    Base_Base_implementation.F90             <status=57>
pe=00053 FAIL at line=00685    SU2G_GridCompMod.F90                     <status=57>
pe=00053 FAIL at line=01818    MAPL_Generic.F90                         <status=57>
pe=00053 FAIL at line=00193    BaseProfiler.F90                         <Timer <GOCART2G> does not match start timer <SU>>
pe=00053 FAIL at line=01839    MAPL_Generic.F90                         <status=1>
pe=00053 FAIL at line=00161    Aerosol_GridComp.F90                     <Failed to run child component>

@weiyuan-jiang
Copy link
Contributor

@junwang-noaa I have no problem running the new MAPL branch with unchanged ufs. There may be two reasons for the crash:

  1. The line GridType: Cubed-Sphere is not added to CAP.rc ( not AERO.rc)
  2. The new MAPL branch is not really linked ( need a fresh build)

@junwang-noaa
Copy link
Author

@weiyuan-jiang Thanks for looking into the issue. I added the GridType to CAP.rc, instead of AERO.rc, the UFS WM test finished successfully. Please let us know if you have a MAPL release version available.

@mathomp4
Copy link
Member

mathomp4 commented Mar 6, 2023

@junwang-noaa MAPL 2.35.2 is now out:

https://github.com/GEOS-ESM/MAPL/releases/tag/v2.35.2

@weiyuan-jiang Can you inform us how to use your new fixes?

@junwang-noaa
Copy link
Author

@mathomp4 I want to confirm that the MAPL 2.35.2 has the fixes in Weiyuan's MAPL branch: https://github.com/GEOS-ESM/MAPL/tree/fix/wjiang/set_grid_fix, right? Thanks.

@mathomp4
Copy link
Member

mathomp4 commented Mar 6, 2023

@mathomp4 I want to confirm that the MAPL 2.35.2 has the fixes in Weiyuan's MAPL branch: https://github.com/GEOS-ESM/MAPL/tree/fix/wjiang/set_grid_fix, right? Thanks.

Yes. It has GEOS-ESM/MAPL#2003 inside

ETA: For Git Flow reasons, @weiyuan-jiang made a new branch against main so we actually used a different branch, but should be the same code.

@weiyuan-jiang
Copy link
Contributor

Yes. That is right

@junwang-noaa
Copy link
Author

junwang-noaa commented Mar 7, 2023

@weiyuan-jiang I want to confirm with you, when using the new MAPL library, our gocart history files (gocart.inst_aod.20130401_0600z.nc4) now have one additional dimension: lev with value 1 to 4 as shown below:

netcdf gocart.inst_aod.20210323_0600z {
dimensions:
        lat = 361 ;
        lev = 4 ;
        lon = 720 ;
        time = UNLIMITED ; // (1 currently)
variables:
        double lon(lon) ;
                lon:long_name = "longitude" ;
                lon:units = "degrees_east" ;
        double lat(lat) ;
                lat:long_name = "latitude" ;
                lat:units = "degrees_north" ;
        double lev(lev) ;
                lev:coordinate = "N/A" ;
                lev:standard_name = "N/A" ;
                lev:units = "level" ;
        float time(time) ;
                time:begin_date = 20210323 ;
                time:begin_time = 60000 ;
                time:long_name = "time" ;
                time:time_increment = 60000 ;
                time:units = "minutes since 2021-03-23 06:00:00" ;
        float AOD(time, lev, lat, lon) ;
                AOD:_FillValue = 1.e+15f ;
                AOD:add_offset = 0.f ;
                AOD:fmissing_value = 1.e+15f ;
                AOD:long_name = "Total Aerosol Extinction AOT [550 nm]" ;
                AOD:missing_value = 1.e+15f ;
                AOD:regrid_method = "bilinear" ;
                AOD:scale_factor = 1.f ;
                AOD:standard_name = "Total Aerosol Extinction AOT [550 nm]" ;
                AOD:units = "1" ;
                AOD:valid_range = -1.e+15f, 1.e+15f ;
                AOD:vmax = 1.e+15f ;
                AOD:vmin = -1.e+15f ;
...
 lev = 1, 2, 3, 4 ;

Is this what we expect, may I ask what the "lev" means? Thanks

@weiyuan-jiang
Copy link
Contributor

It should be the number of wavelength. But I am not sure why it uses this confusing name "lev" which usually represents the levels. @bena-nasa, I think we should change this dimension name.

@mathomp4
Copy link
Member

mathomp4 commented Mar 7, 2023

@weiyuan-jiang I'll let @bena-nasa chime in, but I think one reason we did that (at least in the past) was for plotting packages like GrADS that could only handle 3rd dimensions that were layer or level. Without some post processing, the variables were unviewable.

I suppose we should do whatever CF says is "right" for these sorts of things, but I want to say these might be part of the "discrete axis" part of the Conventions which is confusing to read.

@junwang-noaa
Copy link
Author

@mathomp4 @weiyuan-jiang Thanks for the information. I think we are OK to use the "lev" for 3rd dimension. Is it possible that we have attributes to specify what wavelengths the lev values represent?

@bbakernoaa @rmontuoro @lipan-NOAA FYI.

@bbakernoaa
Copy link

bbakernoaa commented Mar 7, 2023

@junwang-noaa I believe that is specified in the GOCART2G_GridComp.rc file

In the global-workflow it is here: https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/chem/GOCART2G_GridComp.rc#L41

@junwang-noaa
Copy link
Author

junwang-noaa commented Mar 7, 2023

@bbakernoaa Thanks for the info. So do we want to have the information in the gocart files as attributes or it is OK to leave with lev value 1-4? I mean for downstream jobs (post or verification jobs)

@bbakernoaa
Copy link

@junwang-noaa I think it is ok. In production we may send it through the UPP anyway. We currently do not pass the AOD back to the physics radiation.

@junwang-noaa
Copy link
Author

@bbakernoaa Thanks for confirming. I will ask the library team to install the MAPL 2.35.2 for UFS.

@junwang-noaa
Copy link
Author

I have run some tests with threads using the MAPL 2.35.2 and GOCART develop branch e2245c2. It looks to me that the threading does not work in UFS. We may still need some work to enable threads in UFS WM:

single thread:

      [CHM] RunPhase1                                                      144    144    120      67.9798     55.2096     142     91.1209     39
        [CAP] Run 1                                                        144    144    120      66.5963     53.6441     142     89.8835     39
          [EXTDATA] Run 1                                                  144    144    120      39.2418     26.1711     142     64.7887     39
            [EXTDATA] Run 11                                               144    144    120      39.2375     26.1678     142     64.7829     39
          [AERO] Run 1                                                     144    144    120      25.3808     23.0391     20      28.2873     54
            [AERO] Run 11                                                  144    144    120      25.3781     23.0370     20      28.2846     54
                [GOCART2G] Run 12                                          144    144    120      24.0121     21.6550     133     26.9398     54

4 threads:

      [CHM] RunPhase1                                                      36     144    120      131.4783    124.9583    12      139.0562    36
        [CAP] Run 1                                                        36     144    120      129.3455    123.0336    12      136.7338    36
          [AERO] Run 1                                                     36     144    120      88.4206     81.5668     36      95.0078     88
            [AERO] Run 11                                                  36     144    120      88.4186     81.5650     36      95.0057     88
              [GOCART2G] Run 2                                             36     144    120      83.9012     77.0662     36      90.4515     88
...
          [EXTDATA] Run 1                                                  36     144    120      40.1352     33.6841     128     54.1526     36
            [EXTDATA] Run 11                                               36     144    120      40.1316     33.6808     128     54.1485     36

@mathomp4
Copy link
Member

mathomp4 commented Mar 9, 2023

@junwang-noaa I think @weiyuan-jiang and @aoloso are looking at this...

@junwang-noaa
Copy link
Author

junwang-noaa commented Mar 17, 2023

@weiyuan-jiang @aoloso I have two test cases ready on Orion for you to run some tests.

Case 1: single thread (total PETs 192 for atm forecast and chem, layout 4x8, 1 thread)

/work/noaa/stmp/junwang/stmp/junwang/FV3_RT/rt_72194/atmaero_control_p8

Timing profile is in ESMF_Profile.summary

        [fv3_fcst] RunPhase1               192    192    120      130.4369    113.6451    52      143.5964    107
              [GOCART2G] Run 2             192    192    120      22.0089     17.5834     56      28.3196     72
              [GOCART2G] Run 1             192    192    120      1.0615      1.0312      186     1.0915      1

Case 2: 4 threads (total PETs 192 for atm forecast and chem, layout 1x8, 4 threads)

/work/noaa/stmp/junwang/stmp/junwang/FV3_RT/rt_72194/atmaero_control_p8_thrd4

Timing profile is in ESMF_Profile.summary

        [fv3_fcst] RunPhase1               48     192    120      140.5584    131.2598    52      146.3891    12
              [GOCART2G] Run 2             48     192    120      84.5028     73.9572     20      93.3594     68
              [GOCART2G] Run 1             48     192    120      3.6855      3.5674      168     3.7565      0

My branch is: https://github.com/junwang-noaa/ufs-weather-model/blob/newmapl
the code is on Orion at:

/work/noaa/nems/junwang/ufs-weather/20230306/mapl/ufs-weather-model

You can compile the code by:

cd ufs-weather-model/tests
./compile.sh orion.intel "-DAPP=ATMAERO -DCCPP_SUITES=FV3_GFS_v17_p8 -D32BIT=ON" 001
you will get an executable fv3_001.exe, you can copy it over to fv3.exe the run directory and in the run directory, just submit:
sbatch job_card

Please let me know if you have any questions, Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants