Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload client sometimes misses either declaring the file replica or declaring the file to a dataset #6670

Closed
StevenCTimm opened this issue Apr 11, 2024 · 21 comments
Assignees
Labels
backport Issues which are backported to additional releases bug Clients
Milestone

Comments

@StevenCTimm
Copy link

Description

On several occasions DUNE has observed that the uploadclient.py. of rucio v33.3 (as called in our utility justin-rucio-upload)

https://github.com/DUNE/dune-justin/blob/main/commands/justin-rucio-upload

either fails to register the file replica, or fails to attach the DID to the requested dataset.

This tends to happen when the rucio server is under high load.

6093.90.justin-prod-sched02.dune.hep.ac.uk.txt

The snippet of the debug output generated from the rucio upload client is below.
In short what you see is 5 attempts at rucio upload--3 to QMUL and 2 to RAL-PP the last of which
finally showed a success. the QMUL RSE was very busy but we were also getting 503 errors against the
server at the time.

But what actually happened under the covers is that the file in question was attached to the dataset
but the replica was not declared to rucio.

A layman's review of uploadclient.py seems to indicate that there are very few if any checks for exceptions in the several lower level tasks that this client does.. i.e. there is an attach to the DID and no attached method to handle the exception--so it's quite possible that the 2 tasks are not atomic--what needs to be done is either to make the client code retry if it fails, or return an error if one of those two fails.

DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2ba369504ca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:02:59,967 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:03:22,081 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:03:37,845 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:03:53,947 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
justin-rucio-upload fails: An unknown exception occurred.
Details: no error information passed (http status code: 503)
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b17ceb98ca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:04:24,097 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:04:39,902 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:04:55,837 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
justin-rucio-upload fails: An unknown exception occurred.
Details: no error information passed (http status code: 503)
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b674216bca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:05:21,496 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=QMUL HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL HTTP/1.1" 200 1192
DEBUG:root:Input validation done.
INFO:root:Preparing upload for file nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL/attr/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:06:07,145 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL/att
r/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:06:22,830 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (4): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/QMUL/attr/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:06:38,838 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
WARNING:root:Attributes of the RSE: QMUL not available.
DEBUG:root:wan domain is used for the upload
DEBUG:root:Registering file
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b674216bdf0> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /accounts/dunepro/scopes/ HTTP/1.1" 200 67
DEBUG:root:Trying to create dataset: fardet-vd:fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/fardet-vd/fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779 HTTP/1.1" 409 104
INFO:root:Dataset fardet-vd:fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779 already exists - no rule will be created
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/fardet-vd/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root/meta?plugin=DID_COLUMN HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:07:43,747 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (5): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/fardet-vd/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root/meta?plugin=DID_COLUMN HTTP/1.1" 404 238
DEBUG:root:File DID does not exist
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b674344dca0> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:08:09,206 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:08:24,971 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:08:40,984 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
justin-rucio-upload fails: An unknown exception occurred.
Details: no error information passed (http status code: 503)
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b77fa844ca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL-PP HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:09:04,890 WARNING Waiting 0.25s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.
gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL-PP HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:09:20,654 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL-PP HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:09:36,591 WARNING Waiting 1.0s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 1.0s due to reason: server returned 503
justin-rucio-upload fails: An unknown exception occurred.
Details: no error information passed (http status code: 503)
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWr:
apper object at 0x2b9c7f6e6ca0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=RAL-PP HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL-PP HTTP/1.1" 200 1634
DEBUG:root:Input validation done.
INFO:root:Preparing upload for file nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL-PP/attr/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:10:18,472 WARNING Waiting 0.25s due to reason: ser:
ver returned 503 ESC[0m
WARNING:baseclient:Waiting 0.25s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL-PP/attr/ HTTP/1.1" 503 299
ESC[33;1m2024-04-07 00:10:34,158 WARNING Waiting 0.5s due to reason: server returned 503 ESC[0m
WARNING:baseclient:Waiting 0.5s due to reason: server returned 503
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/RAL-PP/attr/ HTTP/1.1" 200 350
DEBUG:root:wan domain is used for the upload
DEBUG:root:Registering file
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b9c809c78b0> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /accounts/dunepro/scopes/ HTTP/1.1" 200 678
DEBUG:root:Trying to create dataset: fardet-vd:fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/fardet-vd/fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779 HTTP/1.1" 409 104
INFO:root:Dataset fardet-vd:fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779 already exists - no rule will be created
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/fardet-vd/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root/meta?plugin=DID_COLUMN HTTP/1.1" 404 238
DEBUG:root:File DID does not exist
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 201 7
INFO:root:Successfully added replica in Rucio catalogue at RAL-PP
DEBUG:root:gfal.NoRename: connecting to storage
DEBUG:root:gfal.NoRename: checking if file exists None
DEBUG:root:Checking if root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root exists
DEBUG:root:gfal.NoRename: checking if file exists root://mover.pp.rl.ac.uk:1094/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z
_reco2_ana.root
DEBUG:root:gfal.NoRename: closing protocol connection
DEBUG:root:[{'hostname': 'mover.pp.rl.ac.uk', 'scheme': 'root', 'port': 1094, 'prefix': '/pnfs/pp.rl.ac.uk/data/dune', 'impl': 'rucio.rse.protocols.gfal.NoRename', 'domains': {'lan': {'read': 1, 'write': 2, 'delete': 2}, 'wan': {'read': 1, 'write': 2, 'delete': 2, 'third_party_copy_read': 10, 'third_party_copy_write': 10}}, 'extended_attributes': None}, {'hostname': 'heplnx204.pp.rl.ac.uk', 'scheme': 'srm', 'port': 8443, 'prefix': '/pnfs/pp.rl.ac.uk/data/dune', 'impl': 'rucio.rse.protocols.gfal.Default', 'domains': {'lan': {'read': 0, 'write': 0, 'delete': 0}, 'wan': {'read': 3, 'write': 2, 'delete': 3, 'third_party_copy_read': 0, 'third_party_copy_write': 0}}, 'extended_attributes': {'space_token': 'DATADISK', 'web_service_path': '/srm/managerv2?SFN='}}, {'hostname': 'mover.pp.rl.ac.uk', 'scheme': 'davs', 'port': 2880, 'prefix': '/pnfs/pp.rl.ac.uk/data/dune', 'impl': 'rucio.rse.protocols.gfal.NoRename', 'domains': {'lan': {'read': 2, 'write': 1, 'delete': 1}, 'wan': {'read': 2, 'write': 1, 'delete': 1, 'third_party_copy_read': 1, 'third_party_copy_write': 1}}, 'extended_attributes': None}]
INFO:root:Trying upload with davs to RAL-PP
DEBUG:root:Processing upload with the domain: wan
DEBUG:root:gfal.NoRename: connecting to storage
DEBUG:root:The PFN created from the LFN: davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:gfal.NoRename: checking if file exists davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:gfal.NoRename: checking if file exists davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:put: Attempt 1
DEBUG:root:gfal.NoRename: uploading file from nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root to davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
INFO:root:Successful upload of temporary file. davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:skip_upload_stat=False
DEBUG:root:stat: pfn=davs://mover.pp.rl.ac.uk:2880/pnfs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:gfal.NoRename: getting stats of file davs://mover.pp.rl.ac.uk:2880/pn
fs/pp.rl.ac.uk/data/dune/fardet-vd/3f/1f/nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:root:Filesize: Expected=11222972 Found=11222972
DEBUG:root:Checksum: Expected=87c05d33 Found=87c05d33
DEBUG:root:gfal.NoRename: closing protocol connection
DEBUG:root:Upload done.
INFO:root:Successfully uploaded file nu_numu2nutau_nue2numu_dunevd10kt_1x8x6_3view_30deg_1004_695_20230810T010722Z_gen_g4_detsim_hitreco__20240406T223117Z_reco2_ana.root
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v33_3_0/NULL/lib/python3.9/site-packages/urllib3/connectionpool.py:1061: InsecureRequestWarning: Unverified HTTPS request is being made to host 'dune-rucio.fnal.gov'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /traces/ HTTP/1.1" 404 207
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b9c7f6e6df0> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "PUT /replicas HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/fardet-vd/fardet-vd-reco2ana_ritm1780305_nu_numu2nutau_nue2numu_fhc_skip0_end_1779/dids HTTP/1.1" 201 7
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
User: dunepro
Expires: Sun Apr 14 00:13:07 2024
Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org
Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/
User: dunepro
Expires: Sun Apr 14 00:13:23 2024
{
"checksums": {
"adler32": "2b80a75b"
},
"created_timestamp": null,
"creator": "dunepro",

Steps to reproduce

  1. get approximately 8K jobs running
  2. have each of them attempt to upload 5 individual files in succession.

Rucio Version

client v33.3 server v33.3 and v34.

Additional Information

This behavior happens rucio v33_3 client python version 3.9 under both SL7 and alma9

@jamesp-epcc jamesp-epcc self-assigned this Apr 15, 2024
@jamesp-epcc
Copy link
Contributor

After looking through the code and the log above, I don't think the problem is exceptions being missed. It's true that the upload client doesn't explicitly catch exceptions from some of the server calls, but they should still be propagated to a higher level.

In any case, it doesn't look as though any relevant exceptions are being thrown here. The server calls for the DID registration and replica registration are returning 200 or 201 (success). So the server is reporting to the client that the file was registered correctly, but it wasn't. Do you know if the replica registration is missing entirely or is it present but in the wrong state? (The replica is initially created before the file is uploaded, in state "C", then is set to state "A" if the upload succeeds, so this distinction could give us a clue as to what's going wrong).

Is there any possibility of getting the server logs for while this was going on? I can try to reproduce this in a test environment but I'm not sure how easy it will be to reproduce a heavily loaded Rucio server.

@StevenCTimm
Copy link
Author

unfortunately the server logs rotated out already. The rucio replicas in question showed in state "U"when I did
rucio list-file-replicas --missing

@StevenCTimm
Copy link
Author

We will watch again during the upcoming dress rehearsal to see if this happens again, I expect it will.

@StevenCTimm
Copy link
Author

So one thing to look at is the following. DUNE has a privilege package which calls out to an external server (metacat)

It could be that some of the external calls to this server are not properly catching exceptions and thus the server side is returning a OK status when things are not really OK and thus the fault is not in the client at all.. how would we investigate this?

@jamesp-epcc
Copy link
Contributor

I don't think the symptoms match what would happen if the MetaCat call was failing. If the file being uploaded doesn't exist in MetaCat then that should prevent the upload (this may still not be working yet due to issue #6058), but it should kill the upload entirely rather than allow the file to be uploaded but not registered. Conversely, if the check returned success when it should have returned failure, Rucio will just go ahead and upload the file as normal. The MetaCat entry will be missing but this shouldn't have any effect on the replica registration in Rucio.

@StevenCTimm
Copy link
Author

StevenCTimm commented Apr 29, 2024 via email

@jamesp-epcc
Copy link
Contributor

I'm not certain, but most likely either an empty list would be returned causing the permission check to fail, or an exception would be thrown in the MetaCat API library which would be caught somewhere in the Rucio server code and propagated back to the client as an internal server error, or at the very least would appear in the server logs.

@StevenCTimm
Copy link
Author

I have a capture of a log where this happened and all server logs from that time but I am not seeing any fails

This is from our job log

Steven Timm
1:53 PM

OK now I have isolated the failure mode to the following:
1:54 PM
May 01 16:46:36 Try 1/3 of executing these commands in a subshell: source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh ; setup python v3_9_13 ; setup metacat ; export X509_USER_PROXY=justin-jobs-production.proxy.pem ; export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/dune ; export METACAT_SERVER_URL=https://metacat.fnal.gov:9443/dune_meta_prod/app ; metacat auth login -m x509 dunepro ; metacat file declare --json -f tmp.json -s 2381705143 -c "adler32:1cf32087" "dune:all" 2>&1 May 01 16:46:40 metacat returns 0 - success May 01 16:46:40 Try usertests:prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root to NIKHEF/davs May 01 16:46:40 Try 1/3 of executing these commands in a subshell: source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh ; setup python v3_9_13 ; setup rucio ; setup justin ; export X509_USER_PROXY=justin-jobs-production.proxy.pem ; unset GFAL_CONFIG_DIR GFAL_PLUGIN_DIR GFAL2_DIR ; justin-rucio-upload --rse NIKHEF --protocol davs --scope usertests --dataset april_pdhd_dress_rehearsal_reco_5_1_24_00 --timeout 1200 home/workspace/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root 2>&1 May 01 17:07:00 justin-rucio-upload returns 0 - success May 01 17:07:00 Uploaded usertests:prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root to NIKHEF in 1220.309s

(apologies that this got cut into and out of mattermost)

In any case I have 9 rucio server pods active right now, I have searched through all of the logs for this file name, anything with the "Failed" error message

What should I be looking for?
As with all others in this sequence, the file actually does get uploaded successfully and renamed, the replica declare doesn't work, but the declaration to the dataset does. Since we are not using --register-after-upload the three operations happen in that order.

Lines 288-298 of the rucio upload client appear to be where the code must be failing but I can't find any error message like that either in the Rucio-server logs or the httpd-error-logs log

What am I missing.. what should I be looking for?

It appears that for most uploads we don't see the name of the file we are uploading in the rucio server logs
We do however see the file name on the subsequent list-file-replicas calls we do to make sure things are there.

Suggestions? What should we be looking for? These files will rotate off of the server pretty soon.. I have them available in Elasticsearch going farther back but don't know what to search for.. especially since the upload client is returning zero there may be no error messages to be found.

@jamesp-epcc
Copy link
Contributor

I have a couple of questions.

  1. You said previously (when talking about previous files that had failed in this way) that the replicas were in state 'U'. Is that the case for this latest file as well?
  2. In the log you posted in the first post of this issue, the file upload failed several times (due to the Rucio server returning 503 errors) before the file was finally uploaded. Did this happen with this latest file as well? I noticed that on one of the previous upload attempts (for the first file), it was the replica registration step that returned 503, causing the entire operation to fail. I'm wondering whether this could have left the database in a bad state that resulted in the later upload attempts silently failing to declare the replica properly.

@StevenCTimm
Copy link
Author

(1) Yes--in this latest test we have four files missing replicas and if you do rucio list-file-replicas --missing they all show in state U.

(2) So here is the stderr of the corresponding stdout I posted above:

@StevenCTimm
Copy link
Author

DEBUG:root:Num. of files that upload client is processing: 1
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:NeedRegenerationException
DEBUG:dogpile.lock:no value, waiting for create lock
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b8164c7f4f0> acquired
DEBUG:dogpile.cache.region:No value present for key: "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Calling creation function for not-yet-present value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/?expression=NIKHEF HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/NIKHEF HTTP/1.1" 200 1252
DEBUG:root:Input validation done.
INFO:root:Preparing upload for file prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /rses/NIKHEF/attr/ HTTP/1.1" 200 314
DEBUG:root:wan domain is used for the upload
DEBUG:root:Registering file
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /accounts/dunepro/scopes/ HTTP/1.1" 200 678
DEBUG:root:Trying to create dataset: usertests:april_pdhd_dress_rehearsal_reco_5_1_24_00
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/usertests/april_pdhd_dress_rehearsal_reco_5_1_24_00 HTTP/1.1" 409 104
INFO:root:Dataset usertests:april_pdhd_dress_rehearsal_reco_5_1_24_00 already exists - no rule will be created
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "GET /dids/usertests/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root/meta?plugin=DID_COLUMN HTTP/1.1" 404 246
DEBUG:root:File DID does not exist
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /replicas HTTP/1.1" 201 7
INFO:root:Successfully added replica in Rucio catalogue at NIKHEF
DEBUG:root:gfal.Default: connecting to storage
DEBUG:root:gfal.Default: checking if file exists None
DEBUG:root:Checking if root://dune.dcache.nikhef.nl:1094/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root exists
DEBUG:root:gfal.Default: checking if file exists root://dune.dcache.nikhef.nl:1094/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root
DEBUG:root:gfal.Default: closing protocol connection
DEBUG:root:[{'hostname': 'dune.dcache.nikhef.nl', 'scheme': 'root', 'port': 1094, 'prefix': '/pnfs/nikhef.nl/data/dune/generic/rucio', 'impl': 'rucio.rse.protocols.gfal.Default', 'domains': {'lan': {'read': 1, 'write': 2, 'delete': 2}, 'wan': {'read': 1, 'write': 2, 'delete': 2, 'third_party_copy_read': 5, 'third_party_copy_write': 5}}, 'extended_attributes': None}, {'hostname': 'dune.dcache.nikhef.nl', 'scheme': 'davs', 'port': 2880, 'prefix': '/pnfs/nikhef.nl/data/dune/generic/rucio', 'impl': 'rucio.rse.protocols.gfal.Default', 'domains': {'lan': {'read': 2, 'write': 1, 'delete': 1}, 'wan': {'read': 2, 'write': 1, 'delete': 1, 'third_party_copy_read': 1, 'third_party_copy_write': 1}}, 'extended_attributes': None}]
INFO:root:Trying upload with davs to NIKHEF
DEBUG:root:Processing upload with the domain: wan
DEBUG:root:gfal.Default: connecting to storage
DEBUG:root:The PFN created from the LFN: davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root
DEBUG:root:gfal.Default: checking if file exists davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root
DEBUG:root:gfal.Default: checking if file exists davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root.rucio.upload
DEBUG:root:put: Attempt 1
DEBUG:root:gfal.Default: uploading file from prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root to davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root.rucio.upload
INFO:root:Successful upload of temporary file. davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root.rucio.upload
DEBUG:root:skip_upload_stat=False
DEBUG:root:stat: pfn=davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root.rucio.upload
DEBUG:root:gfal.Default: getting stats of file davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root.rucio.upload
DEBUG:root:Filesize: Expected=2381705143 Found=2381705143
DEBUG:root:Checksum: Expected=1cf32087 Found=1cf32087
DEBUG:root:Renaming file davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root.rucio.upload to davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root
DEBUG:root:gfal.Default: renaming file from davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root.rucio.upload to davs://dune.dcache.nikhef.nl:2880/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/90/db/prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root
DEBUG:root:gfal.Default: closing protocol connection
DEBUG:root:Upload done.
INFO:root:Successfully uploaded file prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): dune-rucio.fnal.gov:443
/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v33_3_0/NULL/lib/python3.9/site-packages/urllib3/connectionpool.py:1061: InsecureRequestWarning: Unverified HTTPS request is being made to host 'dune-rucio.fnal.gov'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /traces/ HTTP/1.1" 404 207
DEBUG:urllib3.connectionpool:Resetting dropped connection: dune-rucio.fnal.gov
ERROR:root:Failed to update replica state for file prod_beam_p1GeV_cosmics_protodunehd_20240330T235955Z_186820_004405_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240501T142659Z_rerun_reco.root
DEBUG:root:Details: HTTPSConnectionPool(host='dune-rucio.fnal.gov', port=443): Read timed out. (read timeout=1200)
DEBUG:dogpile.lock:value creation lock <dogpile.cache.region.CacheRegion._LockWrapper object at 0x2b81650f9d00> acquired
DEBUG:dogpile.lock:Calling creation function for previously expired value
DEBUG:dogpile.cache.region:Cache value generated in 0.000 seconds for key(s): "host_to_choose_choice['https://dune-rucio.fnal.gov']"
DEBUG:dogpile.lock:Released creation lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): dune-rucio.fnal.gov:443
DEBUG:urllib3.connectionpool:https://dune-rucio.fnal.gov:443 "POST /dids/usertests/april_pdhd_dress_rehearsal_reco_5_1_24_00/dids HTTP/1.1" 201 7

@StevenCTimm
Copy link
Author

So yes here in this case it did indeed fail to update the replica state for the file.
Yet the overall upload method returned a status zero to Justin-rucio-upload.. what is going on.
?

@StevenCTimm
Copy link
Author

Looking more carefully at lines 277-304 it seems clear that if either the self.client.update_replicas_state call on line 294 fails (which in this case it did) or the self.client.attach call on line 302 failed (which it didn't in this case but has at other times) the client is still going to count that uploaded file as successful, and if all files in the upload are successful ( in this case only 1) will return 0 at line 337.

This is a bug.. if the replica is not declared/updated this leads to dark data. We need this fixed.

@StevenCTimm
Copy link
Author

Do you still think there's anything to be found on the server side? I am not sure what I am looking for. the dids are often truncated in the Rucio-server logs. the server logs are going to rotate out pretty soon if they haven't already so time is of the essence.

@StevenCTimm
Copy link
Author

To be clear, it would really be nice to have a rucio server that doesn't timeout, but if it must timeout then we have to catch it in the client and have a way to fix the damage that's caused.

@jamesp-epcc
Copy link
Contributor

Yes, I think in this case the client should return non-zero. Specifically, uploadclient.upload should raise an exception if things are not registered correctly. Just printing a message to the log is too easy to miss.

I'm a bit concerned that maybe there is something else going on as well, because the original log doesn't show the Failed to update replica state for file message. It could be a similar issue but a slightly different failure mode. It's hard to tell exactly what's going on... the replica code is quite complicated and I can't see where the replica state could be getting set to U.

@StevenCTimm
Copy link
Author

The earlier client log snippet that I pasted above was going to RAL-PP where we have a noRename implementation so the flow of things could have been slightly different. The one from 1 May was going to NIKHEF where the file does get renamed.
In any case, since the Justin-rucio-upload does not do the register-after-upload option, the fact that a replica is coming is done at the start of the flow before the upload and then the call after the upload is just a change state--so the first replica call sets it to U and the second call then fails to set it away from U.

Or so I think.

@StevenCTimm
Copy link
Author

Eventually we have to also get to the bigger problem of why is the server timing out? These 4 timeouts out of 27500 files processed happened with no more than 3000 simultaneous jobs running at any point on Wednesday 5/1. With 10K jobs running we see timeouts in the hundreds and that is the job level we have to get to eventually.

@jamesp-epcc
Copy link
Contributor

The earlier client log snippet that I pasted above was going to RAL-PP where we have a noRename implementation so the flow of things could have been slightly different. The one from 1 May was going to NIKHEF where the file does get renamed. In any case, since the Justin-rucio-upload does not do the register-after-upload option, the fact that a replica is coming is done at the start of the flow before the upload and then the call after the upload is just a change state--so the first replica call sets it to U and the second call then fails to set it away from U.

From the code it looks as though the replica state should initially be set to C (copying), then changed to A (available) if the upload succeeds, so I am not sure where the U is coming from. However it also looks as if list-file-replicas sometimes modifies the state that gets returned, so it's possible it's never actually set to U in the database.

I will put in a PR to ensure that failures to update the replica state get flagged, and similarly for the call to attach the file to the dataset, which can also fail right now and still have the command return 0.

@StevenCTimm
Copy link
Author

Indeed--we have many documented instances where the call to attach the file to the dataset also has failed.. these are much more easy to recover afterwards.

@bari12 bari12 added this to the 34.4.1 / 34.5.0 milestone May 22, 2024
@bari12 bari12 closed this as completed May 22, 2024
@bari12
Copy link
Member

bari12 commented May 22, 2024

Backport 32.8.2

@bari12 bari12 added the backport Issues which are backported to additional releases label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Issues which are backported to additional releases bug Clients
Projects
None yet
Development

No branches or pull requests

3 participants