Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when reading 4000+ pdb files in the same Universe #4590

Open
samuelmurail opened this issue May 6, 2024 · 10 comments
Open

Issue when reading 4000+ pdb files in the same Universe #4590

samuelmurail opened this issue May 6, 2024 · 10 comments
Labels
Format-ChainReader ChainReader virtual trajectory reader Format-PDB

Comments

@samuelmurail
Copy link

Expected behavior

Hello,

I am using MDAnalysis to compute PCA on a big set of pdb file.

u = mda.Universe(files[0], files)

Actual behavior

When I have more than roughly 4000 structures, I have this error message:

Exception ignored in: <function ReaderBase.__del__ at 0x742932ccbc70>
Traceback (most recent call last):
  File "/home/murail/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/base.py", line 1532, in __del__
    self.close()
  File "/home/murail/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/chain.py", line 597, in close
    self._apply('close')
  File "/home/murail/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/chain.py", line 504, in _apply
    return [reader.__getattribute__(method)(**kwargs) for reader in self.readers]
AttributeError: 'ChainReader' object has no attribute 'readers'
Exception ignored in: <function ReaderBase.__del__ at 0x742932ccbc70>
Traceback (most recent call last):
  File "/home/murail/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/base.py", line 1532, in __del__
    self.close()
  File "/home/murail/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/PDB.py", line 485, in close
    self._pdbfile.close()
AttributeError: 'PDBReader' object has no attribute '_pdbfile'

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[50], line 1
----> 1 clustering.clustering(local_df, 0.3, show_dendrogram=True)

File ~/Documents/Code/af2_analysis/src/af2_analysis/clustering.py:104, in clustering(my_data_df, threshold, show_dendrogram, show_cluster_distribution)
    102 assert null_number == sum(pd.isnull(my_data_df[my_data_df['query'] == pdb]['pdb'][-null_number:])), f"Missing pdb data in the middle of the query {pdb}"
    103 print("Read all structures")
--> 104 u = mda.Universe(files[0], files)
    105 chain_pep_value=chain_pep(files[0])
    106 print("align structures")

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/core/universe.py:375, in Universe.__init__(self, topology, all_coordinates, format, topology_format, transformations, guess_bonds, vdwradii, fudge_factor, lower_bound, in_memory, in_memory_step, *coordinates, **kwargs)
    370 coordinates = _resolve_coordinates(self.filename, *coordinates,
    371                                    format=format,
    372                                    all_coordinates=all_coordinates)
    374 if coordinates:
--> 375     self.load_new(coordinates, format=format, in_memory=in_memory,
    376                 in_memory_step=in_memory_step, **kwargs)
    378 if transformations:
    379     if callable(transformations):

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/core/universe.py:580, in Universe.load_new(self, filename, format, in_memory, in_memory_step, **kwargs)
    577 # supply number of atoms for readers that cannot do it for themselves
    578 kwargs['n_atoms'] = self.atoms.n_atoms
--> 580 self.trajectory = reader(filename, format=format, **kwargs)
    581 if self.trajectory.n_atoms != len(self.atoms):
    582     raise ValueError("The topology and {form} trajectory files don't"
    583                      " have the same number of atoms!\n"
    584                      "Topology number of atoms {top_n_atoms}\n"
   (...)
    588                          fname=filename,
    589                          trj_n_atoms=self.trajectory.n_atoms))

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/lib/util.py:2553, in store_init_arguments.<locals>.wrapper(self, *args, **kwargs)
   2551             else:
   2552                 self._kwargs[key] = arg
-> 2553 return func(self, *args, **kwargs)

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/chain.py:270, in ChainReader.__init__(self, filenames, skip, dt, continuous, convert_units, **kwargs)
    268 if dt is not None:
    269     kwargs['dt'] = dt
--> 270 self.readers = [core.reader(filename, convert_units=convert_units, **kwargs)
    271                 for filename in filenames]
    272 self.filenames = np.array([fn[0] if isinstance(fn, tuple) else fn
    273                                                 for fn in filenames])
    274 # pointer to "active" trajectory index into self.readers

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/chain.py:270, in <listcomp>(.0)
    268 if dt is not None:
    269     kwargs['dt'] = dt
--> 270 self.readers = [core.reader(filename, convert_units=convert_units, **kwargs)
    271                 for filename in filenames]
    272 self.filenames = np.array([fn[0] if isinstance(fn, tuple) else fn
    273                                                 for fn in filenames])
    274 # pointer to "active" trajectory index into self.readers

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/core.py:82, in reader(filename, format, **kwargs)
     80     Reader = get_reader_for(filename, format=format)
     81 try:
---> 82     return Reader(filename, **kwargs)
     83 except ValueError:
     84     errmsg = f'Unable to read {filename} with {Reader}.'

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/lib/util.py:2553, in store_init_arguments.<locals>.wrapper(self, *args, **kwargs)
   2551             else:
   2552                 self._kwargs[key] = arg
-> 2553 return func(self, *args, **kwargs)

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/PDB.py:303, in PDBReader.__init__(self, filename, **kwargs)
    300 if isinstance(filename, util.NamedStream) and isinstance(filename.stream, StringIO):
    301     filename.stream = BytesIO(filename.stream.getvalue().encode())
--> 303 pdbfile = self._pdbfile = util.anyopen(filename, 'rb')
    305 line = "magical"
    306 while line:
    307     # need to use readline so tell gives end of line
    308     # (rather than end of current chunk)

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/lib/util.py:402, in anyopen(datasource, mode, reset)
    400                 break
    401         if stream is None:
--> 402             raise IOError(errno.EIO, "Cannot open file or stream in mode={mode!r}.".format(**vars()), repr(filename))
    403 elif mode.startswith('w') or mode.startswith('a'):  # append 'a' not tested...
    404     if isstream(datasource):

OSError: [Errno 5] Cannot open file or stream in mode='rb'.: "'../colabfold_1.5.2/X/X_unrelaxed_rank_2017_alphafold2_multimer_v3_model_2_seed_804.pdb'"

Code to reproduce the behavior

To reproduce the behavior, I just need a list of 4000+ pdb files (colabfold prediction with 1000+ seed).

It works up to 4000 structure but a little more give the previous error.

import MDAnalysis as mda

u = mda.Universe(files[0], files)

Current version of MDAnalysis

  • Which version are you using?
    2.7.0
  • Which version of Python (python -V)?
    Python 3.10.13
  • Which operating system?
    Ubuntu 22.0
@IAlibay
Copy link
Member

IAlibay commented May 6, 2024

@samuelmurail any chance the exact number it fails at is 4096 files?

That would be the default file open limit in modern Linux if I remember correctly.

@IAlibay
Copy link
Member

IAlibay commented May 6, 2024

Off the top of my head options here could be to increase your hard limit using ulimit or to concatenate chunks of your PDBs into a multi frame trajectory file.

@samuelmurail
Copy link
Author

@samuelmurail any chance the exact number it fails at is 4096 files?

No, it is more around 4200 structures.

Off the top of my head options here could be to increase your hard limit using ulimit or to concatenate chunks of your PDBs into a multi frame trajectory file.

I have use another library to concatenate all pdb in a single one, that has solve my issue however it is very slow.

Any idea how to increase hard limit ?

I have tried:

import ressource
resource.setrlimit(resource.RLIMIT_DATA, (1024,1024))

But now I have this error :

Exception ignored in: <function ReaderBase.__del__ at 0x7bd4b80331c0>
Traceback (most recent call last):
  File "/home/murail/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/base.py", line 1532, in __del__
    self.close()
  File "/home/murail/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/chain.py", line 597, in close
    self._apply('close')
  File "/home/murail/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/chain.py", line 504, in _apply
    return [reader.__getattribute__(method)(**kwargs) for reader in self.readers]
AttributeError: 'ChainReader' object has no attribute 'readers'

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[39], line 1
----> 1 clustering.clustering(my_data.df, 0.3, show_dendrogram=True, max_files=30000)

File ~/Documents/Code/af2_analysis/src/af2_analysis/clustering.py:117, in clustering(my_data_df, threshold, show_dendrogram, show_cluster_distribution, max_files)
    115 assert null_number == sum(pd.isnull(my_data_df[my_data_df['query'] == pdb]['pdb'][-null_number:])), f"Missing pdb data in the middle of the query {pdb}"
    116 print("Read all structures")
--> 117 u = get_univ(files, max_files=max_files)
    118 chain_pep_value=chain_pep(files[0])
    119 print("align structures")

File ~/Documents/Code/af2_analysis/src/af2_analysis/clustering.py:173, in get_univ(files, max_files)
    156 """Concatenating PDB trajectories.
    157 
    158 This function concatenates the different PDB files of a trajectory into a single PDB file.
   (...)
    169 None
    170 """
    172 if len(files) < max_files:
--> 173     return mda.Universe(files[0], files, in_memory=False)
    175 print("Start concat pdb files")
    176 first_coor = pdb_numpy.Coor(files[0])

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/core/universe.py:375, in Universe.__init__(self, topology, all_coordinates, format, topology_format, transformations, guess_bonds, vdwradii, fudge_factor, lower_bound, in_memory, in_memory_step, *coordinates, **kwargs)
    370 coordinates = _resolve_coordinates(self.filename, *coordinates,
    371                                    format=format,
    372                                    all_coordinates=all_coordinates)
    374 if coordinates:
--> 375     self.load_new(coordinates, format=format, in_memory=in_memory,
    376                 in_memory_step=in_memory_step, **kwargs)
    378 if transformations:
    379     if callable(transformations):

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/core/universe.py:580, in Universe.load_new(self, filename, format, in_memory, in_memory_step, **kwargs)
    577 # supply number of atoms for readers that cannot do it for themselves
    578 kwargs['n_atoms'] = self.atoms.n_atoms
--> 580 self.trajectory = reader(filename, format=format, **kwargs)
    581 if self.trajectory.n_atoms != len(self.atoms):
    582     raise ValueError("The topology and {form} trajectory files don't"
    583                      " have the same number of atoms!\n"
    584                      "Topology number of atoms {top_n_atoms}\n"
   (...)
    588                          fname=filename,
    589                          trj_n_atoms=self.trajectory.n_atoms))

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/lib/util.py:2553, in store_init_arguments.<locals>.wrapper(self, *args, **kwargs)
   2551             else:
   2552                 self._kwargs[key] = arg
-> 2553 return func(self, *args, **kwargs)

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/chain.py:270, in ChainReader.__init__(self, filenames, skip, dt, continuous, convert_units, **kwargs)
    268 if dt is not None:
    269     kwargs['dt'] = dt
--> 270 self.readers = [core.reader(filename, convert_units=convert_units, **kwargs)
    271                 for filename in filenames]
    272 self.filenames = np.array([fn[0] if isinstance(fn, tuple) else fn
    273                                                 for fn in filenames])
    274 # pointer to "active" trajectory index into self.readers

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/chain.py:270, in <listcomp>(.0)
    268 if dt is not None:
    269     kwargs['dt'] = dt
--> 270 self.readers = [core.reader(filename, convert_units=convert_units, **kwargs)
    271                 for filename in filenames]
    272 self.filenames = np.array([fn[0] if isinstance(fn, tuple) else fn
    273                                                 for fn in filenames])
    274 # pointer to "active" trajectory index into self.readers

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/core.py:82, in reader(filename, format, **kwargs)
     80     Reader = get_reader_for(filename, format=format)
     81 try:
---> 82     return Reader(filename, **kwargs)
     83 except ValueError:
     84     errmsg = f'Unable to read {filename} with {Reader}.'

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/lib/util.py:2553, in store_init_arguments.<locals>.wrapper(self, *args, **kwargs)
   2551             else:
   2552                 self._kwargs[key] = arg
-> 2553 return func(self, *args, **kwargs)

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/PDB.py:349, in PDBReader.__init__(self, filename, **kwargs)
    346 self._stop_offsets = offsets[1:] + [end]
    347 self.n_frames = len(offsets)
--> 349 self._read_frame(0)

File ~/miniforge3/envs/openmm/lib/python3.10/site-packages/MDAnalysis/coordinates/PDB.py:419, in PDBReader._read_frame(self, frame)
    417 # Seek to start and read until start of next frame
    418 self._pdbfile.seek(start)
--> 419 chunk = self._pdbfile.read(stop - start).decode()
    421 tmp_buf = []
    422 for line in chunk.splitlines():

MemoryError: 

@orbeckst orbeckst added Format-PDB Format-ChainReader ChainReader virtual trajectory reader labels May 9, 2024
@IAlibay
Copy link
Member

IAlibay commented May 9, 2024

Sorry my capacity is a bit limited this week - maybe someone from @MDAnalysis/coredevs has some insights here?

@orbeckst
Copy link
Member

orbeckst commented May 9, 2024

I can decrease my open file limit with RLIMIT_NOFILE

resource.setrlimit(resource.RLIMIT_NOFILE, (16, 16))

and this will make the chainreader stop at even fewer than 16 files, just because there are already a bunch of Python modules open.

import MDAnalysis as mda; import MDAnalysisTests.datafiles as data
import resource
resource.setrlimit(resource.RLIMIT_NOFILE, (16, 16))
u = mda.Universe(data.PDB, 15 * [data.PDB])

fails with OSError: [Errno 24] Too many open files.

Note that on my macOS,

In [1]: import resource
In [2]: resource.getrlimit(resource.RLIMIT_NOFILE)
Out[2]: (256, 9223372036854775807)

the default hard limit for open files is RLIM_INFINITY.

When trying to load the same PDB 5000 times

u = mda.Universe(data.PDB, 5000 * [data.PDB])

I also get OSError: [Errno 24] Too many open files.

Trying with an increased soft limit

import MDAnalysis as mda; import MDAnalysisTests.datafiles as data
import resource
resource.setrlimit(resource.RLIMIT_NOFILE, (6000, resource.RLIM_INFINITY))

u = mda.Universe(data.PDB, 5000 * [data.PDB])

... that took a while but worked.

I don't have the time right now to run the mem profiler to see if the chainreader + PDB takes up memory for all files ... it might, and then that is probably the reason for the MemoryError. Something to check,

@orbeckst
Copy link
Member

orbeckst commented May 9, 2024

The issue with open file descriptors is likely related to #239.

@yuxuanzhuang
Copy link
Contributor

Add more details on the reason why the memory usage of ChainReader increases with the number of files. During initialization, every file will be opened, and a Timestep attached to it will be created. This Timestep contains, for example, the positions of all atoms. Therefore, for 4000 single-frame PDB files, it essentially means reading 4000 frames into memory.

@samuelmurail, could you try to convert it into a temporay trajectory format first with e.g.

import MDAnalysis as mda
from MDAnalysisTests.datafiles import PDB

files = [PDB] * 10
u = mda.Universe(PDB)

with mda.Writer('test.xtc', u.atoms.n_atoms) as W:
    for file in files:
        u.load_new(file)
        W.write(u.atoms)

u = mda.Universe(PDB, 'test.xtc')

@orbeckst
Copy link
Member

Thanks @yuxuanzhuang , makes sense.

I don't have a good idea how one could make the ChainReader work with arbitrary numbers of files that avoids the limitation of having so many readers open at the same time — this would likely require a redesign.

@samuelmurail
Copy link
Author

Thanks for the feedback, here is my fix:

import MDAnalysis as mda
from MDAnalysis.coordinates.memory import MemoryReader
import numpy as np

def read_numerous_pdb(pdb_files, batch_size=1000):
    """
    Read a large number of PDB files in batches and combine them into a single MDAnalysis Universe.

    Parameters
    ----------
    pdb_files : list of str
        List of file paths to the PDB files to be read.
    batch_size : int, optional
        Number of PDB files to read in each batch. Default is 1000.

    Returns
    -------
    MDAnalysis.Universe
        A single MDAnalysis Universe containing the combined frames from all the PDB files.
        
    Notes
    -----
    - This function reads PDB files in batches to avoid memory issues.
    - Each batch of PDB files is loaded into a temporary Universe, and the positions of each frame
      are stored in a list.
    - The list of frames is then converted into a numpy array and used to create a new Universe with
      a MemoryReader, combining all the frames.
    
    Example
    -------
    >>> pdb_files = ['file1.pdb', 'file2.pdb', ...]
    >>> combined_universe = read_numerous_pdb(pdb_files, batch_size=1000)
    >>> print(combined_universe)
    """

    all_frames = []
    
    for i in range(0, len(pdb_files), batch_size):
        # print(f"Reading frames {i:5} to {i+batch_size:5}, total : {len(pdb_files[i:i+batch_size])} frames")
        local_u = mda.Universe(pdb_files[0], pdb_files[i:i+batch_size])
        for ts in local_u.trajectory:
            all_frames.append(ts.positions.copy())
        del local_u
    
    # print("Convert to numpy")
    frames_array = np.array(all_frames)
    del all_frames
    
    # print(frames_array.shape)
    return mda.Universe(pdb_files[0], frames_array, format=MemoryReader, order='fac')

Using this function, I could read +20.000 files, probably more.

Cheers,
Samuel

@richardjgowers
Copy link
Member

@orbeckst I think redesign is maybe too strong a word (implying API breaks?). I think you could refactor (implying no API breaks) ChainReader to just-in-time open file handles (some pattern of actively using .open() .close() methods on the owned/child Readers as you walk files) to allow this sort of usage. This would probably be a heavy lift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Format-ChainReader ChainReader virtual trajectory reader Format-PDB
Projects
None yet
Development

No branches or pull requests

5 participants