About
What is mirar? Good question!
Errors
In general, the philosophy is that DataBatch objects
should be processed, unless an error occurs.
If there is an error, then an ErrorReport
should be created for the error.
These ErrorReport objects should be collated in
a single ErrorStack object.
After processing is complete, the ErrorStack can
then be used to summarise these errors, and track which images failed.
Ideally, it should be understood why the processing failed for a given image. Therefore the code distinguishes between internal errors which were deliberately raised, and external errors which were not deliberately raised.
In general, all internal errors should inherit from the
mirar.errors.exceptions.BaseProcessorError class.
If the error is critical (i.e the image should not be processed further), then an
error should be raised which inherits from the
mirar.errors.exceptions.ProcessorError class.
If the error is non-critical (so processing should continue), then an
error should be raised which inherits from the
mirar.errors.exceptions.NoncriticalProcessingError class. In that case,
processing will continue but the error will be logged.
Data Structure
This contains the base data classes for the :module:`wintedrp.processors`.
The smallest unit is a DataBlock object,
corresponding to a single image.
These DataBlock objects are grouped into
DataBatch objects.
Each BaseProcessor will operate on a individual
DataBatch object.
The DataBatch objects are stored within a larger
DataSet object.
A BaseProcessor will iterate over each
DataBatch in a
Dataset.
How does the code actually work?
DataBlock objects
through a series of BaseProcessor objects.
Since a given image can easily be ~10-100Mb, and there may be several hundred raw images
from a typical survey in a given night, the total data volume for these processors
could be several 10s of Gb or more. Storing these all in RAM would be very
inefficient/slow for a typical laptop or many larger processing machines.
To mitigate this, the code can be operated in cache mode. In that case, after raw images are loaded, only the header data is stored in memory. The actual image data itself is stored temporarily in as a npy file in a dedicated cache directory, and only loaded into memory when needed. When the data is updated, the npy file is changed. The path of the file is a unique hash, and includes the read time of the file, so multiple copies of an image can be read and modified independently.
In cache mode, all of the image data is temporarily stored in a cache, and this cache can therefore reach the size of 10s of Gb. The location of the cache is in the configurable output data directory. This would increase linearly with successive code executions. To mitigate that, and to avoid cleaning the cache by hand, the code tries to automatically delete cache files as needed.
Python provides a default __del__() method for handling clean up when an object is deleted. Images automatically delete their cache in this method. However, has a somewhat-complicated method of ‘garbage collection’ (see the official description for more info), and it is not guaranteed that Image objects will clean themselves.
As a fallback, when you run the code from the command line (and therefore call __main__), we use the standard python tempfile library <https://docs.python.org/3/library/tempfile.html> to create a temporary directory, and set this as a cache. We call the directory using with context manager, ensuring that cleanup runs automatically before exiting, even if the code crashes/raises errors. We also use tempfile and careful cleaning
for the unit tests, as provided by the base test class. If you try to interact with the code in any other way, please be mindful of this behaviour, and ensure that you clean your cache in a responsible way!
If you don’t like this feature, you don’t need to use it. Cache mode is entirely optional, and can be disabled by setting the environment variable to false.
You can change this via an environment variable.
export USE_WINTER_CACHE = false
See Usage for more information about selecting cache mode, and setting the output data directory.