Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request, write_element does not materialize computation in in-memory SpatialData Object #866

Open
ArneDefauw opened this issue Feb 10, 2025 · 7 comments

Comments

@ArneDefauw
Copy link
Contributor

When using sdata.write_element(), the Dask computation remains lazy in the in-memory SpatialData object, when SpatialData object is backed by disk.
So each time we do sdata.write_element(...), we need to call spatialdata.read_zarr(sdata.path).

See example:


import os
from tempfile import tempdir

import dask.array as da
from spatialdata import SpatialData, read_zarr
from spatialdata.models import Image2DModel

#1) creation of a spatialdata object backed by a zarr store

sdata = SpatialData()

arr =  da.random.random((1, 1000, 2000), chunks=(1, 500, 500))

sdata[ "image" ] = Image2DModel.parse( arr, dims = ( "c" , "y", "x" ) )

sdata.write( os.path.join( tempdir, "sdata.zarr" ) )

#2) read it back from the zarr store
sdata = read_zarr( sdata.path )

#3) Do something with the data, and save to a new slot
arr = sdata[ "image" ].data *2

sdata[ "image_mul" ] = Image2DModel.parse( arr, dims = ( "c", "y", "x" ) )

# this triggers computation, similar as .to_zarr(...)
sdata.write_element( "image_mul" )

for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
    if not layer.is_materialized():
        print( layer )  # layer 'mul-...' is not materialized, i.e. computation is still lazy in the in memory sdata object, similar behaviour as `dask.array.to_zarr`

# if we load back from the zarr store, we see that layer 'mul-...' is indeed materialized, similar behaviour as `dask.array.from_zarr`
sdata=read_zarr( sdata.path )
for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
    if not layer.is_materialized():
        print( layer )

Although this is in line with e.g. dask.array.to_zarr and dask.array.from_zarr, it would be a nice feature to have an extra parameter inplace, e.g. sdata.write_element(..., inplace=True), which takes care of reloading from the zarr store. Ideally this would only reload the element that was written to disk, not the complete Spatialdata Zarr store, which can be slow when there are lot of tables or shapes, as these are always loaded in-memory.

Also see saeyslab/harpy#90 for additional context.

@ArneDefauw ArneDefauw changed the title Feature request, write_element Does Not Materialize Computation in In-Memory SpatialData Object Feature request, write_element does not materialize computation in in-memory SpatialData Object Feb 10, 2025
@LucaMarconato
Copy link
Member

Thanks for reporting. I also endorse the use case. I wonder if a more explicit approach, were the users loads the object in-memory completely before calling write_element(), should be preferred.

@ArneDefauw
Copy link
Contributor Author

Loading the object in-memory before calling write_element() indeed solves the problem:. user could trigger the computation with .persist(), see example workflow below. But for large images you don't want to do this, and instead give dask the option to write to the zarr store without loading the image in memory.

import os
from tempfile import tempdir

import dask.array as da
from spatialdata import SpatialData, read_zarr
from spatialdata.models import Image2DModel

#1) creation of a spatialdata object backed by a zarr store

sdata = SpatialData()

arr =  da.random.random((1, 1000, 2000), chunks=(1, 500, 500))

sdata[ "image" ] = Image2DModel.parse( arr, dims = ( "c" , "y", "x" ) )

sdata.write( os.path.join( tempdir, "sdata.zarr" ) )

#2) read it back from the zarr store
sdata = read_zarr( sdata.path )

#3) Do something with the data, and save to a new slot
arr = (sdata[ "image" ].data *2)

sdata[ "image_mul" ] = Image2DModel.parse( arr, dims = ( "c", "y", "x" ) )

# trigger computation via persist -> pulls "image_mul" in memory
sdata[ "image_mul" ] = sdata[ "image_mul" ].persist()

sdata.write_element( "image_mul" )

for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
    if not layer.is_materialized():
        print( layer )  # layer 'mul-...' is materialized

I think writing pipelines using a spatialdata object backed by disk should ideally support idempotent functions, e.g., operation(sdata, ..., inplace=True), where the user does not have to worry about loading back from the zarr store, and layers not being materialized.

@ArneDefauw
Copy link
Contributor Author

ArneDefauw commented Feb 11, 2025

I see this issue is already mentioned here #521.

I think reloading/materializing computation could be implemented as follows:

import os
from tempfile import tempdir

import dask.array as da
from spatialdata import SpatialData, read_zarr
from spatialdata.models import Image2DModel

#1) creation of a spatialdata object backed by a zarr store

sdata = SpatialData()

arr =  da.random.random((1, 1000, 2000), chunks=(1, 500, 500))

sdata[ "image" ] = Image2DModel.parse( arr, dims = ( "c" , "y", "x" ) )

sdata.write( os.path.join( tempdir, "sdata.zarr" ) )

#2) read it back from the zarr store
sdata = read_zarr( sdata.path )

#3) Do something with the data, and save to a new slot
arr = (sdata[ "image" ].data *2)

sdata[ "image_mul" ] = Image2DModel.parse( arr, dims = ( "c", "y", "x" ) )

sdata.write_element( "image_mul" )

for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
    if not layer.is_materialized():
        print( layer )  # layer 'mul-...' is not materialized


# wait for support for reading particular elements
sdata_temp = read_zarr( sdata.path, selection=["images"])

del sdata[ "image_mul" ]
# here we get warning that 'image_mul' already exists, probably should try to surpress this warning
sdata[ "image_mul" ]=sdata_temp[ "image_mul" ]
del sdata_temp

for _, layer in sdata[ "image_mul" ].data.__dask_graph__().layers.items():
    if not layer.is_materialized():
        print( layer )  # layer 'mul-...' is materialized

@LucaMarconato
Copy link
Member

Thanks for sharing!

I wonder if a more explicit approach, were the users loads the object in-memory completely before calling write_element(), should be preferred.

I'll expand on the above: we would like allow the data model to accept also non-lazy objects whenever we have also the lazy option. Tracked for instance here #153.

For instance right now it is not possible to do
sdata['my_image'] = sdata['my_image'].compute()
but we want to allow for this.

Do you think the approach you described based on .persist() could be replaced by what shown above or do you still see value in implementing both?

@ArneDefauw
Copy link
Contributor Author

I think adding support for in-memory equivalent of elements would be a nice feature, but I would restrict it to the in-memory equivalents of the elements ( dask->numpy, daskdataframe->pandas,... etc ).

If a .to_memory() method would be implemented, it should be made clear that it converts dask to numpy etc, and not pull dask arrays or dask dataframes in memory (via persist()).

In short, I don't think spatialdata should call persist() internally, because it would probably confuse users.

@LucaMarconato
Copy link
Member

Ok thanks for the info. Then I think a good way forwards would be to have the to_memory() implemented and documented and in addition explaining the .persist() use case to the users, so they can choose what fit them the most.

@ArneDefauw
Copy link
Contributor Author

ArneDefauw commented Feb 14, 2025

Indeed, to_memory() would be a nice feature, but I think it is somehow unrelated to a possible inplace or reload parameter for write_element(). I see two separate use cases.

  1. sdata not backed by disk. For me currently, there are no issues here. Having to_memory would be nice, but I personally would seldom use this feature, since dask arrays, and dask dataframes are already in memory in this case.
    Maybe .compute() might be a clearer description of what to_memory is intented to do. One could argue that for consistency with dask, it would then also be a good idea to also implement sdata.persist().

  2. sdata backed by disk. Having the option reload or in_place in write_element(), so we do not have to read from the zarr store each time we write to the zarr store. I would understand if you prefer not to implement this feature, since it is in line with how dask writes to zarr stores .to_zarr(), and we can always reload downstream. Having the option to read (read_zarr( element="images/image" ))specific elements would already be a nice step forward for this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants