Skip to content

Commit d08affc

Browse files
Recursive exploration of remote datasets (#7912)
* WIP: Add recursive exporation of remote s3 layer * WIP: finish first version of recursive exploration of remote s3 layer * WIP: add gcs support * WIP: add gcs support * WIP: run explorers in parallel on same subdirectory * Code clean up (mainly extracted methods) * add local file system exploration * do not include mutableReport in requests regarding the local file system * add missing override of listDirectory of MockDataVault * some cleanup * add command to build backend parts like in CI to be ablte to detect errors before pushing * clean up code * format backend code * update docs to mention recursive exploration * add changelog entry * apply some feedback * apply some feedback; Mainly extract methods in ExploreRemoteLayerService.scala * Only let explorers of simple dataset formats explore for additional layers in sibling folders - And remove exploration of sibling folders to find additional layers * apply pr feedback * restore accidentally deleted changelog entry
1 parent a0542e6 commit d08affc

File tree

13 files changed

+199
-41
lines changed

13 files changed

+199
-41
lines changed

CHANGELOG.unreleased.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,10 @@ For upgrade instructions, please check the [migration guide](MIGRATIONS.released
1111
[Commits](https://github.com/scalableminds/webknossos/compare/24.07.0...HEAD)
1212

1313
### Added
14+
- WEBKNOSSOS now automatically searches in subfolder / sub-collection identifiers for valid datasets in case a provided link to a remote dataset does not directly point to a dataset. [#7912](https://github.com/scalableminds/webknossos/pull/7912)
1415
- Added the option to move a bounding box via dragging while pressing ctrl / meta. [#7892](https://github.com/scalableminds/webknossos/pull/7892)
1516
- Added route `/import?url=<url_to_datasource>` to automatically import and view remote datasets. [#7844](https://github.com/scalableminds/webknossos/pull/7844)
16-
- The context menu that is opened upon right-clicking a segment in the dataview port now contains the segment's name. [#7920](https://github.com/scalableminds/webknossos/pull/7920)
17+
- The context menu that is opened upon right-clicking a segment in the dataview port now contains the segment's name. [#7920](https://github.com/scalableminds/webknossos/pull/7920)
1718
- Upgraded backend dependencies for improved performance and stability. [#7922](https://github.com/scalableminds/webknossos/pull/7922)
1819

1920
### Changed

docs/datasets.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ In particular, the following file formats are supported for uploading (and conve
4747
Once the data is uploaded (and potentially converted), you can further configure a dataset's [Settings](#configuring-datasets) and double-check layer properties, fine tune access rights & permissions, or set default values for rendering.
4848

4949
### Streaming from remote servers and the cloud
50-
WEBKNOSSOS supports loading and remotely streaming [Zarr](https://zarr.dev), [Neuroglancer precomputed format](https://github.com/google/neuroglancer/tree/master/src/neuroglancer/datasource/precomputed) and [N5](https://github.com/saalfeldlab/n5) datasets from a remote source, e.g. Cloud storage (S3) or HTTP server.
50+
WEBKNOSSOS supports loading and remotely streaming [Zarr](https://zarr.dev), [Neuroglancer precomputed format](https://github.com/google/neuroglancer/tree/master/src/neuroglancer/datasource/precomputed) and [N5](https://github.com/saalfeldlab/n5) datasets from a remote source, e.g. Cloud storage (S3 / GCS) or HTTP server.
5151
WEBKNOSSOS supports loading Zarr datasets according to the [OME NGFF v0.4 spec](https://ngff.openmicroscopy.org/latest/).
5252

5353
WEBKNOSSOS can load several remote sources and assemble them into a WEBKNOSSOS dataset with several layers, e.g. one Zarr file/source for the `color` layer and one Zarr file/source for a `segmentation` layer.

frontend/javascripts/admin/dataset/dataset_add_remote_view.tsx

+4-2
Original file line numberDiff line numberDiff line change
@@ -461,7 +461,7 @@ function AddRemoteLayer({
461461
const [showCredentialsFields, setShowCredentialsFields] = useState<boolean>(false);
462462
const [usernameOrAccessKey, setUsernameOrAccessKey] = useState<string>("");
463463
const [passwordOrSecretKey, setPasswordOrSecretKey] = useState<string>("");
464-
const [selectedProtocol, setSelectedProtocol] = useState<"s3" | "https" | "gs">("https");
464+
const [selectedProtocol, setSelectedProtocol] = useState<"s3" | "https" | "gs" | "file">("https");
465465
const [fileList, setFileList] = useState<FileList>([]);
466466

467467
useEffect(() => {
@@ -489,9 +489,11 @@ function AddRemoteLayer({
489489
setSelectedProtocol("s3");
490490
} else if (userInput.startsWith("gs://")) {
491491
setSelectedProtocol("gs");
492+
} else if (userInput.startsWith("file://")) {
493+
setSelectedProtocol("file"); // Unused
492494
} else {
493495
throw new Error(
494-
"Dataset URL must employ one of the following protocols: https://, http://, s3:// or gs://",
496+
"Dataset URL must employ one of the following protocols: https://, http://, s3://, gs:// or file://",
495497
);
496498
}
497499
}

package.json

+5
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,11 @@
7676
"scripts": {
7777
"start": "node tools/proxy/proxy.js",
7878
"build": "node --max-old-space-size=4096 node_modules/.bin/webpack --env production",
79+
"@comment build-backend": "Only check for errors in the backend code like done by the CI. This command is not needed to run WEBKNOSSOS",
80+
"build-backend": "yarn build-wk-backend && yarn build-wk-datastore && yarn build-wk-tracingstore",
81+
"build-wk-backend": "sbt -no-colors -DfailOnWarning compile stage",
82+
"build-wk-datastore": "sbt -no-colors -DfailOnWarning \"project webknossosDatastore\" copyMessages compile stage",
83+
"build-wk-tracingstore": "sbt -no-colors -DfailOnWarning \"project webknossosTracingstore\" copyMessages compile stage",
7984
"build-dev": "node_modules/.bin/webpack",
8085
"build-watch": "node_modules/.bin/webpack -w",
8186
"listening": "lsof -i:5005,7155,9000,9001,9002",

test/backend/DataVaultTestSuite.scala

+3
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,9 @@ class DataVaultTestSuite extends PlaySpec {
134134
class MockDataVault extends DataVault {
135135
override def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)(
136136
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] = ???
137+
138+
override def listDirectory(path: VaultPath,
139+
maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] = ???
137140
}
138141

139142
"Uri has no trailing slash" should {

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala

+6-1
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,11 @@ import play.api.libs.json.Json
3434
import play.api.mvc.{Action, AnyContent, MultipartFormData, PlayBodyParsers}
3535

3636
import java.io.File
37-
import com.scalableminds.webknossos.datastore.storage.AgglomerateFileKey
37+
import com.scalableminds.webknossos.datastore.storage.{AgglomerateFileKey, DataVaultService}
3838
import net.liftweb.common.{Box, Empty, Failure, Full}
3939
import play.api.libs.Files
4040

41+
import java.net.URI
4142
import scala.collection.mutable.ListBuffer
4243
import scala.concurrent.{ExecutionContext, Future}
4344
import scala.concurrent.duration._
@@ -721,10 +722,14 @@ class DataSourceController @Inject()(
721722
Action.async(validateJson[ExploreRemoteDatasetRequest]) { implicit request =>
722723
accessTokenService.validateAccess(UserAccessRequest.administrateDataSources(request.body.organizationName), token) {
723724
val reportMutable = ListBuffer[String]()
725+
val hasLocalFilesystemRequest = request.body.layerParameters.exists(param =>
726+
new URI(param.remoteUri).getScheme == DataVaultService.schemeFile)
724727
for {
725728
dataSourceBox: Box[GenericDataSource[DataLayer]] <- exploreRemoteLayerService
726729
.exploreRemoteDatasource(request.body.layerParameters, reportMutable)
727730
.futureBox
731+
// Remove report of recursive exploration in case of exploring the local file system to avoid information exposure.
732+
_ <- Fox.runIf(hasLocalFilesystemRequest)(Fox.successful(reportMutable.clear()))
728733
dataSourceOpt = dataSourceBox match {
729734
case Full(dataSource) if dataSource.dataLayers.nonEmpty =>
730735
reportMutable += s"Resulted in dataSource with ${dataSource.dataLayers.length} layers."

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/DataVault.scala

+2
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,6 @@ import scala.concurrent.ExecutionContext
77
trait DataVault {
88
def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)(
99
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)]
10+
11+
def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]]
1012
}

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/FileSystemDataVault.scala

+28-7
Original file line numberDiff line numberDiff line change
@@ -8,21 +8,18 @@ import org.apache.commons.lang3.builder.HashCodeBuilder
88

99
import java.nio.ByteBuffer
1010
import java.nio.file.{Files, Path, Paths}
11+
import java.util.stream.Collectors
1112
import scala.concurrent.ExecutionContext
13+
import scala.jdk.CollectionConverters._
1214

1315
class FileSystemDataVault extends DataVault {
1416

1517
override def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)(
16-
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] = {
17-
val uri = path.toUri
18+
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] =
1819
for {
19-
_ <- bool2Fox(uri.getScheme == DataVaultService.schemeFile) ?~> "trying to read from FileSystemDataVault, but uri scheme is not file"
20-
_ <- bool2Fox(uri.getHost == null || uri.getHost.isEmpty) ?~> s"trying to read from FileSystemDataVault, but hostname ${uri.getHost} is non-empty"
21-
localPath = Paths.get(uri.getPath)
22-
_ <- bool2Fox(localPath.isAbsolute) ?~> "trying to read from FileSystemDataVault, but hostname is non-empty"
20+
localPath <- vaultPathToLocalPath(path)
2321
bytes <- readBytesLocal(localPath, range)
2422
} yield (bytes, Encoding.identity)
25-
}
2623

2724
private def readBytesLocal(localPath: Path, range: RangeSpecifier)(implicit ec: ExecutionContext): Fox[Array[Byte]] =
2825
if (Files.exists(localPath)) {
@@ -53,6 +50,30 @@ class FileSystemDataVault extends DataVault {
5350
}
5451
} else Fox.empty
5552

53+
override def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
54+
vaultPathToLocalPath(path).map(
55+
localPath =>
56+
if (Files.isDirectory(localPath))
57+
Files
58+
.list(localPath)
59+
.filter(file => Files.isDirectory(file))
60+
.collect(Collectors.toList())
61+
.asScala
62+
.toList
63+
.map(dir => new VaultPath(dir.toUri, this))
64+
.take(maxItems)
65+
else List.empty)
66+
67+
private def vaultPathToLocalPath(path: VaultPath)(implicit ec: ExecutionContext): Fox[Path] = {
68+
val uri = path.toUri
69+
for {
70+
_ <- bool2Fox(uri.getScheme == DataVaultService.schemeFile) ?~> "trying to read from FileSystemDataVault, but uri scheme is not file"
71+
_ <- bool2Fox(uri.getHost == null || uri.getHost.isEmpty) ?~> s"trying to read from FileSystemDataVault, but hostname ${uri.getHost} is non-empty"
72+
localPath = Paths.get(uri.getPath)
73+
_ <- bool2Fox(localPath.isAbsolute) ?~> "trying to read from FileSystemDataVault, but hostname is non-empty"
74+
} yield localPath
75+
}
76+
5677
override def hashCode(): Int =
5778
new HashCodeBuilder(19, 31).toHashCode
5879

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/GoogleCloudDataVault.scala

+12
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import java.io.ByteArrayInputStream
1111
import java.net.URI
1212
import java.nio.ByteBuffer
1313
import scala.concurrent.ExecutionContext
14+
import scala.jdk.CollectionConverters.IterableHasAsScala
1415

1516
class GoogleCloudDataVault(uri: URI, credential: Option[GoogleServiceAccountCredential]) extends DataVault {
1617

@@ -72,6 +73,17 @@ class GoogleCloudDataVault(uri: URI, credential: Option[GoogleServiceAccountCred
7273
} yield (bytes, encoding)
7374
}
7475

76+
override def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
77+
tryo({
78+
val objName = path.toUri.getPath.tail
79+
val blobs =
80+
storage.list(bucket, Storage.BlobListOption.prefix(objName), Storage.BlobListOption.currentDirectory())
81+
val subDirectories = blobs.getValues.asScala.toList.filter(_.isDirectory).take(maxItems)
82+
val paths = subDirectories.map(dirBlob =>
83+
new VaultPath(new URI(s"${uri.getScheme}://$bucket/${dirBlob.getBlobId.getName}"), this))
84+
paths
85+
})
86+
7587
private def getUri = uri
7688
private def getCredential = credential
7789

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/HttpsDataVault.scala

+4
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,10 @@ class HttpsDataVault(credential: Option[DataVaultCredential], ws: WSClient) exte
4444

4545
}
4646

47+
override def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
48+
// HTTP file listing is currently not supported.
49+
Fox.successful(List.empty)
50+
4751
private val headerInfoCache: AlfuCache[URI, (Boolean, Long)] = AlfuCache()
4852

4953
private def getHeaderInformation(uri: URI)(implicit ec: ExecutionContext): Fox[(Boolean, Long)] =

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/S3DataVault.scala

+35-3
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,24 @@ import com.amazonaws.auth.{
1111
import com.amazonaws.client.builder.AwsClientBuilder.EndpointConfiguration
1212
import com.amazonaws.regions.Regions
1313
import com.amazonaws.services.s3.{AmazonS3, AmazonS3ClientBuilder}
14-
import com.amazonaws.services.s3.model.{GetObjectRequest, S3Object}
14+
import com.amazonaws.services.s3.model.{GetObjectRequest, ListObjectsV2Request, S3Object}
1515
import com.amazonaws.util.AwsHostNameUtils
1616
import com.scalableminds.util.tools.Fox
17+
import com.scalableminds.util.tools.Fox.box2Fox
1718
import com.scalableminds.webknossos.datastore.storage.{
1819
LegacyDataVaultCredential,
1920
RemoteSourceDescriptor,
2021
S3AccessKeyCredential
2122
}
23+
import net.liftweb.common.Box.tryo
2224
import net.liftweb.common.{Box, Failure, Full}
2325
import org.apache.commons.io.IOUtils
2426
import org.apache.commons.lang3.builder.HashCodeBuilder
2527

2628
import java.net.URI
2729
import scala.collection.immutable.NumericRange
2830
import scala.concurrent.ExecutionContext
31+
import scala.jdk.CollectionConverters._
2932

3033
class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI) extends DataVault {
3134
private lazy val bucketName = S3DataVault.hostBucketFromUri(uri) match {
@@ -50,7 +53,8 @@ class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI
5053

5154
private def getRequest(bucketName: String, key: String): GetObjectRequest = new GetObjectRequest(bucketName, key)
5255

53-
private def performRequest(request: GetObjectRequest)(implicit ec: ExecutionContext): Fox[(Array[Byte], String)] = {
56+
private def performGetObjectRequest(request: GetObjectRequest)(
57+
implicit ec: ExecutionContext): Fox[(Array[Byte], String)] = {
5458
var s3objectRef: Option[S3Object] = None // Used for cleanup later (possession of a S3Object requires closing it)
5559
try {
5660
val s3object = client.getObject(request)
@@ -82,10 +86,38 @@ class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI
8286
case SuffixLength(l) => getSuffixRangeRequest(bucketName, objectKey, l)
8387
case Complete() => getRequest(bucketName, objectKey)
8488
}
85-
(bytes, encodingString) <- performRequest(request)
89+
(bytes, encodingString) <- performGetObjectRequest(request)
8690
encoding <- Encoding.fromRfc7231String(encodingString)
8791
} yield (bytes, encoding)
8892

93+
override def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
94+
for {
95+
prefixKey <- Fox.box2Fox(S3DataVault.objectKeyFromUri(path.toUri))
96+
s3SubPrefixKeys <- getObjectSummaries(bucketName, prefixKey, maxItems)
97+
vaultPaths <- tryo(
98+
s3SubPrefixKeys.map(key => new VaultPath(new URI(s"${uri.getScheme}://$bucketName/$key"), this))).toFox
99+
} yield vaultPaths
100+
101+
private def getObjectSummaries(bucketName: String, keyPrefix: String, maxItems: Int)(
102+
implicit ec: ExecutionContext): Fox[List[String]] =
103+
try {
104+
val listObjectsRequest = new ListObjectsV2Request
105+
listObjectsRequest.setBucketName(bucketName)
106+
listObjectsRequest.setPrefix(keyPrefix)
107+
listObjectsRequest.setDelimiter("/")
108+
listObjectsRequest.setMaxKeys(maxItems)
109+
val objectListing = client.listObjectsV2(listObjectsRequest)
110+
val s3SubPrefixes = objectListing.getCommonPrefixes.asScala.toList
111+
Fox.successful(s3SubPrefixes)
112+
} catch {
113+
case e: AmazonServiceException =>
114+
e.getStatusCode match {
115+
case 404 => Fox.empty
116+
case _ => Fox.failure(e.getMessage)
117+
}
118+
case e: Exception => Fox.failure(e.getMessage)
119+
}
120+
89121
private def getUri = uri
90122
private def getCredential = s3AccessKeyCredential
91123

webknossos-datastore/app/com/scalableminds/webknossos/datastore/datavault/VaultPath.scala

+3
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,9 @@ class VaultPath(uri: URI, dataVault: DataVault) extends LazyLogging {
3838
}
3939
}
4040

41+
def listDirectory(maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
42+
dataVault.listDirectory(this, maxItems)
43+
4144
private def decodeBrotli(bytes: Array[Byte]) = {
4245
Brotli4jLoader.ensureAvailability()
4346
val brotliInputStream = new BrotliInputStream(new ByteArrayInputStream(bytes))

0 commit comments

Comments
 (0)