Avoid full distance matrix within thin.algorithm() to allow for thin() on a large number of points? #24

kschn111 · 2020-09-11T07:19:10Z

Hello,
I am running severely out of memory when using thin() on a very large number of occurrence points. I looked into the thin.algorithm() and saw that you compute the full set of distances among all points. I experiencing a similar problem when trying to compute distances during a data-processing step. The RANN:nn2() function solved my previous issues by (I guess) only looking at neighbors within a user-defined radius. In doing so, there is no need to compute the full distance matrix which saves time and memory requirements.

I am writing to check whether you think it is possible to perform the same steps of thin() when substituting

rdist.earth(x1=rec.df.orig, miles=FALSE) < thin.par

with RANN:nn2() using the arguments searchtype="radius" and radius=thin.par.

I am sorry for bugging you with this. I am an unexperienced student and do not really have someone else to ask. I would of course not expect you to assist with the function. I just wanted to check whether the full distances are strictly required for steps I am not aware of.

Very much appreciate your time!

The text was updated successfully, but these errors were encountered:

ThomasBaker96 · 2021-02-16T00:41:19Z

Hi kschn111,

How many points were you working with and how long did the script take for you? I have just under 15,000 and my computer has been working on spatially thinning them for over 60 hours now. I feel like giving up on it but I keep worrying that I will do that right before the script is about to complete. I did not think it would take this long especially considering that Aiello-Lammens et al. 2015 describes the processing time as trivial, of course they were only working with 201 points though.

All the best,

mlammens · 2021-02-16T01:14:21Z

@kschn111 Sorry for never responding to your question. The short answer is 'I'm not sure'. I guess the big thing I'm wondering is does your suggestion return a similar matrix of logical values? If not, then there'd be quite a bit of re-tooling involved downstream.

mlammens · 2021-02-16T01:16:10Z

@ThomasBaker96 This amount of time seems possible, but the last version (0.2.0) had a change to the algorithm that made it much faster with larger data sets. Did you test out the algorithm on a smaller subset of your data before going to the full 15k? It could be some structural issue.

kschn111 · 2021-02-16T08:37:47Z

Hello @ThomasBaker96 and @mlammens,

the study for which I had this issue is more or less on hold right now.

However, I did end up drafting a own function that, I think, results in a similar output at a reduced time and memory cost. I thinned around 320,000 points over night. For a subset of points, I visually compared thinned results between thin() and my approach. It looked equivalent to me. Regarding the output, my function returns a vector of indices for points that should be kept. For rep>1 this becomes a list of vectors.

The function follows more or less the same steps as thin() but substitutes the aforementioned line (rdist.earth) with a call to RANN:nn2(). I had contacted the RANN developer because I was confused by the speed increase. He responded that nabor::knn() should even be faster. Notably, both function return the euclidean distances. However, as the thinning radius is usually compact I did not judge this to be a serious drawback.

I heard from several scholars now that thin() takes forever to compute even on small datasets. Having looked into it, I do not really see the point of first computing the distance to literally every point. If you, for example, want to keep only one point in a 5km radius I think you only need to look at a 5km radius. Arguably, if you only evaluate points within that radius computing time and memory requirements will always go down significantly

ThomasBaker96 · 2021-02-16T17:02:02Z

Hi @mlammens,

I have tested the algorithm on smaller subsets of my data and it worked fine so I believe it should be working, just quite slowly.

All the best,

mlammens · 2021-02-17T19:11:28Z

@kschn111 My sense is that for moderately sized data sets and larger (a few hundred points and more), thinning by radius as you have is perfectly fine, and addresses the goal of spatial bias reduction. spThin was written to solve a specific problem - how to both maximize the points retained and ensure they are all thinned. It's the maximization that doesn't always work with the radius thinning in my experience. The thought is that when you have say 50 occurrences you want to thin, keeping 40 vs 37 might be meaningful. All that said, I haven't done a robust comparison of model results using the various different "random" data sets produced by spThin on larger thinned data sets, or on data sets thinned by other methods. I think that would be potentially interesting, but again, my hunch is that there wont be much differences.

mlammens · 2021-02-17T19:13:35Z

As a follow up @kschn111 , if a robust analysis did show that the faster thinning method you used does result in data sets that yield similar model results, a nice addition to spThin would be a function that implements your approach, maybe like a thin.large or something like that.

toczydlowski · 2021-12-14T22:30:08Z

Agreed. The whole reason my colleagues and I started trying to use spThin is that we need to reduce the number of points in our dataset to something more manageable so we can run pairwise distance type calcs and analyses. I was disappointed by how few points thin() can handle. Even with 1TB of RAM the process gets killed with more than a few tens of thousands of points (per species aka dataset). Many species in e.g. GBIF have many more sample points than 10,000 or so but I can't use thin() on them due to the memory issues.

@kschn111 is your workaround posted publicly somewhere or are you willing to share?

jeanKRS · 2023-04-09T06:02:33Z

@kschn111 hello, the "fields" package has "RdistEarth( )" function as a more efficient alternative to "rdist.earth()"
(its implemented in C and main benefit is reduced memory usage)

SUGGESTION:

Replace:

rdist.earth(x1=rec.df.orig, miles=FALSE) < thin.par

With:

RdistEarth(x1=as.matrix(rec.df.orig), miles=FALSE) < thin.par

Note: "rec.df.orig" must be converted to a matrix. The function makes a call to "as.numeric(rec.df.orig)" which fails if "rec.df.orig" is a DF

LIMITATION:
The function still gets limited with more occurrence points

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid full distance matrix within thin.algorithm() to allow for thin() on a large number of points? #24

Avoid full distance matrix within thin.algorithm() to allow for thin() on a large number of points? #24

kschn111 commented Sep 11, 2020

ThomasBaker96 commented Feb 16, 2021

mlammens commented Feb 16, 2021

mlammens commented Feb 16, 2021

kschn111 commented Feb 16, 2021

ThomasBaker96 commented Feb 16, 2021

mlammens commented Feb 17, 2021

mlammens commented Feb 17, 2021

toczydlowski commented Dec 14, 2021

jeanKRS commented Apr 9, 2023 •

edited

Loading

Avoid full distance matrix within thin.algorithm() to allow for thin() on a large number of points? #24

Avoid full distance matrix within thin.algorithm() to allow for thin() on a large number of points? #24

Comments

kschn111 commented Sep 11, 2020

ThomasBaker96 commented Feb 16, 2021

mlammens commented Feb 16, 2021

mlammens commented Feb 16, 2021

kschn111 commented Feb 16, 2021

ThomasBaker96 commented Feb 16, 2021

mlammens commented Feb 17, 2021

mlammens commented Feb 17, 2021

toczydlowski commented Dec 14, 2021

jeanKRS commented Apr 9, 2023 • edited Loading

jeanKRS commented Apr 9, 2023 •

edited

Loading