Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for exporting to GUI format #1

Closed
daynefiler opened this issue Feb 11, 2022 · 14 comments
Closed

Add function for exporting to GUI format #1

daynefiler opened this issue Feb 11, 2022 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@daynefiler
Copy link
Member

The difficult aspect of doing this -- and the reason for not implementing in v1.0 -- is the transformation functions. The simple fix would be to NOT include transformation functions (i.e. make them all linear) and apply any transformations to input prior to exporting; this breaks down when the same input is included in different slices that may have different transformation functions. While we would advise against duplicating inputs across the model (particularly using different transformation functions), and throw a warning to the user when duplicating inputs, this particular situation is present within the format_C.csv example file. The "simple" fix, therefore would break unless we detected this situation and created new columns with alternately transformed inputs. Off the top of my head, I cannot see an easy way to map the used transformation functions (if possible) to the GUI options. And what would we do if users supplied function(s) not available in the GUI? Thoughts?

@davidmreif
Copy link
Member

I agree that starting by NOT including transformation functions is the right move for a simple 'exportToGUI' function. R is really the better environment to handle transformations/scaling versus Java. I think that it may be possible to wrest custom user functions as GUI input, but that would be cumbersome.

From other discussions, the other issue that would need to be handled is the need for decimal weights to be rational fractions. Perhaps a warning could be added that decimal weights may be close approximations if input could be added?

@daynefiler
Copy link
Member Author

Function merged to dev branch. See #7 for further work.

@daynefiler
Copy link
Member Author

I merged the PR request (#6) prematurely. We can continue the work/discussion here.

@daynefiler
Copy link
Member Author

Per @SkylarMarvel:

When looking at the data file output, there is an issue with the slice metric indicators being applied to multiple copies of the transformed metrics. I think each metric will need something appended to the name if it occurs multiple times, e.g. append "_1a" for slice 1 component 1 and "_3d" for slice 3 component 4. I think appending something like that should just be done for all metrics even if not repeated just for consistency. However, the main issue I see is with transformed values that are now negative - they will be ignored in the GUI and treated as missing. One workaround might be to push all values up by a constant so everything is positive - I don't think this will change the results because all scaling functions are linear at this point. Also, the names in the output seem to have been replaced with the row number. I've attached the original "format_C.csv" data file and the initial txpExportGui output ("txpModel.csv") and a modified version ("txpModel_modified.csv") that has the adjustments I mention above.

  • The reason the row names were used is because the txpExportGui function requires the id.var variable. We don’t store ID values until the result stage, so there’s not a way currently to grab the ID column from the import function and pass it to the export. The following works:
tst <- txpImportGui("format_C.csv")
txpExportGui(fileName = "test.csv", 
             input = tst$input, 
             model = tst$model, 
             id.var = "CASRN", 
             fills = tst$fills)
  • As for the duplicate metrics, I think Skylar’s proposal works. We could also just say any duplicate inputs trigger the export as slices option. Thoughts?
  • I need a refresher on how we decided to handle the negative values discrepancy between the package and GUI previously.

@SkylarMarvel
Copy link
Contributor

SkylarMarvel commented Jul 13, 2022

Shifting negative values by adding a constant, c, to all values (to allow transformed negative values to work in the GUI) does not work when there are missing values in a slice that has multiple components/columns. This occurs because the slice scores are first computed as sum(x) before scaling to [0, 1], so a missing value would result in something like adding 2*c instead of 3*c. A possible solution for this is to process the transformations separately for each slice. If a slice has multiple components, negative values, and missing values, then those missing values are replaced with 0 before all values are shifted up. One caveat for this handling of transformed negative values in the case of a slice with multiple components and missing values, e.g. running "format_C" through the GUI directly or first through toxpiR txpImportGui() --> txpExportGui() and then the GUI - the slice/toxpi scores will match, but the bootstrapped confidence intervals will not. I anticipate this situation will rarely occur.

@daynefiler
Copy link
Member Author

We should probably setup a call to discuss how the package will handle negative values. I had coerced the negatives to 0 on importing, but txpCalculateScores does not do any special handling of negative values. Therefore, if someone were to create a model from scratch in R with data containing negative values, then export it to the GUI, the GUI would give different results. We can simplify the import and export functions if we handle the negatives in the calculate function -- I think the more important question is should we?

@daynefiler
Copy link
Member Author

I would make a sleep-deprived argument that because we scale the slice scores to 0-1, having negative values doesn't matter. Would it be better to update the GUI behavior?

@daynefiler
Copy link
Member Author

Another argument for keeping negative values -- it's not overly intuitive they would be removed. Especially when doing a standard normal transformation. I imagine users could be very surprised learning half their data is being thrown out of the model.

@davidmreif
Copy link
Member

davidmreif commented Jul 15, 2022 via email

@davidmreif
Copy link
Member

Add handle.neg parameter (or flag) to function (data structure)...

Testing notes (to verify):

  1. toxpiR creation of model with +/- values; check in GUI
  2. toxpiR creation of model with +/- AND missing values; check in GUI
  3. check distributions.csv file (incl neg/missing)

--> Be sure to annotate/justify departures from GUI (by design because different use cases) + document in vignette revisions.

@daynefiler
Copy link
Member Author

I added an initial implementation of negative.value.handling (4adbd5e). I am almost certain I did this incorrectly. @SkylarMarvel Suppose (in the GUI) you have an input with negative values, and the user indicates a 'zscore' transformation -- what is the order of operations?

@daynefiler
Copy link
Member Author

Shifting negative values by adding a constant, c, to all values (to allow transformed negative values to work in the GUI) does not work when there are missing values in a slice that has multiple components/columns. This occurs because the slice scores are first computed as sum(x) before scaling to [0, 1], so a missing value would result in something like adding 2c instead of 3c. A possible solution for this is to process the transformations separately for each slice. If a slice has multiple components, negative values, and missing values, then those missing values are replaced with 0 before all values are shifted up. One caveat for this handling of transformed negative values in the case of a slice with multiple components and missing values, e.g. running "format_C" through the GUI directly or first through toxpiR txpImportGui() --> txpExportGui() and then the GUI - the slice/toxpi scores will match, but the bootstrapped confidence intervals will not. I anticipate this situation will rarely occur.

This took me some time to wrap my head around. I thought shifting by a constant wouldn't matter, but it changes in the magnitude of the vector. So when I get confused about this in the future:

vec <- c(-100, 0, 10, 15, 40)
norm(vec, type = "2")
# [1] 109.2016
norm(vec + 100, type = "2")
# [1] 234.3608

@SkylarMarvel
Copy link
Contributor

SkylarMarvel commented Jul 18, 2022

@daynefiler In the GUI, negative values are ignored during the slice score computation. I've copied a code segment below where the score summation occurs. Slice scores are normalized to [0, 1] after this step.

scores = new double[pieSlices.size()][chemicals.size()];
for (int i=0; i<pieSlices.size(); i++) {
  pieSlice = pieSlices.get(i);
  for (int j=0; j<chemicals.size(); j++) {
    chemical = chemicals.get(j);
    score = Double.NaN;
    for (Metric metric : pieSlice.getMetrics()) {
      value = toxData.getData(chemical, metric);
      if (!Double.isNaN(value) && value>=0d) {
        scaledValue = Utility.scale(value, metricStats.get(metric), pieSlice.getScaling());
        if (!Double.isNaN(scaledValue)) {
          score = Double.isNaN(score) ? scaledValue : score+scaledValue;
        }
      }
    }
    scores[i][j] = score;
  }
}

@davidmreif
Copy link
Member

It looks like the checks in /dev/tests (https://github.com/ToxPi/toxpiR/tree/dev/tests) cover the intended cases with some wacky test data, and we can document the corner issues. I wanted to 3x check before closing this issue, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants