Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Def data enhancements #134

Merged
merged 9 commits into from
Jan 18, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ export(defMiss)
export(defRead)
export(defReadAdd)
export(defReadCond)
export(defRepeat)
export(defRepeatAdd)
export(defSurv)
export(delColumns)
export(gammaGetShapeRate)
Expand Down
131 changes: 130 additions & 1 deletion R/define_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,135 @@ defDataAdd <- function(dtDefs = NULL,
return(defNew[])
}

#' Add multiple (similar) rows to definitions table
#'
#' @param dtDefs Definition data.table to be modified
#' @param nvars Number of new variables to define
#' @param prefix Prefix (character) for new variables
#' @param formula An R expression for mean (string)
#' @param variance Number or formula
#' @param dist Distribution. For possibilities, see details
#' @param link The link function for the mean, see details
#' @param id A string indicating the field name for the unique record identifier
#' @return A data.table named dtName that is an updated data definitions table
#' @seealso [distributions]
#' @details The possible data distributions are: `r paste0(.getDists(),collapse = ", ")`.
#'
#' @examples
#' def <- defRepeat(nvars = 4, prefix = "g", formula = "1/3;1/3;1/3",
#' variance = 0, dist = "categorical")
#' def <- defData(def, varname = "a", formula = "1;1", dist = "trtAssign")
#' def <- defRepeat(def, 8, "b", formula = "5 + a", variance = 3, dist = "normal")
#' def <- defData(def, "y", formula = "0.10", dist = "binary")
#'
#' def
#' @export
#' @concept define_data
defRepeat <- function(dtDefs = NULL,
nvars,
prefix,
formula,
variance = 0,
dist = "normal",
link = "identity",
id = "id") {

assertNotMissing(nvars = missing(nvars),
prefix = missing(prefix),
formula = missing(formula))

varnames <- paste0(prefix, 1 : nvars)

if (is.null(dtDefs)) {

defNew <- defData(varname = varnames[1], formula = formula,
variance = variance, dist=dist, link = link, id = id)

for (i in (2:nvars) ) {
defNew <- defData(defNew, varname = varnames[i],
formula = formula, variance = variance,
dist=dist, link = link, id = id)
}

} else {

defNew <- data.table::copy(dtDefs)

for (i in 1:nvars) {
defNew <- defData(defNew, varname = varnames[i],
formula = formula, variance = variance,
dist=dist, link = link, id = id)
}
}

return(defNew[])
}

#' Add multiple (similar) rows to definitions table that will be used to add data to an
#' existing data.table
#'
#' @param dtDefs Definition data.table to be modified
#' @param nvars Number of new variables to define
#' @param prefix Prefix (character) for new variables
#' @param formula An R expression for mean (string)
#' @param variance Number or formula
#' @param dist Distribution. For possibilities, see details
#' @param link The link function for the mean, see details
#' @param id A string indicating the field name for the unique record identifier
#' @return A data.table named dtName that is an updated data definitions table
#' @seealso [distributions]
#' @details The possible data distributions are: `r paste0(.getDists(),collapse = ", ")`.
#'
#' @examples
#' def <- defRepeatAdd(nvars = 4, prefix = "g", formula = "1/3;1/3;1/3",
#' variance = 0, dist = "categorical")
#' def <- defDataAdd(def, varname = "a", formula = "1;1", dist = "trtAssign")
#' def <- defRepeatAdd(def, 8, "b", formula = "5 + a", variance = 3, dist = "normal")
#' def <- defDataAdd(def, "y", formula = "0.10", dist = "binary")
#'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need defRepeatAdd in addition to defRepeat as they do the same thing? Code duplication like this is usally a big no-no :D

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - they do the same thing, except for the variable checking, just like defData and defDataAdd. defRepeatAdd calls defDataAdd, whereas defRepeat calls defData.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see. Ah that remindes me of #18
But that can still be done without complete duplication of code... I`ll try it later

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, defRepeat wouldn't be needed at all if defData had another argument nVars that would just indicate how many instances of the current variable. But, I don't mind the extra function, since it makes it explicit what the user is trying to do. Again, complete transparency at the expense of efficiency.

#' def
#' @export
#' @concept define_data
defRepeatAdd <- function(dtDefs = NULL,
nvars,
prefix,
formula,
variance = 0,
dist = "normal",
link = "identity",
id = "id") {

assertNotMissing(nvars = missing(nvars),
prefix = missing(prefix),
formula = missing(formula))

varnames <- paste0(prefix, 1 : nvars)

if (is.null(dtDefs)) {

defNew <- defDataAdd(varname = varnames[1], formula = formula,
variance = variance, dist=dist, link = link)

for (i in (2:nvars) ) {
defNew <- defDataAdd(defNew, varname = varnames[i],
formula = formula, variance = variance,
dist=dist, link = link)
}

} else {

defNew <- data.table::copy(dtDefs)

for (i in 1:nvars) {
defNew <- defDataAdd(defNew, varname = varnames[i],
formula = formula, variance = variance,
dist=dist, link = link)
}
}

return(defNew[])
}

#' Read external csv data set definitions
#'
#' @param filen String file name, including full path. Must be a csv file.
Expand Down Expand Up @@ -651,7 +780,7 @@ defSurv <- function(dtDefs = NULL,

#' Check uniform formula
#'
#' @description Unifom formulas must be of the form "min;max"
#' @description Uniform formulas must be of the form "min;max"
#' @param formula Formula as string.
#' @return Invisible, error if formula not valid.
#' @seealso distributions
Expand Down
56 changes: 56 additions & 0 deletions man/defRepeat.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

58 changes: 58 additions & 0 deletions man/defRepeatAdd.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

25 changes: 24 additions & 1 deletion tests/testthat/test-add_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,27 @@ test_that("addColumns works.", {
def2 <- defDataAdd(varname = "y", formula = "2.3 * (1/x)", dist = "normal")

expect_silent(addColumns(def2, dt))
})
})

test_that("defRepeatAdd works", {
expect_silent(
defRepeatAdd(nvars = 4, prefix = "g", formula = "1/3;1/3;1/3", variance = 0, dist = "categorical")
)

def <- defDataAdd(varname = "a", formula = "1;1", dist = "trtAssign")
expect_silent(
defRepeatAdd(def, 8, "b", formula = "5 + a", variance = 3, dist = "normal")
)

expect_silent(defRepeatAdd(nvars = 4, prefix = "b", formula = "5 + a", variance = 3, dist = "normal"))

})

test_that("defRepeatAdd throws errors correctly.", {
expect_error(defRepeatAdd(prefix = "b", formula = 5, variance = 3, dist = "normal"),
class = "simstudy::missingArgument")
expect_error(defRepeatAdd(nvars = 8, formula = 5, variance = 3, dist = "normal"),
class = "simstudy::missingArgument")
expect_error(defRepeatAdd(nvars = 8, prefix = "b", variance = 3, dist = "normal"),
class = "simstudy::missingArgument")
})
23 changes: 22 additions & 1 deletion tests/testthat/test-define_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ test_that("checks combine in .evalDef correctly", {
forall(gen_evalDef_call, function(args) expect_silent(do.call(.evalDef, args)))
})

test_that(".evalDef throws erros correctly.", {
test_that(".evalDef throws errors correctly.", {
expect_error(.evalDef(newvar = 1, "1 + 2", "normal", 0, "identiy", ""), class = "simstudy::wrongType")
expect_error(.evalDef(newvar = c("a", "b"), "1 + 2", "normal", 0, "identiy", ""), class = "simstudy::lengthMismatch")
expect_error(.evalDef(newvar = "varname", "1 + 2", "not valid", 0, "identiy", ""), class = "simstudy::optionInvalid")
Expand Down Expand Up @@ -151,4 +151,25 @@ test_that("utility functions work", {
expect_equal(.splitFormula(";split"), c("", "split"))
})

test_that("defRepeat works.", {
expect_silent(
defRepeat(nvars = 4, prefix = "g", formula = "1/3;1/3;1/3", variance = 0, dist = "categorical")
)

def <- defData(varname = "a", formula = "1;1", dist = "trtAssign")
expect_silent(
defRepeat(def, 8, "b", formula = "5 + a", variance = 3, dist = "normal")
)
})

test_that("defRepeat throws errors correctly.", {
expect_error(defRepeat(prefix = "b", formula = 5, variance = 3, dist = "normal"),
class = "simstudy::missingArgument")
expect_error(defRepeat(nvars = 8, formula = 5, variance = 3, dist = "normal"),
class = "simstudy::missingArgument")
expect_error(defRepeat(nvars = 8, prefix = "b", variance = 3, dist = "normal"),
class = "simstudy::missingArgument")
expect_error(defRepeat(nvars = 4, prefix = "b", formula = "5 + a", variance = 3, dist = "normal"))
})

rm(list = setdiff(names(.GlobalEnv), freeze_eval), pos = .GlobalEnv)
34 changes: 25 additions & 9 deletions vignettes/simstudy.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -263,19 +263,35 @@ A *uniform* distribution is a continuous data distribution that takes on values

A *uniform integer* distribution is a discrete data distribution that takes on values from $a$ to $b$, where $b$ > $a$, and they both lie anywhere on the integer number line. The `formula` is a string with the format "a;b", where *a* and *b* are scalars or functions of previously defined variables. The `variance` and `link` arguments do not apply to the *uniform integer* distribution.

## Generating multiple variables with a single definition

`defRepeat` allows us to specify multiple versions of a variable based on a single set of distribution assumptions. The function will add `nvar` variables to the *data definition* table, each of which will be specified with a single set of distribution assumptions. The names of the variables will be based on the `prefix` argument and the distribution assumptions are specified as they are in the `defData` function. Calls to `defRepeat` can be integrated with calls to `defData`.

```{r}
def <- defRepeat(nvars = 4, prefix = "g", formula = "1/3;1/3;1/3",
variance = 0, dist = "categorical")
def <- defData(def, varname = "a", formula = "1;1", dist = "trtAssign")
def <- defRepeat(def, 3, "b", formula = "5 + a", variance = 3, dist = "normal")
def <- defData(def, "y", formula = "0.10", dist = "binary")

def
```

## Adding data to an existing data table

Until this point, we have been generating new data sets, building them up from scratch. However, it is often necessary to generate the data in multiple stages so that we would need to add data as we go along. For example, we may have multi-level data with clusters that contain collections of individual observations. The data generation might begin with defining and generating cluster-level variables, followed by the definition and generation of the individual-level data; the individual-level data set would be adding to the cluster-level data set.

### defDataAdd/readDataAdd and addColumns
### defDataAdd/defRepeatAdd/readDataAdd and addColumns

There are several important functions that facilitate the augmentation of data sets. `defDataAdd` and `readDataAdd` are similar to their counterparts `defData` and `readData`; they create data definition tables that will be used by the function `addColumns`. The formulas in these "*add*-ing" functions are permitted to refer to fields that exist in the data set to be augmented, so all variables need not be defined in the current definition able.
There are several important functions that facilitate the augmentation of data sets. `defDataAdd`, `defRepeatAdd`, and `readDataAdd` are similar to their counterparts `defData`, `defRepeat`, and `readData`, respectively; they create data definition tables that will be used by the function `addColumns`. The formulas in these "*add*-ing" functions are permitted to refer to fields that exist in the data set to be augmented, so all variables need not be defined in the current definition able.

```{r}
d1 <- defData(varname = "x1", formula = 0, variance = 1, dist = "normal")
d1 <- defData(d1, varname = "x2", formula = 0.5, dist = "binary")

d2 <- defDataAdd(varname = "y", formula = "-2 + 0.5*x1 + 0.5*x2 + 1*rx",
d2 <- defRepeatAdd(nvars = 2, prefix = "q", formula = "5 + 3*rx",
variance = 4, dist = "normal")
d2 <- defDataAdd(d2, varname = "y", formula = "-2 + 0.5*x1 + 0.5*x2 + 1*rx",
dist = "binary", link = "logit")

dd <- genData(5, d1)
Expand All @@ -295,12 +311,12 @@ In this example, the slope of a regression line of $y$ on $x$ varies depending o
```{r}
d <- defData(varname = "x", formula = 0, variance = 9, dist = "normal")

dc <- defCondition(condition = "x <= -2", formula = "4 + 3*x", variance = 2,
dist = "normal")
dc <- defCondition(dc, condition = "x > -2 & x <= 2", formula = "0 + 1*x", variance = 4,
dist = "normal")
dc <- defCondition(dc, condition = "x > 2", formula = "-5 + 4*x", variance = 3,
dist = "normal")
dc <- defCondition(condition = "x <= -2", formula = "4 + 3*x",
variance = 2, dist = "normal")
dc <- defCondition(dc, condition = "x > -2 & x <= 2", formula = "0 + 1*x",
variance = 4, dist = "normal")
dc <- defCondition(dc, condition = "x > 2", formula = "-5 + 4*x",
variance = 3, dist = "normal")

dd <- genData(1000, d)
dd <- addCondition(dc, dd, newvar = "y")
Expand Down