02-data-manipulation-examples.Rmd

---
title: "Data Manipulation Examples"
author: "Joscelin Rocha Hidalgo"
date: "Tuesday, Oct 4th, 2022"
output: 
    html_document:
        css: slides/style.css
        toc: true
        toc_depth: 1
        toc_float: true
        df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Loading Packages

Load the `tidyverse` package.

```{r}
library(tidyverse)
```

# Import chds6162_data

```{r warning=FALSE}
data <- read_csv("data/chds6162_data.csv")

tail(data,12)
```

# rename

```{r}
#Let's clean that column with too many characters to simply be date rather than dateofvisit

data <- rename(data, date = dateofvisit)
head(data)


#make them all upper case
rename_with(data, toupper)

#make them all lower case
rename_with(data, tolower)
```

# Clean_names

```{r}
# clean_names is part of the function called janitor
library(janitor)

data %>% 
  clean_names(., "big_camel")

data %>% 
  clean_names(., "screaming_snake")
```

Art by @allison_horst

![](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/other-stats-artwork/coding_cases.png)

# Separate

```{r}
# If you want to split by any non-alphanumeric value (the default)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
head(df)

df %>% 
  separate(x, c("col w A", "col w B"))

# If you just want the second variable:
df %>% 
  separate(x, c(NA, "B"))

df <- data.frame(x = c("a", "a b", "a b c", NA))
head(df)

df %>% 
  separate(x, c("column with As", "column with Bs"))


df %>% 
  separate(x, c("column with As", "column with Bs")) %>%
  unite("new col",1:2, sep = " ")

```

#?

# Select

![](images/select.png)

With the function `select` we can select variables (columns) from the larger data frame.

Use `select` to show just the `gestation` variable.

```{r}
data %>%
  select(gestation)
```

We can also `select` a range of columns. `select` all the variables that belong to the father (they had a "d" in front of them) `drace` to `dwt`.

```{r}
data %>%
  select(drace:dwt)

#What about just the id column and everything after the father information?
data %>%
  select(id, marital:last_col())
```

We can drop variables using the -var format. Drop the `marital` variable.

```{r}
data %>%
  select(-(marital))
```

# Filter

![]images/filter.png) ![art by ](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_filter.jpg) art by @allison_horst

We use `filter` to choose a subset of cases.

Use `filter` to keep only those whose parents have been never married (value `5` in the `marital` variable. Then, use `select` to show only the `age` variable to let us explore the mothers' ages when they delivered.

```{r}
data %>%
  filter(marital == 5) %>%
  select(age)

```

Use `filter` to keep only those who are **not** divorced (value `3`).

```{r}
data %>%
  filter(marital != 3)
```

Use `%in%` within the `filter` function to keep only those who are divorced (3), legally separated (2), or widowed (4).

```{r}
data %>%
  filter(marital %in% c(2, 3,4))
```

Create a chain that keeps only those whose moms were in their 20s when they delivered and are college grads (val= 5). Use the `age` and `ed` variables.

```{r}
data %>%
  filter(ed == 5, 
         age %in% 20:30,
         ht >65)

```

We can use `<`, `>`, `<=`, and `>=` for numeric data. == equal, != not equal

#?

# Mutate

![](images/mutate.png)

![art by @allison_horst](images/dplyr_mutate.png)

We use `mutate` we make new variables or change existing ones.

Create a **new variable with a specific value**

Create a new variable called `data_decade`. Imagine that you will be merging this dataset from 61-62 to dataset from the 70's. To make it easier, you will create this variable with the value "60s."

```{r}
data %>%
  mutate(data_decade = "60s")
```

Create a **new variable based on other variables**

Create a new variable called `wt_k`. This variable will give you information about mom's weight pre-pregnancy(`wt.1`) in kilos (1 pound = .454 kilos).

```{r}
data %>% 
  mutate(wt_k = wt.1*.454)

# too many decimals? let's round things

data %>% 
  mutate(wt_k = round((wt.1*.454),0))

```

# case_when

![art by @allison_horst](images/dplyr_case_when.png)

Change an **existing variable**

For some reason you want to have the real labels rather than the values. Let's change the `marital`variable to its real values: 1 = married, 2 = legally separated, 3 = divorced, 4 = widowed, 5 = never married.

```{r}
data %>% 
  mutate(marital = case_when(
    marital == 1 ~ "married",
    marital == 2 ~ "legally separated",
    marital == 3 ~ "divorced",
    marital == 4 ~ "widowed",
    marital == 5 ~ "never married"
  ))
```

Fixing an ID

```{r}
data %>% 
  mutate(id = case_when(
    id == 15 ~ 19,
    TRUE ~ id
  ))
```

# summarize or summarise

![](images/summarize.png)

With `summarize`, as the name implies, you will get a summary of your dataset.

Get the mean days of gestation for this sample.

```{r}
mean<-data %>%
  summarize(gestation_length_mean = round(mean(gestation, na.rm = TRUE),2))
```

**Extra

```{r}
mean2<-data %>%
  summarize(gestation_length_mean = round(mean(gestation, na.rm = TRUE),2)) %>%
  pull()
```

We can have multiple arguments in each usage of `summarize`.

In addition to the mean gestation length you calculated earlier, calculate the minimum and maximum for the same variable.

```{r}
data %>%
  summarize(mean_gestation_length = mean(gestation, na.rm = TRUE),
            min_gestation_length = min(gestation, na.rm = TRUE),
            max_gestation_length = max(gestation, na.rm = TRUE))
```
# ?

# group_by

![](images/group-by.png)

![art by @allison_horst](images/group_by_ungroup.png)

`group_by`enables us to perform calculations on each of our groups.

Calculate the mean gestation length in days based on mother's smoking habits (0=never, 1=smokes now, 2=until current pregnancy, 3=once did, not now) using `group_by` and `summarize`.

```{r}
data %>% 
  group_by(smoke) %>%
  summarize(mean_gestation_length = mean(gestation,
                                    na.rm = TRUE))

#Since we know that there are some NAs in the Smoke variable we can go ahead and drop those NAs before we summarize them
data %>% 
  drop_na(smoke) %>%
  group_by(smoke) %>%
  summarize(mean_gestation_length = mean(gestation,
                                    na.rm = TRUE))
```

We can use `group_by` with multiple groups.

```{r}
data %>% 
  group_by(ed,smoke) %>%
  summarize(mean_gestation_length = mean(gestation,
                                    na.rm = TRUE))
```

# across

![art by @allison_horst](images/dplyr_across.png)

`across`enables us to perform calculations on each of our selected columns

In the example above, we calculated the mean for gestation but what about if I wanted the mean for gestation, mom age (age) and dad age (dage)

```{r}

#If we didn't know better, we would do this:
data %>% 
  group_by(smoke) %>%
  summarize(mean_gestation_length = mean(gestation,
                                    na.rm = TRUE),
            mean_m_age = mean(age,
                                    na.rm = TRUE),
            mean_d_age = mean(dage,
                                    na.rm = TRUE))
```

```{r}

#But we know better:
data %>% 
  drop_na(smoke) %>%
  group_by(smoke) %>%
  summarize(across(c(gestation,age,dage),~ mean (., na.rm = TRUE)))

?across
```

# ?
# Create a new data frame

Sometimes you want to save the results of your work to a new data frame or replace the one you already have (not recommended).

```{r}
to_graph <- data %>% 
  drop_na(smoke) %>%
  group_by(smoke) %>%
  summarize(across(c(gestation,age,dage),~mean(.,na.rm = TRUE)))

to_graph
```

# relocate

![art by @allison_horst](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_relocate.png)

Sometimes you want to export the table as a csv but you don't really like the order of the columns. Relocate allows you to move them around without you having to go back to previous codes. Let's take the `to_graph` data frame and move the age column before the gestation column.

```{r}
?relocate

to_graph %>%
  relocate (age, .before = gestation)


#What if I wanted gestation to be the last column?
to_graph %>%
  relocate (gestation, .after = last_col())

# ok now I like it and want a CSV of it. 
# I make sure to save it as a dataframe and then export it
csv_to_export <- to_graph %>%
  relocate (gestation, .after = last_col())

# export csv
write_csv(csv_to_export, "exports/example_to_export.csv")
```

# pivot_longer/pivot_wider

```{r}
oops <- csv_to_export %>%
  pivot_longer(c(age,dage), names_to = "potato", values_to = "age_mean")


oops %>%
  pivot_wider(names_from = "potato", values_from = "age_mean")
```