Simulate individual-level data • regtools

This vignette describes the aim and potential use cases of the function synthetic_data(). In certain cases described in this vignette, the synthetic_data() also requires an internet connection.

Why?

In most research fields, individual-level data is regulated by strict confidentiality agreements. In Norway, sensitive health and administrative individual-level data cannot be shared or used outside of secure platforms, such as TSD (Services for sensitive data). While this type of disclosure control protects the privacy of study participants, it can also pose a significant challenge to reproducibility and transparency practices in research. Although code sharing has gained popularity, code cannot be appropriately evaluated (and potentially reused) if the original analytic data is unavailable.

One possible solution to the challenges associated with using confidential data is to create synthetic or simulated data that have similar structure and statistical characteristics as the original data. There are a number of excellent R packages dedicated to creating synthetic datasets from existing data, such as synthpop, faux and simpop. The resulting synthetic data are high-fidelity and aim to maintain the internal statistical characteristics and relationships between variables. However, many features of these packages rely on having full access to the original individual-based data. Additionally, it is crucial to remember that synthetic data based on existing datasets is not inherently private (source , Alan Turing Institute) and can be vulnerable to data leaks and attacks.

In line with the overarching objectives of regtools, the synthetic_data() function aims to help researchers navigate some of the challenges associated with working with confidential individual-level data in Norway by creating synthetic datasets without the need of pre-existing data. As this function does not require the use of pre-existing data, it also does not attempt to capture the internal statistical characteristics and relationships between variables. However, synthetic_data() is still particularly useful for researchers working with Norwegian health and sociodemographic datasets, as it produces synthetic datasets with a structure and semantics resembling those found in actual individual-level data (e.g NPR, KPR, SSB).

Consider the features the synthetic_data() function, we have identified three main use cases:

Easier code sharing: this type of synthetic datasets allow researchers to simultaneously share their analytically code and data without privacy concerns. In turn, reviewers and collaborators can successfully execute the code, facilitating its correct evaluation and potential reuse.
Educational purposes: the data generated by the synthetic_data() function also provides a low-barrier and low-risk way of exploring and manipulating data, making it ideal for training or onboarding new researchers.
Development without data access: if data availability is delayed or not possible to access for other reasons (e.g. preregistration), researchers can still prepare analytic scripts in advance.

How?

Broadly speaking, the synthetic_data() function has the ability to create three different types of datasets that meet certain minimum characteristics:

Diagnostic (health) data: This includes at least a unique personal identifier (ID), date of diagnostic event, and diagnosis code (such as ICD-10 or ICP-2). For instance, datasets from NPR (Norwegian Patient Registry) or KPR (Kommunalt pasient- og brukerregister).
Time-invariant data: Including at least a unique personal identifier (ID) and sociodemographic variables such as date of birth, immigration background, etc.
Time-varying data: Encompassing at least a unique personal identifier (ID), date, and sociodemographic variables such as place of residence or marital status. This type of information usually is updated quarterly or yearly in administrative registries.

Considering the complexity and variability of registry data, the datasets created by synthetic_data() will likely differ to a certain degree from the actual data delivered by NPR or SSB (Statistics Norway). Even when the structure of the synthetic datasets varies from the original data, the simulated datasets serve as a useful starting point that researchers can further modify and tailor to suit their specific needs. Furthermore, the synthetic_data() function also keeps track of important metadata associated with the data generation process. In this way, it is possible for researchers to produce consistent datasets in an efficient manner.

Practical example

To successfully generate the three different synthetic datasets describe in the section above, you will need to provide some information about the population size and classifications you want to include in the different datasets. For the whole description of each argument please consult the synthetic_data() function’s documentation.

The population_size parameter will be used to ensure that the size of the datasets is similar to the one you will encounter working with real data, while also generating the necessary number of diagnostic cases for the given prevalence or incidence rate. In the code below, the population size is specified as 25,000 with a period prevalence of .06 (6%). Therefore, there will be 1500 relevant cases in the data.

With the purpose of making the synthetic datasets as realistic as possible, they all include an additional number of non-relevant cases (to complete the specified population size). Additionally, all the individuals in the diagnostic dataset can have a random number of repeated diagnostic events either in the same year or different years.

The arguments family_codes, pattern, diag_years, sex_vector, y_birth, invariant_codes, varying_query are all used to populate the diagnostic, time-invariant and time-varying information of the relevant 1500 cases. The remainder arguments are used to generate the filler non-relevant cases.

In most cases you will want to specify your own column names and codes for the invariant and time-varying variables (invariant_codes, invariant_codes_filler, varying_codes, varying_codes_filler), however the function also supports looking for classification codes in SSB’s Statistical Classifications and Codelists (Klass). Both varying_query and invariant_queries options require an internet connection to retrieve the specified classifications and codelists from SSB.


dummy_data <- regtools::synthetic_data(
  population_size = 25000,
  prefix_ids = "P000",
  length_ids = 6,
  family_codes = c("F8"),
  pattern = "increase",
  prevalence = .060,
  diag_years  = c(2012:2020),
  sex_vector = c(0, 1),
  y_birth = c(2010:2018),
  filler_codes = c("F4", "F7"),
  filler_y_birth = c(2000:2009),
  invariant_codes = list("innvandringsgrunn" = c("ARB", "NRD", "UKJ")),
  invariant_codes_filler = list("innvandringsgrunn" = c("FAMM", "UTD")),
  varying_query = "fylke",
  seed = 123) 
#> ℹ Creating relevant cases with the following characteristics:
#> • Population size = 25000
#> • Prefix IDs = P000
#> • Length IDs = 6
#> • Diagnostic relevant codes = F8
#> • Pattern of incidence = increase
#> • Prevalence = 0.06
#> • Diagnostic years = 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, and 2020
#> • Incidence =
#> • Coding sex = 0 and 1
#> • Relevant years of birth = 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, and
#> 2018
#> ℹ Creating filler cases with the following characteristics:
#> • Filler diagnostic codes = F4 and F7
#> • Filler years of birth = 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
#> and 2009
#> • Pattern for filler incidence = 'random'
#> • Number of filler cases to generate = 23500
#> ! This process can take some minutes...
#> ✔ Succesfully generated diagnostic, time-varying and time-invariant datasets!

With the purpose of facilitating transparency and reproducibility, the synthetic_data() function outputs a list with two named lists: “datasets” and “metadata”. Within the first list (“datasets”), you will find the diagnostic, time-invariant and time-varying data as separate data frames.

str(dummy_data$datasets)
#> List of 3
#>  $ invar_df: tibble [25,000 × 4] (S3: tbl_df/tbl/data.frame)
#>   ..$ id               : chr [1:25000] "P000000037" "P000000052" "P000000059" "P000000111" ...
#>   ..$ sex              : Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 2 1 2 ...
#>   ..$ y_birth          : int [1:25000] 2002 2001 2011 2009 2007 2006 2009 2004 2007 2001 ...
#>   ..$ innvandringsgrunn: chr [1:25000] "FAMM" "FAMM" "NRD" "FAMM" ...
#>  $ var_df  : tibble [225,000 × 3] (S3: tbl_df/tbl/data.frame)
#>   ..$ id          : chr [1:225000] "P000000037" "P000000037" "P000000037" "P000000037" ...
#>   ..$ year_varying: int [1:225000] 2012 2013 2014 2015 2016 2017 2018 2019 2020 2012 ...
#>   ..$ varying_code: chr [1:225000] "03" "03" "03" "03" ...
#>  $ diag_df : tibble [99,731 × 3] (S3: tbl_df/tbl/data.frame)
#>   ..$ id       : chr [1:99731] "P000000059" "P000000059" "P000000059" "P000001221" ...
#>   ..$ code     : chr [1:99731] "F801" "F419" "F480" "F848" ...
#>   ..$ diag_year: int [1:99731] 2015 2017 2018 2020 2019 2017 2014 2019 2020 2020 ...
dummy_diag_df <- dummy_data$datasets$diag_df

The second list (“metadata”) includes useful metadata like the exact call used to generate the datasets, as well as the values of each argument used in the function call.

str(dummy_data$metadata)
#> List of 2
#>  $ call     : language regtools::synthetic_data(population_size = 25000, prefix_ids = "P000",      length_ids = 6, seed = 123, family_co| __truncated__ ...
#>  $ arguments:List of 15
#>   ..$ population_size       : num 25000
#>   ..$ prefix_ids            : chr "P000"
#>   ..$ length_ids            : num 6
#>   ..$ seed                  : num 123
#>   ..$ family_codes          : language c("F8")
#>   ..$ pattern               : chr "increase"
#>   ..$ prevalence            : num 0.06
#>   ..$ diag_years            : language c(2012:2020)
#>   ..$ sex_vector            : language c(0, 1)
#>   ..$ y_birth               : language c(2010:2018)
#>   ..$ filler_codes          : language c("F4", "F7")
#>   ..$ filler_y_birth        : language c(2000:2009)
#>   ..$ invariant_codes       : language list(innvandringsgrunn = c("ARB", "NRD", "UKJ"))
#>   ..$ invariant_codes_filler: language list(innvandringsgrunn = c("FAMM", "UTD"))
#>   ..$ varying_query         : chr "fylke"

Let’s say you are researcher looking into the co-occurrence of two particular family codes (ICD-10 F8 and F7) in your population of interest. However, there has been a delay in data access. However, using the diagnostic dataset generated in the previous step and some of the functions from regtools, it is possible to prepare some analytic scripts before you have access to the actual data in your project.

# Output in console has been silenced for this example 

dates <- as.character(c(2012:2018))

diag_df_first <- dummy_diag_df |> 
  regtools::curate_diag_data(
    code_col = "code", 
    date_col = "diag_year", 
    log_path = l_file)

diag_f8_year <- dates |> 
  purrr::map(\(x) regtools::filter_diag_data(
    diag_df_first, 
    pattern_codes = "F8", 
    code_col = "code", 
    date_col = "y_diagnosis_first", 
    diag_dates = x, 
    log_path = l_file)) |> 
  purrr::map(\(x) dplyr::select(x, "id"))


diag_f7 <- dummy_diag_df |> 
  regtools::filter_diag_data(
    pattern_codes = "F7", 
    code_col = "code", 
    log_path = l_file)

intersect_f8_f7 <- purrr::map(diag_f8_year, \(x) dplyr::intersect(x, diag_f7[1])) |> 
  purrr::map(\(x) nrow(x)) 

names(intersect_f8_f7) <- dates

intersect_f8_f7_df <- purrr::map_df(intersect_f8_f7, ~as.data.frame(.x), .id="year") |>
  dplyr::rename("count" = ".x")

regtools::plot_rates(
  intersect_f8_f7_df, 
  date_col = "year", 
  grouping_var = "year", 
  rate_col = "count", 
  plot_type = "lollipop",
  percent = FALSE, 
  palette = "viridis", 
  plot_title = "New yearly F8 diagnoses",  
  y_name = "Count", 
  coord_flip = TRUE) + 
  ggplot2::labs(subtitle = "Individuals with a previous or future F7 diagnosis") +
  ggplot2::theme(legend.position = "none")