Fetch data from many data sources

The fetch package allows you to retrieve data from many different data sources. The package retrieves data in a memory-efficient manner. You first identify the data by defining a data catalog. Then fetch the data from the catalog. Catalogs can be defined for many popular data formats: csv, rds, sas7bdat, excel, etc.

The functions contained in the fetch package are as follows:

catalog: Creates a data library
fetch: Creates a data dictionary
import_spec: Defines an import spec for a specific dataset

The fetch function retrieves a dataset from a data catalog. The function accepts a catalog item as the first parameter. The catalog item is the only required parameter. The "select" parameter allows you to pull only some of the columns. The "where" and "top" parameters may be used to define a subset of the data to retrieve. The "import_specs" parameter accepts an import_spec object, which can be used to control how data is read into the data frame.

Usage

fetch(catalog, select = NULL, where = NULL, top = NULL, import_specs = NULL)

Arguments

catalog: The catalog item to fetch data for. Catalog items are created using the catalog function.
select: A vector of column names or column numbers to extract from the data item. Note that the column names can be easily obtained as a vector from the catalog item, and then manipulated to suit your needs.
where: An optional expression to be used to filter the fetched data. Use the base R expression function to define the expression. The expression allows logical operators and Base R functions. Column names can be unquoted.
top: A number of records to return from the head of the data item. Valid value is an integer.
import_specs: The import specs to use for the fetch operation. Import specs can be used to control the data types of the fetched dataset. An import specification is created with the import_spec function. See the documentation of this function for additional details and an example.

Value

The desired dataset, returned as a tibble.

Author

Maintainer: David Bosak dbosak01@gmail.com

Other contributors:

Kevin Kramer kkrame02@amgen.com [contributor]
Archytas Clinical Solutions [copyright holder]

Examples

# Get data directory
pkg <- system.file("extdata", package = "fetch")

# Create catalog
ct <- catalog(pkg, engines$csv)

# View catalog
ct
# data catalog: 6 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Items:
  # data item 'ADAE': 56 cols 150 rows
  # data item 'ADEX': 17 cols 348 rows
  # data item 'ADPR': 37 cols 552 rows
  # data item 'ADPSGA': 42 cols 695 rows
  # data item 'ADSL': 56 cols 87 rows
  # data item 'ADVS': 37 cols 3617 rows

# Example 1: Fetch Entire Dataset

# Get data from the catalog
dat1 <- fetch(ct$ADEX)

# View Data
dat1
# A tibble: 348 × 17                                                                                      
#   STUDYID USUBJID   SUBJID SITEID TRTP  TRTPN TRTA  TRTAN RANDFL SAFFL
#   <chr>   <chr>     <chr>  <chr>  <chr> <dbl> <chr> <dbl> <chr>  <chr>
#  1 ABC     ABC-01-0… 049    01     ARM D     4 ARM D     4 Y      Y    
#  2 ABC     ABC-01-0… 049    01     ARM D     4 ARM D     4 Y      Y    
#  3 ABC     ABC-01-0… 049    01     ARM D     4 ARM D     4 Y      Y    
#  4 ABC     ABC-01-0… 049    01     ARM D     4 ARM D     4 Y      Y    
#  5 ABC     ABC-01-0… 050    01     ARM B     2 ARM B     2 Y      Y    
#  6 ABC     ABC-01-0… 050    01     ARM B     2 ARM B     2 Y      Y    
#  7 ABC     ABC-01-0… 050    01     ARM B     2 ARM B     2 Y      Y    
#  8 ABC     ABC-01-0… 050    01     ARM B     2 ARM B     2 Y      Y    
#  9 ABC     ABC-01-0… 051    01     ARM A     1 ARM A     1 Y      Y    
# 10 ABC     ABC-01-0… 051    01     ARM A     1 ARM A     1 Y      Y    
#  338 more rows
#  7 more variables: MITTFL <chr>, PPROTFL <chr>, PARAM <chr>,
#  PARAMCD <chr>, PARAMN <dbl>, AVAL <dbl>, AVALCAT1 <chr>
#  Use `print(n = ...)` to see more rows

# Example 2: Fetch a Subset

# Get data with selected columns and where expression
dat2 <- fetch(ct$ADEX, select = c("SUBJID", "TRTA", "RANDFL", "SAFFL"),
              where = expression(SUBJID == '051'))

# View Data
dat2
# A tibble: 4 x 4
#   SUBJID TRTA  RANDFL SAFFL
#   <chr>  <chr> <chr>  <chr>
# 1 051    ARM A Y      Y    
# 2 051    ARM A Y      Y    
# 3 051    ARM A Y      Y    
# 4 051    ARM A Y      Y