R package to link historical individual-level data. Optimized to run on many cores. Read documentation here or below.
Installation
The development version from GitHub with:
# install.packages("devtools")
devtools::install_github("eirikberger/hismatch")
Use auth_token
option to use personal access tokens while the repo is still privat.
Usage
A brief example of how to setup the class and run it.
library(data.table)
library(furrr)
library(hismatch)
dta1 <- fread("~/Dropbox/shared_server/deathDataFromFMB/processed_data/dar_1928-1945.csv")[1:1000]
dta2 <- copy(dta1)
plan(multisession, workers = 1L)
linkingTest <- Hismatch$new(data1 = dta1,
data2 = dta2,
firstname='firstname',
surname='lastname',
blocks=c(),
dist_thr=0.5,
rel_thr=FALSE,
max_block_size=50000,
matching_method=c("jw"))
# define matching blocks
linkingTest$blocks <- c('l_first', 'l_sur', 'byear', 'bmonth', 'bday', 'male')
# execute fuzzy matching
linkingTest$runMatching()
Inspect results
Examples of how to extract information and data based on fuzzy match, as well as to save and read results.
# print basic matching statistics
linkingTest
# other information and data
linkingTest$getMatchingRatesByVariable_DT(variable_list="birth_year", data=linkingTest$data1_pros)
linkingTest$plotMatchingByGroup('birth_year>1850')
linkingTest$full_match
linkingTest$data1_pros
linkingTest$data2_pros
linkingTest$data1
linkingTest$data2
linkingTest$blocks
# read and save results
linkingTest <- readHismatchClass('linkingTest.RDS')
saveHismatch('linkingTest', 'linkingTest.RDS')
Iterative Matching
There are cases where you want to start with the most restrictive blocking strategy, and than gradually lift blocks for observations that are not matched. In the following, I demonstrate how this can be implemented, where block_list
is a list of the vectors used for blocking from the strictest to the least strict.
linkingTest <- Hismatch$new(data1 = dta1,
data2 = dta2
firstname = "firstname",
surname = "carryforward_surname",
dist_thr = 0.90,
rel_thr = FALSE,
max_block_size = 50000,
letters = 1,
matching_method = c("jw")
)
# Linking
bl1 <- c("l_first", "l_sur")
bl2 <- c("l_first", "l_sur", "komnr")
bl3 <- c("l_first", "l_sur", "komnr", "residence")
bl4 <- c("l_first", "l_sur", "komnr", "residence", "occupation")
block_list <- list(bl4, bl3, bl2, bl1)
linkingTest$iterative_link("iterative_linking", block_list, "Norway_to_random_source", "dataset_1", "dataset_2")
iterative_link_by_year
is a wrapper around this function, and matches observations by year from the same data.table
. This function matches observations in year t
to t+2
as defined by the years
vector.
linkingTest <- Hismatch$new(firstname = "firstname",
surname = "carryforward_surname",
dist_thr = 0.90,
rel_thr = FALSE,
max_block_size = 50000,
letters = 1,
matching_method = c("jw")
)
# Linking
bl1 <- c("l_first", "l_sur")
bl2 <- c("l_first", "l_sur", "komnr")
bl3 <- c("l_first", "l_sur", "komnr", "residence")
bl4 <- c("l_first", "l_sur", "komnr", "residence", "occupation")
block_list <- list(bl4, bl3, bl2, bl1)
linkingTest$iterative_link_by_year(dta, years, "Norway", "iterative_linking", block_list, 2)
Unifying Names
A frequent problem in name matching is that some sources include full middle names, some include only the first letter and some ignore them altogether. If two sources use different approaches, then it could increases error. The function unify_names()
solves this by checking the naming convention for the two possible matches and makes them consistent. See an example produced by ChatGPT below.
name_from_dataset1 name_from_dataset2 new_name_1 new_name_2
1: John Michael Doe John M Doe John M Doe John M Doe
2: Jane M Doe Jane Michaela Doe Jane M Doe Jane M Doe
3: Robert Smith Robert Alan Smith Robert Smith Robert Smith
4: Ann B Johnson Ann Bethany Johnson Ann B Johnson Ann B Johnson
5: Chris G W Bush Chris Bush Chris Bush Chris Bush
Set unify_middlenames = TRUE
in Hismatch$new()
do activate this functionality.