Methodology

The core of the Census Tree comes from information provided by users of FamilySearch.org, an online genealogy platform. Users can attach digitized historical records to the profiles of their ancestors, including the decennial censuses from 1850-1940. Any time a user links two different census records to a single profile, this creates a census-to-census link. There are over 317 million user-provided links, which constitute a dataset we call the Family Tree.

We build on the Family Tree in two ways. First, we use the Family Tree as training data for a machine learning algorithm to create additional census-to-census links. Second, we add links from the Census Linking Project and the IPUMS Multigenerational Longitudinal Panel, and hints from FamilySearch. After filtering the links for quality and adjudicating conflicts, we have the Census Tree.

For a more detailed description of the methodology behind the Census Tree, please see Buckles, Haws, Price, and Wilbert (forthcoming).

Training Data and Code

We have created a replication package that can be used to recreate the links for 1900-1910, including training data and code for the machine learning model.
The XGBoost models for all 36 crosswalks are available here.
The Census Tree Project also includes efforts to improve the technology used to digitize historical records. This paper describes the work. Download the training data, source code, and a trained model here.

Acknowledgements