This article was intended for a different forum. When that didn’t work out, I decided to park it on my personal blog and here.
–
Professor of Sociology William H. Sewell was deeply interested in social mobility. Do career aspirations affect career achievement? Do individual and social traits underlie those aspirations? Despite some preliminary research in the 1950s, Sewell lacked the data to answer these questions.
In 1962, Sewell made a lucky discovery: sitting unused in a University of Wisconsin administration building were the survey schedules and punch cards from a 1957 survey on the educational plans of all Wisconsin high school seniors. Now he had what he needed. He randomly selected 10,317 of these seniors and, in 1964, sent postcard surveys to their parents, asking about the seniors’ education, career aspirations, and socioeconomic status. Eighty-seven per cent of the parents responded. A 1975 telephone survey of the graduates themselves had a response rate of 89 per cent.
These steps began the Wisconsin Longitudinal Study; longitudinal studies observe subjects over time. In 1977, the survey expanded to include randomly selected siblings of the graduates. By 2020, the survey had accumulated over 60 years of data. The high school graduates were 81 years old.
In 2006-07 was a significant milestone. Saliva sample kits for genetic analysis were posted to participants. Kits were mailed in 2011 to those who missed the first round. Overall, 64 per cent of graduates and 36 per cent of siblings provided samples, covering about half of the 18,000 participants.
As genetic data doesn’t change, these samples enriched 50 years of data, including that initial collection in 1957. Each new collection, such as those in 2011 and 2020, is also augmented by this genetic data.
In 2018, Daniel Belsky and colleagues published a paper in the Proceedings of the National Academy of Sciences using the Wisconsin Longitudinal Study’s genetic data. For each student, they calculated an education “polygenic score”, a measure of the genetic influence on educational attainment. Students with higher scores had higher career success and social rank than their parents did in 1957. They were upwardly mobile. This observation held within families: siblings with higher scores achieved more as well.
As genetics are the cause of many phenomena we study, genetic data can be of immense value. Studying the mechanisms of DNA transmission and recombination between generations can help policymakers investigating issues such as social mobility, poverty, and inequality. We could assess interventions, from education to tax reform to childcare. By learning which interventions work, we could allocate public resources more efficiently. Kathryn Paige Harden makes the case for the value of genetic analysis comprehensively in the excellent The Genetic Lottery: Why DNA Matters for Social Equality. In what follows, I will remake that case in only the most superficial way.
However, there is a major barrier to using genetic data in this way: most of the datasets we use to investigate socio-economic phenomena do not include genetic data, preventing us from including genetics in our analyses.
We should address this problem by collecting genetic data from the participants in longitudinal research. As our DNA is fixed for life, we should supplement old longitudinal data with genetic data, enriching decades of past work. The addition of genetic data to the Wisconsin Longitudinal Survey provides a template.
We could also be bringing genetic data to bear in our experimental work. The addition of genetic data to experimental panels could provide rich insight into the heterogeneity of behaviour.
Importantly, we can do this now. While many discussions on genetic data in social science focus on growing sample size, new tools and future possibilities, existing tools can give us insight today and bring future possibilities to life.
In what follows, I will first touch on two topics of interest to economists (my own profession): using genetic transmission to infer causation and some examples of genetic data applied to questions of social mobility and inequality. That will take me to my main point, that we should be supplementing our core research datasets with genetic data now.
Causation
There is abundant data indicating the intergenerational persistence of educational outcomes and socioeconomic status. Here are two Australian examples (I’ll lean on material from my home country, which I know best): A Year 9 student (aged 14-15) whose parents have a bachelor’s degree or higher will, on average, have numeracy skills almost four years ahead of those of classmates whose parents do not have a bachelor’s degree. A child born in Australia to a family in the bottom 20 per cent of parental incomes has a 12 per cent chance of being in the top 20 per cent of incomes 30 years later.
For an economic policymaker, these statistics raise questions. Do the children of educated, wealthy parents have an unfair advantage? Would equalising wealth through taxes and transfers close the intergenerational gap? Designing robust solutions requires accurately assessing causality, but a correlation between child and parent outcomes does not prove that higher parental education or socioeconomic class causes outcomes in the next generation. We need to consider if a third factor might be “confounding” the result.
A likely third factor is genetics. Children and parents share DNA. If parents genetically transmitted traits like intelligence and conscientiousness to their children, we could see a correlation between the child and parent even if parental education or income had no direct effect on the child. It should not be controversial to say that genetics could underlie this result: the first law of behaviour genetics is that all human behavioural traits are heritable.
The genetic confound raises the question of how to infer the cause. Economists are infatuated with causation; absent a randomised controlled trial, they scour the world for interesting data sets and quasi-experiments to tease out causality. This pursuit led to innovative approaches to infer causation. Economists celebrate the “credibility revolution”, demanding rigorous study design.
An example quasi-experiment concerns protests in Paris in May 1968, which led authorities to be lenient in university entrance exams. Eric Maurin and Sandra McNally studied students who barely passed despite the increased leniency, and likely would not have been accepted in other years. These students earned higher future wages, and, in turn, their children obtained more education. As the riot did not affect the genetics of the parents, we can take the change in schooling as the cause.
Economists’ focus on causation does have a benefit. The social science literature is littered with studies for which genetics is an obvious confound. There are fewer examples in the economics literature (that is my impression, at least). Although economists rarely discuss genetics, they prefer experimental designs that avoid confounds. That, however, places a constraint on the questions that we can answer. Thankfully, overcoming the genetic confound does not always require a city to descend into riots.
Human genomes consist of 3 billion base pairs, with over 99 per cent of base pairs the same from person to person. Most of the variation between people are single nucleotide polymorphisms (SNPs), changes in just one base pair that are present in at least 1 per cent of the population. Genetic databases, like the Wisconsin Longitudinal Study, typically contain samples of SNPs. If you’ve taken a DNA test from companies like 23andMe (R.I.P.) or MyHeritage, they analysed your SNPs too.
Research using SNP data suggests that most traits are polygenic: that is, many genes underlie the traits’ heritability. For example, one study identified 3,952 SNPs linked to educational attainment. This reflects a proposed fourth law of behaviour genetics: ‘A typical human behavioural trait is associated with very many genetic variants, each of which accounts for a very small percentage of the behavioural variability.’
A long list of SNPs alone is not useful for examining social science outcomes. As a result, scientists developed polygenic scores, a weighted count of SNPs enhancing a trait. To give an example, one study found that a person in the 84th percentile of polygenic scores for educational attainment was 19 per cent more likely to complete a university degree than someone with an average score. It may not sound like much, but this is a similar effect size to that of many social and cultural factors, like family socioeconomic status.
Despite the apparent link between polygenic scores and outcomes, this is again correlation and not proof of causation. Does a link between polygenic score and outcome mean genetics caused the difference? We cannot jump to an answer of “yes”. Population stratification may be at play, where genetic differences arise between groups due to historical, cultural and social factors. For example, intermarriage among highly educated groups can lead to genetic variant concentration over time, showing genetics to be a historic contingency and not truly causal. We can at least rule out reverse causation; education does not alter DNA.
However, the nature of DNA transmission from parent to child provides a mechanism by which we can get closer to the cause. Your genetic material is organized into 23 pairs of chromosomes, one-half of each pair coming from each of your father and mother. Each parent’s chromosomes in turn came from your grandparents, but you didn’t receive exact copies of your grandparents’ chromosomes. During the creation of your parents’ eggs and sperm, your grandparents’ chromosomes were spliced together, resulting in each egg or sperm having different combinations of your grandparents’ chromosomes. Women create about 45 splices, and men create about 26. The result is that siblings receive a random draw from each parent’s chromosomes. This draw involves a small number of chunks (around 23+45=68 from the mother and 23+26=49 from the father), not each of the three billion base pairs independently. Because of this small number of chunks, the genetic relatedness between siblings can vary significantly from the average of 50 per cent, with most siblings sharing between 43 per cent and 57 per cent of their genetic material. My identical twin sons share only 41 per cent of their DNA with their younger brother.
Understanding how DNA is transmitted provides a fantastic opportunity. Historically, behaviour genetics relied on comparing identical and fraternal twins or examining adoptees. Relatedness data now allows an extension of the twin study methodology to families without twins. The lower variation in relatedness between siblings (as compared to identical versus fraternal twins) reduces the ability to link between genes and outcomes, but this is offset by the larger samples available when you are no longer constrained to twin samples. Researchers have since expanded this methodology to include broader population samples, not just siblings, to estimate the heritability of traits like educational attainment.
Within-family variation in SNPs also allows us to calculate more robust polygenic scores. Since each sibling’s set of SNPs is the outcome of a lottery, causation is less susceptible to bias.
Another opportunity arises because some genetic variants are not passed from parent to child. If these non-transmitted variants affect child outcomes, we can rule out genetic transmission and focus on environmental channels such as socioeconomic status. Kathryn Paige Harden and Philipp Koellinger call this a ‘virtual parent’ design, which mimics adoption studies in that the children are raised by someone with different genes. Kong and colleagues used this premise to show that the effect of the non-transmitted variants on child education was 30 per cent as strong as the transmitted polygenic score.
The methodologies underlying this research are still in development and subject to some interesting debates. Are we effectively controlling for population stratification when we don’t have family-based samples? However, that is not a barrier to building data now, and further data collection will support resolving such debates.
Illustrating the applications
For all of the evidence indicating their influence, genes are not destiny. An example is Arthur Goldberger’s thought experiment about eyeglasses. Poor eyesight due to genetics might be corrected by an environmental intervention.
Although Goldberger made this point to argue against using heritability in policy development, studies of intergenerational status have shown genes and environment are interconnected. Methodologies such as those described above can help us understand the causes of transmission, explore what policy interventions are most prospective and examine the distributional effects of those policies.
This article is not a comprehensive review of policy relevant research, but below are some brief illustrations related to social mobility and intergenerational transmission of socioeconomic status to give a flavour of the questions that can be illuminated with genetic data.
What is the optimal level of social mobility?
I don’t know the answer to that question, but any answer requires us to consider genetics. With genetic endowments, random sorting will not emerge in a society with equal environments. Consider the following. A study of Finnish twins found a substantial genetic effect on lifetime earnings: around 40 per cent of the variance in women’s earnings and around half for men’s. Twenty-one other studies in Australia, Sweden and the United States produced similar estimates for genetic contribution, with the effect of shared environment, comprising common environmental factors such as parental socioeconomic status, being around 9 per cent. This is the second law of behaviour genetics in action: the effect of being raised in the same family is smaller than the effect of genes. However, is heritability capturing an inherent characteristic of the child, the parental response to the child’s genotype, or the environment created due to the parental genotype that is also shared with the child?
Analyses using genetic data are nascent, but they can shed light on this question. Polygenic scores for education are linked to higher socioeconomic status and better labour outcomes, even among siblings. However, those who grow up in high-status households tend to have higher college completion and better socioeconomic outcomes independent of their score. Further, there is a stronger relationship between polygenic scores and college completion in higher socioeconomic groups. While this pattern may partly reflect unobserved genetic variation - polygenic scores capture only some of the heritability of traits - this evidence suggests an opportunity to improve outcomes for talented students in low-status families.
There is growing research into the transmission of skills, one of the pathways by which socioeconomic status might persist. One study identified three genetic pathways: the direct genetic effect, whereby both parent and child have genes that increase their skills; parental investment in children with higher polygenic scores; and parents with higher genetic factors themselves investing more in their children. This study indicated that ignoring genetics overestimates the effect of parental investment on child skills, but that the environment created by the parent, influenced by their genetics, also matters, at least for the children aged seven years or less examined in this study. Examining these investments may provide intervention ideas.
Another study used genetic data to provide insight into the accumulation of wealth. People with a higher polygenic score for years of schooling had greater household wealth at retirement. Those with scores in the 84th percentile had over $150,000 extra wealth. A polygenic score premium persisted even after controlling for education and income, suggesting the score captures other skills. One policy insight comes from a gene-environment interaction. The relationship between wealth and polygenic score was four times as large for those without defined benefit pensions, which involve a guaranteed income and require few decisions about money allocation. Those with lower polygenic scores tended to struggle with managing their retirement investments, indicating that freedom can harm those who find complex financial decisions difficult. This finding could inform policy. In Australia, compulsory defined accumulation plans divert a portion of income into tax-advantaged retirement accounts that cannot be accessed until retirement. This is an ‘eyeglasses’ solution to the problem, albeit everyone gets eyeglasses regardless of need. A more targeted response might be Australia’s relatively generous means-tested pension system.
Genetic factors may also illuminate the distributional effects of policy. Taxes on tobacco were intended to curb usage, but as Jason Fletcher argued in a speculative article, genetically disadvantaged populations might bear a higher burden. Variants of nicotine receptor genes may trigger different responses; those with higher reward responses in the brain didn’t reduce smoking despite the higher cost. This suggests that diversion through nicotine substitutes may be more effective than taxes for some populations.
Examples of heterogeneous genetic responses to policy interventions are growing. Raising the minimum school leaving age reduced the body mass index of those genetically prone to obesity. Students with low polygenic scores attending advantaged schools were less likely to drop out of math classes. Similarly, the relationship between polygenic scores and high school dropout was weaker in higher socioeconomic families, suggesting an ability to buffer against the worst outcomes. A useful heuristic to consider whether responses might vary with genetics is to consider whether they vary with education, income, or other socioeconomic factors. Genetics likely underlies many of these non-genetic variables.
The small but growing body of research provides a consistent picture: genetics influence social mobility and interact with environmental factors in non-obvious ways. Absent genetics, we struggle to assess the causes and overlook insights that can help us evaluate interventions.
The practical steps
Despite the above examples, there are many unexploited opportunities to use genetic data in economics and policy development. The literature is rich but small.
However, to exploit these opportunities, we need to enhance our datasets with genetic data. We need to build the genetic infrastructure. When economists plan research using longitudinal datasets or experimental panels, they shouldn’t have to think about how to collect saliva samples or the cost of genetic analysis. The genetic data should be there by default.
Enhancing longitudinal data
Longitudinal data sets are valuable for their multidimensionality over time, allowing researchers to track changes and examine the participants’ life paths. Ask an applied economist about their most valuable research resource, and they will often point to longitudinal data.
In Australia, our best-known longitudinal dataset is the Household, Income and Labour Dynamics in Australia (HILDA) Survey. Begun in 2001 with a sample of over 7,000 households, 22 years of data are now available to researchers. Over 1,000 papers have been published using HILDA data on topics ranging from how losing a job affects who does chores at home to the effect of stock market performance on well-being to intergenerational social mobility.
In the United States, perhaps the best-known longitudinal surveys are the National Longitudinal Surveys (NLS) sponsored by the U.S. Bureau of Labour Statistics. These include the National Longitudinal Study of Youth 1979 (tracking subjects born between 1957 and 1964) and 1997 (tracking subjects born between 1980 and 1984). Named for their starting years, both studies are still running today.
What makes these longitudinal resources valuable is the sheer depth of the surveys across topic areas and time. HILDA includes data on ancestry, children, education, finance, health, housing, labour force outcomes, skills and relationships. The challenge, however, is that even with the richness of the resource, it can take some ingenuity to infer what is causing the observed outcomes. Researchers use mechanisms such as the ordering of events, assuming causation must flow forward through time. But this typically has severe constraints.
Again, genetics provides a potential tool. If seeking to examine how parent behaviour affects child outcomes over time, genetic data could be used to disentangle environmental and genetic causes, and where genetic, examine the pathways by which the genetics operate. Even where genetic data does not identify causation, it might provide insight.
Unfortunately, many of our most valuable longitudinal datasets do not have associated genetic data. There are a few exceptions, including the Wisconsin Longitudinal Study and others that underlie the examples in this article. The Twins Early Development Study, which has been tracking 15,000 twins born in the UK between 1994 and 1996, is now augmented with genetic data. The Dunedin study and the National Longitudinal Survey on Adolescent Health also contain genetic data. There are others, some of which underlie the examples in this article.
For every longitudinal dataset, we should augment data collection with genetic data. This does not apply only to new surveys. We could augment existing surveys with that data. Genetic data obtained from participants today could be used to analyse Round 1 HILDA or National Longitudinal Survey data. There are many studies where people have been tracked for decades where participants are still available. If genetic data collection is extended to families, many of the approaches to causation described in this article become available. The Wisconsin Longitudinal Study is an example of this occurring.
The nature of longitudinal surveys makes genetic sampling prospective. Longitudinal surveys have the benefit of a strong relationship with the participants. Participants already provide much sensitive information and may be willing to contribute genetic data. Most longitudinal surveys already have strong privacy protocols in place, controlling access to data based on sensitivity. Similar privacy measures can be applied to genetic data, addressing concerns such as the use of the genetic data to identify individuals. Instead of making the SNP data directly available, a set of polygenic scores and relatedness data within families could be released. If researchers need access to more detailed data, the existing access controls for the more sensitive longitudinal data provide prior art. The Wisconsin Longitudinal Study has such tiered access, with more stringent applications for SNP data than for aggregated polygenic scores.
The sheer volume of research from the major longitudinal surveys makes sampling highly cost-effective. With costs comfortably below $100 per person, the cost of genotyping for, say, the NLS Youth Surveys or HILDA is less than $1 million each. Sample collection can be done by post. The thousands of papers using this data, not to mention its use in government and by policymakers, make this a high-value step.
The cost-effectiveness will only increase. Genotyping and sequencing are becoming cheaper. Soon genetic samples will expand to the whole genome, capturing more rare variants and mitigating problems of population stratification. The polygenic scores that can be constructed with the genetic data will only increase in power.
Enhancing experimental data
While most of the above discussion has been focussed on research using longitudinal datasets, there is also an opportunity to use genetic data to increase insight from experimental work.
Historically, students have been the typical subjects in economics or psychology experiments. They’re cheap and widely available on university campuses. This reliance on a narrow population slice led to an inevitable (at least in hindsight) critique. Behaviour varies across populations. The standard experimental subject - someone Western, Educated, Industrialised, Rich and Democratic (WEIRD) - is not representative. The resulting data skew exists both across populations or societies and within them. College-educated students are not representative of those without a college education. The same holds for the rich and the poor.
One solution is to broaden subject pools. Today, experiments are less often conducted with students (although it’s unclear whether experimental participants sourced through Amazon’s crowdsourcing platform Mechanical Turk are more representative of humanity).
Another approach has been to examine how participants’ responses vary within experiments. Behavioural economists often point to heterogeneity as the future of applied behavioural science. People vary in capabilities, resources, goals and preferences. As a result, behavioural economists need to move beyond a one-size-fits-all philosophy and use personalised nudges. Practically, researchers address this by collecting data on gender, income, education and other demographic variables and studying how responses vary with demographic differences.
Genetics may enhance our understanding of heterogeneity. Per the first law of behavioural genetics, genetics may drive variability in experimental behaviour. Further, many characteristics that we identify as a source of variation may not capture the underlying cause. Genetic data can help us determine whether wealthy people respond differently because they are richer, or because they have characteristics that tend to lead to wealth. Genetic data offers a way to check that randomised controlled trials are balanced - that is, to test the assumption that each group is the same. It also allows us to increase the power of the analysis - our ability to detect an effect - by accounting for some of the variation between the experimental participants.
Integrating genetic data into experiments can let us investigate associations discovered in observational data. For example, the link between polygenic scores for educational attainment and wealth at retirement may be due to differences in the ability to make complex decisions. We could investigate that hypothesis through experiments examining how complex decision-making skills vary with the polygenic score. Experiments could provide a test bed for interventions, or at least inform policy discussions.
Enabling the use of genetics in experimental work requires building panels - collections of people who have registered to participate in experiments - with genetic data available for each participant. Each panel member could be genotyped, with the experimenter provided with polygenic scores for a range of outcomes of interest for each participant. The aggregated nature of these scores makes reidentification near impossible. As the effect sizes associated with polygenic scores are typically as strong or stronger than those for many social science interventions, genetic effects could be detected in experiments with as few participants as the typical lab experiment.
A version of this has been done in the past with panels of twins, although in that case the genetic data is typically limited to whether the twins are mono- or dizygotic. For example, Twins Research Australia maintains a panel of 35,000 twin pairs, which researchers may apply to access existing data or run a new study. One team of researchers developed the Australian Twins Economic Preferences Survey using that data.
Incorporating genetic data into experimental work may be a bigger challenge than for longitudinal data, as there are no ready-made data collection and access arrangements. These would require investment. However, genotyping or sequencing cost are unlikely to be prohibitive. As panel participants typically participate in many experiments, the cost of the genetic tests could be spread over them. Since genetic data doesn’t change - unlike other attributes measured in experimental studies - it only needs to be collected once. Further, genetic data is concrete and not self-reported, so is thereby more consistent and reliable.
Alternatively, existing genetic research resources could be expanded in purpose. The UK Biobank database contains genetic, lifestyle and health data for half a million UK participants (albeit it could be stronger with more family-based participants). Estonia, Iceland and the Scandinavian countries have large genetic databases. These could provide experimental participants together with polygenic scores. Experimental data then forms part of the growing data resource. Commercial providers such as 23andMe with large customer databases could even seek alternative revenue sources by facilitating the provision of participants for experimental studies.
Critical to the value of any future data sets is representative population sampling. However, polygenic scores are typically developed from homogeneous population groups, most commonly with European ancestry. As a result, polygenic scores cannot simply be plugged into analyses of diverse groups. Genetic data itself doesn’t solve the WEIRD problem if research panels remain WEIRD. These are the important but ultimately tractable issues we should be grappling with.
My PhD was on the link between human evolution and economic growth. When I presented a draft of my PhD research proposal, the first comment I received was that I should refuse any grants from men with funny little moustaches and straight-arm salutes. Although that commenter came around, this initial reaction is a typical response to a discussion of genetics in social science.
That ‘fear’ of the implications of genetics, however, is not the only obstacle to its use. Genetic data is simply not available for many studies. Its absence makes it easy to ignore. If the authors don’t mention genetics in their analysis, few peer reviewers will criticise them for overlooking an obvious confound. They can’t ask for further analysis when the data is not there.
The result is that, despite examples of the type I have discussed above, the potential for genetics to inform our thinking on important policy questions is untapped. I trawled policy-focused papers on social mobility in Australia, including Treasury policy briefs, reports by the Australian Institute for Health and Welfare and speeches by the head of the Australian Government’s major economic think tank. Genetics does not get a mention. It is possible to dig up the occasional left-of-centre politician who realises genetics can affect social mobility (albeit that politician was an economics professor who published on social mobility). However, even they are silent on genetics when they move into political mode.
That said, it is hard for policymakers to engage with genetic questions when the research comes from longitudinal datasets without genetic data. Insightful papers examining the genetics of social mobility or other economic questions are rare. When faced with a particular policy question, it is unlikely that genetic analysis is available, especially one that matches relevant populations or precise policy measures.
That is why enriching our economic datasets is so important. A robust genetic data foundation is crucial for advancing our understanding of policy questions such as social mobility, inequality, skill development, and the diversity of our responses to government interventions. By enabling more studies that incorporate genetics, we can expose the limitations of research that ignores this critical factor. Only then can we answer vital questions and develop more effective policies.