• Search
  • Twitter
  • Youtube
  • Login




How to use the data
1.How can I get started analyzing PSID data? 
The PSID user guide provides historical context and basic design features of the PSID. There are several tutorials which provide step-by-step instructions on downloading and analyzing the data in a variety of ways.

The Data Center is the most popular means for obtaining PSID data, and it delivers thousands of customized data files to researchers and quantitative social science students each year. The Data Center is fully automated and allows for user-specified subsetting criteria when downloading and merging data. Data can be generated in a variety of formats including ASCII, SAS, SPSS, and Stata.

2.How do I merge family- and individual- level files? 
The Data Center provides automatic and customized merges of files. For the analyst who prefers to write their own programming code to merge data downloaded from our zip packages, sample SAS and SPSS programs have also been prepared to assist users with creating cross-year analysis files.

3.How can I identify families from year to year? 
Each family unit in a specific wave is assigned a unique "Family Interview (ID) Number" valid for that wave only. In addition, each family also has a "1968 Family Identifier", also known as the "1968 ID". This is the Family Interview (ID) Number that was assigned to the original family in the 1968 interviewing wave. When sample members in any family move out and establish their own household, we interview them (these families are called "splitoffs", in the first year they are formed). These new "splitoff" families have the same 1968 ID as the family they moved out of, and keep that same 1968 ID each year. All families with the same 1968 ID contain at least one of the original members from the 1968 family or their lineal descendents born after 1968.

4.Do family ID numbers vary from year to year? 
For each family, the family ID number will most certainly vary from year to year. Yearly IDs are assigned based on the order in which interviews are received--the first interview in from field is numbered 1, the second, 2, and so on. This means it's very unlikely that a family with the Family ID Number 1234 in one year will get the same Family ID Number the next year, or any other year.

5.When is a new Reference Person ('Head' prior to 2017) selected? 
A new Reference Person (the term ‘Reference Person’ has replaced ‘Head’ in 2017) is selected if any of the following conditions apply:
  • last year's Reference Person moved out of the FU (family unit), died, (in some cases) became incapacitated; or
  • a female Reference Person has gotten married to a male, or now has a male Partner; or
  • this is a splitoff family. (Note that this new Reference Person may have been the Reference Person of the family this new family split off from)

6.How can I tell the current Reference Person and Spouse/Partner from mover-out Reference Person and Spouse/Partner? Why is this important? (the term ‘Reference Person’ has replaced ‘Head’ in 2017) 
To tell the current Reference Person (starting with the 2017 wave, the term ‘Reference Person’ has replaced ‘Head') and Spouse/Partner from mover-out Reference Person and Spouse/Partner, use the Sequence Number (SN) from the individual file. The current Reference Person will always have SN=1, the current Spouse/Partner (if there is one) will always be SN=2. A mover-out Reference Person or Spouse/Partner will have a SN in the range 50-89, depending on the move-out circumstances. The SN allows you to identify an individual's status with regard to the family unit and determine family composition change. It's important to understand family composition change to avoid spurious correlations in a longitudinal analysis where you are looking at variables pertinent to the same person(s) over time.

7.How do I assemble a Reference Person/Spouse file from an individual file? (the term ‘Reference Person’ has replaced ‘Head’ in 2017) 
The easiest way to do this is by visiting the PSID Data Center which will create a customized dataset for you automatically.

Instructions for creating Reference Person (starting with the 2017 wave, the term ‘Reference Person’ has replaced ‘Head')/Spouse file from an individual file by writing your own programming code:

To create a single year Reference Person (‘Head’ prior to 2017)/Spouse file: Select individuals with Relationship to Reference Person of "Reference Person" (a code value of 1 for 1968-1982; code 10 from 1983 onward) and with the Sequence Number=1. The reason for using the Sequence Number variable is that non-response movers out have relationships to the PREVIOUS YEAR's Reference Person, so two individuals within one family may have relationships of Reference Person. One, however, is the real, current Reference Person; the other is a mover out. (The type of mover-out can be determined from the value for Sequence Number. Refer to the individual file codebook for details.) To illustrate the importance of Sequence Number, assume that in the last wave we have an elderly married couple. He is the Reference Person (‘Head’ prior to 2017) and she is the Spouse--Sequence Number=1 and Relationship to Reference Person=10 for him, Sequence Number=2 and Relationship to Reference Person=20 for her. When we find them for the new interview, he has died and she has become the new Reference Person--his Sequence Number=81 and Relationship to Reference Person=10, her Sequence Number=1 and Relationship to Reference Person=10. All the family data items about Reference Person in the current wave refer to HER, not to him. Information about his income, etc. is located in OFUM (other family unit members) variables only. Similarly, to subset Spouses or Partners in a current wave--select Relationship to Reference Person=20 or 22 and Sequence Number = 2.

To create a cross-year Reference Person (‘Head’ prior to 2017)/Spouse file: These concepts can be expanded to subset persons who have been the Reference Person over a period of years--the yearly values for Sequence Number must be 1, and 1 or 10 for Relationship to Reference Person. As a corollary, to select individuals who have been either Reference Persons or Spouses/Partners, yearly Sequence Numbers must equal 1 or 2 and yearly Relationships to Reference Person must be in the range 1, 2, 10, 20, or 22. Once that subset is made and family data are merged, information about an individual can be found in Reference Person variables (Reference Person's work hours, Reference Person's labor income, etc.) when his or her Relationship to Reference Person=1 or 10. When Relationship to Reference Person is 2, 20, or 22, then his or her information is found in variables about the Spouse/Partner.

8.How can I identify splitoffs from the main family? 
Select only current Reference Person (starting with the 2017 wave, the term ‘Reference Person’ has replaced ‘Head') (Sequence Number=1 and Relationship to Head/Reference Person=10) from the individual file for the wave in question. Then, if Reference Person's moved in/out indicator=1 and month moved in/out=0, it's a splitoff. Otherwise it's a main family.

9.How is an individual uniquely identified? 
The combination of the 1968 ID and the person number uniquely identify each individual.

To identify an individual across waves use the 1968 ID and Person Number (Summary Variables ER30001 and ER30002). Though you can combine them uniquely in many ways we find that many researchers use the following method:

(ER30001 * 1000) + ER30002


(1968 ID multiplied by 1000) plus Person Number

10.Why is using the latest version of the Cross Year Individual File essential? 
The Cross Year Individual File has records for every person who has ever lived in a PSID study family (including some who moved out just before the initial interviewing year for each part of the sample, or were institutional in the first interviewing year; see the codebook for ER30002). Each study individual has a record for each study year; years when that individual was not listed in a study family are zero-filled. The file is organized by ID68, PN, and year.

In addition to those variables, the file contains individual-level information such as YEARID, SN, Relationship to Head/Reference Person (starting with the 2017 wave, the term ‘Reference Person’ has replaced ‘Head'), AGE, SEX, birthdates, move-in and move-out dates, follow status, type of individual, why non-response, several health insurance variables, variables indicating eligibility for supplementary studies such as CDS and DUST, and individual level longitudinal and cross-sectional weights.

It is essential to use the latest version of the file for analysis. The entire file is regenerated for every wave’s release not only to add the latest wave’s info, but to make corrections based on information received in the latest wave’s interviewing. These corrections are often to information such as birthdate or Relationship to Head/Reference Person, and may affect several waves of data.

Most importantly, however, we also make corrections to ID68 and PNs, based on new information. Sometimes we discover that a person we believed to be a separate individual is actually the same as a family member we already know about, with a different person number. If the two individuals were response in different waves, we can combine the information into one record, keeping the record for one PN and deleting the record for the other. In other cases, we might learn that someone we thought was the biological child of a sample member actually isn’t. This would necessitate a change in PN from the "born-in sample" range (030-169) to the 170-and-up range, and also a change in follow status, since on the new information the child would no longer be sample. In another instance, we found that a woman in the immigrant sample with PN 170 was actually a spouse who had moved out prior to the first year of immigrant interviewing (1997 for this case), so should have had PN 227 (a PN with special meaning, see the codebook for ER30002).

Currently, we uncover about 100 of these person-number fixes each wave. So failure to use the latest cross-year individual file may result in errors in both data and analysis.

11.How can I determine if data will be collected about an individual who is not present in an interviewed family? 

It depends on the situation.

  • For persons who are mover out deceased, some OFUM (other family unit member) information is collected for the wave they are reported to have died.
  • For persons moving out to an institution, some OFUM information is collected during the wave they are reported to have moved out from the family.
  • For persons moving out to another household but no interviews are conducted with the new family unit, some OFUM information is collected the wave they are reported to have moved out.
  • For persons already in institutions, no new information is collected.
  • For persons who attrited from the study, no new information is collected. However, a large recontact effort was initiated in 1992.
  • For persons not yet born or not yet appearing in the study, no information is collected that wave.

Beginning with the 1999 wave, when the PSID switched from annual to biennial interviews, the following rules for movers out apply only if the person moved out in the calendar year before the interview. For example, in the 2007 data, there will be information for movers out on or after 1/1/2006, but no information for movers out before that date.


12.How can I determine which variables are comparable across years? 
The cross-year index can help you identify comparable variables across the years.

You can also look at the ‘Years Available’ section of each variable’s codebook entry for a year-by-year listing of when that variable is available in the data center.

13.How can I tell if a variable value is actual or imputed? 
A missing data value is either identified as such (value=9) or an imputed value is assigned in lieu of a missing data code. If an imputed value is assigned, an associated "accuracy code" variable describes the nature of the assignment.

14.How can I identify the SEO (Survey of Economic Opportunity) sample and the SRC (Survey Research Center) sample? 
You will need to look at the 1968 family interview number available in the individual-level files (variable ER30001).

SRC sample families have values less than 3000.
SEO sample families have values greater than 5000 and less than 7000.

Immigrant sample families have values greater than 3000 and less than 5000. (Values from 3001 to 3441 indicate that the original family was first interviewed in 1997; values from 3442 to 3511 indicate the original family was first interviewed in 1999; values from 4001-4851 indicate the original family was first interviewed in 2017; values from 4700-4851 indicate the original family was first interviewed in 2019.)

Latino sample families have values greater than 7000 and less than 9309. (Values from 7001 to 9043 indicate the original family was first interviewed in 1990; values from 9044 to 9308 indicate the original family was first interviewed in 1992.)

15.For what years are Latino data available? How do Latino data differ from immigrant data? 
In 1990 the PSID added 2,000 Latino households consisting of families originally from Mexico, Puerto Rico, and Cuba. But while this sample did represent three major groups of immigrants, it missed out on the full range of post-1968 immigrants, Asians in particular. Because of this crucial shortcoming, and a lack of sufficient funding, the Latino sample was dropped after 1995, and a sample of 441 post-1968 immigrant families was added in 1997. In 1999, an additional 70 families were added in for a total of 511 immigrant families as of 1999. These families are included on the files along with the core PSID families

16.Where can I find information on the 1997-1999 Immigrant Sample? 
Information on the Immigrant Sample is available in the 1997 and 1999 main interview documentation.

17.Codebook information for some variables does not show index information. Why? 
Variables from supplemental files, with the exception of CDS and TAS, are not yet in the index. We plan to add these variables to the index in the future.

18.Why do some supplemental files have fewer observations than the main family files? 
Some supplemental files were created only for sub-samples. For example, the Disability and Use of Time Supplement (DUST) only collects information on Heads ('Reference Person' starting in the 2017 wave) and Spouse/Partners of a certain age.

19.What identification variables should I download? 
The Data Center automatically includes all appropriate identification variables for your file.

20.How does one analyze data from families from one wave to the next? 

Users often want to look at data from the "same" family in adjacent waves. It is important to understand that there is no absolute definition of "same" family. Families are made up of individuals who may move in or out of study families from wave to wave. It is up to the user to decide what he or she means by "same" family. The user may want to restrict this definition to option 1) absolutely no changes in the composition of the family since the previous wave. All the individuals that were in the prior wave are still in the current wave - no one has moved in and no one has moved out. Alternatively, the user may want define "same" family as option 2) those who have the same Reference Person (starting with the 2017 wave, the term ‘Reference Person’ has replaced ‘Head') in both waves.

In order to subset those cases which the user has defined as "same" family, he or she will find the Family Composition Change variable most useful. The Family Composition Change variable indicates the degree of change in this family since the prior wave's data collection. For option 1, the user would subset the families in the current wave where Family Composition Change variable = 0. For option 2, the user would subset the families in the current wave where Family Composition Change variable in (0,1, 2).

For 2019, for example, the Family Composition Change variable is ER72007.