CONTENT OUTLINE I. Introduction II. Characteristics A. Variable Numbers, Positions and Generated Variables B. Documentation (or Paucity Thereof!) and Codes C. Problem Variables, Missing Variables and the 1993-1994 Family Index D. Files and Format E. Additional Notes: Sample Supplements in 1993 and 1994 III. A Concluding Note PANEL STUDY OF INCOME DYNAMICS, 1994 PRELIMINARY FILES FOR 1994 I. Introduction The more than two-year interval between the completion of interviewing on a given PSID wave and the public release of a fully-cleaned and documented data file has prompted demand for speedier release of a "preliminary" version of PSID data files. In response to this demand, the PSID staff has produced a "preliminary" version of the 1994 family and individual files. The files' preliminary nature (including, most notably, very incomplete documentation and virtually no PSID staff counselling) leads us to recommend these data files only to very experienced PSID data users. Relatively inexperienced users should wait for the regular release of the data files. We trust that the experienced research community will be able to make effective use of these data, despite their very preliminary form, without increasing the workload on PSID staff. This document is written for experienced users and will be the ONLY documentation of this version of these files released by PSID. In a nutshell, the advantage of these preliminary files is that of quicker access to recent waves of data. Disadvantages include: a) no documentation other than this treatise and the 1994 questionnaire; b) an incomplete set of family-level variables, with the most prominent missing variables being annualized work and income components and totals, weights and prorated poverty thresholds, and individual "summary variables" about marital and fertility histories; c) zero values for most or all cases for a handful of variables (detailed below in Section II, Part A); d) rather "dirty" data containing some wild codes and no imputations; and e) extremely limited and very grumpy PSID staff counselling (nobody wants the release of these files to add to the time it takes us to release the fully-cleaned and documented versions of them). Many of the 1994 early release variables are not directly equivalent to variables from 1992 and earlier waves (although they are equivalent to the 1993 early release data). See Section II below for further details. II. Characteristics Two data files are included as part of the "preliminary" data-file package: the 1994 single-year family-level data with almost 11,000 data records and the 1968-1994 early-release individual file with about 50,000 records, including both response and nonresponse individuals for all waves of the study. Data from the Latino sample are also included. Beginning with the 1993 wave, the data were collected using CATI (Computer Assisted Telephone Interviewing). This means that information about each question is collected electronically by the interviewer and in effect is coded at the time of collection. Of course this data collection method replaces contingency and wild code checking, formerly performed as part of our coding operation. But because of the way in which questions are asked and responses are given, the data much more closely resemble answers to questionnaire questions. For example, rent costs now exist as two variables: one for the dollar amount and one for the time unit, e.g., $400 per month and $100 per week are typical of responses to the question about rent payments. Unlike data collected through 1992, the family data have NOT been cleaned with our manual economic edit process (nor have imputations been made), so the user must convert these kinds of amounts into some sort of consistent unit for inter-case comparison and make decisions about handling missing data. In addition, we expect that values for quite a few cases will change when we do perform economic edit operations. For instance, time spent working, being laid off, unemployed, out of the labor force, etc. does not sum to 52 weeks per year in more than 10 per cent of the cases. As mentioned above, dollar amounts generally are associated with time units. All time unit questions include an "other" code, as well as options for missing data. Amounts associated with these "other" codes will have to be recoded from missing data or else imputed when the data are cleaned. Also beware that we are not intending to include these component amount and time unit data as part of the main final release file. Our current plans are to release final data that resemble as closely as possible our datasets from the past. We believe that the majority of users will not be interested in computing amounts differently than we have in the past. However, the amount-time unit (and similar) data collected in CATI but not generally part of our prior final files will probably be available as a separate, subsidiary file so that users who desire this detail can access it. Our usual inter-year consistency checking was performed for the 1968-1994 individual file, so we expect the individual-level data to remain quite stable for final release. Beginning with the 1990 data, we faced problems with our merged record formats, i.e., the cross-year distribution files would exceed a logical record length of 32,767, the maximum allowed on most systems (including our own). So, in contrast to the structure of the cross-year family and cross-year family-individual files issued prior to 1990, family-level data files have NOT been merged across waves to form a single data record. Thus the analyst must merge the necessary information from the appropriate files. This is not difficult since the needed family identification numbers appear both in the cross-year individual file and in the single-year family data files. Detailed instructions for the merging process are located in the 1990 through 1992 family documentation and are not repeated here. A. Variable Numbers, Positions and Generated Variables All variable numbers for both family and individual early release files are prefaced with "ER", rather than "V", to assist both users and study staff in the future to determine whether reference is to the early release file or to the final release version. All 1994 early release family-level variables are in the range ER2001 through ER4016. Most of these variables will eventually be incorporated into the final version of the 1994 data, but their variable numbers will change and the data will be cleaner. Variable numbers and locations for the 1994 family file are not the same as those we intend for the final version. In addition, the family file includes neither variable numbers nor positions for so-called "edited" and "generated" family-level variables. By "edited" variables we mean the first 300 or so variables usually present in each wave's family-level data, beginning with the state of residence and ending with income detail for other family unit members. The term "generated" variable refers to those variables traditionally located at the end of the raw data after the Head's background information. Omitted variables include: annual mortgage and rent payments, annual food costs, annual work hours, annual unemployment, etc. hours, annual income of any sort for all family members, total family money income, Head's total labor income, pro-rated food and poverty thresholds, education of Head and Wife/"Wife", family income deciles, and average hourly earnings of Head and Wife/"Wife". Component items exist on the file, however, so that the user may generate these items. Needless to say, imputations have NOT been done for missing data. Some other omitted variables cannot be generated from information available. These include: weights, state and region of residence, urbanicity, Head's geographic mobility, numbers of children in various age and sex categories, county unemployment rate, and variables linking related families. In short, all variables equivalent to the 1992 variable ranges V20303-V20620 and V21481-V21549 are absent. Background information is not asked about Heads and Wive/"Wives" each and every year. We ask the questions about new Heads and new Wives/"Wives" only. If a female Head marries and becomes a Wife, then she is reasked the background information, and her new husband, the Head, is also asked. During processing, we have traditionally brought forward the background information from previous waves for Heads or Wives/"Wives" who are the same persons as in the prior year. In every wave, each set of background variables is preceded by a variable indicating whether data need to be brought forward. The 1994 early release file, in keeping with our practice for other early release files, has not undergone this bringing forward. In addition, the process is somewhat less straightforward than for our previous early release files. The background data include questions about Head's father's occupation, state and county variables for the locations where Head and his parents grew up, and number of states and regions in which Head has lived. These variables have NOT YET been created for the 1994 early releases, so the user must carefully compare the list and codes of background variables included in the 1994 early release data set before bringing forward prior-wave information. Background information is complete for 1992 on the 1992 final release file, but as of this writing, the 1993 family data are available only in early release form and therefore have not yet undergone the bringing-forward process. Only Heads and Wives/"Wives" who were new in 1993 have actual background data in the 1993 early release file, so the user must search both 1992 and 1993 data to complete 1994 background variables. There is another complicating factor in bringing forward background data: the absence of the 1992 and 1993 family ID numbers on the 1994 family file. Therefore, the user must check these variables from the individual data file in order to match with 1992 and 1993 background information. Below we detail the procedure for bringing forward Head's background information in a series of steps. COMPARE THE 1992 BACKGROUND VARIABLES ITEM FOR ITEM WITH 1993 AND 1994 DATA FOR COMPARABILITY OF CODES AND FOR IDENTICAL ITEMS; SOME 1992 BACKGROUND QUESTIONS ARE NOT INCLUDED IN THE 1993 AND 1994 SETS OF BACKGROUND DATA. IN ADDITION, THE 1993 AND 1994 VARIABLES ARE NOT COMPLETELY IDENTICAL TO EACH OTHER!!!!! Next, match the 1994 family file with the 1994 Head's record from the 1968-1994 individual file (1994 family variable ER2002 with individual variable ER33101 where ER33102=01). Copy the 1992 and 1993 family IDs from the individual file (ER30733 and ER33001) to the 1994 family file. Check values for 1994 family variable ER3917, the indicator for whether background information exists for Head on the 1994 file. If ER3917=1, then the appropriate background information is already part of the 1994 data, and this case needs no further processing. If 1994 ER3917=5, match the 1993 family ID that you attached to the 1994 family file with the 1993 family ID number from the 1993 file (1993 ER1850). Check the value for the 1993 Head's background indicator variable (1993 ER1850) on the 1993 family file. If ER1850=1, then the background data are located on the 1993 file. Copy the data from the 1993 family file (1993 ER1851-ER1944) to equivalent variables in the 1994 family file (1994 ER3918-ER3986), recalling that there is not a one-to-one match. If 1993 ER1850=5, then it is necessary to go back to the 1992 final release family file for the background information. Match the 1992 family ID from the 1994 family file with the 1992 family ID number from the final 1992 file (V20302). There is no need to check the value for the 1992 indicator (V21388), as all 1992 cases contain background information. Copy the data from the 1992 family file (1992 V21389-V21461) to the corresponding variables in the 1994 family file (ER3918-ER3986), again recalling that these variables do not match perfectly. A similar procedure can be done for Wives/"Wives" using the 1993 and 1994 indicators (ER1777 and ER3863, repectively) to determine same Wife/"Wife". We advise using the 1993 family ID from the 1994 Wife's/"Wife's" individual data record in place of the Head's. The values to be copied are 1993 ER1778-ER1849 or 1992 V21340-V21387 to the equivalent 1994 variables (ER3864-ER3916). These variables must also be checked for direct correspondence between the 1993 and 1994 early release files and the 1992 wave. Individual-level data on recent PSID data files have consisted of annual measures and a set of "summary variables" that have appeared at the end of the individual data record. In the 1994 preliminary data, most of the annual measures (e.g., Sequence Number, Relationship to Head, Family Identification Numbers) are available, while virtually NONE of the "summary variables" are included. With a single exception, the individual-level "summary variables" (i.e., V31996-V32049) are not included on these files. The exception is V32000, Sex of Individual, which is too important to omit. Variables ER30001 through ER30794 will remain the same for the final release version (with the prefix change from "ER" to "V"), but a few more variables must be added to the 1992-1994 individual data, most notably the weights. The order of the 1968-1994 early release data is as follows: 1968 through 1992 individual data are arranged as usual by wave; we jump to the summary variable ER32000, Sex of Individual, and then we include the 1993 individual data in ER33001-ER33018 and the 1994 individual data in ER33101-ER33118. For the final release version, the 1993 and 1994 variables will be moved to follow the completed 1992 individual data and ER32000 will appear in its usual place among the summary variables. Some 1993 and 1994 equivalents of the annual individual-level variables are not included in this preliminary version. Variables with this treatment include individual income components and totals, linking measures for splitoffs, age generated from birth date (rather than respondent report), reason for nonresponse, and weights. These variables will be located near the end of the yearly data, just as in 1992 and earlier waves. To create variables from early release data that resemble those on final files from 1992 and earlier waves, we suggest users consult the 1992 codebooks. Descriptions of "edited" and generated variables for 1992 include enough information to create many variables, for example, annual work hours of Head and Wife/"Wife" and Head's annual wages. Some other variables, such as total family money income, are not generatable because income components of individuals other than Head and Wife/"Wife" are not included in the 1993 and 1994 early release data. B. Documentation (or Paucity Thereof!) and Codes This document does not include codebooks. However, a 1994 questionnaire is incorporated in the early release package. It is available on the Internet in a PDF format suitable for perusal with an Adobe Acrobat viewer. See our home page for further information. (The Acrobat viewer is available free of charge.) Use the variable labels from the SAS and SPSS to match variables with the questionnaire. The questionnaire text contains codes for most data items. The codebook from Section II, Part 1 of the 1992 documentation can also be helpful in deciphering the "preliminary" data on this file. An index of 1993 and 1994 early release family file variables is included in the following section. The index covers ONLY 1993 and 1994 variables; no attempt has been made to link the early release variables with equivalents from 1992 and earlier waves. In general, codes follow our traditional structure, although "don't know" responses are now largely distinguished from other missing data responses. If the questionnaire does not indicate otherwise, code 8 (or 98 or 998, etc.) represents "don't know" and code 9 represents a refusal or other missing data. Inappropriate questions are padded with zeroes. A few fields contained non-numeric characters, and these have also been converted to zeros for the early release file. If a variable contains a code value that is neither included in the questionnaire nor one of the zero, eight or nine codes just mentioned, assume missing data for that value. We will clean such cases for final release, but time constraints do not permit this sort of cleaning for early release. The inevitable exception: codes 21 through 24 for month variables in event dating questions were not printed in the questionnaire but were used throughout the CATI application to indicate mentions of season only. These codes follow: 21. DK month, but season was winter 22. DK month, but season was spring 23. DK month, but season was summer 24. DK month, but season was autumn For individual data, use the codebook in our 1992 documentation, Section II, Part 2. Similar variables for 1993 and 1994 are coded identically to those from earlier waves. C. Problem Variables, Missing Variables and the 1993-1994 Family Index Some variables included on the 1994 file are known to include bad or completely missing data. These will be corrected for the final version of the file, but in the meantime we want to inform users about The 1994 file includes many series of variables concerning monthly dating of events during the prior calendar year. For example, ER2119-ER2130 indicate the months during which the Head worked on his or her present main job in 1993 (questionnaire question B39). The "strings" consist of a set of twelve dummy variables, one for each month. Essentially, a code value of 0 indicates that the activity did not occur during this month; a code value of 1 indicates that it did. The month of January in each monthly "string" is suspect because it can contain a value of 1 when the value should be 0. This implies that incomes could be miscalculated if the monthly string is used for computation. In addition, the series for question E66, months in which a nonworking Wife/"Wife" was unemployed in 1993, is missing the month of February entirely; there are only eleven variables in this set. Several variables are included in the 1994 early release file, but PSID staff has found that code distributions are suspect or all cases contain missing data. These are ER2055 (question A36), reason why the family neither owns nor rents the HU; ER3718 (question G111), a checkpoint for number of dependents; and ER 3924 and ER3926 (questions L14 and L16), education of Head's father and mother, respectively. The individual-level variable ER33111, Employment Status, contains zeroes for every person on the file. The employment statuses of Head and Wife/"Wife" are available on the family file (ER2068-ER2071 and ER2562-ER2565, respectively), but information for other individuals is missing. Besides the above-mentioned omission of the month of February from the set of variables for question E66 (series ER2932-ER2942), some other variables are missing from the 1994 early release file: question G113, the number of persons dependent on this family for more than half of their support; and questions G9a-G9d, whether Head and Wife/ "Wife" spent time working at a business and, if so, whether they reported those work hours. D. Files and Format The early release package for 1994 consists of the two data files mentioned above, i.e., the 1994 family-level data and the merged individual-level 1968-1994 data. These are ASCII data files. We have also included two other pairs of files with information about variables in the corresponding data files. These are SAS and SPSS data definition statements. The user is cautioned that neither of these contains missing data specifications as of this time, although we plan to include missing data information with the final release versions. E. Additional Notes: Sample Supplements in 1993 and 1994 We had added a Latino sample of 2,043 families to the PSID for 1990. This sample is described in detail in the 1990 documentation, but briefly the Temple University Institute for Survey Research selected and interviewed this sample for the Latino National Political Survey (LNPS). The Latino addition was made congruent with our usual ID scheme and unique identifier formats, and these cases are easily identified at the family and individual levels by the code values for 1968 ID Number (V20302 for 1992 family data and V30001 for the individual file): the values for their 1968 IDs are in the range 7001 to 9043. In 1992 several different kinds of recontacts were attempted. These are described in detail in the 1992 family documentation, but briefly, three groups were followed: all 1991 nonresponse; a random subset of SRC and Census sample members who had become nonresponse in 1990 or earlier; and all of Temple University's Latino sample persons who were not successfully interviewed by us in 1990. The successfully recontacted Latino families have 1968 ID Numbers are in the range 9244-9308. The 1993 and 1994 waves included a change in PSID following rules. We now follow all sample persons who leave home, regardless of age. So, for example, when a sample male Head leaves his nonsample spouse with their children, we attempt an interview not only with him but also with her because her household contains sample members. Our recontact effort for 1993 included the resurrection of many nonresponse sample persons who shared a 1968 ID number with families still responding in 1992, similar to the second group selected for 1992 as described above. But in contrast to this 1992 group, priority was given to families with connected individuals under age 18 at the time of nonresponse. All sample individuals within such a family were selected for recontact, even if they themselves were older. The main thrust of the 1994 recontact effort was to follow some nonsample ex- spouses of sample members; these ex-spouses had had one or more children with the sample members, and at least one of those children was expected to be under age 18 by 1994. In addition, recontacts were attempted with 1992 and 1993 nonresponse and also with families with no remaining response individuals. Some of these latter families had become nonresponse as early as 1969. III. A Concluding Note We close by repeating our warning: THESE DATA SHOULD BE USED ONLY BY VERY EXPERIENCED PSID DATA ANALYSTS. The absence of complete documentation makes it difficult to determine the precise coding of a number of variables on the file. And the absence of weight variables makes it impossible to use these files by themselves to produce any nationally-representative estimates from either the original or Latino samples. We expect that these preliminary versions of the data will be useful for experienced users who want to pull a handful of variables from the files so that they can be merged onto analysis files constructed from prior-wave data.