User Guide Tutorial #1



Cross Sectional Files

Ngina Chiteji, Mohammad Mushtaq, and Frank Stafford

Revised: June 2010

Overview

Public access files from the PSID archive were first delivered 'in bulk' over the Internet in 1995. Since then there has been a growth in the functionality in data delivery via the PSID website. Correspondingly, there has been an emerging use of the PSID data sources for instructional as well as research and policy purposes. A key element in delivery functionality was added in the spring of 1996. Then the PSID began providing the capacity for users to download not just the full 'in bulk' archive, but to use our on-line system to create data subsets, customized to the researcher’s domain of interest as defined by year and level (family, individual), variables and value ranges of variables. Since then we have added subsetting functionality to all Public Release files. These extend from 1968 forward and the Data Center can now produce customized codebooks and other research resources. In addition, important generated variables (such as wealth, primary income components, work hours, ...), which have been created as an integral part of Public Release  files, are now part of the files accessible via the Data Center.  

Improved Data Center functionality has lead to increased use by the research community resulting in a much increased number and range of publications, but has also attracted a margin of graduate student ‘classroom’users, beyond the traditional use of PSID resources for Ph.D. dissertation research. Recently, there has been a trend toward some use of PSID files in courses by undergraduates. To facilitate that use, besides modifying our Data Center to have increased subsetting functionality and to allow the linking of Public Release files across different years, we have decided to try our hand at drafting 'User Guide Tutorials'. Because of the complexity of the PSID archive we have decided to use a building blocks approach, starting first with this tutorial on access to subsetted cross-sectional files. The user is here able to learn how to access both variables from the individual and family level files. This is here restricted to adults and does not include individuals age 0-12 from our 1997 Child Development Supplement (CDS) and the 2002/03  CDS (or CDSIII collected in2007-08 school year) - which are now incorporated into the Data Center. (See Tutorial #4 to create a customized subset from a selection of CDS data modules that highlights the use of relational data structures) Our tutorials assume that the user selects dBase as the output file format option and reads the dBase file into an Excel spreadsheet (or creates an Excel file directly from the Data Center) and performs transformations and computes basic descriptive statistics using Excel. We have an option to read directly into Excel and to carry labels into the Excel file from the subset selection off the Data Center. Others may select SPSS, SAS or STATA as output file formats.

Search Tools Available

Before getting started it is important to realize that the PSID Data Center holds a complex and extensive archive on economic, family, health and other variables. There are numerous on-line search tools that will help you navigate the archive. For starters suppose you are interested in non-market housework, the topic illustrated in this tutorial. What does the PSID archive hold? To check this out go to the 'Search' tab at the PSID homepage and select 'Search for variables'. The default is to search by all Data File Types and across all Data Years and for all Codebook Parts and All Words. If you do this for the subject in this tutorial - housework - by typing in the word: housework - you should come up with over 300 entries! If you type 'wealth' you will get over 90. If you type 'relig' you will get (over 200) variables related to religion such as participation in religious activity, religious preferences, and donating to and providing volunteer time to religious organizations. Typing 'income' or 'inc' is not advised! This is the Panel Study of Income Dynamics! Even 'labor income' will produce over 500 variables from which to choose.    

If 'housework' is chosen for the search, the variables range from 1968 forward for the family head, the wife, and also for the different types of housework from the time diaries in the child development supplement of 1997 (CDSI) and 2002/03 (CDSII). There are variables measuring the time in housework, but also whether there was paid housework (1969 V489 $ HSWRK NONFU). If you want the specific codes for a variable, click on the purple box. This will bring up the code values and information on other years in which the variable was collected. In the case of $ HWRK NONFU the codes indicate that the codes are in dollars per year up to $9,999 for work provided by those living outside the family unit, and that it was measured in other waves, 1970, 1971, 1972 and 1976, but evidently was not asked subsequently.

The code information also indicates that it is available in the family cross year variable index under HOUSEWORK. This is another avenue to seek out variables. If you go to the Data Center main page and select By Index you will see the choices of Individual, Family or CDS index. (There will be some of the variables at the Family Level, some at the Individual Level (more on this later in the tutorial) and some for CDS, the Child Development Supplements going back to 1997. Selecting Family and scrolling down to the topical index of Housework, you will see Help Received Outside, Cost, and Annual as you branch out by clicking on the "+" sign (for expand) next to the word "housework." There will be shaded boxes for the year in which the $ HWRK NONFU was collected - namely 1970, 1971, 1972 and 1976. In other years it was not collected, so the boxes are not shaded. If you click on the purple box you will have the codes brought up and the code information will also include any other years in which the question was asked and for whom the question was applicable. Note that as you use the search tools and identify a variable that you want in a customized data set, you have the option of adding it to your Data Cart at that point. More on Data Cart below.

Another way to locate a variable is to scroll through the variables by year in a By File search. So just to see this different routing, select 'By File'. If you then further select Family, a box will appear with Main Family Data as a choice. Selecting this will provide a list of the years from 1968 forward. Since you already know that $ HWRK NONFU was asked in 1969, if you select 1969 you can then scroll down to V489. Clicking on V489 will bring up the by now familiar code frames. These are the same ones that you saw under By Index selection, and are now seeing through a different view. As a general rule the By File variables that rather directly translate from the respondent responses in the data collection and are presented in the order in which they were asked in the data collection instrument. For those variables which are in the Data Center essentially 'as collected,' the codes, question text, and skip sequences can also be seen from the Computer Assisted Interview HTML version and the labels include the question section reference as we will illustrate below with housework hours. Generally, too, complicated generated variables such as total family money income, wealth, or average annual hours of (market) work are listed toward the end on the variable list in this type of By File access. For example in 1969-1972 there is a variable which modifies money income for various activities of home production, V840. For 2005 there is Wealth - variable S718 - the third to last variable in the Main Family Data group.

Using By File you can 'locate 'Summary Variables (sampling variables, family history variables, and sex)' under PSID Individual level data. Additionally, you can choose from Individual-level data for the most recent year available all the way back to the first year of the study, 1968; and Main Family Data for the most recent year also available back to 1968. Selecting a year (or years) in the Individual and Main Family Data (by clicking on the "+" sign next to the year of interest) will bring up scrollable variable boxes. Scrolling to the end of the lists of variables you will find displayed a number of generated variables. This tutorial will show you how to view their code values and background documentation of variables later. Or, you can check more detailed information under Supplemental Data Files. There will be different possible levels of codebook information available, depending on the variable. All will have Basic Information (Variable Label, whether numeric, width and decimals, and unweighted case counts). In addition, there can be detailed codes, explanation text/question text, notes on comparability of variables across the years, and possible links to other more extended documentation for complex variables such as income and occupation.

I. Using the PSID Data Center

A. The Nature of the Exercise

The exercise which we will illustrate is using Public Release files for 1999, but these tools can be applied to all PSID years from 1968 onward. The exercise involves subsetting to a sample of women, some of whom are female heads, and some of whom are spouses in families with a male head. Then we need to take some variables on housework which will be used to construct a variable, Annual Hours Spent on Housework for women. In the Excel exercise the sample is further restricted to those who are under age 65, and for those the sample needs to be subdivided further into groups based on the number of children: none, 1, 2, 3 and 4 or more. We want to apply individual weights and get simple weighted averages. To motivate the exercise, we are going to see if the long-term housework trends for women over the period 1968-1989 continued to hold going to 1999. PSID has asked questions on housework for both men and women in every year of data collection. There have been some minor wording changes from time to time, but the consistency of the series for women is such that review of very long-term trends is meaningful. The 1993 - 2003 questions from the Computer Assisted Telephone Interview (CATI) application are interactively viewable at this website. For the 1999 CATI housework questions, please refer to: 

ftp://ftp.isr.umich.edu/pub/src/psid/questionnaires/q1999.pdf#page=78

The trends in housework of women are highlighted by the following table derived from Luoh (1999 ) and Juster, Ono and Stafford (2003).

Table 1: Annual Hours Spent on Housework by Married Women Age 65 and Under, Selected Years, 1969 - 1999

YEAR

All Women

No Children

1 Child

2 Children

3 Children

4 or More

1969

   1802  

   1396

  1706

  2003

  2113

  2379

1979

   1492

   1279

  1492

  1579

  1880

  1917

1989

   1206

   1075

  1213

  1332

  1441

  1581

1999

   ?

    ?

    ?

    ?

   ?

   ?

As is evident from the table, there has been a downward trend in housework hours of adult women, 1969-1989. This downward trend can not simply be the consequence of a drift toward fewer children in the family, since the decline is observed conditional on a given number of children. That is, in each of the number of children categories, estimated annual housework hours declines through time. Another interesting aspect of the table is that the dependence of housework hours on number of children is weaker in each of the three successive decades. Consider a simple ratio of housework hours of those with three children to those with none in the household. In 1969 this ratio was, on average, 1.51 (2113 hours/1396 hours). In 1979 it was 1.47 (1880/1279) and in 1989 it fell still further to 1.34 (1441/1075).

Going forward to 1999, did the overall housework hours fall further and within each number of child category? Now we have to get to work.

B. Getting Started

First is to note that housework hours are measured from responses to the 1999 CATI question F2 at the following URL (for later years there will be either CATI documentation of a PDF file which portrays the paper questionnaire which would be the equivalent of the CATI application.):

ftp://ftp.isr.umich.edu/pub/src/psid/questionnaires/q1999.pdf#page=78

The question asked is:

F2. About how much time (does your wife/do you) spend on housework in an average week? I mean time spent cooking, cleaning and doing other activities around the house? [IF QUALIFIED ANSWER] What is your best estimate?

HOUSEWORK HOURS PER WEEK

0-111 ACTUAL HOURS

112 HOURS OR MORE

If a respondent claims to not know (don't know or D.K.), the numeric code is 998. If the respondent refuses to answer or the number of hours was not ascertained (N.A.), the numeric code is 999. In working with the data we will see below the need to decide what to do with these D.K. and N.A. values. If you mistakenly use them when you calculate arithmetic averages or other statistics, these two codes will present problems for your analysis.

A few brief remarks. This question asks a respondent for a quite complicated variable as a simple estimate, including potentially complex averaging to get to a value for an 'average' week. Hence, the answer is likely to be less than ideal. Notably, respondents generally give somewhat upward biased answers to such questions and occasionally may not have even thought about the fact that there are only 168 hours in a week! So any answer above 112 (two thirds of 168! -and one needs sleep) is automatically truncated at 112 as part of the interviewing process. In processing the data prior to 1994 a further restriction has been that for reported housework hours in excess of 84 (half of 168) the 'weekly average' is set to 84. Then the resulting housework hours are multiplied by 52 (weeks per year) to get the data reported on our table above. Those reporting 'zero' or no housework are assumed to be right. However, another reason for a value of 'zero' is that the question is inapplicable (INAP.) because the family has a single male or single female head. So, again one needs to be careful with the values of 'zero' on the wife's housework hours.

Methodology research using alternative time diary methods shows zero to be plausible in many cases and not predominated by misreporting. http://psidonline.isr.umich.edu/CDS/timediary.html

The last thing we will mention here is what to do about a respondent who gives a report of 'I don't know?' or 'refuses'. Such answers get codes of '998' and '999' as noted above. One may want to either exclude such cases or make some simple 'imputation' for these missing data. For example one can assign the median for those who do give answers in the range of 0-84. Or assign at random from quartile values. Or assign at random from quartile values conditional on basic variable values such as number of children. Or, .... We guess you get the picture. One last point (whew!). These housework hours are referred to as a numeric variable. This is distinct from a categorical variable. For example, if male (a category) is coded as '1' and female is coded as '2' or if 'yes' is coded as '1' and 'no' is coded as '5' (a PSID tradition of mysterious origin but alleged to avoid keystroke errors) one would not want to say that 'no' has a five times greater value than 'yes' or that women are (necessarily) twice men. So here the '1' the '2' and the '5' are simply index categories.

C. Using the Data Center

If you are a new PSID user, please note that the PSID requires users to register prior to downloading data in order to impart the Conditions of Use. Through your registration, it also becomes possible to convey the size of the user community to sponsors, allowing the PSID to continue to collect these important data. Registration is very easy, and it is required only once. If you do not already have a PSID login and password, you will want to register at the Data Center before you begin the next few sections of the tutorial. To do such, click on the word "login" that appears under today's date in the far right upper corner of the "Welcome to the Data Center" screen (screenshot 1). Once you have registered, you can return to the Data Center to get going on the tutorial.

Now you need to go to the PSID Data Center Web site ( http://simba.isr.umich.edu ) and its Set Theoretic Subsetting System. You need to select data for 1999 from the Public Release. The screen shot below in this document is not active - it's just there for convenience in this tutorial. This tutorial is designed to show you how to use the "by file" option for selecting variables. Accordingly, you need to click on that option. Next, note that we will be working with both family-level and individual-level data; therefore you want to click on the "+" sign next to the phrases "PSID family-level" and "PSID individual-level" to expand the list of options under these two categories. This will take you to a screen like the one illustrated below in screenshot 2. Here you have the option of expanding your choices even further! Here you have the option of expanding your choices even further, since both the "PSID Main Family Data" and the "PSID Individual Data by Years" options allow the user to select data from specific years (that is to say, from different waves of the PSID). We are studying the 1999 cross-sectional housework pattern, so you want to click on the "+" sign next to these phrases, and to then click on the + sign next to 1999 in each case. In each case, a box should pop up. These boxes list the specific variables that are available for the year 1999 (the first for the variables contained in the 1999 family file, and the second for the variables contained in the 1999 individual file). Leave these boxes visible, and scroll down a bit on your screen until you see the phrase "Summary Variables. You want to click on the "+" sign next to the phrase too, because we will want variables from this file as well.

Screen Shot 1: Selecting Data Groups

Screen Shot 2: Expanding the Family- and Individual-level file choices

 

Now where are those variables needed for completing the entries in Table 1? After getting to this point, you need to scroll through the the Family Data variables, and then Individual Data variables to make your needed selections. First though, the term 'Family' may benefit from some discussion right here. Fundamentally, the PSID interviewer asks the respondent questions that apply to the whole family of one or more persons, including children. The study goes way back (designed in 1966-1967) and it was common then for such household or family economic surveys to ask financial and income questions about the household head or 'breadwinner' who, back then, was commonly assumed to be the male partner in a married or cohabiting couple. Of course if no husband (or significant male other) was around, the household head or breadwinner would be the economically active adult female. This seemingly restrictive structure was carried forward even though the economic role of women has broadened and become far more influential, especially since about 1975, not long after the study began! And through time far more information about the wife or 'wife' in a cohabiting couple was asked. By now the same information is obtained for both adults in a married couple family. But the basic head-wife structure was necessarily maintained in terms of the data consistency. If not, one could not have the consistency over time to complete Table 1 and it turns out this is a key aspect of the relational data structure of the PSID. What happened to trends in household time allocation and intra-family life would be solely the province of speculation! But we can complete Table 1 through relational data base structures, and not be forced to rely on speculation.

Housework hours and many other PSID family level variables have this head-wife structure, giving rise to two sets of identical variables - one for the head and one for the wife (if present). So, here, there is a single variable in each set: housework of the head, housework of the wife. In other content areas there are numerous questions (and corresponding variables) in each set. To illustrate, on health conditions there are (for 2001) 108 health conditions asked of the head (ER19612 based on question H1, overall health status, to ER19719 based on question H23, inches of height) followed by the same 108 health questions asked about the wife (ER19720 based on H25, overall health status, to ER19827 based on question H47, inches of height). For market work activity there are entire CATI sections with one set (B, C) for the head and others (D, E) for the wife.

For housework hours consider a world with only two simple cases for the PSID to handle. Married couples and single adults 'heading' a household - which could be just that single person (or possibly with dependents, such as young children). The precise definition of 'heading' a household is known to the PSID staff but cannot be normally explained to others in a finite lifetime. So, let's just pretend that someone in Ann Arbor has done this for you. If you are really compulsive you can immediately turn to a discussion at the FAQ (Frequently Asked Questions) web page.

Given what we just said, how do you think one would get a good national sample of adult males as of 1999? Surprisingly simple. Just limit the subset to the families with a male head. Some will have a wife, but those who don't will be the single male family heads. No problem in concept, given the traditional male-head-by-default assumption. But, given what we just said, how do you think one would get a good national sample of adult females, married or otherwise, as of 1999? Still pretty simple in concept, but relational data structures are needed, which means ID links. That is, adult women are either single female heads or the wives of the married/cohabiting men.

As we have said, the PSID primarily collects data that are measures at the family level. What does that mean? Well, for some variables it is straightforward. The family (married couple or single adult with children) shares a residence and economic resources, by definition. So the respondent (commonly the head but not always) reports things such as the housing section information. Does the family rent or own? If it owns does the family have a mortgage on that home? In the household wealth section, does the family own equities? Such family level variables apply to the labor income or work hours of the head of the family or the wife of the head of the family. There are parallel sections with extensive detail for the head and the wife on these topics. Traditionally, and today, this has been questionnaire Section B (employment of those heads active in the labor market - primarily working now or actively looking for work - as of the interview date), and Section C - followed by Sections D and E - parallel to B and C, but for the wife/'wife'.

To create our file we want to scroll though the boxes identified above to make the following selections: From the box that came up when you expanded the"summary variables" options, you need to select SEX OF INDIVIDUAL[1 = male, 2 = female] ER32000. As we know from Table 1, we want the number of children in the family. This is ER13013 # CHILDREN IN FU from the variable list box containing the 1999 family-level variables. Continuing with choices from the 1999 family file, we also want to be able to separate out the housework hours of married women from those of all women so we need ER13021 HEAD MARITAL STATUS, and we need the following additional family level variables: ER13011 SEX OF HEAD[1 = male, 2 = female] ER14229 F2 HOUSEWORK HRS-WIFE [hours code 1-112 and 998 or possibly 999 from the CATI application] and ER14230 F3 HOUSEWORK HOURS-HEAD. [Technical note: to select non-adjacent = non-contiguous variables of the variable list box, hold down the 'control' key and skip those variables you do not want to select by scrolling to the desired variable and highlight (dark blue) with the mouse/cursor. This is mentioned in the text preceding the Selecting Variables box, but this reminder could be helpful.]

The other variables needed are from the individual data so they will come from the 1999 individual file (the box that popped up when you clicked on the 1999 under the "PSID Individual Data by Years" option). Select the sequence number, ER33502 SEQUENCE NUMBER 99 (code 1-20 = individuals who lived in the family at the time of the 1999 interview) and the relation to head, ER33503 RELATION TO HEAD 99 (code value 10 = head and code value 20 = wife). {Note that individuals living in a consensual relationship have a code of '22' for the 'wife'. We are analyzing only the cases with ER33503 = 10 and 20 here.} And since Table 1 is for those under 65 you will need the individual level variable, ER33504, AGE OF INDIVIDUAL 1999 (years of age). These relationship to head codes are in the Data Center codebook. Optional: to see them search for variable name ER33503 in the PSID Individuals data at the Data Center search page. Then, on the results page click [View]. This will familiarize you with the Data Center's search facility. What you will see is the full codebook detail about ER33503 "RELATION TO HEAD 99." Codes 1-20 for ER33502 are those 'individuals in the family at the time [date] of the interview', those 51-59 are those 'individuals in institutions at the time of the 1993 interview', and so on. We want to work with those in the code ranges of 1-20. One last item: weights. The PSID sample is representative of the U.S., but because of the long evolution, weights are needed to allow for the fact that certain groups of individuals are over or underrepresented relative to the U.S. population. For 1999 these are ER33546 INDIVIDUAL WEIGHT 99. More information about PSID weights, view the Frequently Asked Question about weights. (Optional: while we are in the data selection mode, maybe we would want to select a few other variables for later use. Maybe you want to go beyond Table 1 and look at housework hours for those who are actively in the labor market. If so, select from the individual variable list, ER33512 EMPLOYMENT STATUS 99. For the purposes of this tutorial, we assume that you have not selected any additional variables. And the Data Center allows you to save your cart selections and return to add or delete from the prior variable selection session.)

Screen Shot 3: Selecting Variables

 

After you have made your selections you should scroll back up to the top of the page and hit "add to cart." This instructs the Data Center to make a shopping cart for you that contains the variables you have identified in the steps above (a "data cart" in PSID parlance). This step is hugely important! If you forget to hit "add to cart" then there won't be any variables in your data cart. Completing the "add to cart" step takes you to a new screen, as shown below (Screenshot 4). As you can see, this screen allows you to check to make sure that your variables have been added to your cart. Hit "variables added to your cart" and then expand the list to see the full list of variables. (See screenshot 5.) Note that at this point it is possible to review the documentation for each individual variable (in case you've forgotten why you chose it). For example, if you click on the words "ER32000: Sex of Individual," you get an explanation of the variable and the coding for it (the values it can take on and what they mean). Next click on "ER13021: Head marital status" and you will see that documentation providing the text of the question asked to get this variable is displayed. Note that even if you had not selected them you will get 1999 interview number, 1968 interview number and person number for 1968. These are needed for the PSID record link syem based on the long-term genealogical nature of the PSID.

A quick note in case you find that the list does not include all the variables you want: If you are missing something, you can go back at this point to retrieve whatever variable you are missing. You would do this by moving your cursor to the upper left portion of the screen to the phrase "Data Center." Doing such you will see that the four options for selecting variables are displayed (by file, by index, by search and by cart). If you click on "by file" you will call up a new copy of the screen at which we started when we first wanted to select variables from the different PSID files (data groups). You would then call up the relevant files for year 1999 that contain the variable(s) that you are missing. When you select the missing variables they will be added to the existing data cart. This means you do not need to re-select all the variables you want (just the ones that were missing).

Screen Shot 4: Variable list

Now--once you are satisfied that your list contains all the variables you want--hit "check out" (which appears toward the top of your screen). Doing such will take you to a screen where you will be asked to login. Here you need to enter your login name and password. After doing that the Data Center will take you to a screen that allows you to select your output options. (Why does the Data Center ask for your login information? The Data Center will store your data cart for you in case you want to retrieve it at a later date. To do such, and so that only you can access it in the future, it needs some record of which user is creating the cart.) Once you log in, you are taken to a screen that allows you to stipulate the format in which you want your data and the accompanying codebook delivered to you. This screen is depicted below. (See screenshot 5.)

At this point you also want to focus on the group needed for Table 1. A nice feature of this output options screen (Screen 5) is that it allows you to restrict your dataset to a limited number of cases of interest. For example, because we want to look at housework done by women, it makes sense to tell the Data Center that we only want observations that are female here (instead of having the Data Center send us information about both men and women). To include (subset) only cases where the female individual resides with the family, you need to type into the subsetting box illustrated below in the 'Output Options' box: (ER32000=2) and (ER33502<21). (If you prefer to handle the gender restriction as part of analysis after data exploration however, you need not make that restriction here.) The format on these subsetting statements is important and there is on-line help if needed. A typo in this box will return the message an error when you go to create your analysis file. Your final subsetting criteria should be:

(ER32000=2) and (ER33502<21) and (ER33503 in (10,20))     

In line with our discussion of housework of females, you want those whose relation to head is either ((ER33503=10) or (ER33503=20)) [10= head, 20= wife]. You may want to have a customized codebook, even for this simple set of variables. Here we illustrate the choice of an 'HTML codebook'. You want 'all individuals' not just the family heads. Then you need to select Output Options. Since the balance of this exercise will be based on Excel, you should select the Excel choice, as illustrated below. You also want to click on the box instructing the Data Center to send you your data via e-mail (well, a link to the location where you can retrieve your data really). Note also that you may want to check the box instructing the Data Center to compress your files. Then, you can use a program like Winzip to unzip them once you receive them. (Should you make your data public, and should you name your datacart? That depends. These are simply options that can make one's life easier if one envisages using the Data Center on repeated occasions. For example, if you make your dataset public, then you will not need to login to retrieve your datacart if you want to work with it on a future date. Similarly, if you name the datacart then it will be easy to remember what kinds of data it includes if you want to use it again in the future.) IMPORTANT STEP: Hit the submit box when you have finished stipulating your output options.

Screen Shot 5: Censoring and Output Data File Options

 

You will see a notice indicating that the Data Center is creating your dataset, and when the task is finished a note appears indicating that your dataset has been sent to your e-mail address. When retrieving the dataset, PC users should right click (other users may be required to use alternatives to right click) on the blue text of Excel Data File and select open; the file will be transferred to you. You should do the same for the Variable Labels.

To understand how the effective sample size varies depending on the subset of cases selected in the Excel file we have prepared Table 2. Of the 5,725 rows of data, selecting those equal to or under age 65 produces 4,994 rows. Of the 5,725 rows of data, selecting those with housework hours of 0 -112 produces 5,655 rows. Limiting the analysis to those with positive weights, under age 65, and with valid housework hours (0 - 112 hours) produces 2,754 rows. These 2,754 rows are the basis for the analysis below.

Table 2. Case Counts from Specific Data Partitions

N=5725: All females (head or wife only) and living in the FU at the time of 1999 interview.

Age <= 65

N=4994

 

0 <= Hours <= 112

 Weight  >  0

0 <= Hours <= 112

4940

 

Weight  >  0

2754

2784

0 <= Hours <= 112

N=5655

 

Age <= 65

 Weight > 0

Age <= 65

4940

 

 Weight > 0 

2754

3421

Weight > 0

N=3465

 

Age <= 65

 0 <= Hours <= 112

Age <= 65

2784

 

 0 <= Hours <= 112

2754

3421

II. Using Excel (2003) on your Output Subset*

* Note: For other uses of the tutorial, keep in mind that Microsoft Excel has limitations on the number of rows and columns that are displayed on a spreadsheet. Please consult the documentation for your version of Excel for more details.

Instructions for using Excel come next!

You should have an Excel file with the variable numbers arrayed across the top row and the variable values running from row 2 to row 5726 (there should be 5725 observations in this subset from the PSID Data Center.) Steps to take:

1.      In order to make it easier to identify the variables represented in each column, you may want to add the name associated with each number. Instructions for doing this can be found in the tutorial section of the PSID web site. If you do not choose to add labels, you should add a blank row 1.

2.     Our dataset contains 5725 observations, and we will have occasion to scroll through it. A convenient Excel option that makes it efficient to scroll through the data is "freeze panes." To use this option highlight the far left non-data 'row button' at row 3. If you then click on the 'Window' menu and then 'freeze panes,' then the top two rows of labels will remain visible as you scroll down through the data rather than disappearing from sight.

3.     Weights! PSID weights are aligned to national totals. Many statistical packages take such weights and normalize them (for you) to 1 in statistical work. Here we will do our own normalizing, so the average weight = 1. Each time we create a new sub-sample in this exercise, it will be necessary to renormalize these weights. In order to save time (and especially to avoid the clutter of many new weight columns), we will incorporate the normalizing step into our steps for finding the average annual housework hours of each sub-sample. This will be done using the following equation: mean = {sum(Vi * Wi) / sum(Wi)}, where V i=housework hours, and Wi=weight.

4.      First, you can see the selected variables to be like this

A

B

C

D

E

F

G

H

I

J

K

L

M

ER30001

ER30002

ER32000

ER13011

ER13013

ER13021

ER14229

ER14230

ER33501

ER33502

ER33503

ER33504

ER33546

1968 Interview Number

Person Number 1968

Sex of Individual

Sex of Head

Number  of Children in Family Unit

Head Marital Status

Housework Hours Wife

Housework Hours Head

1999 Interview Number

Sequence Number 1999

Relation To Head 1999

Age of Individual 1999

Individual Weight # 1999

Next, let's think about housework of all women in our sample of 5725, not just the wives. To do this we need to use some Excel logic steps. First, we need to create a new column to list the housework hours of all women. So label column N, 'HOUSEWORK HOURS-WOMEN.' In column K, we know the values of '10' mean that the relation to head = head. In cellN3enterthecommand[=IF(K3=10,H3,G3)]. Then 'fill down' by placing the cursor on the lower right hand corner of cell N3 and highlighting downward by pulling the cursor down and stopping at N5727. This should automatically 'fill down', copying the formula from N3 into the rest of the cells in the column (although the cell number in each formula will change to mirror the row number). If not, and the cells are only highlighted and not filled in, click on 'edit, 'fill' and 'down'. This logic command puts into the N3 cell and later N cells the value of the head's housework hours (from H3) if the selected woman is head, the value of the wife's housework hours (from G3) if the selected woman is wife. Get it? OK, 'yes' or 'no', moving on,

5.     Now you want to focus on the group needed for Table 1. Remember Table 1 for those age 65 and under, so you need to create a dummy variable which can function as choosing just the cases equal to and under 65. To do this label column O, 'DUMMY 1. AGE' and enter the command [=IF(L3<=65,1,"")] in cell O3 then 'fill down'. Also these are cases where housework hours are less than 113 (avoiding the imputation or treatment of extreme cases). To do this, you should create a dummy variable. Label column P "DUMMY 2.HOURS"and in the cell P3 enter the command [=IF(AND(N3>=0,N3<=112),1,"")]. Additionally, you should add a dummy for the individual weight "DUMMY 3. WEIGHT" in column Q and enter [=IF(M3>0,1,"")] in cell Q3, so you only include those with positive individual weights. If you would like to see how many cases are selected withtherestrictionofage,hoursandweightsimultaneously,incellR3enterthecommand[=IF(AND(O3=1,P3=1,Q3=1),1,"")]. This is the dummy variable ‘DUMMY – COMBINED’. In the last row(R5728) of column R 'DUMMY-COMBINED', you can see the number of observation chosen by restriction, using the formula [=SUM(R3:R5727)]. This should be in line with the entries in Table 2, namely, 2,754.

6.    . Now we can create a restricted sample of only women by listing their weights in column S. To do this, enter the formula[=IF(AND(O3=1,P3=1,Q3=1),M3,"")] in cell S3 and 'fill down.' This command puts in cell S3 and later S cells the weight value from column M but only if there is a female under 65 with housework hours under 112 otherwise, the cell will be left blank. Label column S 'WEIGHTS-WOMEN.' So, how many housework hours do adult women put in per year as of 1999? To find the answer, enter [=52*(SUMPRODUCT(N3:N5727,S3:S5727))/(SUM(S3:S5727))] in an open cell (such as S5730). This command takes each row and multiplies the value in column N (weekly housework hours) by the value in column S (weight) and sums all the products. Then it divides this number by the sum of all the weights in column S which normalizes the weights. Multiplied by 52 this gives the weighted annual housework hours of all women. This should return a value of 863.02. To confirm this, why not open our ‘answer’ spreadsheet (Note: Netscape users must right-click on this link and select 'save link as' in order to view the answer spreadsheet). (Note: It will make it easier to check your answers against our 'answer' spreadsheet if you keep them together in a table with labels similar to ours.)

7.    To get housework hours of married women, go to column T and label it 'WEIGHTS-MARRIEDWOMEN'. IncellT3enterthecommand[=IF(AND(O3=1,P3=1,Q3=1,F3=1),S3,"")] and 'fill down.' These will be the weights of married (F3=1) women. By entering [=52*(SUMPRODUCT(N3:N5727,T3:T5727))/(SUM(T3:T5727))] in cell S5731 you will get 1005.15, a modest decline from the 1989 value (1206) in Table 1.

8.     To what extent do housework hours of married women depend on the number of children living in the family? Let’s make a few more column labels. Column U can be ‘WEIGHTS-WIVES K=0’ (Kids=0), Column V as ‘WEIGHTS- WIVES K=1’, W as ‘WEIGHTS- WIVES K= 2’, X as ‘WEIGHTS- WIVES K= 3’, and Y as ‘WEIGHTS-WIVES K>3’.To fill in these columns, start with U3. Then reenter the Excel command[=IF(E3=0,T3,"")]. Fill down, you’re expert by now. This will select the weights only for those married women with no children in the family.E3= 0 means no children. Otherwise, a ‘blank’ will be placed in the cell. In cell S5732, enter the command [= 52*(SUMPRODUCT(N3:N5727,U3:U5727))/(SUM(U3:U5727))]. This will return the weighted average annual housework hours for those women with no children in the family (873.92). Continue with Column V. InV3,entertheExcelcommand[=IF(E3=1,T3,"")]. Similar to Column U, E3=1 means one child. Similarly, proceed to complete columns W-Y. You can get the weighted annual averages, by number of children (E=1, E=2,…), by mirroring the 'sumproduct' command in cells S5733 – S5736. Now the impact of more children on housework hours can be seen and you can complete the entries in Table 1.

9.     That table was for married women. How about hours of adult men? You can retrace the steps in this tutorial with different subsetting criteria, SEX OF INDIVIDUAL (ER32000=1). To confirm this, why not open our ‘answer’ spreadsheet. The second sheet shows the housework hours for men.

10.     Further analysis. Excel has numerous statistical functions. One could see if there are significant differences across groups defined by numbers of children. Maybe the number of families with more than 3 children present is so small that the difference in housework hours of those with >3 children is not statistically different from those where there are 3 children or 2 children. Or one could look at the regression of married women’s housework hours as a function of number of children. On average, by how many hours does weekly housework rise as there is one additional child?