Instructor: Tony
E. Smith
274 Towne (898-9647)
tesmith@seas.upenn.edu
Office Hours: By Appointment
Teaching Assistant:
Dafeng Xu
dafeng.xu.pku@gmail.com
Office Hours: Fridays: 10-11 AM
Towne Library (Group Study Area)
This
course builds
on ESE 301 (Engineering Probability), and introduces students to the
basic
methods of statistical estimation, hypothesis testing, and regression.
The
emphasis is on practical applications of these tools, including the
analysis of
a variety of real-world data sets using standard statistical software.
The
capstone of the course is a small-team project, typically involving
pairs of
individuals. Each team is expected to formulate a problem of interest,
gather
relevant data pertaining to the problem, and analyze this data using
multiple
regression techniques. The project culminates in a written report that
is
designed to strengthen the students’ technical writing skills.
EDUCATIONAL OBJECTIVES. This course
will introduce students to:
| Data Representations | Ch.1[D] plus material in Ch.5 [JMP] |
| Review of Probability | Chs.2-5[D] plus JMP-Examples |
| Random Sampling | Ch.5[D] plus JMP-Examples |
| Statistical Estimation | Ch.6[D] plus JMP-Examples |
| Confidence Intervals | Ch.7[D] plus JMP-Examples |
| Hypothesis Testing | Chs.8-9[D] plus JMP-Examples |
| Regression Analysis | Chs.12-13[D] plus supplementary materials and JMP-Examples |
| Homework | 10% |
| First Exam | 25% |
| Second Exam | 25% |
| Project | 40% |
Lectures |
Day/Date |
Topic |
Homework |
INTRO |
Th/Jan.10 |
Introduction |
|
1 |
Tu/Jan. 15 |
Data Representations |
|
2 |
Th/Jan. 17 |
Discrete Random Variables |
|
3 |
Tu/Jan. 22 |
Sums of Random Variables |
|
4 |
Th/Jan. 24 |
Continuous Random Variables |
PS1 due |
5 |
Tu/Jan. 29 |
Random Sampling |
|
6 |
Th/Jan. 31 |
Central Limit Theorem |
|
7 |
Tu/Feb. 5 |
Estimation |
|
| 8 | Th/Feb. 7 |
Regression Model | PS2 due |
| 9 | Tu/Feb. 12 |
Regression Analysis |
|
10 |
Th/Feb. 14 |
Multiple Regression Model |
|
11 |
Tu/Feb. 19 |
Multiple Regression Analysis |
|
| 12 | Th/Feb. 21 |
Simple Confidence Intervals |
PS3 due |
Tu/Feb. 26 |
EXAM1 |
|
|
| 13 | Th/Feb. 28 |
One-Sided Confidence Intervals |
|
Tu/Mar. 5 |
SPRING BREAK |
|
|
Th/Mar. 7 |
SPRING BREAK |
||
14 |
Tu/Mar. 12 |
General Confidence Intervals |
Project Proposal due |
15 |
Th/Mar. 14 |
Regression Applications |
|
16 |
Tu/Mar. 19 |
Regression Applications |
PS4 due |
17 |
Th/Mar. 21 |
Simple Tests of Hypotheses |
|
18 |
Tu/Mar. 26 |
General Tests of Hypotheses |
|
| 19 | Th/Mar. 28 |
Two-Sample Tests |
|
| 20 | Tu/Apr. 2 | Regression Applications | |
21 |
Th/Apr. 4 |
Regression Applications |
PS5 due |
Tu/Apr. 9 |
EXAM II |
||
| 22 | Th/Apr. 11 |
Additional Regression Topics |
|
| 23 | Tu/Apr. 16 | Additional Regression Topics | |
| 24 | Th/Apr. 18 | Additional Regression Topics | |
| 25 | Tu/Apr. 23 | Additional Regression Topics | |
| Mon/Apr. 29 | Final Projects Due |
|
|
|
|
|
|
|
|
S2 |
|
|
S3 |
|
|
S4 |
|
|
S5 |
|
|
|
|
|
|
| PE1.2 | PS1.2 |
|
|
PS2.1 |
| PE2.2 | PS2.2 |
|
|
|
|
|
|
|
|
ES2 |
All class data sets can be downloaded from the web site: http://www.seas.upenn.edu/~ese302/lab-content/
In addition, the following homework data sets can be accessed directly::
1. PROJECT DESCRIPTION
During the first few weeks of class, you should choose a partner to work with. Projects are expected to involve teams of two individuals. Individual projects are permitted. Teams of three are also permitted, but not encouraged -- and are expected to do more work.
Each team is expected to undertake a case study involving a statistical analysis of some data set. The only substantive requirement is that your analysis should focus on multiple regression. This analysis should demonstrate a sound statistical knowledge of regression (including goodness of fit and significance tests of coefficients). The report is to be typed double-spaced and is expected to be on the order of 15 to 20 pages in length (this is not a rigid requirement). The first page should contain an introduction which (i) motivates the problem, (ii) states all of the main assumptions [without mathematics], and (iii) briefly summarizes your findings. The main body of the report should contain a detailed development of your statistical analysis, including a mathematical formulation of both the problem studied and the analytical methods employed. Use plots and graphs wherever possible to illustrate your results (preferably in JMP). But be sure to back these up with appropriate discussion and analyses. [Do not include graphs or tables that are not discussed in the text.] All source material (including software packages used) should be cited explicitly. The last page should summarize your findings and conclusions in detail. Finally, be sure to include page numbers in your report. (I write comments on every project, and am very unhappy when I have no page numbers to refer to!).
Along with the hard copy of your project, you must send me an email attachment (preferably on the same day you turn in your project) including the following items:
There are no constraints on the subject of your case study. You might start by looking through the set of projects that are included in this web page. (These projects are presented in their original form -- including possible errors. So don't assume that everything in them is correct. They intended mainly to suggest possible topic areas and data sources. ) With respect to data, it is preferable to use real data from an experiment or survey that you or someone else has performed. For example, sports fans may wish to consider published data on their favorite players or teams. (A variety of interesting data sources can also be found by ‘web surfing’.) In any case, you must clearly specify the source of your data.
Students often find it difficult to obtain the data sets they want to study. So it is advisable to start looking as soon as possible. There is a list of web sites given below where you can start to search for existing data.
The final grade will be based on several factors: the appropriateness and sophistication of the analytical methods employed, the correctness of the analysis carried out, the logic and perceptiveness of the conclusions drawn, and the overall clarity of the presentation.
2. DATA FOR REGRESSIONS
One key point to remember in gathering data is that your data must involve properties of well-defined sampling units . For example, to study the relation between income and years of education, it would be ideal to have data on individual workers (sample units) with both the income and years of education for each individual. This is usually not possible. But often such data exists at, say, the state level. So here you could do a regression by taking states as sampling units and regressing per capita income of states against average years of education of state residents.
A particularly vexing problem here is the preponderance of data in the form of summary tables. For example, if you only have a summary table listing average income for various education categories in the US, it is very difficult to run a regression on this data --because there is no clear sampling unit. Since most data you find will be in the form of summary tables, it is very difficult to use such data in regressions (without a host of additional assumptions). However, if you were able to find summary tables for each of a number of countries, then you could use 'country' as a meaningful sample unit in regression, and examine the relation between education and income across countries.
So in short, you should try to find data for which the sampling unit
is
well defined, and hopefully for which there are sufficiently many samples
to allow an interesting regression. A common rule-of-thumb here is to
have at least
10 samples for every beta parameter
estimated. So a simple regression (two beta parameters) should ideally
have at least 20 samples. This does not mean that you shouldn't
consider a wide range of possible explanatory variables. It only means
that your final regression should
have enough samples to allow reasonable estimation of each parameter.
(See the notes on Stepwise Regression above for further discussion.)
3. SELECTED WEB SITES FOR DATA SOURCES
PENN CAMPUS RESOURCES
http://data.library.upenn.edu/index.html
http://www.cml.upenn.edu/
GENERAL DATASET COLLECTIONS
http://lib.stat.cmu.edu/datasets/
http://www.stat.ucla.edu/cases/
https://www.cia.gov/library/publications/the-world-factbook/index.html/
http://genderstats.worldbank.org
http://www.icpsr.umich.edu/
http://web.lexis-nexis.com/statuniv/
http://www.lib.umich.edu/libhome/Documents.center/stats.html
CENSUS DATA
http://www.census.gov
http://www.census.gov/DES/www/welcome.html
http://dataferrett.census.gov/TheDataWeb/
http://www.census.gov/apsd/www/statbrief/
COMMODITIES
http://www.carprices.com/
http://www.consumerreports.org/
http://www.diamonds.com/
http://www.diamondfinder.com/
CRIME
http://www.albany.edu/sourcebook/
http://www.ojp.usdoj.gov/bjs/dtdata.htm#crime
http://bjsdata.ojp.usdoj.gov/dataonline/
http://www.fbi.gov/ucr/ucr.htm#nibrs
ENVIRONMENTAL
http://www.epa.gov/enviro/html/ef_overview.html
http://www.eia.doe.gov/
http://www.pasda.psu.edu/
INTERNATIONAL DATA
http://www.geographic.org/
http://www.un.org/databases/
http://unstats.un.org/unsd/default.htm
http://www.worldbank.org/data/
MEDICAL
http://www.cdc.gov/nchs/fastats/
http://www.cdc.gov/scientific.htm
http://www.nci.nih.gov/public/factbk95/index.htm
http://www.who.ch/hst/hsp/a/country.htm
http://www.lungusa.org/
http://seer.cancer.gov/
http://www.cdc.gov/brfss/smart/2002/summary_matrix_02.htm
NATIONAL DATA
http://www.bls.gov
http://www.stat-usa.gov/econtest.nsf
SPORTS
http://www.sportstalk.com
http://sportsillustrated.cnn.com/
http://www.baseballprospectus.com/
http://www.hockeyguide.com/
http://www.nhl.com
http://www.pgatour.com
SURVEYS AND POLLS
http://www.nua.ie/surveys/
http://www.gallup.com/poll/releases/
http://www.cnn.com/ALLPOLITICS/
TRANSPORTATION
http://www.bts.gov/ntda/
http://www.apta.com/research/stats/
http://www.njtide.org/links/index.html
http://www.nhtsa.dot.gov/people/ncsa/
http://www.ntsb.gov/Aviation/Stats.htm
http://www.nhtsa.dot.gov/people/ncsa/fars.html
The following projects have been selected as examples of the level and quality of analysis that I am looking for. However, please be aware that none
these projects is free of errors (i.e., none have been "corrected"). So please don't think that because something appears in an example project that it is
automatically "OK". If you are not sure about something, please ask me.