Syllabus
This class is about scientific and statistical computing. It is
intended to provide you with a strong foundation in computing skills
that are increasingly necessary for a practicing statistician and
scientists generally.
The main topics that we will learn about during the class are
-
- The R environment and programming language
-
Basics, data structures, control flow,
graphics, writing functions.
simulation, sampling, exploratory data analysis.
-
- Data manipulation & Regular Expressions
- Basic input techniques for rectangular data,
for non-rectangular data,
text manipulation and regular expressions
-
- Shell tools and programming and working with other languages.
-
-
- Web-related computing, Web services, XML
-
Accessing data via the Web:
Scraping data from HTML pages, HTML forms, REST services, SOAP, parsing XML.
Creating graphics for the web, e.g., Google Earth, SVG animations, ...
-
- Relational database management systems (RDBMS)
- concepts of databases, relational model,
structured query language (SQL)
and accessing databases from R.
We will encounter these topics in the context of exploring real data.
Much of the work will involve manipulating and exploring data and
making sense of it through summaries and creative graphical displays.
We will also use the computer and programming to perform simulations
of stochastic processes. We will also use some statistical modeling,
covering statistical methods that you may not have seen in other
classes (e.g. k-nearest neighbors, cross validation, bootstrapping).
We will cover these heuristically rather than with formal theory.
So you will learn the computing topics by using them in actual settings.
The primary goals of the class are
- for you to become competent high-level programmers so that you
can approach data analysis and simulation problems confidently
and sensibly;
- for you to become aware of important available technologies and gain an
understanding of how they work;
- for you you learn how to find out about new computing topics
e.g. R functions, packages, new technologies.
- be able to explore, make sense of and summarize data and understand
the importance of this aspect of statistics (rather than
simply applying statistical methods to numbers)
See detailed topics for more information
Grading
-
- 70% 4 or 5 homeworks
-
-
- 20% Final project.
-
-
- 10% Class participation
- This includes asking and answering questions in class,
on the class mailing list, in office hours and generally
being engaged.
Policies
- You can discuss approaches to problems with other students.
- You cannot copy code from other students.
- You can look for hints, code and solutions on the Web, but you
must acknowledge them in your writeups.
- Reports:
- You are to hand in printed reports to Gabe Becker
and send an archive of the writeup and all the relevant code
to sta141@wald.ucdavis.edu.
- The reports should describe the context of the problem
and your approach to it and provide details about
the more difficult and non-basic elements of the
programming involved in your work.
You should discuss and contrast alternative approaches, even if you
have not implemented them.
- For data analysis problems, you are to write your answer
as if for a scientist or journalist who is familiar with
the basic ideas. The focus is on the
discoveries or confirmation of conjectures and
hypotheses, not the programming. However,
do point out interesting programming tasks
or things that you learned about important functions
(e.g. lattice and legends).
So your writeup should include both computational issues
and commentary and analysis of the primary problem/context.
- The data analysis problems are intended to focus
on exploratory data analysis and understanding the data.
At times, we will fit statistical models.
For the rest of the time, use common sense and find
interesting aspects of the data based on the context
of the data. Do not apply "arbitrary" statistical methods
to data just for the sake of it!
Duncan Temple Lang
<duncan@wald.ucdavis.edu>
Last modified: Fri Sep 25 07:48:08 PDT 2009