Thursday, October 31, 2013

Controlling for Confounding Variables: Blocking Design

We don't want confounding variables to affect the response variable in an ideal experiment, yet we know that sometimes they are going to exist anyway. How can we set up the experiment so that the effects of confounding variables are minimized? There are three ways...

1. Randomization. When we randomly place subjects into treatment groups, we are mixing up the members of our sample so that they are diverse.

2. Replication. This means performing an experiment many times. Ideally, when we replicate an experiment over and over, we are looking for similar results each time. What if we don't see similar results? Then there may be a confounding variable within our data, or perhaps the experiment needs to be re-designed.

*3. Blocking. Once you identify a possible confounding variable (i.e. gender) we can create experimental groups based on that variable (i.e. one group for boys and one for girls). Then we randomize within those groups. Perform the experiment, then compare your groups at the end to see if there are differences between boys and girls.

Note: The blocking variable does not have to be gender, I used that as an example because I thought it would be easy for everyone to understand. Your blocking variable should be something that could potentially confound your study.

Extraneous and Confounding Variables

When we experiment, we are trying to determine a cause-and-effect relationship between the explanatory variable (factor) and the response variable (what we are measuing). Sometimes, this is easier said than done because there are extraneous variables that we don't consider when conducting the experiment. Extraneous variables are other factors, like gender, age, weight, etc. that are not directly addressed when the experiment is conducted, but still exist anyway.

All experiments have extraneous variables, but they can become problematic whenever they affect the response variable. These are called confounding variables because they affect what we are measuring. Why don't we want them? Well, if confounding variables affect the response, AND the explanatory affects the response, it is difficult for us to determine exactly WHAT causes changes in the response variable: the confounding variable or the explanatory variable.

Monday, October 28, 2013

Experimental Design

Experiments try to find casual relationships between an explanatory variable and a response variable. In practice, this is very difficult to do but we try to do this buy giving a treatment to our subjects to see how they respond. Of course, it makes sense to have two groups: an experimental group and a control group (to which you don't give the treatment) so that you can observe the effects of the treatment.

We can model an experiment through an experiment diagram. Begin by labeling the subjects of your experiment. Then, through random selection, place your subjects into experimental groups. Induce your treatment onto your subjects and record the response variable. At the end, combine the results and compare between groups to observe the effects of the treatment.

Saturday, October 26, 2013

Sampling Bias

Bias makes our data inaccurate. We try to minimize or eliminate bias from sampling because we want the most accurate data possible. The three types of bias that we can find are:

1. Selection Bias: this is when we exclude a group (either on purpose or by accident) from having a chance to participate in the sample.

2. Measurement/Response Bias: when we have errors in the way our responses are measured. This could be from math mistakes, from people responding with the wrong answers, or from rounding error.

3. Nonresponse Bias: when people don't respond to a question because they 1) don't know the answer 2) don't feel like writing an answer or 3) may feel pressured or embarrassed to give an answer.

Creating a Survey

We created surveys in class to collect data on whether or not people were satisfied at FCHS with the homecoming activities this year. You turned in 3 things on Friday: a survey of 10 questions, a sampling plan of how you would go about surveying the FCHS population, and an explanation of why/why not your survey was biased. 10 possible points were given for each, giving the assignment a total of 30 points.

In case anyone is wondering, here is how I would have written and conducted this survey:

1. In which homecoming activities did you participate this year? Please check all that apply:
-Parade     
-Football Fest
-PinkOut Day
-Class Day
-Tacky Day
-Blue/White Day
-Character Day
-Pep Rally
-Helped Build a Float
-Helped Decorate a Hallway
-Helped Decorate a Door
-Homecoming Football Game
-Other (Please list) ___________________

2. For the activities listed above, please rate them 1-5 in terms of your overall satisfaction (1 - lowest, 5 - highest).
-Parade     
-Football Fest
-PinkOut Day
-Class Day
-Tacky Day
-Blue/White Day
-Character Day
-Pep Rally
-Helped Build a Float
-Helped Decorate a Hallway
-Helped Decorate a Door
-Homecoming Football Game
-Other (Please list) ___________________

3. Name 1 activity from this year that you would like to see continued next year____________

4. Name 1 activity from this year that could have been done better __________________

5. For your answer to #4, explain why the activity you chose could have been improved __________

6. Do you plan on attending the FCHS Homecoming Football Game tonight?       Yes     No

7. In reference to your answer to #6, if yes, what inspires you to attend? If no, what could be done to inspire you to attend?

8. On a scale of 1-5, 1 - lowest, 5 - highest, rate your own personal sense of school pride/spirit during a regular basis.

9. On a scale of 1-5, 1 - lowest, 5 - highest, rate your own personal sense of school pride/spirit during homecoming week.

10. Any other comments/questions feedback for next year's homecoming organization team?

Sampling Plan:
I would first use a random number generator to assign each 2nd period classroom a number. I would then use the random number generator to select four classrooms in which to carry out my survey - approx. 100 students. I would not collect the surveys until all answers are filled in to prevent nonresponse bias. I chose 2nd period because some students come in late in the morning, so my reasoning suggests that most if not all students have made it to school by that time. Next, I would use the random number generator to assign each teacher a number, then randomly select 10 teachers to participate in the survey (out of 72 teachers, 10 sufficiently represents the sub-population of teachers).  I would label the counselors, secretaries, and administrators "Other," then use the same procedure as the teachers to randomly select 4 "other" workers to take the survey.

Bias:
My sampling plan is biased in several ways. Firstly, it is only administered in 2nd period, which excludes the population of students who did not attend second period that day. It also is administered in school, which excludes the homebound population of students. Additionally, I risk not receiving an accurate picture of the population if students from all grade levels were nor randomly selected to be represented in my data (I did not purposefully do this, so it is not a source of bias, but I want to take note of it nonetheless).  I took several measures to minimize bias. I chose four clustered classrooms at random during a time of day where many students are present. I gave all teachers and "other" staff members a number so that teachers who do not have classes 2nd period were not excluded from the selection process. I gave specific numbers in my survey to measure satisfaction level of students to eliminate measurement bias.

10 extra points on the next problem set if you write your favorite homecoming-themed day under your answer to question 2!

Thursday, October 24, 2013

Stratified, Systematic, and Clustered Sampling

A stratified sample is broken up into groups of similar objects/people/whatever we are sampling. Then, we take a simple random sample from each group, or strata. Combined, this is the sample. This is the MOST ACCURATE type of sampling (shout out to Megan Megee for getting this in class!) because we're ensuring that we have a representative from each group, which is not the case in any other type of sample.

Systematic sampling is based on a patter. We choose a number at random (say, 7) and then take every 7th person/object in the population - that makes up our sample.

A clustered sample divides the population up into groups - but unlike the stratified sample, the groups are not made up of like people/objects. The clusters are filled at random. We use a random number generator to select a cluster, then select either everything in the cluster or take random samples from within that cluster.

Judgmental samples are biased because they are based on our opinions. Every other type of sample is not biased because we use some type of random number generating device to get the samples.

Tuesday, October 22, 2013

Sampling Design part 1

Today we began to look at sampling design but didn't make it all the way through due to activities schedule. We covered the first two types of sampling: convenience/judgmental sampling and simple random sampling.

Samples represent the population, and thus, are smaller than the population. We take samples of data because there are many instances were we cannot, or don't have time to, collect data for the entire population. Convenience samples are based on what is nearby, or what we feel is "random." These are never actually random, since they come from within our minds, and our minds are not random devices. Simple random samples, on the other hand, come from a random number generator. They are random because there is, theoretically speaking, no way to determine who will be chosen for the sample before they are chosen.

The three other types of samples that we will discuss tomorrow are stratified random samples, systematic random samples, and clustered random samples. Like today, we will do this with the smiley face activity. Bring those sheets tomorrow!

Monday, October 14, 2013

Hints for Problem Set #5

Question 1:
In general, parts A B C and D are four separate questions. They do not all refer back to housing in the bay area - only the first one does. Sorry if that was unclear.
Part A: think about what a slope IS, conceptually
Part B: why does it make sense that the slope and the correlation coefficient would have the same sign?
Part C:  Think about what we talked about today in regards to extrapolation
Part D: What type of data must be measured in a scatterplot - and are we given that type of data?

Question Two:
Part C: Think about what must be true about B0 in order for it to not be possible, then examine whether or not this is true within your regression model
Part D: There are a few correct answers for this. My advice is to use statistical language, and use it correctly to justify your answers. BE SPECIFIC, and read your answer aloud to yourself when through to make sure you sound like you're making sense.

Question Three:
Part C: Remember, to calculate a residual: RAP. Residual = Actual - predicted. To get the predicted you have to plug your x's in your regression equation. Your actuals are given to you.
Part D: what must be true for an influential observation? where would you expect it to be located on the graph relative to the rest of the data?
Part E: This one comes down to whether or not you can articulate the difference between outliers and influential observations....if you draw your favorite fruit next to this problem I will give you five bonus points. Ten if it's pretty.

Question Four:
Part A: Just describe the correlation: direction and strength of association
Part B: Think about what we talked about in regards to causation. Can we justify a casual relationship here? What other factors may have led to forest fires besides the ones given?

Question Five:
Part A: what variables is the problem talking about? "Constant" is not a variable - it's your y-intercept. Think about the example we did like this in class.
Part B: can't solve this one without an equation...think what features of the chart show you this
Part C; what important stat concept is this describing? Check your notes - we talked about this specific interpretation.

Nonlinear Transformations

We like straight lines. Straight lines are predictable and easy to model. Ideally, we want our data to be linear, i.e. to resemble a straight line, but that doesn't always happen in practice. And if not, we can transform the data a few ways to make it look linear, so that way, we can interpret and make predictions with it more easily.

The four major AP stat transformations (there are many, many more in college stat and beyond) are     y = x-squared, y = root x, y = log(x), and y = 1/x. Depending on the data shape, we want to pick a transformation and change our x-variable (can do this in the calculator by using L3) so that our scatterplot appears linear.

The evidence that our transformation is a good one comes in three forms. Firstly, the scatterplot between the transformed x and y appears more linear than when it was regular x and y. Secondly, the r-squared value improves. This is a good thing, because it means our data has less variation. Less variation = more predictive power for our model = more accuracy. Finally, our residual plot should appear more randomly scattered, with more points above and below the x-axis than before (although today in class this definitely was not the case!).

Remember: if you can improve the model by getting a higher r-squared, then transforming the data is probably a good idea.

Sunday, October 13, 2013

Influential Observations and Bivariate Outliers

Data will rarely show perfect correlation. We can almost expect that there will be variation in and amongst the data. The more spread there is, the higher our coefficient of determination (r-squared) will be.

Influential observations make our coefficient of determination much lower. They change the slope and the correlation because they're so far above or below most of the data. We want to remove these from the data because they weaken the model's predictive power. Bivariate outliers are different - they are far from the bulk of the data but they still lie close to the regression equation. They hardly effect the slope of the regression line or the correlation. We can usually leave those within the model because they don't weaken the correlation.

Wednesday, October 9, 2013

2-Day Balloon Launching Experiment

For today and tomorrow, we are launching balloons (and popping them, if your name is Shalaunda Mosley) from various heights and testing whether or not there is a relationship between height and time for the balloon to make it to the ground. There should be 42 observations in your data - 6 from each height level.

Hint: Most likely, your data will not have particularly strong correlation. This does not mean that your experiment failed, but rather, that those two variables probably don't have a particularly strong linear relationship with each other. This is due to the variability in the way the balloons fall. Since they don't fall straight to the ground, there's a lot more flexibility in the path it takes (and thus, the time it takes to travel that path) to get to the ground...think about this when you write your responses tomorrow.

Tuesday, October 8, 2013

Coefficient of Determination

All data has variability. If the points on a scatterplot are closer together, this means that the variability is low, but if they are more spread out, then the variability is high. When variability is high, it's hard to use a model to make predictions, since we're less sure about where our data will be. When variability is low, it's easier to use a model to make predictions, since we know our points are more likely to end up closer together, and therefore become more accurate.

R-Squared (awkward, I can't make exponents on the blog) is called the coefficient of determination. While r measures the correlation of the data, r-squared measures variability. We interpret r-squared as "the amount of variation in (x-variable) that can be explained by a linear relationship with (y-variable)." If r-squared is high, this serves as strong evidence that a least squares regression line is a good fit for the data. If r-squared is low, then a least squares regression line is not a good fit for the data (and that we might have to transform it to a different type of model - more on that next week).

Residuals and Residual Plots

A least squares regression line is the "average" of all the data points - it goes through the middle of everything, after all, so therefore, it is a measure of center. There are points above and below the regression line - not everything is directly on the line (if they were, we would have a perfect correlation of 1 or -1). When we take the actual data values (the points) and subtract what the line predicts we will have (the predicted), we are left with the residual value. The residual value is the distance in the "y variable" from the data point to the regression line.

Residual = Actual - Predicted (RAP!)

We can find the residuals in the TI-84 by making a regression line. After you enter your data, hit stat - calc - LinReg, which will bring up your regression equation. Then go to make a scatterplot, but instead of L2 in your y-variable, go to 2nd - list and put in #7 - RESID. Then zoom9 to make the plot.

Note - if you don't make the regression line first, you won't have the residuals programmed in the calculator, and it won't work.

We look for 3 features in the residual plot to tell if the regression line is a good fit for the data. 1) Random scatter, 2) approx equal points above and below the x axis, and 3) no outliers (if we have one, we should consider removing it to improve the model). If we see patterns or unequal distribution of points, this is a sign that a linear regression model might not be the best fit to use for our data.

Saturday, October 5, 2013

Least Squares Regression Lines and the Correlation Coefficient, r

Bivariate, quantitative data is displayed in a scatterplot. Scatterplots show us the extent to which the random variables x and y are related (or not related) to one another. We typically display this relationship through a least squares regression line.

The least squares regression line comes from the equation of a line from algebra 1: instead of y = mx + b, where m is the slope and b is the y-intercept, AP stats likes to use y = b1x + b0, where b1 is the slope and b0 is the y-intercept (this is because when you have more than 1 variable, they start labeling the coefficients b2x, b3x, b4x....etc. so that they can keep track of how many variables are in the regression equation).

b1 is interpreted as: "the change in the (y-variable) is (slope) given a one-unit (whatever x is measured in) change in (x variable)."
b0 is interpreted as: "the amount of (y-variable) is (y-intercept) whenever the (x-variable) is zero."

The correlation coefficient r tells us the extent to which x and y are linearly related to one another. If the absolute value of r is high, there is a strong relationship, and conversely, if the absolute value of r is low, there is a weak relationship.

Thursday, October 3, 2013

Probability Test

You took your probability test on Wednesday and the next unit (beginning Friday) will focus on linear regression modeling (think bivariate data, scatterplots, and lines of best fit).

The tests were...much better than I expected! Shout out to Candace Latham and Josh Reynolds for making 100% or above on the test - that RARELY happens so FANTASTIC WORK! Another Shout out to Gambia Mosby for being the only one to guess what age I will be turning on my birthday - 25 (I am currently 24, but the question was how old will I be turning....)

Quite a few people got 10/10 on the multiple choice which was very good as well. Highest score was a 118%, the lowest was a 39%.

Hope you all had fun on the field trip in Little Rock!

Tuesday, October 1, 2013

Test Tomorrow

We reviewed for the test tomorrow with a concept map and some great examples of probability questions. Answers to 1 and 2 on your review page were given in class. Here are the answers and some explanations for 3, 4, and 5. DEFINITELY take a look at 5 and make sure that you understand it...

Number 4...

And finally, Question 5.....