Anonymous:
Specific Question: What should we do in order for us to perform ANOVA or ANCOVA if our data violate the assumptions?
When I read the Wikipedia entry on both ANOVA and ANCOVA the articles state that the data need to fulfill requirements in-order for us to do the analysis. The assumption the articles state are:
1. Independence of cases - this is a requirement of the design.
2. Normality - the distributions in each of the groups are normal.
3. Equal variances or homoscedasticity - the variance of data in groups should be the same.
My question is what should I do if some data set we have violated either one of the three assumption or all three assumptions? One think I can think of to cope the normality issue is to transform the data so it will be “somehow” normal but I have difficulty in find solution to “change” the data so it will fulfill the assumption. If nothing can be done should we just discard the dataset and not do ANOVA or ANCOVA test or there are some other method that we can use?
Assumptions are never satisfied, for any statistical method, so if the requirement was that the assumptions were satisfied, we would never do anything. Rather, you should use a model (assumptions) that produce data reasonably like the data you observe; if the data produced by the model are reasonably similar to the data that you observe, then you can use that model, since the conclusions resulting from the model will be reasonably accurate. If the data produced by the model are dramatically different from those you observe, then the conclusions obtained by using such a model are suspect.
You can assess how suspect the results are by simulation study: simulate data from a model that allows, say nonnormality, or dependence, then perform the analysis (ANOVA or ANCOVA or anything at all) and see how far the results are from what you expect. For example, how far are the true Type I error rates from .05 when the model assumptions are violated? We did some of this in ISQS 5347.
We also know from the central limit theorem that was discussed in ISQS 5347 that some tests are robust to non-normality, hence normality is not absolutely essential. Other assumptions, like constant variance and independence are not absolutely essential either, but the usual procedures (eg ANOVA) can have poor operating characteristics (eg, true type I error rates far from .05, low power, true confidence levels far from .95 etc.) when the assumptions are badly violated, so you need to use an alternative model when the assumptions are badly violated.
This issue is covered in a lot more detail in the regression class, which is why I decided not to cover it in this class. This question is really more of a "general" question than a specific one, because it does not refer to specific details of what we covered in class.
100 100 30 70
General Question: How to make connecting between tree splitting and Bonferroni Method.
When we are creating decision tree in data mining we can use Bonferroni method to increase the accuracy of the decision tree. Basically from my understanding Bonferroni will stop the tree from growing to big because the branch of a tree will stop any split whose p-value is greater than the threshold p-value. However, it is just hard for me to put the tree splitting concept and Bonferroni method together to make the tree a better tree. Can you help me with this?
You need to give a little more context here for the benefit of the reader. What is tree splitting? Describe it. Others don't know.
Anyway, with tree splitting you can choose to put the split point anywhere along the range of X values. If you have 1000 distinct values for an X variable, then there are 1000 possible splits, leading to 1000 possible two-sample tests to compare the mean of the Y variable for the two groups. Here k=1000.
100 70 80 90