My learning journey in Capstone in Statistics and Data
Taking a Capstone course in Statistics and Data has been a challenging but rewarding experience for me. The course has provided me with a comprehensive understanding of statistical methods and their applications in data analysis.
At the beginning of the course, I was introduced to advanced statistical concepts such as multivariate analysis, time-series analysis, and Bayesian statistics. These concepts were new to me, and I found it challenging to understand the underlying theory and their practical applications. However, with the guidance of my instructor and the support of my peers, I was able to gain a deeper understanding of these concepts.
Throughout the course, I was required to apply my knowledge to real-world data sets. For example, in one assignment, I analyzed a dataset of hospital patients to investigate the relationship between patient satisfaction and hospital quality. I used multivariate analysis techniques such as regression and factor analysis to analyze the data and draw conclusions. Through this assignment, I learned how to identify and control for confounding variables and how to interpret the results of statistical analyses.
In another assignment, I analyzed a time-series dataset of the stock market to predict future market trends. I used time-series analysis techniques such as autoregression and moving averages to make predictions. This assignment taught me the importance of selecting appropriate models and how to validate and test these models to ensure their accuracy.
One of the highlights of the course was a group project where we had to design and execute our own research study. My group chose to investigate the impact of COVID-19 on mental health among college students. We collected data through surveys and used Bayesian statistics to analyze the results. Through this project, I gained valuable experience in designing research studies, collecting data, and using advanced statistical methods to analyze the results.
Throughout the course, I also learned how to use statistical software such as R and Python to analyze data. This was a valuable skill that will be useful in future research projects.
I found some problems quite interesting:
Probability:
A fair coin is tossed repeatedly until 2 tails are observed. (The 2 tails need not be consecutive.) What is the probability that at least 2 heads were observed?
Solution:
The expectation of getting two consecutive tails is 6 tosses.
The last two will be tails.
So the question is equal to the probability of getting at least two heads in 4 tosses.
HHHH, HTTT, HHTT, HHHT, HTHT, TTTT, THHH, TTHH, TTTH, TTHT, HHTH, HTHH, THTT, TTHT, HTHT, THTH
P = p(>=2) = 9/16 = 0.5625
Statistics Theory:
Assume we don’t actually get to observe X1,…,Xn. instead, we only observe the random variables Yi = 1 (Xi <=32), which indicates if lightbulb I have failed before 32 thousand hours
Expectation of Y
Compute E [Y1]
E [Y1] = 1
Distribution of Y
What kind of distribution does Y = 1 (Xi <=32) follow:
o The same distribution that X follows
o Bernoulli
o Binomial
o Normal
Identifiability
Is the parameter λ identifiable in this model?
o Yes
o No
o Not enough information to determine
Statistics:
In the problems in this exam, you will investigate various aspects of the following data set, and answer questions such as how different factors affect the health insurance cost of a person in the United States.
The description of the field is listed below:
age: the age of the person
sex: binary variable describing the sex of the person (1 if male, 0 if female)
bmi: body mass index (BMI) given by the weight divided by the height squared, of the person (in units of kg/m2)
children: number of children that person has
smoker: binary variable indicating whether the person is a smoker (1) or not (0)
region: the region where the person lives
charge: the amount of insurance cost charged to the person (in US dollars)
Preliminaries
Load the data set (insurance.csv) and answer the following questions about the data set
1. How many observations are there in the data set?
Answer: 1338
2. How many people are smokers?
Answer: 274
3. What is the sample mean charges of the insurance cost?
(Enter numerical answers correct to the nearest integer)
Answer: 13270
Machine Learning:
For each of the question below, select the best: option.
1 (a)
Which of the following is/are difference(s) between supervised and unsupervised learning? (Choose all that apply.)
q Supervised learning uses linear models only, but unsupervised learning use both linear and non-linear models.
q Supervised learning uses labelled data. Unsupervised learning uses unlabeled data
q In supervised learning, we minimize or maximize an objective function. In unsupervised learning, we do not need to solve any minimization or maximization problems.
1 (b)
What is the purpose of regularization?
q To decrease training error
q To increase testing error
q To prevent overfitting
q To tune hyperparameters
Overall, taking Capstone in Statistics and Data has been a challenging but rewarding experience. I feel much more confident in my ability to analyze and interpret complex data sets, and I am excited to apply these skills in my future academic and professional endeavors.
To view the full journey, please visit: