Skip to main content

Posts

Showing posts from 2019

Project: Building a Predictive Model

You are a data scientist working for University of South Florida. Your boss wants to develop a predictive model to automatically make a prediction on students' graduation rates based on several factors (variables). You have College dataset ( College.csv ) , which is also available in the ISLR package.  R code Studio

Final Project

Final Project Step 1 Data set: College.csv- Statistics for a large number of US Colleges from the 1995 issue of US News and World Report. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. It was used in the ASA Statistical Graphics Section’s 1995 Data Analysis Exposition. Project goal: based on the college set data giving from ISLR package, I want to be able to determine students' graduation rates based on several factors (variables). Step 2   Hypothesis-   The fraction of students from the top 10%   of the class predict what fraction graduates better than top 25% of high school graduate student entering college. Null Hypothesis-   The fraction of students from the top 10% of the class don’t predict what fraction graduates better than top 25% of high school graduate student entering college. Step 3 R Codes I'm going be using Public school only variables from college data sets:...

Time Series

Time Series in R Using the data set Tampa weather to create a time series function.  R CODE: ##create data for the rainfall rain2015 <- c(-3,41,33,6,14.6,28.2,21.4,1.81,15.60,0.52,2.90) rain1995 <- c( 0 ,60, 46,16,21.2, 32.6, 26.9, 3.66, 24.20, 0.93, 5.60) ##storing time series and printint it out rrain2015 <- ts(rain2015, ) rrain1995<- ts(rain1995) rrain1995  rrain2015 ##set up time series for the year of rain fall rain2015.timeseries <- ts(rain2015,start = c(2015,1),frequency = 12) ##print the year for rainfall 2015 print(rain2015.timeseries) ##plot the rain fall for 2015 year plot.ts(rrain2015) plot.ts(rain2015.timeseries) lograin2015 <- log(rain2015) plot.ts(lograin2015) #plot multiple time series combined.rainfall <-  matrix(c(rain1995,rain2015),nrow = 12) rainfall.timeseries <- ts(combined.rainfall,start = c(2015,1),frequency = 12) print(rainfall.timeseries) ...

Hypothesis Testing and Correlation Analysis

The director of manufacturing at a cookies needs to determine whether a new machine is production a particular type of cookies according to the manufacturer's specifications, which indicate that cookies should have a mean of 70 and standard deviation of 3.5 pounds. A sample pf 49 of cookies reveals a sample mean breaking strength of 69.1 pounds. A.  State the null and alternative hypothesis   Ho = u>=  70  and alt hypo. Ho = u<70 B.  Is there evidence that the machine is nor meeting the manufacturer's specifications for average strength? Use a 0.05 level of significance .  since the data is random sample size the data seem almost approximate normal.  C.  Compute the p value and interpret its meaning?  (xbar - mu) / (stdsqrt(n)) = (69.1 - 70)/(3.5/sqrt(49)) =  -1.80 this indicted it does not fall under the region and it is rejected.  D.   What would be your answer in (B) if the standard deviation were specified...

Confidence Interval Estimation And introduction to Fundamental of hypothesis testing

1. x̄ = 85 and σ = 8, and n = 64, set up a 95% confidence interval estimate of the population mean μ.  Z= 1-(0.05/2) = 1.96 Sample mean= x-bar = 85 Z*s/sqrt(n) = (1.96*8)/sqrt(64) = 1.96 CI= 85 – 1.96= 83.04 CI= 85- 1.96= 86.96 (83.04, 86.96) 2. If  x̄ = 125, σ = 24 and n = 36, set up a 99% confidence interval estimate of the population mean μ.  Z= 1- (0.01/2) = 0.995= 2.57 Z*s/sqrt(n) = 125 - (2.57*8/sqrt(36) = 3.42-125= 121.58 Z*s/sqrt(n) = 125 + (2.57*8/sqrt(36) = 3.42+125= 128.42 3. The manager of a supply store wants to estimate the actual amount of paint contained in 1-gallon cans purchased from a nationally known manufacturer. It is known from the manufacturer's specification sheet that standard deviation of the amount of paint is equal to 0.02 gallon. A Random sample of 50 cans is selected and the sample mean amount of paint per 1 gallon is 0.99 gallon.  3a. Set up a 99% confidence inter...

Random variable(s) & Probability Distribution(s)

Assignment # 5 1) Variance and Standard Deviation of a Discrete Random Variable We were given two probability distribution tables and asked to find the variance and standard distribution of each. Table #1 X p(x) 0 0.50 1 0.20 2 0.15 3 0.10 Table #2 x P(x) 1 0.10 3 0.20 5 0.60 4 0.20   R Studio Coding  Table #1 is represented by x and Table #2 is represented by y. x <- c(0.5, 0.2, 0.15, 010) y <- c(0.10, 0.2, 0.6, 0.2) Then simply find the variance and standard deviation of each vector by using the var and sd functions. varianceX <- var(x) standardX <- sd(x) varianceY <- var(y) standardY <- sd(y) You can print the results: varianceX varianceY standardX standardY R reports that the variance ...

Probability Theory

A. Based on Table 1 What is the probability of: B B1 A 10 20 A1 20 40 A1 . Event A A2 . Event B? A3.  Event A or B A4 . P(A or B) = P(A) + P(B) B. Applying Bayes' Theorem  Jane is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year. Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on the day of Jane's wedding?  Solution: The sample space is defined by two mutually-exclusive events - it rains or it does not rain. Additionally, a third event occurs when the weatherman predicts rain. Notation for these events appears below. Event A1. It rains on Jane's wedding. Event A2. It does not rain on Marie's wedding. Event B. The weatherman predicts rain. In terms of probabilities, we know t...

Bivariate Analysis

1.  the association between boarding screeners and  security violations has sample size of n=20 with a Mean of  boarding screeners   =  261.2 and  Mean  security violations   =   252.5.  boarding <-c(287, 243,237,227,247,264,247,247,251,254,277,303,285,254,280,264,261,292,248,253) secruity<- c(271,261,230,225,236,252,243,247,238,274,256,305,273,234,261,265,241,292,228,252) cor.test(boarding, secruity) Pearson's product-moment correlation data:  boarding and secruity t = 6.5033, df = 18, p-value = 4.088e-06 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:  0.6276251 0.9339189 sample estimates:       cor  0.8375321  plot(boarding, secruity)

Descriptive Statistics

In this week assignment i was task to compute the mean, mode, median,  range, interquartile, variance, standard deviation for two  set data with sample size of n=7.  from what observes the Y data have higher mean and median than the X data. the mode for both X and Y are numeric which mean both data doesn't have a mode because of the sample size. the variance of both data are the same occurrences  of deviation. the quantile for Y data is much higher than X data. below is my code for the X and Y data.  R Code