I. Introduction

## In this research work we will learn the basic methods and principles of statistical data processing.

The topic we have chosen for our data is information about basic parameters of 406 cars.

Here they are:

– Miles per gallon

– Engine size

– Horse power

– Car weight in lbs

– Time needed to accelerate to 60 miles per hour (in second)

– The year when the car was made

– Country of origin (1: America, 2: European, 3: Japanese)

Also we include a filter parameter – An SPSS generated variable for identifying how the data are filtered. 0: Not selected, 1: Selected. This was accomplished by going to the Data Menu, Select Cases, then, provide a condition for case selection.

We begin with our descriptive statistics and then will test our data for a normal distribution, and the inferential statistics. Then will discuss our results.

II. Methods

## Quantitative research is generally made using scientific methods, which can include:

The generation of models, theories and hypotheses

The development of instruments and methods for measurement

Experimental control and manipulation of variables

Collection of empirical data

Modeling and analysis of data

The data is already collected, so in this work we will use such methods as modeling and analysis of data. Particularly, we will test our data for a normal distribution, for independence and heterogeneity. After this we will use regression analysis to model the mileage per hour.

III. Results and Discussion

1) Descriptive statistics. In this chapter we will basically analyze our data step by step to understand type of data, minimum and maximum values, and such statistical parameters as expected (mean) value and sample standard deviation.

There were 398 valid cases of miles per gallon available and we can see, that this value is defined from 9 to 47 mpg, with a mean of 23. 51 miles per gallon.

We see, that the engine displacement is defined for all our 406 cases and this value defined from 4 to 455 cubic inches, with a mean value of 194. 04. The standard deviation of 105. 207 is quite large, so it means that we have many different cars in our data set – from cars with little capacity of their engine, to big cars with large displacement.

## For horse power, there are 400 cases with defined value, from 46 to 230, with a mean value of 104. 83

Vehicle weight in lbs. also defined for all 406 cases with a minimum of 732 lbs to maximum of 5140 lbs.

The acceleration time for our cars changes from 8 seconds to 25 seconds with a mean of 15. 5 seconds with quite little standard deviation. If the distribution of this random variable is normal, this means that many values of time to accelerate are quite close to its mean.

Model year for 405 of 406 our cars is changing between 1970 and 1982 years with a mean approximately December 1975 with a standard deviation of approximately 3 years and 9 months.

Country of origin is a nominal parameter, hence, the mean value and standard deviation will not give us any description. For categorical data (i. e., nominal data) we typically report the frequency of each value, so, according to SPSS result, the frequencies of country of origin are following:

As we can see, the most part of our cars are American (253 cars). European and Japanese cars appears to be in almost same quantities – 73 and 79 respectively. And for one car the data is missed.

A number of cylinders is defined for 405 of 406 cars and changes between 3 and 8 cylinders, with a sample mean 5. 47 and standard deviation 1. 71

2) Test for normality

## Kolmogorov-Smirnov test.

In Kolmogorov-Smirnov test, the deviation from the normal distribution is considered to be significant at p <0. 05. Other nonparametric tests could be applied for the corresponding variables in this case. According to the result we have in last table, we can conclude, that the time to accelerate from 0 to 60 has an approximately normal distribution, as its p-value = 0. 326> 0. 05

## Now let’s check the other data:

We can see, that there is no variable, whose distribution could be counted as normal (according to Kolmogorov-Smirnov Test). Country of origin variable wasn’t testes as it is nominal variable.

Let’s analyze the graphs of distribution for each parameter. We know, that the graph of normal distribution is a bell-curve, for example:

## For our variables the graphs will be following:

Analyzing these graphs we can see, that only time to accelerate from 0 to 60 approximately has a form of normal distribution. Others distributions are “ far” from the normal law.

## Regression modeling.

Now we want to use regression analysis to model the mileage per hour, which depends on other variables. We will try a simple linear model, where the resulting variable is mileage per hour.

## SPSS result is below:

Now let’s analyze the results of our model.

First of all, look at summary. The value of R-square and adjusted R-square are quite close to 1. This means that the model is well-constructed and represent the dependence between variables.

The next table is the ANOVA table. This table indicates that the regression model predicts the outcome variable significantly well. How do we know this? Look at the ” Regression” row and go to the Sig. column. This indicates the statistical significance of the regression model that was applied. Here, p < 0. 05, and indicates that, overall, the model applied can statistically significantly predict the outcome variable.

The table below, Coefficients, provides us with information on each predictor variable. We can see, that not all of them are significant enough. For example, horsepower, time to accelerate and number of cylinders are not as significant as engine displacement, vehicle weight, model year and country of origin.

And it is natural, because we know, that the last factors have more influence on mileage per hour for a certain car.

## Model limitations.

When you choose to analyze your data using linear regression, part of the process involves checking to make sure that the data you want to analyze can actually be analyzed using linear regression. You need to do this because it is only appropriate to use linear regression if your data ” passes” several assumptions that are required for linear regression to give you a valid result. In practice, checking for these six assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS when performing your analysis, as well as think a little bit more about your data.

Assumption #1: Your variables should be measured at the interval or ratio level (i. e., they are continuous). Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth

Assumption #2: There should be no significant outliers. Outliers are simply single data points within your data that do not follow the usual pattern

## Assumption #3: You should have independence of observations. (To avoid this assumption, we can check using the Durbin-Watson statistic)

Assumption #4: Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line

Assumption #6: We need to know that the residuals (errors) of our variables are approximately normally distributed. (To check this, generally we can use a normal P-P plot).

## Sources

Brigitte Le Roux, Henry Rouanet (2004). Geometric Data Analysis: from Correspondence

Analysis to Structured Data Analysis. Springer.

Carlsson, G. (2009), Topology and Data, Bulletin (New Series) of the American Mathematical

Society.

Lawrence J. Hubert, Phipps Arabie, Jacqueline Meulman (2001). Combinatorial Data Analysis:

Optimization by Dynamic Programming. SIAM.

James O. Ramsay, B. W. Silverman (2005). Functional data analysis. Springer.