Academic Data Projects

These academic projects explore datasets across a range of domains, integrating preprocessing, modeling, visualization, and analysis to derive valuable business insights using data.

The Zillow dataset provided housing prices based on features such as the year a home was built and its square footage, which were used to predict each home's tax-assessed value. This exercise gave my group the skills to practice machine learning techniques, such as linear regression, random forest regression, and gradient boosting, using features of each home and its tax-assessed value as training data to predict the tax-assessed values of new homes, with specific columns. We found that gradient boosting regression performed slightly better than linear and ridge regression (as evidenced by the model's lowest root mean squared error).

Zillow Predictions

Breast Cancer Wisconsin Dataset

For the final project for one of my data science courses, I was tasked to utilize my skills to implement a comprehensive data analysis on the Body Fat Dataset. This analysis involved using the Pandas, Numpy, and Matplotlib Python libraries to create selections, pre-process data, and visuals relevant to finding exploratory data analysis insights from the Body Fat Dataset.

Steps that I took for this analysis include:

Choosing the dataset (then loading it).
Describing the dataset.
EDA on the Body Fat Dataset.
Plotting a feature correlation matrix.
Experimenting with 3 unfamiliar regression modelling techniques.
Creating a basic Implementation for the new regression modelling techniques.
5-fold cross-validation for the new regression modelling techniques.
Pipelines for the new regression modelling techniques.
Answering why more complex decision trees may not be better.
Choosing the best model.

Body Fat Dataset

For the final project for one of my data science courses, I was tasked to utilize my skills to implement a comprehensive data analysis on the Breast Cancer Wisconsin Dataset (BCWD). This analysis involved using the Pandas, Numpy, and Matplotlib Python libraries to create selections, pre-process data, and visuals relevant to finding exploratory data analysis insights from the BCWD.

Steps that I took for this analysis include:

Choosing the dataset.
Describing the dataset.
Plotting the histograms.
Comparing the column pairs.
Performing OLS regression to find the average loss over the entire dataset.
Finding the single best feature to predict a diagnosis?
Finding a column pair of input columns with a visible dependency.
Performing principal components analysis (PCA) on all features.
Finding the highest correlative feature pair.
Identifying an outlier of interest.

Academic Data Projects

Zillow Predictions

Breast Cancer Wisconsin Dataset

Body Fat Dataset

CFornesa