Criminal Rate Prediction

Crime Prediction of category in San Francisco

Analyzing nearly 12 years of crime reports across various neighborhoods in San Francisco. The primary goal is to predict the category of crime based on temporal and spatial data, using machine learning techniques.

Writing Report

Technology Stack

1. Data Wrangling:

The dataset was cleaned to remove inconsistencies or errors, ensuring its

quality for analysis.

2. Data Exploration and Feature Engineering:

Explored the dataset to understand its variables and created additional

features to enhance the predictive power of the model.

3. Two different models:

Random Forest Classifier
XGBoost Classifier

Hypotheses Tested

1. Weekday vs. Weekend Crimes:

It was hypothesized that crimes are more likely to occur on weekdays than

weekends. The analysis confirmed this, showing that weekday crimes are more

frequent by approximately 4%.

Average Daily Crime Incidents

Crime Incident Counts by Day

Pattern Analysis by Crime Category

2. Crimes by Time of Day:

It was hypothesized that crimes are more likely to occur during late

night/early morning hours. Contrary to this hypothesis, the data showed that

crimes peak during the evening (18:00–22:00) and around noon (12:00).

Crimes by Hour of the Day

3. Proximity to Crime Hotspots:

This hypothesis suggested that crimes of similar types are more likely to

occur near each other, forming crime hotspots. The analysis confirmed that

certain types of crimes are concentrated in specific districts, validating the

hypothesis.

Crime Distribution by Police District

Model Building

1. Random Forest Classifier:

This ensemble learning method constructs multiple decision trees and

outputs the mode of the classes. It is robust and can handle large datasets.

Achieved an accuracy of 26.35% on the test set.

2. XGBoost Classifier:

Known for its efficiency and high performance, XGBoost is a scalable

implementation of gradient boosting. It slightly outperformed the Random

Forest model in terms of accuracy.

Improved the accuracy to 27.40%.