Technology Stack
1. Data Wrangling:
The dataset was cleaned to remove inconsistencies or errors, ensuring its
quality for analysis.
2. Data Exploration and Feature Engineering:
Explored the dataset to understand its variables and created additional
features to enhance the predictive power of the model.
3. Two different models:
-
Random Forest Classifier
-
XGBoost Classifier
Hypotheses Tested
1. Weekday vs. Weekend Crimes:
It was hypothesized that crimes are more likely to occur on weekdays than
weekends. The analysis confirmed this, showing that weekday crimes are more
frequent by approximately 4%.

Average Daily Crime Incidents

Crime Incident Counts by Day

Pattern Analysis by Crime Category
2. Crimes by Time of Day:
It was hypothesized that crimes are more likely to occur during late
night/early morning hours. Contrary to this hypothesis, the data showed that
crimes peak during the evening (18:00–22:00) and around noon (12:00).

Crimes by Hour of the Day
3. Proximity to Crime Hotspots:
This hypothesis suggested that crimes of similar types are more likely to
occur near each other, forming crime hotspots. The analysis confirmed that
certain types of crimes are concentrated in specific districts, validating the
hypothesis.

Crime Distribution by Police District
Model Building
1. Random Forest Classifier:
This ensemble learning method constructs multiple decision trees and
outputs the mode of the classes. It is robust and can handle large datasets.
Achieved an accuracy of 26.35% on the test set.
2. XGBoost Classifier:
Known for its efficiency and high performance, XGBoost is a scalable
implementation of gradient boosting. It slightly outperformed the Random
Forest model in terms of accuracy.
Improved the accuracy to 27.40%.
