top of page
image.png

Crime Prediction of category in San Francisco

Analyzing nearly 12 years of crime reports across various neighborhoods in San Francisco. The primary goal is to predict the category of crime based on temporal and spatial data, using machine learning techniques.

Technology Stack

1. Data Wrangling:​

   The dataset was cleaned to remove inconsistencies or errors, ensuring its 

   quality for analysis.

2. Data Exploration and Feature Engineering:​

   Explored the dataset to understand its variables and created additional     

   features to enhance the predictive power of the model.

3. Two different models:

  • Random Forest Classifier

  • XGBoost Classifier

Hypotheses Tested

1. Weekday vs. Weekend Crimes:​

   It was hypothesized that crimes are more likely to occur on weekdays than

   weekends. The analysis confirmed this, showing that weekday crimes are more

   frequent by approximately 4%.

螢幕擷取畫面 2024-09-05 000040.png

Average Daily Crime Incidents

螢幕擷取畫面 2024-09-05 000110.png

Crime Incident Counts by Day

螢幕擷取畫面 2024-09-05 000118.png

Pattern Analysis by Crime Category

2. Crimes by Time of Day:​

   It was hypothesized that crimes are more likely to occur during late

   night/early morning hours. Contrary to this hypothesis, the data showed that

   crimes peak during the evening (18:00–22:00) and around noon (12:00).

螢幕擷取畫面 2024-09-05 000126.png

Crimes by Hour of the Day

3. Proximity to Crime Hotspots:​

   This hypothesis suggested that crimes of similar types are more likely to

   occur near each other, forming crime hotspots. The analysis confirmed that

   certain types of crimes are concentrated in specific districts, validating the

   hypothesis.

螢幕擷取畫面 2024-09-05 000148.png

Crime Distribution by Police District

Model Building

1. Random Forest Classifier:

   This ensemble learning method constructs multiple decision trees and     

   outputs the mode of the classes. It is robust and can handle large datasets.

   Achieved an accuracy of 26.35% on the test set.

2. XGBoost Classifier:

   Known for its efficiency and high performance, XGBoost is a scalable   

   implementation of gradient boosting. It slightly outperformed the Random

   Forest model in terms of accuracy.
 

   Improved the accuracy to 27.40%.

© YuKai Huang, 2025

  • GitHub
  • LinkedIn
bottom of page