Victimology, the study of victims in the context of criminal acts, plays a pivotal role in understanding crime patterns. By analyzing victim profiles, we can uncover critical insights into the nature of criminal activities, helping law enforcement agencies and policymakers in designing more effective crime prevention and response strategies. The study of victim profiles not only aids in revealing demographic vulnerabilities but also assists in understanding the societal and environmental factors contributing to crime.
In this study, we delve into the Supplementary Homicide Report (SHR) dataset, a rich compilation of data detailing homicide incidents across various demographics. Our focus is primarily on the victims’ profiles, encompassing their age, sex, race, and ethnicity. This exploration aims to identify patterns and trends within these victim profiles, leveraging the power of machine learning techniques. By doing so, we hope to provide a data-driven perspective on homicide cases, offering insights that could potentially guide future crime prevention and investigation efforts.
All data can be found at https://murderdata.org
The dataset in question, the Supplementary Homicide Report (SHR), is an extensive record of homicide incidents, meticulously compiled to offer detailed insights into each case. For our victimology study, we concentrate on several key variables: the age of the victim (VICAGE), the sex of the victim (VICSEX), the race of the victim (VICRACE), and the victim’s ethnicity (VICETHNIC). These variables offer a comprehensive view of the victim’s demographic profile, serving as crucial factors in our analysis.
The data from the Supplementary Homicide Report (SHR) you provided is a detailed collection of homicide incidents, including various attributes related to each case. Here are some key fields in the dataset:
- ID: Unique identifier for each record.
- CNTYFIPS: County FIPS code.
- ORI: Originating Agency Identifier.
- STATE, STATENAME: State of the reporting agency.
- AGENCY: Name of the law enforcement agency.
- AGENTYPE: Type of law enforcement agency.
- SOURCE: Source of the record.
- SOLVED: Indicates if the offender was identified.
- YEAR, MONTH: Date of the incident.
- INCIDENT: Case number within the month.
- ACTIONTYPE: Nature of the report.
- HOMICIDE: Type of homicide.
- SITUATION: Victim and offender situation.
- VICAGE, VICSEX, VICRACE, VICETHNIC: Victim’s age, sex, race, and ethnicity.
- OFFAGE, OFFSEX, OFFRACE, OFFETHNIC: Offender’s age, sex, race, and ethnicity.
- WEAPON: Weapon used.
- RELATIONSHIP: Relationship between victim and offender.
- CIRCUMSTANCES: Circumstances of the crime.
- SUBCIRCUM: Conditions in which the victim was a criminal offender.
- VICCOUNT, OFFCOUNT: Number of additional victims and offenders.
- FILEDATE: Date the record was reported.
- FSTATE: State in which homicide was reported.
- MSA: Metropolitan Statistical Area code.
Given the nature of this data, several machine learning and deep learning approaches can be applied for various predictive analyses.
The initial step involves loading and preprocessing the data. We utilize Python, a powerful tool for data analysis, and pandas, a library providing easy-to-use data structures and data analysis tools. The following snippet demonstrates how to load the dataset into a pandas DataFrame:
import pandas as pd# Load the dataset
file_path = '/content/SHR65_22.csv' # Replace with your file path
shr_data = pd.read_csv(file_path, encoding='utf-8')
# Display the first few rows of the DataFrame
shr_data.head()
Data quality is paramount. Therefore, our next step is to clean the dataset by handling missing values and any inconsistencies that may skew our analysis. We check for missing data and decide on an appropriate strategy — whether to fill in missing values with a statistical measure like the mean or median (for numerical data) or to impute with the most frequent category (for categorical data). In cases where the data is missing at random, and if the proportion of missing values is significant, we might consider removing those records from our dataset. The following code demonstrates a basic approach to handling missing data:
# Check for missing values
missing_values = shr_data.isnull().sum()
print("Missing values in each column:\n", missing_values)# Fill missing numerical data with the median
shr_data['VicAge'].fillna(shr_data['VicAge'].median(), inplace=True)
# Fill missing categorical data with the mode
shr_data['VicSex'].fillna(shr_data['VicSex'].mode()[0], inplace=True)
shr_data['VicRace'].fillna(shr_data['VicRace'].mode()[0], inplace=True)
shr_data['VicEthnic'].fillna(shr_data['VicEthnic'].mode()[0], inplace=True)
This preprocessing stage sets a solid foundation for the subsequent analytical and machine learning procedures, ensuring the reliability and accuracy of our findings.
In the realm of data science, univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so in other words your data has only one variable. It doesn’t deal with causes or relationships (unlike regression) and it’s major purpose is to describe; It takes data, summarizes that data and finds patterns in the data.
The age of victims in homicide cases can provide significant insights into the demographics most affected by such crimes. A histogram can effectively illustrate the distribution of ages.
import matplotlib.pyplot as plt
import seaborn as sns# Set the aesthetic style of the plots
sns.set_style('whitegrid')
plt.figure(figsize=(10, 6))
# Assuming 'shr_data' is your dataframe and 'age' is the column with victim ages
shr_data = shr_data[shr_data['VicAge'] <= 120] # Filter out unreasonable ages
sns.histplot(shr_data['VicAge'], bins=24, kde=False, color='dodgerblue') # 24 bins for 5-year age groups up to 120
plt.title('Age Distribution of Homicide Victims')
plt.xlabel('Age')
plt.ylabel('Number of Victims')
plt.show()
- Age Group Concentration: The most noticeable peak occurs in the 20–30 age range, suggesting that individuals in their late teens to early adulthood are disproportionately represented among homicide victims. This is a critical age range often associated with certain risk factors such as involvement in gang activity, risky behaviors, or other situations that may increase vulnerability to violent crime.
- Decline in Victim Count with Age: There’s a clear decline in the number of victims as age increases beyond the 30-year mark. This trend continues progressively, with each successive age bracket having fewer victims than the preceding one. The drop is especially pronounced after age 50. This might reflect social, behavioral, or biological factors that reduce the risk of falling victim to homicide with increasing age.
- Early Adulthood Vulnerability: The relatively high number of victims in the 20–30 age range could also be indicative of social factors, such as economic instability, substance abuse, or high-risk lifestyles, which are more prevalent in this demographic. It may also reflect the age at which individuals are more likely to be involved in activities or relationships that could escalate to violence.
- Childhood and Adolescence: There’s a smaller, yet significant number of victims aged 0–20. The presence of victims in the 0–10 age range might indicate cases of child abuse, family violence, or other tragic family-related incidents. The 10–20 age range could involve juvenile delinquency, school violence, or domestic issues.
- Senior Citizens: The relatively low number of victims in the older age groups might suggest that senior citizens are less involved in activities that could lead to violent crimes or they may have lifestyle patterns that keep them out of harm’s way. However, it could also suggest that they are less likely to be targeted for violent crimes or that when they are victims, the crimes may be less likely to be classified as homicide.
- Public Health and Social Services: The concentration of victims in the younger age brackets could be a call to action for public health and social service organizations to focus their efforts on youth and young adults. It may be beneficial to invest in early intervention programs, youth services, and education to address the underlying causes that contribute to the increased homicide rate in this demographic.
- Policy Implications: From a policy-making perspective, these insights could be instrumental in shaping targeted crime prevention strategies. For instance, understanding that the highest risk age group is 20–30 can help direct resources towards initiatives that engage this demographic, such as job training programs, community-centered policing, and educational outreach.
Each of these insights lays the groundwork for more targeted, age-specific inquiries and interventions. Further analysis could also look into how age interacts with other factors like race, gender, socioeconomic status, and geographic location to yield even more nuanced understandings of victimology within homicide cases.
Understanding the gender distribution can shed light on potential gender-based trends in homicide incidents.
# Plotting the gender distribution of victims
plt.figure(figsize=(7, 5))
sns.countplot(x='VicSex', data=shr_data, palette='coolwarm')
plt.title('Gender Distribution of Homicide Victims')
plt.xlabel('Gender')
plt.ylabel('Number of Victims')
plt.xticks([0, 1], ['Male', 'Female']) # Replace with appropriate labels based on data
plt.show()
- Disproportionate Victimization: Males are substantially more likely to be victims of homicide compared to females. This pronounced discrepancy points towards gender-specific risk factors and could imply that men are more often in situations that elevate their risk of falling victim to homicide.
- Societal and Behavioral Factors: The over-representation of males may be associated with societal norms and behaviors, such as aggression and involvement in violent activities, which are traditionally more associated with men. It might also reflect involvement in certain types of crime or occupations that carry higher risks.
- Domestic Violence: While the numbers for female victims are lower, it does not negate the seriousness or frequency of homicides resulting from domestic violence, where women are often disproportionately affected. Further analysis could help in understanding the contexts in which female homicides occur.
- Public Health Approach: The data suggests a need for gender-specific violence prevention strategies. For men, this might include community-based programs that address violence, gang intervention strategies, and substance abuse treatment. For women, this might involve efforts to combat domestic violence and provide support for at-risk individuals.
- Targeted Interventions: Recognizing that the majority of victims are male could lead to targeted interventions aiming to reduce risk factors among young men, particularly in high-crime areas. Initiatives might include mentorship programs, educational and economic opportunities, and conflict resolution training.
- Cultural and Legal Implications: The stark gender difference also begs for cultural introspection on the value systems and legal frameworks that might inadvertently contribute to the high number of male victims. It could also open up discussions on masculinity and its associations with violence, both as perpetrators and victims.
- Intersectionality: While this chart presents a clear binary distribution, it’s also important to consider how gender intersects with other factors such as race, age, and socioeconomic status to affect homicide rates. For instance, men of certain racial backgrounds or in certain age groups may be at even higher risk.
This gender distribution serves as a vital statistic that could inform law enforcement agencies, policymakers, and community organizations about the demographic groups most at risk and help in developing focused crime prevention measures and support services. It also raises questions about the broader societal patterns that contribute to such a gender gap in victimization and how these patterns can be changed.
Analyzing the race and ethnicity of victims can highlight racial and ethnic disparities in homicide cases.
# Plotting the race distribution of victims
plt.figure(figsize=(10, 6))
race_order = shr_data['VicRace'].value_counts().index # Order bars by count
race_palette = sns.color_palette("Set2", len(race_order)) # Create a palette with a distinct color for each race
sns.countplot(x='VicRace', data=shr_data, order=race_order, palette=race_palette)
plt.title('Race Distribution of Homicide Victims')
plt.xlabel('Race')
plt.ylabel('Number of Victims')
plt.xticks(rotation=45) # Rotate labels to fit and avoid overlapping
plt.show()# Plotting the ethnicity distribution of victims
plt.figure(figsize=(7, 5))
ethnic_order = shr_data['VicEthnic'].value_counts().index # Order bars by count
ethnic_palette = sns.color_palette("Set3", len(ethnic_order)) # Create a palette with a distinct color for each ethnicity
sns.countplot(x='VicEthnic', data=shr_data, order=ethnic_order, palette=ethnic_palette)
plt.title('Ethnicity of Homicide Victims')
plt.xlabel('Ethnicity')
plt.ylabel('Number of Victims')
plt.xticks(rotation=45) # Rotate labels to fit and avoid overlapping
plt.show()
- Racial Disparity: The almost equal numbers of White and Black victims, despite the demographic fact that Black individuals typically represent a smaller percentage of the total population, point towards a significant racial disparity. This indicates a higher rate of homicide victimization within the Black community, which could be due to a range of socio-economic factors, including poverty, inequality, and neighborhood crime rates.
- Minority Representation: Asian, American Indian or Alaskan Native, and Native Hawaiian or Pacific Islander victims are present in much smaller numbers. This could reflect their relative population sizes, but it also raises questions about underreporting, differences in crime rates, or perhaps different social dynamics that affect these groups.
- Socio-Economic Context: The high numbers for both White and Black victims may indicate the presence of socio-economic issues that transcend racial lines, such as income inequality, unemployment, and educational disparities that contribute to the likelihood of becoming a homicide victim.
- Unknown Racial Data: The ‘Unknown’ category signifies a considerable gap in data collection or reporting. The inability to classify a substantial number of victims by race could hinder efforts to fully understand and address the racial dynamics of homicide.
- Targeted Community Interventions: The data underscores the need for community interventions that are sensitive to the racial contexts of different groups. For Black communities, where the number of victims is disproportionately high, programs that address systemic issues like poverty, education, and job opportunities may be especially important.
- Policing and Prevention Policies: These statistics might influence policing strategies and crime prevention policies. Ensuring equitable allocation of resources and community policing efforts could help to mitigate the factors that lead to the observed racial disparities in homicide victimization.
- Cultural Competence: The insights suggest a need for culturally competent approaches to victim services and support, recognizing the specific needs and experiences of different racial groups. This is particularly important for racial minorities who may face additional barriers in accessing support services.
- Further Research Needed: This distribution also highlights the need for further research into the structural and systemic reasons behind the racial disparities in homicide victimization rates. Understanding the root causes is critical to developing effective solutions.
It’s important to approach the interpretation of this data with sensitivity to the historical and social contexts that shape the experiences of different racial groups with respect to crime and victimization. The insights from this analysis should be used to inform holistic, fair, and effective crime prevention and response strategies.
Bivariate analysis is the simultaneous analysis of two variables to understand the relationship between them. It helps uncover correlations, trends, and potential causal relationships, which can be crucial for informed decision-making and hypothesis testing. In this section of our victimology study, we will explore the relationships between age and gender, and race and gender, of homicide victims.
To explore the relationship between the age of homicide victims and their gender, we can use a box plot. This will allow us to see the distribution of ages for male and female victims and identify any notable differences.
# Create a boxplot to visualize the age distribution by gender
plt.figure(figsize=(10, 6))
sns.boxplot(x='VicSex', y='VicAge', data=shr_data)
plt.title('Age Distribution by Gender of Homicide Victims')
plt.xlabel('Gender')
plt.ylabel('Age')
plt.show()
The box plot will show the median age, the interquartile range, and any outliers for each gender. From this, we can make several interpretations:
- Median Age: The median age, indicated by the line within the box, seems similar for both male and female victims, suggesting that the central tendency for victim age is not significantly different between genders.
- Age Range: Both males and females have a wide interquartile range (IQR), which is the height of the box, indicating that victims come from a broad age range. However, the range for males extends slightly lower, indicating younger male victims, whereas the range for females is slightly more compressed and shifted upwards.
- Outliers: There are outliers present for both genders, indicating that there are victims at ages which are unusually low or high compared to the general distribution. These could represent particularly vulnerable groups or indicate data recording errors.
- Unknown Gender: The ‘Unknown’ category likely represents data that could not be categorized into male or female due to various reasons, which might include the state of the remains or lack of identification.
- The similarity in median ages suggests that both genders, when victimized by homicide, are subject to similar life stage risks.
- The broader age range for males and the presence of lower age outliers could indicate that younger males are at risk due to factors such as gang involvement, lifestyle risks, or other forms of social conflict.
- For females, the slightly older age distribution might relate to different societal roles or risk exposures, such as domestic violence, which can occur at any age but may be more prevalent in certain life stages.
- Youth Engagement: Given the younger age of male victims, youth engagement programs that offer alternatives to gang membership or criminal activity could be effective.
- Domestic Violence: The presence of female victims across all ages might necessitate continued support for domestic violence resources, including hotlines, shelters, and public awareness campaigns.
- Elder Abuse: The presence of outliers in the older age range suggests that policies addressing elder abuse and protections for senior citizens could also be critical.
- Data Quality: The ‘Unknown’ category emphasizes the need for accurate and thorough crime reporting and victim identification. Policies could be enacted to improve forensic capabilities and data management systems.
- Tailored Interventions: Prevention programs could be tailored for the most at-risk age groups within each gender, informed by the observed age ranges and distributions.
- Research: Further research into the specific contexts of these homicides could inform more precise interventions. For example, if younger males are more often victims of street violence, then community policing and after-school programs might be beneficial.
Next, we can investigate the relationship between race and gender of homicide victims using a stacked bar chart. This will display the proportion of male and female victims within each racial category.
# Create a dataframe to count the number of victims by race and gender
race_gender_counts = shr_data.groupby(['VicRace', 'VicSex']).size().unstack()# Plot a stacked bar chart
race_gender_counts.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Gender Distribution by Race of Homicide Victims')
plt.xlabel('Race')
plt.ylabel('Number of Victims')
plt.legend(title='Gender')
plt.show()
- Predominance of Male Victims: Across all racial categories, male victims significantly outnumber female victims. This consistent pattern suggests that being male is a risk factor for homicide victimization regardless of race.
- Racial Disparities: The number of Black male victims stands out as particularly high, which reinforces the need to address the specific social and economic factors contributing to this risk. There is also a substantial number of White male victims, indicating that these two racial groups are the most affected by homicide.
- Less Representation of Minority Races: Asian and Native American or Alaskan Native victims, both male and female, are represented in much smaller numbers. This could reflect population proportions but also could indicate different societal dynamics, reporting practices, or perhaps lower rates of victimization.
- Unknown Gender and Race: The ‘Unknown’ category for both gender and race could be indicative of challenges in data collection, such as missing information or the inability to determine these attributes due to the condition of the remains.
- Socioeconomic Factors: Socioeconomic issues, including poverty and unemployment, which are often correlated with higher crime rates, may disproportionately affect certain racial groups and lead to higher rates of homicide victimization, especially among men.
- Community Dynamics: Community-level factors such as gang presence, neighborhood crime rates, and policing policies could also play a significant role in the observed disparities.
- Cultural Factors: Cultural norms regarding masculinity and conflict resolution may contribute to the high numbers of male victims in all racial categories.
- Violence Prevention Programs: The consistent over-representation of males in homicide statistics across racial groups suggests that violence prevention programs need to be sensitive to gender dynamics and specifically target risk factors that affect men.
- Targeted Interventions for High-risk Groups: For races with disproportionately high numbers of victims, such as Black and White males, interventions could be tailored to address the unique challenges these groups face.
- Data Quality Improvement: Efforts to improve the quality and completeness of homicide data collection will be important for understanding and addressing these issues.
- Community Engagement and Support: Initiatives to engage at-risk communities, provide support and resources, and address systemic inequalities could be instrumental in reducing homicide rates, particularly in heavily impacted racial groups.
- Cultural Awareness and Education: Educational programs that promote cultural awareness, non-violent conflict resolution, and positive expressions of masculinity could contribute to reducing homicide victimization among men.
It is crucial to note that correlation does not imply causation, and these visualizations can only suggest associations, not definitive reasons behind these associations. Further statistical analysis would be required to delve deeper into the causative factors. The findings from the bivariate analysis will form hypotheses for multivariate analysis, which can control for other variables and potentially provide a clearer picture of the interplay between different demographic factors in the context of homicide victimization.
Before we can apply machine learning algorithms to the Supplementary Homicide Report (SHR) data, we must prepare the data to ensure that it is in a suitable format for analysis. This includes encoding categorical variables, splitting the data into training and test sets, and normalizing or standardizing the data when necessary.
Many machine learning models cannot handle categorical variables unless they are converted into numerical form. For the SHR dataset, variables such as VicSex
, VicRace
, VicEthnic
, and Weapon
are categorical and need to be encoded.
from sklearn.preprocessing import LabelEncoder# Initialize label encoder
label_encoder = LabelEncoder()
# Update the list of categorical columns to include all object dtype columns
categorical_cols = ['Agentype', 'OffSex', 'OffRace', 'OffEthnic', 'Relationship', 'Circumstance', 'Subcircum', 'VicSex', 'VicRace', 'VicEthnic', 'Weapon']
# Remove any columns not needed for the model or that are identifiers
shr_data = shr_data.drop(columns=['ID', 'CNTYFIPS', 'Ori', 'State', 'Agency', 'Source', 'ActionType', 'Homicide', 'Situation', 'Month', 'MSA', 'FileDate'])
# Apply label encoding to each categorical column
for col in categorical_cols:
if col in shr_data.columns: # Check if the column is in the DataFrame
shr_data[col] = label_encoder.fit_transform(shr_data[col].astype(str)) # Convert to string type if not already
# Verify the encoding
shr_data.info()
Alternatively, for variables where the order is not important, one-hot encoding can be used:
# One-hot encode categorical variables
# shr_data = pd.get_dummies(shr_data, columns=categorical_cols, drop_first=True)
To evaluate the performance of our machine learning model, we need to split the data into a training set and a test set.
from sklearn.model_selection import train_test_split# Specify the target variable and the features
target = 'Solved'
features = shr_data.columns.drop([target])
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(shr_data[features], shr_data[target], test_size=0.2, random_state=42)
Some algorithms, like neural networks and k-nearest neighbors, are sensitive to the scale of the input features and can benefit from normalization or standardization.
from sklearn.preprocessing import StandardScaler# Initialize the standard scaler
scaler = StandardScaler()
# Fit the scaler on the training data and transform both the training and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Logistic regression is a statistical method for predicting binary outcomes based on one or more predictor variables (features). The output of logistic regression is a probability that the given input point belongs to a certain class, which is used to classify the input into one of two classes. It is a special case of linear regression where the outcome variable is categorical. Logistic regression is particularly suitable for binary classification problems, such as predicting whether an event will occur (e.g., pass/fail, win/lose, alive/dead, solved/unsolved).
Let’s use logistic regression to predict whether a homicide case will be solved (Solved
) based on features such as VicAge
and VicRace
. First, we must encode the Solved
column if it’s categorical:
from sklearn.preprocessing import LabelEncoder# Initialize label encoder and fit it to the 'Solved' column
label_encoder = LabelEncoder()
shr_data['Solved'] = label_encoder.fit_transform(shr_data['Solved'])
We want to note what the encoding looks like:
# Get the class labels and corresponding encoded values
class_labels = label_encoder.classes_
class_mappings = {label: idx for idx, label in enumerate(class_labels)}# Print the mapping
print("Label mapping from LabelEncoder:")
print(class_mappings)
Now, we’ll implement the logistic regression model using scikit-learn:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score# Initialize the Logistic Regression model
log_reg = LogisticRegression()
# Fit the model to the training data
log_reg.fit(X_train_scaled, y_train)
# Predict on the test data
y_pred = log_reg.predict(X_test_scaled)
# Assuming 'Yes' is the positive class we are interested in
positive_label = 'Yes' # or the equivalent encoded numeric label if using LabelEncoder
# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=positive_label)
recall = recall_score(y_test, y_pred, pos_label=positive_label)
f1 = f1_score(y_test, y_pred, pos_label=positive_label)
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
These scores are not real so let’s look at the balance:
# Calculate the percentage of each class in the 'Solved' column
solved_counts = shr_data['Solved'].value_counts(normalize=True) * 100# Print out the percentages
print("Percentage of Solved Cases: {:.2f}%".format(solved_counts[1])) # Assuming 1 represents 'Solved'
print("Percentage of Unsolved Cases: {:.2f}%".format(solved_counts[0])) # Assuming 0 represents 'Not Solved'
This is insightful. 71% of murders are solved in the US.
The dataset has an imbalance in the target class ‘Solved’, with about 70.83% of cases being solved and 29.17% being unsolved. To balance this, you can use various techniques like undersampling the majority class, oversampling the minority class, or using synthetic data generation methods such as SMOTE (Synthetic Minority Over-sampling Technique).
Here’s how to apply SMOTE to balance your dataset:
First, you need to install the imbalanced-learn
package if you haven’t already, which provides the SMOTE functionality:
!pip install imbalanced-learn
Then you can apply SMOTE to your training data:
from imblearn.over_sampling import SMOTE# Initialize SMOTE
smote = SMOTE()
# Apply SMOTE to the training data
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)
# Check the new class distribution
print("After SMOTE:")
print(pd.Series(y_train_smote).value_counts(normalize=True) * 100)
Now, with the balanced dataset, you can train your logistic regression model again:
# Reinitialize the Logistic Regression model
log_reg_smote = LogisticRegression()# Fit the model to the SMOTE-balanced training data
log_reg_smote.fit(X_train_smote, y_train_smote)
# Predict on the original test data (it's important to test on the original distribution)
# Predict on the original test data
y_pred_smote = log_reg_smote.predict(X_test_scaled)
# Assuming that the positive class 'Solved' is encoded as 1 and 'Not Solved' as 0
# Calculate the metrics using the specified positive class label
accuracy_smote = accuracy_score(y_test, y_pred_smote)
precision_smote = precision_score(y_test, y_pred_smote, pos_label='Yes')
recall_smote = recall_score(y_test, y_pred_smote, pos_label='Yes')
f1_smote = f1_score(y_test, y_pred_smote, pos_label='Yes')
# Print the performance metrics
print(f'Accuracy after SMOTE: {accuracy_smote:.4f}')
print(f'Precision after SMOTE: {precision_smote:.4f}')
print(f'Recall after SMOTE: {recall_smote:.4f}')
print(f'F1 Score after SMOTE: {f1_smote:.4f}')
Keep in mind that while SMOTE can help balance the classes, it creates synthetic samples that may not represent real-world data. Moreover, it’s crucial to maintain the original test set distribution to evaluate the model performance reliably, as this reflects the true class distribution you’d expect to see in practice.
This is still absurdly high so not really valuable.
Throughout our analysis of the Supplementary Homicide Report (SHR) dataset, several key patterns and relationships within the victim data emerged:
- Age and Vulnerability: Young adults, particularly in the 20–30 age range, were identified as being disproportionately represented among homicide victims, suggesting a potential vulnerability of this demographic to violent crime.
- Gender Disparity: Males were significantly more likely to be homicide victims compared to females, indicating gender-related risk factors.
- Racial Disparities: Racial disparities were evident, with Black individuals representing the highest number of victims, highlighting potential systemic and societal issues affecting these communities.
Our machine learning efforts included applying logistic regression to predict case outcomes and victim gender. The logistic regression model exhibited high accuracy, precision, recall, and F1 scores, but these results should be interpreted with caution. The potential for class imbalance and overfitting was addressed by employing techniques such as SMOTE for re-sampling the dataset.
The performance of logistic regression was not directly compared to that of a neural network within the scope of this article, but such a comparison would be a valuable future investigation. Neural networks might uncover more complex patterns due to their ability to model non-linear relationships, and they can potentially provide improved predictive performance over logistic regression.
The insights from this study could inform law enforcement and public policy in several ways:
- Targeted Interventions: The identification of high-risk demographics could lead to more targeted crime prevention and support measures, potentially reducing homicide rates.
- Community Programs: The age and gender data suggest a need for community programs focused on young adult males, which might include education, mentorship, and economic opportunities.
- Policy Reform: The racial disparity in homicide victimization rates suggests that broader policy reforms addressing socioeconomic inequality could have a positive impact on reducing crime.
This article has underscored the importance of data analysis in understanding crime patterns. Reliable, well-analyzed data can offer invaluable insights into the dynamics of crime, guiding more effective law enforcement strategies, and informing public policy.
The potential for further research is vast. Future studies could employ more complex models, such as ensemble methods or deep learning, to handle the nuances of crime data better. The integration of additional data sources, such as socio-economic indicators or geographic information, could also provide a more detailed understanding of the factors contributing to homicide patterns.
- Murder Accountability Project. Available online: murderdata.org.
- Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.
- Chawla, N.V. et al. “SMOTE: Synthetic Minority Over-sampling Technique.” Journal of Artificial Intelligence Research 16 (2002): 321–357.
Data analysis is an ongoing, iterative process, and the work presented here forms a basis upon which deeper and more nuanced understanding can be built. It is the rigorous analysis and application of this data that empowers stakeholders to make informed decisions that can lead to the betterment of society.
In crime data analysis, understanding the profile of victims can aid in developing targeted preventive strategies and resource allocation. Predicting the gender of a homicide victim based on age and race could, for instance, help social services and law enforcement agencies identify and protect vulnerable demographics.
XGBoost (Extreme Gradient Boosting) is an advanced implementation of gradient boosting that is more efficient and potentially more accurate than standard gradient boosting. It is particularly well-suited for datasets with imbalanced classes and has become a popular choice for classification tasks due to its speed and performance.
In this addendum, we’ll use XGBoost to predict the gender of homicide victims based on their age (VicAge
) and race (VicRace
). We will encode the categorical variables, split the data into training and test sets, train an XGBoost classifier, and evaluate its performance.
Here is the complete code block for this task:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from imblearn.over_sampling import SMOTE# Assuming 'shr_data' is your DataFrame and it has been properly loaded
# Prepare the target and feature columns
label_encoder = LabelEncoder()
shr_data['VicSex'] = label_encoder.fit_transform(shr_data['VicSex'])
features = ['VicAge', 'VicRace', 'VicCount', 'Weapon'] # Adding two more features
# Extract features and target
X = shr_data[features]
y = shr_data['VicSex']
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Apply SMOTE to balance the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
# Initialize the XGBoost classifier
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
# Train the model on the balanced training data
xgb_model.fit(X_train_smote, y_train_smote)
# Predict on the test data
y_pred = xgb_model.predict(X_test)
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
# Output the performance
print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(report)
Ensure you have XGBoost installed in your environment (!pip install xgboost
if not), and adjust the feature names (VicAge
and VicRace
) to match exactly how they appear in your dataset. This code assumes that VicSex
has been encoded as a binary variable (0 and 1), where 0 might represent ‘Male’ and 1 ‘Female’, for example. The StandardScaler
is used to standardize VicAge
and VicRace
, which is especially important for gradient-boosting algorithms.
The performance of the XGBoost model will be assessed using accuracy and a classification report, which includes precision, recall, and F1-score for each class. These metrics provide a comprehensive view of the model’s predictive capabilities.
Here’s an analysis of the provided metrics assuming classes ‘0’ and ‘1’ correspond to male and female, respectively, and class ‘2’ is an anomaly that needs investigation:
- Accuracy (0.6410 or 64.10%): The overall accuracy of the model indicates that it correctly predicts the gender of the homicide victim about 64% of the time.
Class ‘0’ (Presumed Male):
- Precision (0.34): Of all the predictions the model made for the male class, only 34% were actually male.
- Recall (0.63): Of all the actual male cases, the model correctly identified 63% of them.
- F1-score (0.44): The F1-score, which balances precision and recall, is relatively low for males, indicating the model isn’t predicting this class very well.
Class ‘1’ (Presumed Female):
- Precision (0.86): The model is much better at precision for the female class, with 86% of female predictions being correct.
- Recall (0.64): It correctly identifies 64% of the actual female cases.
- F1-score (0.74): A higher F1-score for females suggests the model is better at predicting this class compared to males.
Class ‘2’ (Unknown):
- The metrics for this class are not meaningful since we don’t have a clear understanding of what this class represents. The model identifies 67% of this class, but with very low precision.
- Support: The support value shows the number of true instances for each class in the test data. The significant imbalance (with very few instances in class ‘2’) might be affecting the model’s performance.