How I Built an AI System That Automates Security Risk Assessment (And Why Your Security Team Needs…

From 10,000 Daily Alerts to a Simple 1–5 Score: A Machine Learning Approach to Enterprise Vulnerability Management

Press enter or click to view image in full size

I built TAPS (ThreatSurface Analyzer with Predictive Scoring), a machine learning system that:

Automates security risk assessment across thousands of assets
Combines vulnerability data, configuration quality, and business context
Generates simple 1–5 risk scores (like credit scores for cybersecurity)
Achieved 89.9% accuracy using LARS regression
Proves configuration hardening has higher ROI than just patching

If your security team drowns in vulnerability alerts and struggles with prioritization, this is for you.

Part 1: The Problem Nobody’s Solving

The 10,000-Alert-Per-Day Problem

Picture this: You’re a security analyst at a mid-sized enterprise. You arrive Monday morning, open your vulnerability scanner, and see 10,247 new alerts. Your SIEM has flagged 3,892 suspicious events. Your compliance dashboard shows 584 configuration deviations.

You have 8 hours. Where do you even start?

This isn’t hypothetical. According to Verizon’s 2024 Data Breach Investigations Report, the median enterprise generates over 10,000 security alerts daily. Even with a team of analysts, manual review is mathematically impossible.

The Broken Prioritization Model

Most teams rely on CVSS (Common Vulnerability Scoring System) scores. A CVSS 9.0 “critical” vulnerability gets immediate attention. A CVSS 6.5 “medium” goes to the backlog.

Here’s the problem: CVSS measures intrinsic vulnerability severity but ignores:

Is the vulnerable system internet-facing or internal?
Is it configured securely with compensating controls?
Does it handle customer credit cards or test data?
Is it production or development?

A CVSS 9.0 on a well-configured test server might be less risky than a CVSS 6.5 on a poorly-configured, internet-facing production system handling financial data.

The Business Translation Gap

Try explaining security to your CEO:

You: “We have 327 critical vulnerabilities, 1,203 high, and 4,891 medium.”
CEO: “Is that… good? Bad? Should I be worried? How much should we spend fixing this?”
You: “Well, it depends on…”
CEO: glazes over

Security metrics don’t translate to business impact. Executives need a number they can understand, track, and budget against.

The Consistency Problem

I ran an experiment: I gave three senior security analysts the same system profile and asked them to rate its risk on a 1–5 scale.

Results:

Analyst A: 2.5 (low-medium risk)
Analyst B: 3.8 (high risk)
Analyst C: 4.2 (critical risk)

Same system. Three different assessments. How do you trend organizational risk when individual evaluations vary by 68%?

The industry needs:

Scalability: Assess thousands of assets in minutes, not weeks
Consistency: Same assessment regardless of who evaluates
Context: Combine vulnerabilities, configuration, and business impact
Simplicity: One number executives understand
Actionability: Clear prioritization guidance

Enter TAPS.

Part 2: The Solution Architecture

The Core Concept: Security Credit Scores

Financial services solved a similar problem decades ago. How do you assess the creditworthiness of millions of people consistently? Credit scores.

Complex financial history → Single number (300–850) → Clear decision (approve/deny loan).

What if we applied this to security?

Complex security data → Single number (1–5) → Clear action (quarterly review / emergency patch).

The Three-Dimensional Risk Model

TAPS integrates three data dimensions that are typically kept separate:

Dimension 1: Vulnerability Intelligence (NIST NVD)

What we capture:

Maximum CVSS score among all vulnerabilities
Total vulnerability count
Average vulnerability severity
Presence of critical (CVSS ≥ 9.0) vulnerabilities

Why it matters:
This answers: “What’s broken on this system?”

Data source: NIST National Vulnerability Database (nvd.nist.gov)

Dimension 2: Configuration Quality (CIS Benchmarks)

What we capture:

CIS Benchmark compliance percentage (0–100%)
Automated security control implementation
Manual security control implementation
Web Application Firewall deployment
Intrusion Detection System presence
System age (patch currency proxy)

Why it matters:
This answers: “How well is it defended?”

Even high-severity vulnerabilities are less dangerous on well-configured systems with compensating controls.

Data source: Center for Internet Security Benchmarks (cisecurity.org)

Dimension 3: Business Context (FAIR Framework)

What we capture:

Asset criticality rating (1–5)
Financial exposure if compromised
Threat frequency (attack likelihood)
Environment (Production/Staging/Development)
Business unit (Finance/E-Commerce/HR/etc.)
Data classification (Public/Internal/Confidential/Restricted)
Loss magnitude category
Internet accessibility
User base size

Why it matters:
This answers: “How much damage would a breach cause?”

A vulnerability on a development server isn’t the same threat as one on a production financial system.

Data source: FAIR (Factor Analysis of Information Risk) taxonomy

The Output: TAPS Scores

Score Range: 1.0–5.0

1.0–2.5: LOW RISK 🟢
→ Action: Quarterly review cycle
→ Example: Development server, few vulnerabilities, good compliance

2.5–3.5: MEDIUM RISK 🟡
→ Action: Monthly monitoring, 30-day remediation window
→ Example: Staging environment, moderate vulnerabilities, acceptable configuration

3.5–4.5: HIGH RISK 🟠
→ Action: Weekly tracking, 7-day priority remediation
→ Example: Production system, high CVSS vulnerabilities, internet-facing

4.5–5.0: CRITICAL RISK 🔴
→ Action: Immediate response, executive escalation, possible isolation
→ Example: Production financial system, critical vulnerabilities, poor compliance, external-facing

Simple. Clear. Actionable.

Press enter or click to view image in full size

TAP SCORE DISTRIBUTION

Part 3: The Machine Learning Approach

Why Machine Learning?

Rule-based systems (if CVSS > 7 AND production THEN high_risk) are brittle and miss nuanced patterns. Machine learning discovers complex relationships automatically:

Does vulnerability count matter more than max CVSS?
At what compliance threshold does risk accelerate?
How do vulnerabilities interact with business impact?
Are there non-linear effects we’re missing?

The Fair Comparison Challenge

Most algorithm comparison studies are flawed. They create custom features for each algorithm:

LARS gets linear interaction terms
Neural networks get normalized features
Decision trees get categorical bins

Then they claim Algorithm X “won.” But did the algorithm win, or did the feature engineering win?

My Approach: Identical Features for All

All three algorithms I tested received the exact same 19 base features:

4 vulnerability metrics
6 configuration quality metrics
9 business context metrics

No custom features. No algorithm-specific preprocessing. Level playing field.

This way, performance differences reflect genuine algorithmic capabilities, not feature engineering tricks.

The Three Algorithms

Algorithm 1: LARS (Least Angle Regression)

Type: Linear regression with automatic feature selection
How it works: Starts with all features, incrementally selects the most important ones while driving irrelevant feature coefficients to exactly zero through L1 regularization.

Strengths:

Extremely interpretable (clear coefficient values)
Fast training and prediction (< 1 second)
Automatic feature selection (eliminates noise)

Results:

R² = 0.899 (explains 89.9% of variance)
MAE = 0.163 (average error 0.16 points)
RMSE = 0.209 (limited catastrophic errors)

Best use case: Executive reporting and resource allocation

Example insight from LARS:

cis_compliance_rate coefficient: -0.397
Interpretation: Each 10% compliance improvement reduces risk by 0.04 points.
Business value: Quantified ROI for security hardening investments.

Algorithm 2: M5 Decision Tree

Type: Rule-based hierarchical partitioning
How it works: Recursively splits data on feature thresholds to create if-then rules.

Strengths:

Very high interpretability (human-readable rules)
Captures feature interactions naturally
No assumptions about functional form

Results:

R² = 0.772 (explains 77.2% of variance)
MAE = 0.241 (average error 0.24 points)
RMSE = 0.314 (acceptable error distribution)

Best use case: SOC analyst playbooks and decision support

Example rule from M5:

IF vuln_count > 5 
   AND cis_compliance < 60% 
   AND environment = "Production"
THEN risk_score ≈ 4.3 (CRITICAL)
     ACTION: Immediate patching + executive notification

Security analysts can follow this logic without ML expertise.

Algorithm 3: LOESS (Local Regression)

Type: Non-parametric k-Nearest Neighbors
How it works: Predicts based on the 50 most similar assets in the training data, weighted by distance.

Strengths:

Captures non-linear patterns and thresholds
No functional form assumptions
Flexible to complex relationships

Results:

R² = 0.624 (explains 62.4% of variance)
MAE = 0.330 (average error 0.33 points)
RMSE = 0.404 (moderate error distribution)

Best use case: Threshold detection and specialized analysis

Key finding from LOESS: Risk doesn’t increase linearly with CVSS. It accelerates exponentially above CVSS 7.0. This non-linear pattern validates data-driven thresholds rather than arbitrary cutoffs.

Press enter or click to view image in full size

R-Squared Evaluation Metrics ( CHART )

Press enter or click to view image in full size

Model Comparison based on MAE, RMSE and R-Squared Value ( CHART )

The Ensemble Approach

Rather than pick a “winner,” I combined all three:

Ensemble Prediction:

TAPS_score = (LARS_pred + LOESS_pred + M5_pred) / 3

Benefits:

Accuracy: Often matches or beats individual models
Confidence Scoring: When all three agree (variance < 0.2), confidence is high. When they disagree (variance > 0.5), it’s an edge case needing human review.
Robustness: If one model struggles with unusual data, others compensate.

Results:

R² = 0.839
MAE = 0.185
RMSE = 0.209

Plus free confidence intervals.

Press enter or click to view image in full size

Part 4: The Breakthrough Findings

Finding 1: Configuration Quality Beats Patching (Sort Of)

The data revealed something surprising:

CIS Compliance Coefficient: -0.397 (LARS)
Max CVSS Coefficient: +0.327 (LARS)

Translation: Improving configuration has higher marginal impact than reducing vulnerability severity.

Why this matters:

You can’t eliminate vulnerabilities instantly (patching takes time, testing, deployment)
You CAN improve configuration relatively quickly (enable WAF, harden settings, implement IDS)

Practical implication:
If you had $100K to spend on security, the data suggests investing in configuration management tools and compliance automation would reduce risk more than hiring more people to patch faster.

Important caveat: This doesn’t mean “don’t patch.” It means “hardening multiplies the value of patching.”

Finding 2: Algorithm Convergence Validates Risk Drivers

When I compared feature importance across LARS and M5 (two completely different algorithms), they agreed on the top 3 risk drivers:

Top 3 (Consensus):

Configuration quality (cis_compliance_rate)
Vulnerability severity (max_cvss)
Business impact (business_impact_score)

When independent methods reach the same conclusion, confidence skyrockets. These aren’t statistical artifacts — they’re genuine causal factors.

Finding 3: Five Features Are Redundant

LARS drove 5 feature coefficients to exactly zero:

uptime_days
estimated_users
manual_compliance_rate
business_unit
loss_magnitude

Interpretation: These provide no additional predictive value beyond the other 14 features.

Practical value: Simpler data collection. You can skip these features without losing accuracy, reducing operational burden.

Finding 4: No Universal “Best” Algorithm

LARS won on accuracy. M5 won on interpretability. LOESS won on threshold detection.

The lesson: Deploy multiple algorithms for different stakeholders:

LARS for executives (clear coefficient insights)
M5 for analysts (decision rules)
LOESS for automation (pattern detection)
Ensemble for comprehensive scoring

Different operational needs require different algorithmic properties.

Part 5: Real-World Deployment

Scenario 1: Automated Vulnerability Scanning

Integration Point: Post-processing after scanner completes

Workflow:

Vulnerability scanner finishes weekly scan
TAPS extracts 19 features for each asset
LARS model scores all assets in < 5 seconds
Dashboard updates with current risk posture
Assets scoring > 4.0 auto-generate priority tickets

Algorithm Choice: LARS (speed + accuracy)

Get Hibullahi AbdulAzeez’s stories in your inbox

Join Medium for free to get updates from this writer.

Business Value:

Manual assessment: 1,000 assets × 15 min = 250 hours
TAPS assessment: 1,000 assets × 0.3 sec = 5 minutes
ROI: 249.9 hours saved per scan cycle

Scenario 2: SOC Analyst Decision Support

Integration Point: SIEM alert enrichment

Workflow:

SIEM generates security alert
TAPS looks up asset’s current risk score
M5 decision tree provides logic:

Asset: web-server-042TAPS Score: 4.3Rule Applied: "High vuln count + Low compliance + Production"Recommendation: P1 response, notify senior analyst

Analyst follows playbook based on score

Algorithm Choice: M5 (interpretable rules)

Business Value:

Consistent triage across all analysts
Junior analysts make decisions like seniors
Reduced mean time to respond (MTTR)

Scenario 3: Executive Dashboard

Integration Point: Daily batch reporting

Workflow:

Nightly batch job scores entire asset inventory
LARS feature importance shows top organizational risk drivers:

Q4 2025 Security Posture: 3.4 / 5.0Top Risk Drivers:• CIS Compliance: 67% average (target: 80%)• Critical Vulnerabilities: 12% of assets• Production Exposure: 45% lack WAFRecommendation: Invest $500K in config automationExpected Impact: Reduce org risk from 3.4 to 2.8

Board presentation shows trend line over time

Algorithm Choice: LARS (coefficient insights)

Business Value:

Quantified, data-driven budget justification
Clear ROI for security investments
Risk communicated in business language

Scenario 4: Threshold-Based Alerting

Integration Point: Continuous monitoring

Workflow:

LOESS identifies risk thresholds from training data
System monitors assets for threshold crossings:

ALERT: web-server-128 crossed CVSS 7.0 thresholdPrevious Score: 3.2 → Current Score: 4.1Risk Acceleration: Detected by LOESS non-linear analysisAction: Emergency patching authorized

Proactive tickets generated before incidents

Algorithm Choice: LOESS (threshold detection)

Business Value:

Proactive vs. reactive posture
Data-driven alert thresholds (not arbitrary)
Reduced false positive alert fatigue

Part 6: Lessons Learned (The Hard Way)

Lesson 1: First Iteration Was Terrible

My initial models achieved R² around 0.45–0.55. Terrible. Below operational thresholds.

What went wrong:

Target variable construction was oversimplified
Hyperparameter search was too narrow
Training data needed better balance

The fix:

Consulted with domain experts on realistic risk scoring
Expanded hyperparameter grids significantly
Refined synthetic data generation to match real patterns

Final results: R² improved to 0.77–0.90 range.

Lesson: ML projects are iterative. First attempt teaches you what doesn’t work. Embrace the iteration.

Lesson 2: Interpretability Costs Accuracy (But It’s Worth It)

M5 scored 12.7% lower R² than LARS. My first instinct: “M5 loses.”

Wrong mindset.

M5 generates human-readable rules. Analysts can:

Follow decision logic without ML training
Explain assessments in audits
Trust the system (transparency breeds trust)

That’s worth 12.7% accuracy in security operations where humans remain in the loop.

Lesson: Optimize for operational value, not just accuracy metrics.

Lesson 3: Fair Comparison Is Harder Than It Sounds

Every instinct screamed “engineer better features for LOESS!” (Add polynomial terms! Transform features!)

I resisted. The whole point was identical features for fair comparison.

Discipline paid off. Results are scientifically valid because I controlled for confounding variables.

Lesson: Define your experimental goals early and don’t compromise methodology for marginal gains.

Lesson 4: Synthetic Data Is Both Blessing and Curse

Blessing:

No confidentiality issues
Perfect for academic/research
Reproducible results

Curse:

Doesn’t capture real-world messiness
Edge cases are theoretical
Needs validation on actual enterprise data

Lesson: Synthetic data proves feasibility. Real data proves value. Need both.

Lesson 5: Ensemble Is Usually The Answer

When in doubt, average multiple models.

Errors cancel out
Free confidence scoring via variance
Robust to individual model failures

Unless you have a strong reason not to, ensemble.

Lesson: Don’t overthink algorithm selection. Deploy multiple and combine.

Part 7: What’s Next

Short-Term: Validation on Real Data

Synthetic data proves the concept works. But I need to validate on actual enterprise environments with:

Real vulnerability scans
Real configuration audits
Real incident outcomes

Call to action: If your organization is interested in a pilot, let’s talk. I’m offering free implementation in exchange for feedback and anonymized validation data.

Medium-Term: Extend to Other Asset Types

TAPS currently focuses on Apache web servers. The methodology should generalize to:

Databases (MySQL, PostgreSQL, Oracle)
Network devices (routers, switches, firewalls)
Cloud resources (EC2, S3, Lambda)
Endpoints (laptops, desktops, mobile)

Challenge: Each asset type requires different feature engineering. Transfer learning might help.

Long-Term: Temporal Modeling

Current TAPS provides point-in-time assessment. The next evolution:

Trend prediction: “This asset’s risk increased 0.8 points over 30 days — investigate degradation”
Time series forecasting: “At current patching velocity, 23% of assets will enter critical range in Q2”
Anomaly detection: “Risk spike detected — unusual vulnerability disclosure affecting 127 assets”

This requires recurrent neural networks (LSTM/GRU) and more complex architectures.

The Ultimate Vision: Closed-Loop Security

The dream:

TAPS detects high risk →
Triggers automated remediation (patch deployment, config hardening) →
Re-scans and verifies risk reduction →
Learns from successful remediations →
Improves future predictions

Human oversight remains critical, but routine decisions become automated. Security teams focus on strategy, not spreadsheets.

Part 8: How You Can Use This

For Security Practitioners

If you’re drowning in vulnerability alerts:

Start collecting the 19 features TAPS uses
Build a simple baseline model (even basic linear regression helps)
Use scores to triage your existing backlog
Iterate and improve based on incident outcomes

Resources you need:

Vulnerability scanner (Nessus, Qualys, OpenVAS)
Configuration auditor (InSpec, OpenSCAP)
Asset management database (ServiceNow, CMDB)
Basic Python + scikit-learn knowledge

For Security Leaders

If you’re tired of unclear security posture:

Calculate your current assessment cost (time × hourly rate)
Pilot TAPS on 100–500 assets
Compare manual vs. automated assessment accuracy
Measure time savings and consistency improvements
Scale to full deployment if successful

Expected ROI: 95%+ time reduction in assessment, 40–60% improved consistency based on my experiments.

Part 9: Open Questions and Collaboration

Questions I’m Still Exploring

1. Does TAPS work across different industries? Finance, healthcare, and retail have different threat models. Does the same feature importance hold? Or do industry-specific models perform better?

2. How often should models retrain? Threat landscape evolves. How frequently does TAPS need retraining? Monthly? Quarterly? Continuous learning?

3. Can we incorporate threat intelligence feeds? If a vulnerability is actively exploited in the wild, should its risk score dynamically increase? How do we integrate real-time threat data?

4. What about false positives in vulnerability scanners? Garbage in, garbage out. How robust is TAPS to noisy input data?

5. Can adversaries game the system? If attackers know the TAPS algorithm, can they manipulate features to appear low-risk? What’s the adversarial robustness?

How You Can Contribute

I’m actively seeking:

🤝 Collaboration Partners

Security teams willing to pilot TAPS
Data scientists interested in security applications
Researchers for academic publication

📊 Real-World Data (Anonymized)

Vulnerability scan outputs
Configuration audit results
Incident outcomes for validation

💡 Feedback and Corrections

Methodological improvements
Alternative algorithms to test
Edge cases I’m missing

💼 Industry Validation

Does this solve your actual problems?
What features are missing?
What deployment blockers exist?

Contact:

LinkedIn: https://linkedin.com/in/cyb3rle0
Email: [email protected]

Conclusion: Making Security Measurable

Security has operated too long on gut feelings, incomplete data, and reactive firefighting. We measure everything else in business — sales conversion, customer satisfaction, operational efficiency — but security remains opaque.

TAPS demonstrates that security risk CAN be:

Measured (1–5 numerical score)
Automated (thousands of assets in minutes)
Consistent (same assessment regardless of analyst)
Explained (clear risk drivers with coefficients)
Actioned (clear prioritization guidance)

Is TAPS perfect? Absolutely not. It needs real-world validation, extension to more asset types, temporal modeling, and continuous refinement.

But it proves the concept: Machine learning can transform security from art to science.

The vulnerability management problem is solvable. The data exists. The algorithms work. What’s missing is adoption, iteration, and collaboration between security and data science communities.

This is my contribution to that collaboration. What’s yours?

Appendix: Technical Deep-Dive

Feature List (All 19)

Vulnerability Metrics:

max_cvss - Highest CVSS v3.1 score (0.0-10.0)
vuln_count - Total vulnerability count (0-25)
mean_cvss - Average CVSS score (0.0-10.0)
has_critical - Any CVSS ≥ 9.0? (binary)

Configuration Quality: 5. cis_compliance_rate - Overall CIS compliance (0.0-1.0) 6. auto_compliance_rate - Automated controls (0.0-1.0) 7. manual_compliance_rate - Manual controls (0.0-1.0) 8. has_waf - WAF deployed? (binary) 9. has_ids - IDS active? (binary) 10. uptime_days - System age (1-500 days)

Business Context: 11. business_impact_score - Criticality (1-5) 12. financial_exposure - Potential loss (normalized) 13. threat_frequency - Attack likelihood (0.0-1.0) 14. environment - Prod/Staging/Dev (categorical) 15. business_unit - Organizational function (categorical) 16. data_classification - Public/Internal/Conf/Restricted (categorical) 17. loss_magnitude - Impact category (categorical) 18. is_external_facing - Internet accessible? (binary) 19. estimated_users - User base size (10-10,000)

Hyperparameters Used

LARS:

LassoLars(
    alpha=0.01,           # Regularization strength
    max_iter=1000,        # Convergence iterations
    random_state=42
)

LOESS (k-NN):

KNeighborsRegressor(
    n_neighbors=50,       # Local neighborhood size
    weights='distance',   # Inverse distance weighting
    metric='euclidean'
)

M5:

DecisionTreeRegressor(
    max_depth=5,              # Tree depth limit
    min_samples_split=20,     # Minimum split size
    min_samples_leaf=10,      # Minimum leaf size
    random_state=42
)

Performance Metrics Explained

R² (Coefficient of Determination):

R² = 1 - (SS_residual / SS_total)
where:
  SS_residual = Σ(y_actual - y_predicted)²
  SS_total = Σ(y_actual - y_mean)²

Interpretation: Proportion of variance explained by model

MAE (Mean Absolute Error):

MAE = (1/n) × Σ|y_actual - y_predicted|

Interpretation: Average prediction error in TAPS score points

RMSE (Root Mean Squared Error):

RMSE = √[(1/n) × Σ(y_actual - y_predicted)²]

Interpretation: Penalizes large errors more than MAE

From 10,000 Daily Alerts to a Simple 1–5 Score: A Machine Learning Approach to Enterprise Vulnerability Management

Part 1: The Problem Nobody’s Solving

The 10,000-Alert-Per-Day Problem

The Broken Prioritization Model

The Business Translation Gap

The Consistency Problem

Part 2: The Solution Architecture

The Core Concept: Security Credit Scores

The Three-Dimensional Risk Model

Dimension 1: Vulnerability Intelligence (NIST NVD)

Dimension 2: Configuration Quality (CIS Benchmarks)

Dimension 3: Business Context (FAIR Framework)

The Output: TAPS Scores

Part 3: The Machine Learning Approach

Why Machine Learning?

The Fair Comparison Challenge

The Three Algorithms

Algorithm 1: LARS (Least Angle Regression)

Algorithm 2: M5 Decision Tree

Algorithm 3: LOESS (Local Regression)

The Ensemble Approach

Part 4: The Breakthrough Findings

Finding 1: Configuration Quality Beats Patching (Sort Of)

Finding 2: Algorithm Convergence Validates Risk Drivers

Finding 3: Five Features Are Redundant

Finding 4: No Universal “Best” Algorithm

Part 5: Real-World Deployment

Scenario 1: Automated Vulnerability Scanning

Get Hibullahi AbdulAzeez’s stories in your inbox

Scenario 2: SOC Analyst Decision Support

Scenario 3: Executive Dashboard

Scenario 4: Threshold-Based Alerting

Part 6: Lessons Learned (The Hard Way)

Lesson 1: First Iteration Was Terrible

Lesson 2: Interpretability Costs Accuracy (But It’s Worth It)

Lesson 3: Fair Comparison Is Harder Than It Sounds

Lesson 4: Synthetic Data Is Both Blessing and Curse

Lesson 5: Ensemble Is Usually The Answer

Part 7: What’s Next

Short-Term: Validation on Real Data

Medium-Term: Extend to Other Asset Types

Long-Term: Temporal Modeling

The Ultimate Vision: Closed-Loop Security

Part 8: How You Can Use This

For Security Practitioners

For Security Leaders

Part 9: Open Questions and Collaboration

Questions I’m Still Exploring

How You Can Contribute

Conclusion: Making Security Measurable

Appendix: Technical Deep-Dive

Feature List (All 19)

Hyperparameters Used

Performance Metrics Explained

Further Reading