一份完全模拟真实商业银行 CRE PD 模型开发文档(MDD)

一份完全模拟真实商业银行 CRE PD 模型开发文档(MDD)

CRE PD Model – Model Development Document (MDD)

Commercial Bank – Income-Producing Real Estate Portfolio
Version: 1.0 – Model Development Document
Date: 2025
Prepared by: Credit Risk Modeling Team

1. Executive Summary

1.1 Purpose of the Model

The Commercial Real Estate (CRE) PD Model estimates the annual Probability of Default (PD) for income-producing commercial properties. The model supports:
  • CECL lifetime loss estimation
  • Quarterly Allowance for Credit Losses (ACL)
  • RWA / regulatory capital
  • Portfolio monitoring
  • Pricing and underwriting decision support
The model is developed in accordance with:
  • SR 11-7
  • Basel “Use-Test” principles
  • Internal Model Governance Standards
  • Data Quality Framework

1.2 Modeling Approach

  • Logistic Regression using Weight of Evidence (WoE) transformed predictors
  • Borrower/Loan level PD estimation
  • Calibrated using CRE loan data from 2014–2024
  • Includes: collateral risk, cash flow risk, borrower quality, and geographic segmentation
  • Separate risk drivers for Office, Retail, Multifamily, Industrial, Mixed Use

1.3 Key Predictive Variables

Final model uses:
Variable
Category
Rationale
LTV_WoE
Collateral
Most predictive collateral measure
DSCR_WoE
Cash Flow
Key repayment capacity metric
PropertyType_WoE
Structural
Office, Retail, MF differ materially
Geography_WoE
Regional
Captures state-level cycles
LoanAge_WoE
Behavioral
Seasoning effect on PD
SponsorStrength_WoE
Borrower
Captures leverage + liquidity

1.4 Model Performance

Metric
Result
AUC
0.74
KS
36
Gini
0.48
Out-of-Time AUC
0.71
Calibration error
< 3.5%
Performance is consistent with the bank’s risk appetite and industry benchmarks.

1.5 Major Findings

✔ No material leakage found
✔ Strong discriminatory power
✔ Stable across geography and property types
✔ DSCR & LTV remain the two strongest CRE PD predictors
✔ Office portfolio shows structural deterioration (post-COVID) → noted in Limitations

2. Portfolio Description

2.1 CRE Portfolio Definition

CRE portfolio includes:
  • Income-producing properties
  • Constructed and stabilized assets
  • Loans secured by:
    • Multifamily
    • Office
    • Retail
    • Industrial
    • Mixed Use
    • Hotel (excluded in v1 model due to limited data)
Construction loans are excluded (separate PD model).

2.2 Data Sources

Data Type
System
Description
Loan Characteristics
Core Loan System
Balance, LTV, DSCR
Collateral Appraisal
Appraisal Database
Property values
Sponsor Data
Borrower System
Sponsor net worth, liquidity
Property Attributes
CRE Market Data Provider (Trepp/CoStar)
Vacancy, rent index
Default Events
Loss Accounting System
90DPD, non-accrual, foreclosure

2.3 Data Time Window

  • Development dataset covers 2014–2024
  • Default event window: 12-month horizon
  • Out-of-time validation period: 2021–2024

2.4 Portfolio Statistics

Metric
Value
Total Loans
32,450
Total Exposure
$48.3B
Overall Default Rate
2.1%
Avg DSCR
1.42
Avg LTV
63%
Property Type Mix
MF 40%, Office 22%, Retail 18%, Industrial 15%, Other 5%
Office PD noticeably higher (~3.7%) due to post-COVID market dynamics.

3. Data Preparation

3.1 Data Cleaning

Performed:
  • Removal of duplicates (0.3%)
  • Standardization of appraisal dates
  • Capping of extreme values (e.g., DSCR > 5 capped)
  • Consolidation of multiple properties per sponsor
  • Treating negative NOI data

3.2 Default Definition

Following internal credit policy:
A loan is considered defaulted if:
  • 90+ days past due
  • Classified as non-accrual
  • Foreclosure initiated
  • Charged-off (partial or full)
  • Transferred to OREO
12-month PD is modeled.

3.3 Outlier Handling

LTV
  • Values > 150% capped
  • Negative values removed
DSCR
  • DSCR < 0.1 winsorized
  • Missing DSCR assigned to WoE “Missing” bin
Sponsor Strength
  • Winsorized at 5th and 95th percentiles

3.4 Missing Value Treatment

Missing values treated through:
  • Dedicated WoE bins
  • Business confirmation (“Missing DSCR = weak financials”)
  • 98%+ completeness achieved after treating missing values

4. Feature Engineering

4.1 Weight of Evidence (WoE) Transformation

All numeric and categorical predictors are transformed using WoE.
Goals of WoE:
  1. Linearize log-odds → improve logistic regression stability
  1. Guarantee monotonicity → regulator-friendly
  1. Handle missing/extreme values gracefully
  1. Allow business-friendly interpretation

4.2 Example WoE Table – LTV

(模拟真实银行数据)
LTV Bin
# Good
# Bad
Bad Rate
WoE
IV
0–50%
2,100
10
0.47%
-1.69
0.485
50–70%
2,500
25
0.99%
-0.96
0.247
70–85%
1,600
34
2.08%
-0.21
0.010
85–95%
800
38
4.54%
0.62
0.067
95%+
400
49
10.91%
1.43
0.658
Total
7,400
156
1.47
✔ Monotonic
✔ High IV → powerful predictor
✔ Directionally correct

4.3 Example WoE Table – DSCR

DSCR Bin
Bad Rate
WoE
Missing
5.5%
0.92
<1.0
8.0%
1.31
1.0–1.3
2.6%
-0.55
1.3–1.6
1.4%
-1.01
>1.6
0.8%
-1.53
✔ DSCR monotonic
✔ DSCR is 2nd most predictive variable

4.4 IV Summary Table (All Variables)

Variable
IV
Decision
LTV
1.47
Keep (Strong)
DSCR
0.42
Keep
Debt Yield
0.31
Correlated → Drop
Vacancy Rate
0.11
Keep
Geography
0.18
Keep
Sponsor Strength
0.24
Keep
Property Type
0.29
Keep
NOI Growth
0.07
Weak → Remove

5. Feature Selection

The feature selection framework follows a three-layer approach consistent with the bank’s Model Development Standards and regulatory expectations (SR 11-7).

5.1 Layer 1 – Univariate Predictive Power (IV Screening)

Each raw variable was evaluated individually using:
  • Information Value (IV)
  • Monotonicity of default rate
  • Missing-value pattern
Variables with IV < 0.02 were eliminated.

Summary of Univariate IV Results

Variable
IV
Predictive Power
Decision
LTV
1.47
Strong
Keep
DSCR
0.42
Strong
Keep
Property Type
0.29
Medium
Keep
Geography
0.18
Medium
Keep
Sponsor Strength
0.24
Medium
Keep
Interest Rate
0.05
Weak
Keep (Economic rationale)
Vacancy Rate
0.11
Medium
Keep
Debt Yield
0.31
Medium
Drop (collinearity with DSCR/LTV)
NOI Growth
0.07
Weak
Drop
Zip Code
0.01
None
Drop
Seasonality Indicator
0.004
None
Drop
Approximately 42 variables → 13 survivors after IV screening.

5.2 Layer 2 – Multicollinearity Screening (VIF)

Variance Inflation Factor (VIF) was computed on the reduced set of variables.
Threshold: VIF < 5

Multicollinearity Results

Variable
VIF
Decision
Notes
LTV
8.4
Keep
High, but key risk variable
DSCR
2.7
Keep
Core predictor
Debt Yield
7.1
Drop
Strongly correlated with DSCR
Property Type
1.9
Keep
No issues
Sponsor Strength
1.4
Keep
Stable
Geography
1.8
Keep
Stable
Vacancy Rate
1.9
Drop
Fails governance later
Interest Rate
1.5
Keep
Small correlation
Loan Age
1.3
Keep
Behavioral variable
Debt Yield was eliminated due to high VIF with DSCR and LTV.
Vacancy Rate was eliminated not due to VIF, but due to governance issues (see below).

5.3 Layer 3 – Governance & Economic Rationale Screening

Variables must satisfy:

(1) Monotonic WoE Pattern Check

Examples:
  • DSCR → monotonic decreasing WoE → ✔
  • LTV → monotonic increasing WoE → ✔
  • Vacancy Rate → non-monotonic (W-shape) → ❌ (Dropped)

(2) Coefficient Direction Check

Regulatory expectation:
Variable
Expected Sign
Reason
LTV
+
Higher LTV → higher default
DSCR
Better coverage → lower default
Sponsor Strength
Stronger borrower → lower PD
Property Type (Office)
+
Higher structural risk
Variables with sign contradictions were removed.

(3) Business Rationale Check

Business SME (CRE credit team) confirmed:
  • Property Type risk order: Office > Retail > MF > Industrial
  • Geography risk varies with micro-market cycles
  • LoanAge reflects seasoning effects
Vacancy Rate failed SME review because property-level vacancy may not equal sponsor-level ability to service debt (CRE underwriting nuance).

5.4 Final Selected Features

Variable
Type
Reason for Inclusion
LTV_WoE
Collateral
Highest IV, monotonic, intuitive
DSCR_WoE
Cash Flow
Strongest economic rationale
PropertyType_WoE
Structural
Segment risk
Geography_WoE
Regional
Captures local cycles
LoanAge_WoE
Behavioral
Seasoning effect
SponsorStrength_WoE
Borrower
Predictable + intuitive
Final model uses 6 variables.

6. Model Specification

The CRE PD model uses logistic regression.
notion image

6.1 Estimated Coefficients

(模拟真实银行模型参数)
Variable
Coefficient (β)
Expected Sign
Meets Expectation?
Intercept
-1.92
LTV_WoE
0.87
+
DSCR_WoE
-0.64
PropertyType_WoE
0.41
+
Geography_WoE
0.33
+
LoanAge_WoE
-0.12
+/-
✔ (negative seasoning effect)
SponsorStrength_WoE
-0.29
All coefficients have the correct sign and pass economic rationale review.

6.2 Interpretation of Coefficients

  • LTV: Strongest predictor; PD nearly doubles for highest WoE bin
  • DSCR: Negative coefficient; low DSCR significantly increases PD
  • Property Type: Office loans contribute positively to risk
  • Geography: High-risk states (CA, NY, IL) show elevated PD
  • Loan Age: PD decreases as loan seasons (first 2 years highest risk)
  • Sponsor Strength: Liquidity + net worth reduce PD likelihood

6.3 Variance-Covariance Matrix

(略;可加入附录)

7. Model Performance

7.1 Discriminatory Power

AUC / ROC Analysis

Sample
AUC
Development
0.74
Validation
0.71
Out-of-Time (2021–2024)
0.70
Industry benchmark for CRE PD models: AUC = 0.65–0.75
→ model performs at the upper end of industry norm.

7.2 KS Statistic

Sample
KS
Development
36
Validation
33
Out-of-Time
31
KS > 30 considered strong
→ Model meets performance expectations.

7.3 Gini Coefficient

notion image
Development sample Gini = 0.48 (healthy for CRE PD).

7.4 Calibration Performance

Bin-Level Calibration (Observed vs. Expected)

PD Decile
Expected PD
Observed PD
Difference
1
0.4%
0.5%
+0.1%
2
0.7%
0.8%
+0.1%
5
1.9%
1.8%
-0.1%
8
4.3%
4.6%
+0.3%
10
10.6%
10.9%
+0.3%
Calibration error < 3.5% overall → satisfactory.

8. Backtesting & Stability Testing

8.1 Backtesting Methodology

To evaluate model robustness, we conduct:
  • Out-of-time tests (2021–2024)
  • Vintage stability tests
  • Subsegment backtests (Office, MF, Retail, Industrial)
  • Geography-based backtesting

8.2 Backtesting Results – Out-of-Time

Year
Model AUC
Observed PD
Expected PD
2021
0.73
1.8%
1.9%
2022
0.72
2.0%
2.1%
2023
0.70
2.3%
2.4%
2024
0.69
2.7%
2.8%
Model remains stable—performance decline < 0.05 AUC per year.

8.3 Property-Type Stability

Property Type
Expected PD
Observed PD
Result
Multifamily
1.2%
1.3%
OK
Retail
2.6%
2.7%
OK
Industrial
1.0%
1.1%
OK
Office
3.5%
4.2%
Fail (macro deterioration)
Office loans deviate due to macro cyclic downturn post-2021.
This is noted as a model limitation and requires overlay.

8.4 Geography Stability Test

High-risk states (e.g., CA, NY, IL) show upward PD deviation; model still directionally correct.

8.5 Conclusions

✔ Model stable across:
  • Time
  • Geography
  • Most property types
Exception: Office loans
→ requires monitoring and possible overlay.

9. Benchmarking

Benchmarking is used to evaluate the model’s performance against independent reference points. Three types of benchmarks were used:
  1. External Benchmarking (industry loss data)
  1. Internal Challenger Model (macro-driven model)
  1. Legacy Model Comparison

9.1 External Benchmarking (FDIC / Trepp / Market Data)

To validate whether the modeled PDs align with broad market behavior, the following were compared:

FDIC Charge-off Rates (2014–2024)

CRE average charge-off rate: 1.3% – 2.4%
Office charge-off rate: 3.0% – 4.8%
Multifamily: 1.0% – 1.5%

Model PD Comparison

Segment
FDIC Benchmark
Model PD
Result
Multifamily
1.0–1.5%
1.3%
Retail
2.0–3.0%
2.5%
Industrial
1.0–1.5%
1.2%
Office
3.0–4.8%
4.2%
✔ (directional)
Conclusion:
Model outputs fall within industry ranges and reflect correct directional risk.

9.2 Internal Challenger Model

An internal macro-based challenger PD model was constructed using:
  • GDP growth
  • CRE Price Index
  • Vacancy Rates
  • Unemployment Rate
Sample regression (not full model):
notion image

Comparison with PD Model

Property Type
PD (Main Model)
PD (Challenger)
Difference
Result
MF
1.3%
1.4%
+0.1%
Acceptable
Retail
2.5%
2.6%
+0.1%
Acceptable
Industrial
1.2%
1.0%
-0.2%
Acceptable
Office
4.2%
4.6%
+0.4%
Acceptable (same direction)
Conclusion:
Model PD tracks macro-based challenger model directionally. Office shows highest stress sensitivity, consistent with business expectations.

9.3 Legacy Model Comparison

Metric
Legacy PD Model
New PD Model
AUC
0.66
0.74
KS
29
36
Calibration Error
6.7%
3.5%
Variables
4
6 (improved WoE)
Conclusion:
The new model materially improves discriminatory power and calibration.

10. Model Limitations & Assumptions

Regulators (SR 11-7) require explicit identification of all model limitations.

10.1 Data Limitations

1. Low Default Portfolio

CRE loans have historically low default rates, creating:
  • PD estimation challenges
  • Wider confidence intervals
  • Higher sensitivity to rare events

2. Office Market Structural Shift

Post-COVID, office loans exhibit:
  • Higher PD
  • Increased macro volatility
  • Out-of-distribution behavior
This structural change means historical data may underrepresent current/future risk.

3. Appraisal Lag

Property value updates lag by 12–24 months → LTV may not reflect current market shock.

10.2 Modeling Limitations

1. Logistic Regression Functional Form

Even with WoE, linearity assumption may not fully capture non-linear CRE risk.

2. Missing Property-Level Tenant Data

Model does not include:
  • Tenant rollover profile
  • Lease maturity schedule
  • Tenant concentration
  • Occupancy-by-tenant
Due to limited data availability.

10.3 Assumptions

Assumption
Description
Log-odds linearity
WoE addresses non-linearity
Stationarity
Development sample represents future cycles
Sponsor data accuracy
Sponsor strength is self-reported
Appraisal values
Appraisal values represent market values

10.4 Mitigants

  • Monthly monitoring
  • Management overlays
  • Cross-validation
  • Macro stress overlays for office segment
  • Conservative calibration for high-risk states

11. Model Monitoring Plan

Monitoring follows SR 11-7 Ongoing Monitoring standards.

11.1 Monitoring Frequency

  • Quarterly: Performance & calibration
  • Semi-Annual: Data quality & drift
  • Annual: Full validation by MRM

11.2 Monitoring Metrics

1. Discrimination Drift

  • AUC decrease > 0.05 triggers review
  • KS drop > 20% requires escalation

2. Calibration Drift

Tolerance threshold:
Metric
Threshold
PD vs Observed
±20% deviation
Calibration RMSE
< 3%

3. Data Drift

Monitor:
  • LTV distribution
  • DSCR distribution
  • Property type mix
  • Geographic exposure shifts

4. Stability Drift

  • WoE monotonicity breaks
  • Population shifts
  • Increased missing rates

11.3 Triggers & Escalation Framework

Trigger Level
Description
Action
Yellow
Moderate KS/AUC decline
Monitoring + SME review
Orange
Significant drift in 1–2 metrics
Model overlay consideration
Red
Failure of ≥3 metrics
Full redevelopment mandated

11.4 Overlay Policy

Conditions requiring overlay:
  • Office PD deviations > 50 bps
  • Market downturn (CREPI ↓ > 10%)
  • Sponsor weakness in certain regions
Overlay documented and approved via MCC (Model Committee).

12. Governance & Model Use


12.1 Compliance with SR 11-7

This model satisfies SR 11-7 through:

1. Conceptual Soundness

  • Justified model form
  • Variable selection aligned with economics
  • WoE transformations
  • Diagnostic tests documented

2. Ongoing Monitoring

  • Quarterly monitoring
  • Annual validation

3. Outcome Analysis

  • Benchmarking
  • Backtesting
  • Independent challenger model

12.2 Roles & Responsibilities

Group
Responsibility
Model Development
Build, document, calibrate model
Model Risk (MRM)
Independent validation
Credit Risk
Business oversight
Internal Audit
Governance compliance
Model Committee
Approval authority

12.3 Model Use Policy

Model used for:
  • CECL PD estimation
  • Risk-based pricing
  • Portfolio risk analytics
  • Stress testing support
Not permitted for:
  • Collateral valuation
  • Standalone loan approval without human oversight

12.4 Change Management

Changes requiring MCC approval:
  • Variable set
  • Data source
  • Model type
  • Calibration methodology
Minor changes documented under version control.

Appendices

Appendix A — Full WoE Binning Tables for 10+ CRE PD Variables

以下 10 个变量为 CRE PD 模型中最常见、最具预测力的变量:
  1. LTV
  1. DSCR
  1. Loan Age
  1. Property Type
  1. Geography
  1. Interest Rate
  1. Sponsor Strength
  1. Tenant Concentration (若数据可用)
  1. NOI Growth
  1. Borrower Exposure / Portfolio Concentration
每个表格格式均为真实银行标准格式:
  • Good / Bad
  • Bad Rate
  • Dist good / Dist bad
  • WoE
  • IV contribution
  • 单调性检查(每个表末尾我给解释)

A.1 WoE – Loan-to-Value (LTV)

(高预测力变量,IV 非常高)
LTV Bin
# Good
# Bad
Bad Rate
Dist_G
Dist_B
WoE
IV
0–50%
2,100
10
0.47%
0.350
0.064
-1.69
0.485
50–70%
2,500
25
0.99%
0.417
0.159
-0.96
0.247
70–85%
1,600
34
2.08%
0.267
0.216
-0.21
0.010
85–95%
800
38
4.54%
0.133
0.241
0.62
0.067
95%+
400
49
10.91%
0.067
0.520
1.43
0.658
Total
7,400
156
1.0
1.0
1.47
Monotonicity: PERFECT
Interpretation: This is the strongest predictor.

A.2 WoE – DSCR

DSCR Bin
# Good
# Bad
Bad Rate
WoE
IV
Missing
180
12
6.25%
0.92
0.051
<1.0
420
34
7.48%
1.31
0.184
1.0–1.3
2,900
76
2.55%
-0.55
0.061
1.3–1.6
3,500
51
1.43%
-1.01
0.104
>1.6
2,200
19
0.86%
-1.53
0.173
Total
9,200
192
0.57
Monotonicity: PERFECT
Interpretation: DSCR is the second most powerful variable.

A.3 WoE – Loan Age (Seasoning)

Loan Age (Months)
Default Rate
WoE
IV
0–12
3.2%
0.55
0.033
12–24
2.4%
0.17
0.006
24–48
1.6%
-0.40
0.019
48–72
1.1%
-0.82
0.038
72+
0.9%
-1.10
0.052
Total
0.15
Interpretation: New loans risk highest → consistent with seasoning effect.

A.4 WoE – Property Type

Property Type
Bad Rate
WoE
IV
Multifamily
1.2%
-0.92
0.061
Retail
2.5%
0.41
0.044
Industrial
1.1%
-0.97
0.027
Mixed Use
2.0%
0.20
0.008
Office
4.2%
1.18
0.132
Total
0.27
Interpretation: Office behaves as a high-risk structural segment.

A.5 WoE – Geography (State-Level PD)

示例分箱:按风险+区域聚合(真实 CRE 模型常这么做)
Region
Bad Rate
WoE
IV
West (CA, WA, OR)
2.9%
0.36
0.017
Midwest
1.7%
-0.42
0.019
Northeast (NY/NJ/MA)
3.1%
0.47
0.021
South
1.4%
-0.58
0.033
NYC Metro
4.5%
1.11
0.124
Total
0.21

A.6 WoE – Interest Rate

Rate Bin (%)
Bad Rate
WoE
IV
<4%
1.3%
-0.41
0.012
4–6%
1.9%
-0.05
0.002
6–8%
2.5%
0.38
0.007
>8%
3.8%
0.89
0.018
Total
0.04
Weak, but included due to macro-interpretability.

A.7 WoE – Sponsor Strength

Sponsor Strength Score
Bad Rate
WoE
IV
1 – Weak
4.0%
1.05
0.102
2 – Below Avg
2.2%
0.33
0.006
3 – Average
1.4%
-0.48
0.012
4 – Good
1.0%
-0.81
0.019
5 – Excellent
0.7%
-1.21
0.041
Total
0.18

A.8 WoE – Tenant Concentration

(CRE现金流重要驱动因素)
Tenant Concentration
Bad Rate
WoE
IV
Single-tenant
3.6%
0.81
0.042
2–3 tenants
2.1%
0.10
0.002
4–10 tenants
1.5%
-0.40
0.011
>10 tenants
1.1%
-0.72
0.018
Total
0.07

A.9 WoE – NOI Growth

NOI Growth (YoY)
Bad Rate
WoE
IV
< -10%
3.7%
0.76
0.021
-10% to 0%
2.4%
0.19
0.003
0–5%
1.4%
-0.52
0.012
5–10%
1.0%
-0.86
0.019
>10%
0.8%
-1.10
0.017
Total
0.07

A.10 WoE – Borrower Exposure

Exposure Bin
Bad Rate
WoE
IV
<$2M
1.2%
-0.61
0.022
$2–10M
2.0%
-0.02
0.000
$10–30M
2.8%
0.39
0.006
>$30M
4.1%
0.92
0.015
Total
0.04

Appendix B — Performance Charts (AUC, KS, Calibration)

B.1 ROC Curve (AUC = 0.74)

Image
Image
Interpretation (写在文档里):
Model demonstrates strong discriminatory power with an AUC of 0.74, consistent with industry norms for CRE PD models (0.65–0.75).

B.2 KS Statistic Plot(KS = 36)

Image
Image
Interpretation:
Maximum separation between good and bad distributions occurs at ~36% → strong model power.

B.3 Calibration Plot(Observed vs Expected PD)

Image
Image
Interpretation:
Calibration error < 3.5%, deviations within tolerance across all deciles.

B.4 Decile Plot / Lift Chart

Image
Image
Shows monotonic increase in default rate across PD deciles → healthy rank ordering.

Appendix C — Data Dictionary

C.1 Variable List Overview

Variable
Category
WoE Applied
Used in Model
LTV
Collateral
Yes
Yes
DSCR
Cash Flow
Yes
Yes
Property Type
Structural
Yes
Yes
Geography
Regional
Yes
Yes
Loan Age
Behavioral
Yes
Yes
Sponsor Strength
Borrower
Yes
Yes
Interest Rate
Pricing
Yes
No (removed)
Debt Yield
Cash Flow
Yes
No (collinearity)
Vacancy Rate
Property
Yes
No (non-monotonic)
NOI Growth
Performance
Yes
No
Borrower Exposure
Portfolio
Yes
No

C.2 Full Data Dictionary(标准银行格式)


1. Variable Name: LTV (Loan-to-Value Ratio)
  • Category: Collateral Risk
  • Definition: Current loan balance divided by most recent appraised property value
  • Formula: LTV = Loan Amount / Appraised Value
  • Source System: Appraisal System + Loan Accounting
  • Refresh Frequency: Quarterly
  • Data Owner: Collateral Valuation Team
  • Transformation: WoE monotonic bins
  • Data Quality Notes:
    • Values >150% capped
    • Must confirm appraisal dates not post-default
  • Usage: Included in model (strongest predictor)

2. Variable Name: DSCR (Debt Service Coverage Ratio)
  • Category: Cash Flow Risk
  • Definition: NOI divided by annual debt service
  • Formula: DSCR = NOI / Debt Service
  • Source System: CRE Underwriting System
  • Refresh Frequency: Annual (or at renewal)
  • Data Owner: CRE Underwriting
  • Transformation: WoE
  • Missing Values: Put into "Missing" WoE bin
  • Usage: Included in model (2nd most predictive)

3. Variable Name: Property Type
  • Category: Structural Risk
  • Possible Values: Multifamily, Office, Retail, Industrial, Mixed Use
  • Source System: Collateral / CRE Underwriting
  • Business Relevance: CRE segment risk varies significantly (Office = highest PD)
  • Transformation: WoE categorical encoding
  • Usage: Included in model

4. Variable Name: Geography (Region / State)
  • Category: Regional Economic Risk
  • Definition: Loan's primary collateral state grouped by risk clusters
  • Source System: Loan Origination System
  • Transformation: WoE (state → region risk bins)
  • Usage: Included in model

5. Variable Name: Loan Age
  • Category: Behavioral / Vintage
  • Definition: Months since loan origination
  • Source System: Loan Accounting
  • Transformation: WoE monotonic seasoning pattern
  • Usage: Included in model

6. Variable Name: Sponsor Strength Score
  • Category: Borrower Quality
  • Definition: Bank’s internal rating of sponsor’s liquidity + net worth
  • Source: Borrower Financials
  • Scale: 1 = Weak, 5 = Strong
  • Transformation: WoE
  • Usage: Included in model

7. Variable Name: Interest Rate
  • Category: Pricing
  • Definition: Current contractual rate
  • Source: Loan System
  • Transformation: WoE (kept monotonic)
  • Usage: Not included in final model (low IV)

8. Variable Name: Debt Yield
  • Category: Cash Flow
  • Definition: NOI / Loan Balance
  • Source: CRE Underwriting
  • Notes: Highly correlated with DSCR → removed
  • Usage: Not included due to VIF > 5

9. Variable Name: Vacancy Rate
  • Category: Property Performance
  • Source: External CRE data provider
  • Notes: WoE non-monotonic → removed

10. Variable Name: NOI Growth
  • Category: Property Performance
  • Definition: Year-over-year growth of property NOI
  • Source: Financial Reporting
  • Usage: Removed (weak IV)

Appendix D — Model Code Snippets

D.1 数据导入与初始准备

import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, roc_curve df = pd.read_csv("cre_loan_data.csv") TARGET = "default_flag"

D.2 自动分箱(Supervised Monotonic Binning)

(真实模型会采用等频+监督式单调分箱)
def monotonic_binning(x, y, max_bins=5): df_temp = pd.DataFrame({'x': x, 'y': y}) df_temp = df_temp.sort_values('x') bins = pd.qcut(df_temp['x'], max_bins, duplicates='drop') df_temp['bin'] = bins grouped = df_temp.groupby('bin')['y'].mean() # enforce monotonicity while not (grouped.is_monotonic_increasing or grouped.is_monotonic_decreasing): max_bins -= 1 bins = pd.qcut(df_temp['x'], max_bins, duplicates='drop') df_temp['bin'] = bins grouped = df_temp.groupby('bin')['y'].mean() return df_temp['bin']

D.3 WoE 计算函数

def compute_woe_iv(df, feature, target): df_woe = df.groupby(feature).agg({target: ['sum', 'count']}) df_woe.columns = ['bad', 'total'] df_woe['good'] = df_woe['total'] - df_woe['bad'] df_woe['dist_good'] = df_woe['good'] / df_woe['good'].sum() df_woe['dist_bad'] = df_woe['bad'] / df_woe['bad'].sum() df_woe['woe'] = np.log(df_woe['dist_good'] / df_woe['dist_bad']) df_woe['iv'] = (df_woe['dist_good'] - df_woe['dist_bad']) * df_woe['woe'] return df_woe[['woe', 'iv']]

D.4 对所有变量生成 WoE 编码

features = ["LTV", "DSCR", "Loan_Age", "Property_Type", "Geography", "Sponsor_Strength"] woe_maps = {} for var in features: df[var + "_bin"] = monotonic_binning(df[var], df[TARGET]) woe_table = compute_woe_iv(df, var + "_bin", TARGET) woe_maps[var] = woe_table['woe'].to_dict() df[var + "_WOE"] = df[var + "_bin"].map(woe_maps[var])

D.5 Logistic Regression 训练

X = df[[f"{v}_WOE" for v in features]] y = df[TARGET] model = LogisticRegression(max_iter=200) model.fit(X, y) pd.Series(model.coef_[0], index=X.columns)

 

D.6 评分函数(生产可用)

def score_new_loan(record): z = model.intercept_[0] for v in features: bin_value = pd.Interval(left=record[v+"_bin"].left, right=record[v+"_bin"].right) woe = woe_maps[v][bin_value] z += model.coef_[0][features.index(v)] * woe pd_value = 1 / (1 + np.exp(-z)) return pd_value
你可以将这个函数用于:
  • CECL monthly batch
  • PD score generation
  • Stress testing scenario PD