When Logistic Regression Beats Deep Learning: A Practical AI Triage Case Study

What CFPB complaint classification taught me about data quality, model trade-offs, and AI delivery risk

July 1, 2026 · 15 min read

On this page

Financial service providers receive large volumes of complaints across products such as credit reporting, debt collection, credit cards, mortgages, bank accounts, loans, and money transfers.

In a manual workflow, someone has to read each complaint narrative, understand the issue, and route it to the right product team. That is slow, inconsistent, and difficult to scale. It also creates operational risk because the first classification step can become a bottleneck during high-volume periods.

This project explored a practical question:

Can a text classification system use customer complaint narratives to predict the financial product category reliably enough to support faster triage?

In this project, I built and evaluated a complaint narrative classifier using traditional machine learning and deep learning approaches. The strongest practical model was a tuned Logistic Regression model using TF-IDF features. It reached a Macro F1 score of 0.8493 and remained simpler to explain, retrain, and deploy.

End-to-End Workflow of CFPB Complaint Narrative Classification

The useful lesson went beyond the winning metric. The outcome depended on the full AI lifecycle: problem framing, data preparation, label design, feature engineering, evaluation, cost, explainability, and delivery risk.

Why this write-up exists
This started as a Deep Learning and Natural Language Processing project. I treated it as an AI delivery case study because a model is only useful when its data assumptions, evaluation choices, operating cost, and stakeholder value are clear.

What This Project Demonstrates

This project demonstrates the kind of work required to move AI from experimentation toward practical delivery:

framing a real operational problem,
preparing large unstructured text data,
designing a reproducible modelling dataset,
comparing machine learning and deep learning approaches,
selecting metrics that match the business problem,
evaluating model trade-offs beyond raw accuracy,
thinking ahead about governance, monitoring, explainability, and human review.

The Business Problem

Complaint classification is an operational workflow problem with a modelling layer.

When complaint routing is manual, organisations face several constraints:

response time depends on human review capacity,
classification quality may vary across reviewers,
high-volume periods can create backlogs,
reporting is harder when categories are inconsistent,
escalation teams may receive cases later than necessary.

A classifier can support triage by giving each complaint an initial product category, confidence score, and review priority. Governance, accountability, and human judgement remain necessary, especially for low-confidence or high-risk cases. The classifier gives operations teams a more consistent starting point.

The strategic value is to reduce avoidable routing friction and keep human review available where it is most useful.

The Dataset Was Large Enough to Force Engineering Discipline

The project used the CFPB Consumer Complaint Database, a public US government dataset containing structured complaint metadata and free-text consumer narratives.

The raw CSV was large enough to shape the workflow:

Stage	Size	Rows	Purpose
Raw CSV	8.12 GB	15,079,538	Full CFPB complaint export
Curated Parquet	1.87 GB	3,778,235	Non-empty narratives and selected working columns
Gold Parquet	100.05 MB	80,000	Balanced modelling dataset with cleaned narratives

That scale required engineering discipline. I used 2 Jupyter Notebooks for this project.

In the first notebook (01_data_preparation.ipynb), the raw data was inspected with Polars lazy scans, then reduced into Parquet layers. In the second notebook (02_modelling_report.ipynb), for the final modelling, I loaded the gold dataset directly, which kept reruns fast and made the report easier to reproduce.

This follows the same pattern I would expect in a production AI environment:

keep the raw data immutable
create a curated layer for repeatable preparation
create a gold layer for model-ready training and evaluation

Choosing the Right Target Label

The source data included several candidate label columns: Product, Sub-product, Issue, and Sub-issue.

Candidate Label Column Unique Value Counts

Product was selected as the target because it had a manageable number of unique values and no missing values in the raw dataset. The other columns were more granular. They may be useful for later workflows, with the trade-off that they increase label complexity and make the first classification problem harder.

That is an architecture decision I made. The most detailed label can create unnecessary complexity at the start. A useful AI system often begins at the level of granularity that stakeholders can trust, operate, and improve.

For this project, the final prediction task was:

input: narrative_cleaned
target: Product_normalised

Cleaning Was Part of the Model Design

The raw complaint narrative is sensitive, inconsistent, and messy. The data preparation notebook handled this before any model training.

The main preprocessing steps were:

remove empty complaint narratives,
clean redaction markers such as XXXX,
remove masked dates such as XX/XX/XXXX,
convert text to lowercase,
normalise whitespace and slashes,
remove duplicate cleaned narratives,
keep narratives with at least 20 words,
normalise overlapping product labels,
keep the top 8 normalised product classes,
cap each class at 10,000 rows.

This preparation directly affected model quality. Duplicates, empty narratives, inconsistent labels, and very short text can distort evaluation. When those issues are left in the dataset, the model score may look better than the system actually is.

Here is the core cleaning idea from the notebook.

I have shortened the snippet for the article, while keeping the full reusable implementation in this GitHub Gist: TextCleaner for Large-Scale NLP Preprocessing.

python

class TextCleaner:
    """Text-cleaning rules used for the complaint narratives."""
 
    # Word boundary
    WORD = r"\b\w+\b"
    # As discovered earlier in `head` & `tail`
    REDACTION = r"\b[xX]{2,}\b"
    MASKED_DATE = r"(?i)\b(?:XX|\d{1,2})/(?:XX|\d{1,2})/(?:XXXX|\d{2,4})\b"
    BROAD_MASKED_DATE = (
        r"(?i)\b(?:[xX]{1,4}|\d{1,4})/(?:[xX]{1,2}|\d{1,2})/"
        r"(?:[xX]{2,4}|\d{2,4})\b"
    )
    SLASH = r"[/\\]+"
    WHITESPACE = r"\s+"
 
    def clean(self, text: str) -> str:
        text = text.lower()
        text = self.remove_redactions(text)
        text = self.normalise_whitespace(text)
        return text.strip()
 
    def is_valid(self, text: str) -> bool:
        return len(text.split()) >= self.MIN_WORDS

The key design choice is the centralised cleaning logic. It is documented and reused, which reduces hidden notebook state and makes the pipeline easier to maintain.

From Raw Data to a Balanced Gold Dataset

The raw dataset contained 15,079,538 rows. Only 3,778,235 rows had non-empty complaint narratives.

After cleaning and duplicate resolution, the working pool became 2,342,844 unique cleaned narratives. Applying the 20-word minimum left 2,252,159 rows. Product normalisation kept that same row count, and the top 8 label filter retained 2,213,876 rows.

For modelling, the dataset was then capped to 10,000 rows per class, producing a balanced 80,000-row gold dataset.

The final labels were:

Bank account / service
Consumer / vehicle loan
Credit card / prepaid card
Credit reporting / consumer reports
Debt collection
Money transfer / virtual currency
Mortgage
Student loan

The class cap deliberately traded raw volume for evaluation clarity and rerun speed. In an operational system, I would keep the full distribution visible and evaluate real-world class imbalance separately. For this project, a balanced dataset made model comparison fairer because each class had equal representation.

Feature Engineering: Two Paths for the Same Problem

I used the same stratified train-test split for all models:

Split	Rows	Share
Train	64,000	80%
Test	16,000	20%

Every class had 8,000 training rows and 2,000 test rows. I made this decision because the comparison between models should not be affected by different data splits.

The machine learning models used TF-IDF features:

python

tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=5,
    max_features=50_000,
    sublinear_tf=True,
)
 
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_text)
X_test_tfidf = tfidf_vectorizer.transform(X_test_text)

The key detail here is fit_transform on training text and transform on test text. The vectoriser learns vocabulary from the training set only. That avoids leakage from the test set into feature preparation.

The deep learning model used token sequences instead:

vocabulary size: 20,000 tokens,
maximum sequence length: 250 words,
embedding layer,
Conv1D,
global max pooling,
softmax output across 8 classes.

This created a useful comparison between sparse TF-IDF features with linear models and learned word embeddings with local phrase patterns from a 1D Convolutional Neural Network (1D CNN).

Models Built

I compared three models.

Multinomial Naive Bayes

Naive Bayes is a fast baseline for sparse text features. It is simple, efficient, and commonly used for text classification.

Its role in this project was to establish a quick baseline. If a more complex model cannot beat a simple baseline, the additional complexity needs a strong justification.

Logistic Regression

Logistic Regression was the stronger traditional machine learning model. It used the same TF-IDF matrix, then learned weighted evidence for each product class.

The training flow was intentionally straightforward:

python

lr_model = LogisticRegression(
    C=1.0,
    solver="saga",
    max_iter=1000,
    random_state=42,
)
 
lr_model.fit(X_train_tfidf, y_train)
lr_test_pred = lr_model.predict(X_test_tfidf)

For production triage, this model family has an important advantage: it is relatively easy to inspect, explain, deploy, and retrain.

1D Convolutional Neural Network

The deep learning model used an embedding layer followed by Conv1D, GlobalMaxPooling1D, and a dense softmax layer.

The idea is that a convolution can learn local phrase patterns. In complaint text, phrases about payment disputes, credit report errors, debt collection calls, or mortgage servicing problems can carry strong product signals.

The baseline CNN had 2,643,080 trainable parameters. The tuned CNN reduced capacity to 1,957,736 trainable parameters by lowering the embedding dimension, number of filters, kernel size, and epochs. That was done because the baseline CNN showed signs of overfitting after epoch 3.

Baseline 1D CNN training history before tuning — The baseline 1D CNN continued improving on training loss while validation loss started rising after epoch 3, indicating overfitting.

Tuned 1D CNN training history after reducing capacity — The tuned 1D CNN produced a more stable validation curve after reducing model capacity and limiting training duration.

This is a practical lesson: deep learning tuning often starts with restraint. Training longer may reinforce overfitting. Reducing capacity can help the model generalise better.

Evaluation: Macro F1 Over Accuracy

I used Macro F1 as the main comparison metric.

Accuracy was still reported. Macro F1 was more appropriate because all 8 product labels were important. A model that performs well on common classes and poorly on smaller or more operationally sensitive classes can look acceptable under accuracy alone.

The final tuned comparison was:

Model	Test Accuracy	Test Macro Precision	Test Macro Recall	Test Macro F1
Multinomial Naive Bayes (Tuned)	0.7950	0.8070	0.7950	0.7950
Logistic Regression (Tuned)	0.8490	0.8499	0.8490	0.8493
1D CNN (Tuned)	0.8450	0.8454	0.8450	0.8450

The strongest model was tuned Logistic Regression with Macro F1 of 0.8493.

The tuned 1D CNN was close at 0.8450. Naive Bayes improved slightly after tuning and remained a useful baseline. It still trailed the other two models.

Looking Beyond the Score with Confusion Matrices

The metric table ranked the models. The confusion matrices explained the failure patterns.

The plot below comes from the baseline comparison on the same test set. It uses normalised values, so each row shows how complaints from one true product category were distributed across predicted categories.

Normalised confusion matrices for Naive Bayes, Logistic Regression, and 1D CNN complaint classifiers — Normalised confusion matrices helped reveal where each model confused product categories.

Naive Bayes was the weakest overall. It performed well on Mortgage and Student loan, then struggled with Money transfer / virtual currency. Many of those complaints were predicted as Bank account / service, which is a plausible business confusion because both categories can involve account access, transfers, and transaction handling.

Logistic Regression improved performance across most classes, especially Money transfer / virtual currency. That was useful because this category exposed one of the clearer weaknesses in the Naive Bayes baseline.

The 1D CNN behaved similarly to Logistic Regression. It was slightly stronger for Money transfer / virtual currency, Mortgage, and Student loan, then slightly weaker for Bank account / service and Debt collection.

The class-level view supported the final model choice. Logistic Regression gave strong, balanced behaviour across the categories while staying simpler to explain, retrain, and operate.

Why Logistic Regression Was the Best Practical Choice

The final model choice came from score, complexity, explainability, and delivery effort.

Tuned Logistic Regression had the strongest Macro F1. It also achieved that performance with lower complexity than the CNN.

The practical comparison looked like this:

Criteria	Logistic Regression	1D CNN
Performance	Slightly strongest	Very close
Interpretability	Easier to explain	Less transparent
Training cost	Lower	Higher
Deployment ease	Simpler	More complex

This is the kind of trade-off that affects real AI delivery. A deep learning model may be more flexible. That flexibility can require more compute, more tuning, more monitoring, and more specialised deployment support.

For a complaint triage baseline, Logistic Regression with TF-IDF is a strong practical choice because it is:

accurate enough to be useful,
fast to train,
simpler to rerun,
easier to explain to stakeholders,
easier to deploy as a first production candidate.

The CNN remained valuable because it provided a fair deep learning benchmark. It showed that deep learning was competitive for this dataset and scope. The simpler model remained the better first production candidate.

What This Means for Production Readiness

A model notebook is an experiment artefact. A complaint triage system needs workflow design, controls, and observability.

Key production considerations:

Human review: route low-confidence or high-risk complaints to manual review before automation.
Confidence thresholds: use model confidence to decide whether to auto-route, suggest, or escalate.
Explainability: expose top contributing terms or class evidence for review teams.
Data privacy: preserve redaction handling and avoid storing unnecessary sensitive text in downstream systems.
Model monitoring: track class distribution, confidence drift, and complaint types that are repeatedly misrouted.
Retraining: schedule model refreshes when new complaint language, products, or policy categories appear.
Auditability: keep data version, model version, preprocessing version, and prediction timestamp together.
Cost control: prefer the simpler model unless a more complex model produces clear operational value.

The most important governance point is traceability. When a complaint is routed by AI, the organisation should be able to answer:

what data was used,
what preprocessing was applied,
which model version made the prediction,
what confidence or evidence supported the prediction,
when a human reviewed or overrode the result.

That is where AI architecture moves beyond model training. The system has to support operational trust. I explored the same production shift from model output to operational system design in my near-real-time rain risk ML pipeline on AWS, where orchestration, monitoring, serving cadence, and cost controls became part of the model decision.

Lessons for AI Delivery

This project reinforced a few lessons I consider central to practical AI work.

First, the target label is a product decision. Selecting Product created a useful first workflow. More detailed labels may come later, and starting at the right level of granularity improves adoption.

Second, preprocessing is part of governance. Removing empty narratives, duplicates, redaction artefacts, and inconsistent labels protects evaluation quality.

Third, leakage control is a delivery risk. Fitting TF-IDF only on training data is a small line of code, with a large effect on the integrity of the test score.

Fourth, deep learning should earn its complexity. The 1D CNN was close. Logistic Regression was slightly stronger and simpler. In a real delivery conversation, that affects the recommendation.

Finally, model comparison must include business constraints. Accuracy, Macro F1, interpretability, retraining cost, deployment effort, and stakeholder trust all belong in the same decision.

What I Would Improve Next

If I extended this project, I would focus on the production workflow before adding more model complexity.

The next iteration would include:

confidence-based routing rules,
a human review queue for ambiguous complaints,
class-level monitoring dashboards,
prediction explanations for reviewers,
periodic drift checks,
evaluation on the natural imbalanced class distribution,
a small inference API or batch scoring job,
a feedback loop from reviewer corrections back into training data.

I would also evaluate the value of a transformer model against its cost and complexity. That would be a sensible next experiment, and it should be judged against the same practical criteria: better class-level performance, acceptable latency, explainability options, cost, and maintainability.

Conclusion

The final result was clear: tuned Logistic Regression with TF-IDF was the strongest practical model for this complaint classification project, reaching Macro F1 of 0.8493 on the balanced test set.

The broader value of the project came from the delivery path. It showed how to move from raw unstructured text to a reproducible modelling dataset, compare traditional machine learning with deep learning, control leakage, evaluate with the right metric, and make a model selection decision that respects operational constraints.

A practical AI system needs more than a trained model. It needs sound data design, careful evaluation, governance, monitoring, and a workflow that stakeholders can trust.