RESEARCHJan 2025 – May 202510 min read

AI Technical Debt in Software Repositories

Does AI-generated code create more technical debt? We analyzed 5,000+ GitHub repositories to find out.

PythonGitHub APISonarQubeSciPyMann-Whitney UPandas

Overview

As AI coding assistants become ubiquitous, a critical question emerges for software engineering: does code generated by AI accumulate technical debt faster than human-written code? This empirical study mined 5,000+ public GitHub repositories, classified AI-assisted vs. human-written commits, and ran static analysis to compare code quality metrics at scale.

The Problem

AI coding tools like GitHub Copilot are widely adopted, but their long-term impact on codebase maintainability is unknown. Anecdotal evidence suggests AI code may pass tests while introducing subtle smells — duplicated logic, overly complex methods, missing documentation. This study provides the first large-scale empirical measurement.

Questions Addressed

01
Are repositories with high AI-assisted commit rates associated with higher technical debt density (issues per KLOC)?
02
Do specific code smell categories (complexity, duplication, documentation) differ significantly between AI-assisted and human-written code?
03
Is there a threshold of AI usage beyond which code quality metrics deteriorate measurably?

Methodology

Phase 1

Data Collection

Used the GitHub API to identify 5,000+ repositories with AI tool fingerprints in commit messages and PR descriptions (keywords: "Copilot", "ChatGPT", "AI-generated"). Matched each with a control repository of similar size, language, and activity. Extracted commit histories, contributor counts, and issue trackers.

PythonGitHub APIPandas

Phase 2

Static Analysis

Ran SonarQube analysis across all repositories to measure: cyclomatic complexity, code duplication %, documentation coverage, and bug density. Classified findings by severity (blocker, critical, major, minor) and normalized by KLOC for fair comparison across project sizes.

SonarQubePythonBash

Phase 3

Statistical Analysis & Findings

Applied Mann-Whitney U tests (non-parametric, appropriate for non-normal distributions) to compare debt metrics between AI-assisted and control groups. Computed effect sizes using Cohen's d. Built regression models to identify which AI usage levels correlate with quality degradation.

SciPyPandasMann-Whitney Ustatsmodels

Key Results

5,000+Repositories analyzed

3Research questions answered

#1Best Presentation Award

MSRMining Software Repositories nomination

Key Findings

Repositories with >40% AI-assisted commits show statistically significant higher duplication rates (p < 0.01, Cohen's d = 0.42) — a medium effect size.

Documentation coverage is 23% lower on average in AI-heavy repositories, suggesting AI tools generate functional code but skip docstrings and comments.

No significant difference was found in bug density between groups, challenging the assumption that AI code is inherently more bug-prone.

The relationship between AI usage and technical debt is non-linear: moderate AI use (20–40%) shows no degradation; only heavy use (>60%) triggers measurable quality drops.

Conclusion

AI coding assistants are not inherently harmful to code quality — but unchecked, high-volume AI usage correlates with increased duplication and reduced documentation. Teams should integrate AI tools with code review policies that specifically check for documentation and duplication smells. The full dataset and analysis scripts are available for replication.

Gallery

AI Technical Debt in Software Repositories screenshot 1

View on GitHub →▶️ Watch Demo Video