Balahari Vignesh Balu

Allgemeine Informationen rund um die Kurse von B. V. Balu

Weitere Kurse
While large closed-source LLMs perform reasonably well on widely used open-source programming languages and publicly available repositories, their effectiveness decreases in industrial environments. Companies rely on proprietary codebases, internal documentation, architectural models, and domain-specific artifacts that are largely absent from public training data. Consequently, directly applying generic LLMs often leads to limited reliability and contextual understanding. To make LLMs genuinely useful in such environments, smaller, adaptable models and agent-based workflows must be developed that can integrate structured and domain-specific knowledge in a controlled and reliable manner.
 
This project explores how structured software-engineering artifacts can enhance LLM-based systems for code understanding. Students will work with two datasets: a UML corpus containing heterogeneous diagrams and multilingual descriptions, and a dataset of vulnerability–fix pairs collected from programming-language repositories. The project focuses on two main challenges: (1) preparing high-quality structured datasets from noisy sources, including deduplication, normalization, multilingual processing, and semantic preservation, and (2) evaluating methods for integrating this information into LLM-based systems, such as parameter-efficient fine-tuning, retrieval-augmented generation, and agent-based approaches with external knowledge.
 
The project will follow an Agile/Scrum framework to ensure iterative development and integration. Students will be organized into a self-managing sub-teams for planning, tracking  and version controls. This structure also includes weekly sprints, milestone reviews and MVP demonstrations to manage and achieve the defined goals. Students will develop reproducible experimental pipelines and empirically analyze how structured knowledge improves the reliability and interpretability of LLM-supported analysis.