Machine-/Deep-Learning-based GitHub Project Classification

Description
Open-source software repositories are a valuable source of information for software engineering researchers, providing insights into coding patterns, development practices, and developer collaboration. A key challenge fur such studies is the selection of suitable repositories for analysis, which requires categorization of projects (e.g. libraries, applications, etc.). This classification is usually performed manually, making it time consuming and limiting dataset size.

We are therefore developing an automated classifier to categorize software repositories into four types: application, library, framework, and plugin. The classifier is based on Graph-Convolutional Neural Networks (GCNN) and metamodels for language-independent source code representation. The goal of this thesis is to further develop this classifier by extending the training dataset, fine-tuning the hyperparameters of the GCNN and implementing parsing of different programming languages.

Prerequisites: Programming experience with Python, basic knowledge of ML and DL
Contact: Yorick Sens (yorick.sens@rub.de)
Extent: B.Sc./ M.Sc.