AI Projects 02 Jun 2024

Local Models and LangChain

While working on some badly documented GitHub repositories, I was curious what the state of local models were, and if I could build an agent to help clear up some of the mess.

AI Placeholder - Chris
Auto Doc - PhpMailer

The Trials Begin

The process is run with access to the code files only. Existing documentation is intentionally excluded. There is the option to use a language specific preprocesor to strip out function headers etc, but the test is to see how well code can be interpreted using a RAG process.

The demo below potentially has pollution due to the public nature of PHPMailer, BUT similar results were achieved with a recently created private repository.

Systems used: chromadb, langchain, lcel, codellama, pygithub api.

Attempt 1:
The first attempt was to use the non lcel classes from Langchain. It was implemented as a ConversationRetrivealChain with a refine type chain. Essentially - process each file, get a synopsis, combine the synopsis and build a final answer with each summary as context.

This worked on smaller code basis, but as the repositories grew in size, context window limitations and cross context references became more of a challenge. Eventually, the final summary was clearly only seeing part of the prompt - and the results were incoherent.

So I moved on to a map reduce style chain!

Combining this with the simplified coding style in lcel allowed the source files to process in parallel and generate a reduced context for final processing. The last repository tested had over 500 source code files, and still generated a reasonable about of documentation.

Reading the GitHub library through each modification was a challenge as well. I decided since I was going to test small changes to the chain and prompt, that it made more sense to create embeddings from the projects and store them in chroma db vectors. This saved significant time as I could re-query repositories without going through the source retrieval process unnecessarily.

One final step added was getting the result into mark down. I was able to add a chain specifically for this step to ensure that the results were consistent. Samples of produced data below:

Auto Doc - PhpMailer

PhP Mailer Auto Doc Features List