Why I ditched AI/LLMs almost completely for development

Jun 14
6 min read

If it isn't for you I guess this blog post is more for me to dump my thoughts. Because EMerge is about 1 years old, LLMs have been there from the beginning of development. These more advanced agentic ones mostly in the latter half. I started mostly with ChatGPT and then around December last year moved to Gemini and Claude as ChatGPT kept gasslighting me into non-sense saying that stuff didn't work that absolutely worked. Last week I cancelled my Claude subscription. I'm not completely LLM free in my development in terms of using it in all forms but I mostly am. I wanted to discuss why.

What did I use it for?

I started off using LLMs mostly for just asking questions, sometimes implementing small functions and checking for bugs.

Simple function coding

In cases where you know what a function is supposed to do, the scope of the function is entirely decoupled from the rest of your project and you know the solution but just don't feel like doing it, LLMs can be very useful. One of these function is "align_rect_basis" in https://github.com/FennisRobert/EMerge/blob/main/src/emerge/_emerge/selection.py The job is simple, you get a set of 3D coordinates which you know are on the boundary of a rectangle with some random orientation (normal to a provided normal vector) in space. The goal is to find the center of the rectangle and the two basis vectors that align with the long and short edge of the rectangle. This function is used to determine a good coordinate system for the Rectangular Waveguide boundary condition such that the port mode functions are correctly defined relative to the center of the waveguide.

The function is simple, I could do it myself but there are some annoying features that LLMs are very well capable of solving.

Debugging

Another cool use case is quick debugging. A lot of the physics code, especially around assembly of the sparse matrices is extremely sensitive to type-o such as indexing errors. Take this example of the generalized eigenvalue problem assembler. I accidentally indexed wrong:

This is something that can take me hours of searching! Very useful.

Module how-to's

Its also useful to use AI to quickly figure out how to do something in libraries like PyVista or Numpy. Often there is a specific function that does something you need but you don't know it. AI can be useful here.

Stuff I just don't know

The emerge-aasds module which makes a python binding for Apples sparse direct solver is completely vibe-coded. I simply don't know C-python and I can't program C or C++ so making a module to use the Sparse direct solver would take me a long time. Additionally, these sorts of interfaces can be highly abstracted and decoupled so its relatively safe.

So why did I ditch LLMs? Well it wans't for these features, I mostly still is useful for these reasons but these cases are relatively rare in the development process.

As I started using them more I started to become lazy and use it as a crutch.

The Claude temptation

I have to admit that when I used Claude for the Accelerate interface I was highly impressed. It was really easy to make a quick interface and it saved me a tremendous amount of time. So naturally I started relying on it more and more.

When developing the thermal solver, especially when I had to explore the design space and figure out how to get the multi-physics to work it was really easy to just give it my current assembly code and let it implement the basic Mass/Stiffness matrix assembler functions by essentially taking the current assembly scripts from the Curl-curl solver and applying it to the thermal domain. This greatly accelerated the process at first but it got worse and worse from there. The core assembler architecture was the same anyway. The matrix contributions where just different.

The code it wrote just sucked. It wasn't making efficient implementations at all, computing Matrix entries in loops that could be loop lifted etc. But it worked and I could test quickly. I ended up rewriting almost all of it completely myself. afterwards. Im sure that with more prompting I could have asked it to optimize the code but that wasn't the point. I didn't use Claude to do it for me, I used it to temporarily explore and see problems so I could fix/change things before overcommitting.

It also made very odd design choices constantly.

I basically consider it as like an artist rendition of a project that is yet to be. The point isn't to have it be the final product but tot quickly get an impression of what is to come. A concept design basically.

After finishing the thermal solver I went back to EMerge and started implementing the thin-conductor boundary condition and this is where everything went to garbage.

The realization

To implement the thin-conductor boundary condition in EM I needed DoF splitting just like in the thermal domain. It was a good test case. Naturally I thought, lets just give Claude my code for the Thermal solver and then do the same for the EM case but it just didn't work. I kept asking it: just do exactly this but for this other formulation and it kept just doing its own thing, not listening, injecting code that wasn't supposed to be there. Then after giving up and just writing it myself it became debug time.

After finishing the BC, I wanted to test it by computing the total losses of a microstripline. EMerge gave vastly wrong results. I asked a Discord user to simulate it in HFSS and EMerge was just wrong.

I asked Claude naturally and Gemini and got all sorts of suggestions. I think I spend about a week testing all sorts of things and eventually completely rewriting the core assembly code to support all sorts of different second order basis functions. A lot didn't work. Until at some point I started thinking?

Why am I asking AI about what is wrong? 9 out of 10 suggestions at leas where wrong and just wasting my time. I stopped thinking for myself. I thought back and realized that at most, LLMs where right about issues maybe 5% of the time. They kept suggesting code as erroneous that I knew was correct.

That is when I realized that this stuff wasn't actually useful at all. It just send me on wild goose chaces and wasting my time.

I canceled my subscription and solve the problems myself and in the end discovered 1 or 2 more bugs in my code (self inflicted) that the AI never found.

So am I not using them ever?

No not really, I think the original use cases are still useful.

If I have a piece of code that I know can contain a stupid type-o bug I'll give it to Gemini (its free). If it is a stupid indexing bug or something, variable shadowing or something like that, it'll usually spot it immediately. If it doesn't I abandon LLMs completely. But just for that simple first look it can be useful. It doesn't take more that a couple of seconds anyway.

Secondly, for a glorified Stack-exchange it can still at moments be a useful tool to quickly figure out how to add something to PyVista if the documentation isn't exactly clear. It has these weird idiosyncrasies sometimes. But beyond that, I stop using them. Its just a waste of time.

In conclusion

I don't believe these tools are as good as people claim they are or even seem to be to ourselves.

Remember, these tools are trained to become "good" at something but its humans telling them what they like and don't like. You can give them some objective standard tests like solving generic problems but I'm not sure how well that extrapolates to other areas. In the end, AI just becomes good add increasing their "goodness" score and a lot of that is subject to human subjectivity. So yes they look very impressive but in many cases they aren't really.

Once you start using them to actually do something with intention they really really often miss the boat.

A lot of online reviewers who are impressed also fall into the trap where they give them open-ended tasks. If you ask the new Claude models to make a website for something, yes it is impressive. But you also haven't really put tight constraints on the end product. So whatever it produces is nice. We see this constantly, also with cherrypicked end results. Sora is impressive if you see the photo realistic videos of people doing whatever. But the limitations become really apparent if you want something specific and they keep ignoring your requests.

If you don't care about the details of the end results then sure, it can be fun to make a quick vibe-coded tool. But the moment you have any real constraint it becomes harder and harder to get them to do what you want.