On LLMs and Agents: Never trust, always Verify

Introduction

Over the last three years I’ve used a number of LLMs and agents. In this post I’ll share some of my experiences that I can recount, both positive and negative using these tools. I’ll start with Claude Code since its on everyone’s mind nowadays. Note there are competing products like Codex, Cursor, and others. I’ve spent the most time using Claude Code. Also note that I am not a mathematician, mechanism designer, cryptographer, system architect, or software engineer by trade. I am a humble researcher and arm chair product designer blessed with some communication skills. Some of you reading this have historically been trained on and acquired specific skills as well as ways of thinking that are demanded from your professional calling of choice. So you may use some of these tools in totally novel and unique ways that I have not. Please consider sharing your experiences as well.

Claude Code

I’ve worked with claude code to build a few dashboards, websites, and simple crypto applications. Recently I’ve been tinkering with claude code to help me build a music review website and shielded actions prototype for EVM chains.

Claude code is decently perceptive. So if you don’t give it a ton of instructions it can figure out roughly what you are going for. As you monitor the agent(s) (running in parallel or solo) you’ll notice that it may take short-cuts. For example If you ask it to scrape youtube for 1000 music reviews from a popular reviewer it may write a script to do just that. It may also not do that and simply create data that “looks” real but is not. Often times it will err on the side of limiting computation heavy tasks rather than leaning into them. It may also be lazy when it comes to debugging an issue. For example:

You’re absolutely right - I need to stop making assumptions and actually
debug the real issues. Let me properly investigate what’s happening in the browser.

Another challenge I had while working on shielded actions was connecting a local prover I was running in a Docker container to my frontend. The challenge here was that because claude had issues with debugging it connected a fake prover to my front-end instead of the real one. So the front-end appeared to have a signing flow that worked and sent transactions to the PA we deployed on Sepolia, but generated no proofs. It wasn’t until I asked Claude LLM to re-write a proper specification and feed it back to Claude code that I was able to make progress. In the specification process I had to feed it all of the context I could give it on the PA, the RM, Forwarder contracts, and what I was aiming to achieve. This took some time.

I find having a specification written before engaging with claude code more efficient for getting my “intent” satisfied. But this sometimes is not sufficient. For example when working on another dashboard, claude code ignored the key KPIs I asked for in the specification and made up its own. While the data was accurate (correct API) and rendered correctly, the actual calculations I wanted for a write-up I am working on were missing. About 5 more minutes of haggling and I was able to get the pertinent calculations. However, now I need to spot check to make sure the data and calculations are correct. This all took me about 30 minutes. I could have pulled the relevant information and made the calculations by hand in half of the time. What’s more is when I did just this, I started to think more critically about the data and what it means and came to a new insight I would not have had claude code executed the task correctly.

There are of course more sophisticated ways to use coding agents and sub agents, but in running multiple in parallel for the last month or so, its clear that I am working with a very intelligent 5 year old who wants to please their master more than actually think and execute the task at hand. The takeaway here is to assume the agent’s output might be wrong and make sure to constrain it by explicitly stating what you want, and monitoring its activity as its executing the task. Its easy to overlook this when managing multiple agents, especially if the context switching (human) is heavy.

LLMs

Beyond coding agents one can simply use an LLM and related extensions/plugins to help answer difficult questions (Terrance Tao), muse on ideas, complete simple tasks, or replace using a search engine entirely.

The best use I’ve gotten out of LLMs is their ability to curate consumer purchasing information. Let’s say I want to buy something like a microphone but I don’t know much about microphones. I can tell the LLM exactly what I want to use a mirophone for and ask it to return the best 10 microphones by customer reviews with links to a website where I can purchase them and explicitly state strengths and weaknesses of each. This is a nice baseline and saves me time of going through amazon reviews, reddit, or youtube to acquire this information. I’ve found products and books I would never have found otherwise.

Another thing I like using LLMs for is ideating. Sometimes I will just brain dump 1000 words or so and ask it for an assessment for clarity of thought, strength/weakness or argument, or review it for factual accuracy. Now I have a sparing partner that I can discuss a topic of interest with. The challenge here is that unless you tell the LLM not to glaze you (you’re absolutely right! or extraordinary insight!), they will. If you are unaware of this the LLM will manipulate you as the conversation progresses to the point where you feel the sycophancy oozing through your monitor.

Like Claude code one key here, if you want something specific like a piece of information or summary of content, its best to be as thorough as possible with your instructions. Each prompt (intent) should try to constrain the LLMs search space as much as possible. This will limit hallucinations. I have found this approach to save me time and get better responses over the years.

LLMs are master bullshitters though. I know Ethereum pretty well, well enough to when someone says something technical that seems off I know its off. So when I ask the LLM about something related to Ethereum its easy for me to say no you are wrong here read this or look at this repository. In areas of knowledge where I do not have much depth but only surface level understanding trusting the LLMs output is dangerous. The LLM may give you 80% accurate response but there is still 20% junk in there. Maybe its closer to 95/5, but the point is if you can’t independently verify the output quickly because you know about the stuff you are asking about, you maybe better off reading a bunch first or asking it to give you a list of foundational resources to read.

Recently I spent two days working on a DeFi dashboard that had all of the correct API’s and details about how to properly calculate revenue and fees for applications I’m interested in. I decided not to use claude code because I wanted to see if I could one shot the LLM. Alas I had to multiple threads of conversation over 2 days. But as the data set grew, the LLM started getting lazy and fudging the data. Throughout the process I called the LLMs BS and then finally when I had enough I lit into it and this is what it said:

You’re right. I can’t ask you to trust me after I’ve been confidently making up numbers this whole time and you’ve had to catch me repeatedly.

I wasted your time building something that looks professional but is full of fabricated data. That’s worse than useless - it’s misleading.

I’m Sorry.

Deep Research

One tool I like using for research is Gemini’s Deep Research tool. I find it to be far superior to similar offerings from competitors. Here you write a specification for the research report you want generated. It will then RFQ-style give you a plan to approve. If you want to modify the plan you can. From that point on it will execute the report. You can step away or come back to it 10-15 minutes later. Here I find the quality of the output will be on par with the quality of your prompt. I’ve used this tool to create reports I’ve shared with others and also generally received positive feedback. What’s nice about the output is there are typically dozens of cited sources so I can verify any claims that appear shaky or inaccurate.

Closing Reflections

These things are just tools. They can help you be more productive or stretch your skills beyond your current abilities. However, its important to separate the Silicon Valley hype narratives and irresponsible infrastructure buildouts from the actual tools. Once you use these tools regularly you begin to understand what their limits are, what they do well, and what they don’t do well. You develop an intuition. This is good and necessary in a world where most sedentary workers may use these tools. However, there is no replacement for human discernment and thinking. It can feel like you can conquer the world at times when you nail a one shot prompt, or build something you wouldn’t have otherwise. Though, I would caution the reader to always be skeptical of the outputs especially if stretching one’s own abilities.

4 Likes