People Were Never Told How AI Works, And That's a Problem

By the end of this article, you will realize there was still something about using AI that was never clearly explained to you.

Most people do not use AI badly because they are careless. They use it badly because the interface lies a little. Not intentionally. Not maliciously (maybe a bit, since companies profit from it). But it does.

Most people were never taught the mechanics.

A chat assistant looks like a text box. You type something. It answers. You type again. It answers again. The experience feels simple, almost casual, like messaging a very knowledgeable person.

But that is not what is really happening.

Underneath the friendly interface, a chat assistant is not just reading your latest message. It is operating inside a stateful, context-heavy, token-metered system. Every answer depends on what the model receives as input, what it is allowed to remember, what it can fit into context, what tools are active, what instructions are competing for priority, and how much output it has to produce.

That is the part most users were never taught.

And once you understand it, many “AI failures” start looking less like failures of intelligence and more like failures of setup.

A chat assistant is not a search box

The first mistake is treating AI like search.

Search is mostly stateless. You type a query. The engine returns links. If you type a new query, the old one usually does not matter much.

Google Search is the most knowledgeable search engine and now with integrated Gemini in it’s AI Mode, it may pass this idea. But Chat Assistants in general are not like Gemini on Google Search.

A chat has state. It carries prior messages, instructions, corrections, files, preferences, tool results, and sometimes memory. The assistant is not only answering what you just typed. It is answering what you typed inside the conversation that already exists. All of it.

That can be powerful. And it can also become a problem.

A short, focused chat can act like a well-briefed collaborator. A long, messy chat can become a room full of old instructions, half-corrections, abandoned plans, and irrelevant details. Cluttered with confusion.

Then the user says: “Why is the model getting worse?”

Often, it is not getting worse. It is getting overloaded.

Input and output are the meter

Every interaction has input and output.

Input is what the system sends to the model so it can answer. The visible input is your message. But the real input may be much larger.

It often includes the entire current conversation, system instructions, developer instructions, uploaded files, selected memory, tool definitions, retrieved documents, search results, schemas, and other context needed for the assistant to work. Chat Assistants are hungry: they are fed a lot of things.

Output is what the model produces back along with all it’s reasoning-chain of thought (hidden or not).

This matters because AI systems work with tokens.

A token is a small unit of text. Words, word parts, punctuation, and formatting can all become tokens. You do not need to count them manually to use AI well, but you do need to understand the basic idea: the more context you carry, the more the model has to process.

Breaking for some fun facts and to really explain tokens.

One token is not one word (in English - 1,000 tokens ≈ 750 words). Common words are often one token (cat, house, people, woman, and more). Unusual, long, technical, or misspelled words may split (beautiful is common, “beautifull” may cost more). Spaces matter as many tokenizers treat “_ apple” and “apple” differently because the leading space is part of the token pattern. Punctuation also counts. Other languages are usually less token-efficient than English, and that includes programming languages as code is really token-expensive (100 lines of code can cost far more tokens than 100 lines of plain prose). Tables are token-heavy (Markdown tables consume many tokens because pipes, separators, repeated headers, and spacing all count) and don’t even get me started on JSON (and all that structural repetition). Short words are not always cheaper (because it’s about commonness vs rarity). Numbers tokenize oddly as long numbers are usually split.

Talk about a plate full of misconceptions!

In a normal prompt, repeating the same instructions every call costs tokens every time (will get around to prompt caching later on).

**“Context window” is not memory and long chats become expensive as old messages stay in context. \ A 200k-token context window means the model can consider up to that much text in one request. It does not mean the model permanently remembers it.

In chat assistants, prior messages may be re-sent as context. A long conversation can make each new message more expensive.

So please, Please, PLEASE: Think twice before congratulating the model at the end! (a small thank you message will carry the input of all your conversation when you would be done with the model — it’s just mindless spending.)

This is one of the least explained parts of modern AI products: The user sees one text box. The model sees a whole package.

Long chats are useful until they become polluted

A long chat is not automatically bad.

If you are working on one coherent project, keeping the same conversation can be useful. The assistant can preserve terminology, constraints, decisions, and previous outputs. That continuity can improve quality.

But long chats become dangerous when they mix different goals.

A conversation that starts with “help me write an article”, moves into “debug this Python script”, then “summarize this PDF,” then “act as a lawyer”, then “use a different tone”, then “forget the previous instruction”.

And suddenly is no longer a clean workspace. It is polluted context.

The assistant may still answer, but now it has to infer what matters. It must decide whether an old instruction still applies, whether a previous correction replaced the original request, whether the current task belongs to the same project, and whether a detail from 50 messages ago is still relevant. That is the formula for ambiguity.

Practical rules:

Keep a chat long when the task is continuous.
Start a new chat when the task changes.
Summarize chat history and restart a new one when the chat has useful history but too much noise.
Clear the chat when the old context is now more harmful than helpful.

Reasoning is work. Interrupting it has a cost.

Many modern assistants have reasoning modes. They spend more time analyzing the task before answering. Depending on the product and model, that reasoning may consume compute, time, and token budget.

All of this is charged within your token output allowance even if you never saw it. And users often interrupt this process.

Sometimes that is reasonable. If the assistant misunderstood the task, stopping it early can save time. But many users interrupt because they have planned poorly for the task or are impatient.

Reasoning is part of the work. If you stop it, you may lose the benefit of the deeper analysis while still having spent part of the cost. In most cases, you also force the next message to repair the interrupted state.

A better habit is to give the assistant a clear target before it starts: planning and being precise is key.

Some practical examples:

“Analyze the trade-offs, then give me a short response with the 3 main recommendations.”
“Do not produce a long explanation unless there is a real uncertainty.”
“Check the code on lines “xxx” to “yyy” to address error “zzz”. Expected behavior was “abc”.”

Planning ahead is very important and sometimes letting it finish reasoning can be better than interrupting halfway through.

Better instructions reduce output

Now that you are picking up the pace, lets dive into instructions.

Many people think better prompting means writing longer prompts. Not always (perhaps not even most times).

Better prompting means reducing uncertainty.

A vague request forces the assistant to guess the goal, tone, audience, format, depth, and constraints. That usually produces longer, less useful output.

Compare this: “Explain this.”

***With this: ***“Explain this in 10 lines for a non-technical reader. Use plain language. Separate cause, consequence, and action. Do not repeat the question, no filler, no next steps, no emojis.”

The second instruction is longer, but it reduces the answer.

The same applies to writing: “Make this better.”

Is weaker than: “Rewrite for clarity. Preserve the meaning. Reduce by 30%. Keep the tone calm and professional. Do not add new facts.”

Good instructions are not decoration. They are usually about compression. And compression saves on tokens.

Do not fight the hidden reasoning

Some users try to control the assistant by demanding its full internal reasoning. This is usually the wrong target.

What you actually need is not the private mental scratchpad. You need a useful, verifiable answer.

Instead of asking for hidden reasoning, ask for visible reasoning artifacts:

“State your assumptions.” “List the steps briefly.” “Show the calculation.” “Explain the decision criteria.” “Give me a verification checklist.” “Point out what would change your conclusion.”

This works better because it turns reasoning into something useful for the task, instead of trying to expose internal processing that the system may not provide and that may not even be useful in raw form.

The goal is not to see every thought. The goal is to get a better answer.

Also, take note: models have high guards against tampering with their internal Chain of Thought. It is just not doable. You may waste as much tokens as you like trying to ask the model to think in caveman style (lemmatize, no stop-words, ultra compact token saving style), but you can only reach for the final output. At worst you will be forcing the model to work harder on understanding what you mean and wasting significantly more tokens (since reasoning tokens are output ones that cost a lot more).

Aim to reduce unnecessary reasoning and control the final output. That’s the smart goal. That’s where your wallet bleeds.

Tools make assistants more powerful, but not lighter

You might be ready for some advanced topics. Lets touch a bit on Code Assistants for IDE Softwares.

Much has been praised about MCP and all the amazing SKILLs that you can add. Many have made marvelous work of Markdown files to instruct their AGENTS, prevent AI Slop and try to manage memory context. Some people maintain .md files on every folder.

Modern assistants can use browsers, files, calendars, databases, code interpreters, APIs, IDE integrations, and MCP servers. This is useful. It turns the assistant from a text generator into an agent that can inspect and act on external systems.

But what most users rarely understand: skills and tools are not weightless.

In API and developer workflows, this may consume a lot of context. Tool definitions, schemas, retrieved data, and connector instructions can all become part of the working environment. Model Context Protocol (MCP) is a very good example.

It is popular because it makes tools easier to connect. But every active tool adds information that the model has to consider on most calls, just by being active. If you connect many tools casually, you may be increasing the amount of context the model has to carry before it even starts solving the actual problem.

The lesson is not “avoid tools”. The lesson is: tools should be intentional.

Use the tools needed for the task. Do not clutter your Markdown instruction files. Disable or avoid unnecessary tools. Do not turn the assistant into a crowded control room and then expect it to behave like a clean notebook.

Prompt cache is boring until you see the bill

And finally, for me, the Pièce de Résistance: prompt caching.

Prompt caching is one of the most important concepts for people using AI through APIs, IDEs, or automated workflows. Where you pay for use and ouch, that use isn’t cheap.

The idea is simple. If you send the same large instruction or context repeatedly, some systems can reuse that stable portion instead of processing it from scratch every time. This can reduce cost and latency.

For example, imagine you are processing 10,000 records. Each record uses the same long instruction, and only some minor specific input changes.

Without caching, you may be paying again and again for the same instruction. With caching, the repeated part will become very cheap after the first call. In today terms, it’s easy to find models that can reduce input token cost by 70% to almost 90%.

Prompt cache usually works best when the reusable part is stable.

That means:

Separate general instructions from per-record input and put those stable instructions in a fixed block.
Do not randomly edit the cached prompt.
Monitor cache creation and cache read tokens when the provider exposes them.

Most models will allow a max 5 minute interval between calls to sustain cache recovering and at higher prices it may reach 1 hour.

I know this is not something most casual users need every day. But for anyone using AI in scripts, terminals, IDEs, or production pipelines, it can matter a lot in the long run while tackling repetitive work.

Sometimes the difference between “AI is too expensive” and “AI is viable” is not the model. It is prompt structure.

As I take my leave and hope to have highlighted a new facet of AI for you: a practical checklist for better chats

Before asking:

Define the task.
Say what source of truth matters and what to ignore.
Say the desired output format.
Say whether the assistant should be concise or exploratory.

2. During the chat:

Do not mix unrelated tasks.
Correct the assistant explicitly.
Summarize long work before continuing.
Do not interrupt reasoning unless the direction is clearly wrong.

2.A) After the chat gets long (and it will):

Ask for a compact project summary.
Start a new chat with that summary.
Attach only the files that still matter.
Remove obsolete instructions.

3. For API, IDE, and terminal use:

Avoid loading unnecessary tools and be careful with MCP servers.
Keep it light on those Markdown files.
Keep stable system prompts separate and use prompt caching when available.
Log token usage.
Measure cost before scaling.

The inferface is simple. The system is not.

The uncomfortable part is that none of this is obvious from the interface.

The products are designed to feel effortless. That is part of their appeal. But the same simplicity also hides the mechanics that determine quality, latency, and cost.

The industry is very good at showing the wonder. It is less good at explaining the meter.

A chat assistant is not just a box where intelligence happens. It is a working context. The assistant does not magically solves tasks or degrade.

You simply guide it towards efficient or not so efficient paths. Once you understand that, the tool changes.