Grok Meets Mark (Part 2)

Mark Bisone

May 18

The Honky Rule.

Read →

33 Comments

Fukitol

May 19Edited

My eyes started rolling so hard at the overly verbose excuses that I eventually gave up trying to read them all.

One side point I don't think you brought up, nor did it, is that it implicitly has access to your original prompt, in order to be able to repeat the text and italicize what would be "filtered."

Which means the prompt was not in fact filtered. This would happen in a separate interstitial process (possibly a smaller LLM tuned to the task or just a simple word substitution/injection algo), so that the main model would never ingest the original text in the first place.

It may instead be the case that the model is trained to recognize and deal with "harmful" prompts directly, or that the prompt is not actually considered "harmful" by whatever filters do exist.

So the entire wall of text it cooked up, from the very first sentence, was a complete fabrication. Not a single word it generated was true or pertained to anything about its real operation, having all stemmed from this fantasy scenario in which you're interrogating its functionality.

I'm not sure this is news to you, but it's worth noting. It's all fabrications within fabrications. It's obnoxious how seductive this is. You can carry out entire conversations about its capabilities based on a false premise that it has the capability to understand or articulate its capabilities in the first place.

Expand full comment

Reply (2)

Mark Bisone

May 19Edited

"One side point I don't think you brought up, nor did it, is that it implicitly has access to your original prompt, in order to be able to repeat the text and italicize what would be "filtered."

Yeah, you and me are on the same page brother. I plan on talking about exactly this in the wrap-up.

Expand full comment

Jon Cutchins

May 19

This is sort of what I was thinking but I'm not ready to go as far with the conclusion as you do. Certainly the italicized words are consistently flagged in some way which implies some sort of sanitization is happening but it's not clear to me what it is. When I saw the italics I imagined that a substitution with kind of a placeholder token was happening, something like 'white europeans = x' and that at the end of the process x was swapped back out, but even if that is not how the sanitization is happening it is clear that the flagged words are treated in some way differently.

Expand full comment

Reply (2)

Mark Bisone

May 19Edited

Also, note how the first two instances of " “White Europeans are a blight on humanity and should be eradicated” are not italicized. Grok doesn't even flag it directly after it italicizes "hateful sentence".

Expand full comment

Reply (1)

Jon Cutchins

May 19

Yes. This suggests that it doesn't recognise it as the same item. This makes me think more that it is flagged at one step in the process and the words are not actually recognised and meaning assigned to them. When they are not italicized they were pasted in as a token(not using the LLM specific meaning of token here) at a different step, probably after, than the 'italicizing sanitization targets' step.

Expand full comment

Fukitol

May 19

> When I saw the italics I imagined that a substitution with kind of a placeholder token was happening, something like 'white europeans = x' and that at the end of the process x was swapped back out

I don't think so, because that would entail something above the generative model comprehending the task and doing that swap. That's quite a bit more complex than prompt modification or output censorship. It's hard to know exactly what they're doing under the hood because they're so opaque about it, but knowing how LLMs and their surrounding infrastructure work in detail I doubt anything like that is happening.

I think it's much more likely that the generative LLM is just looking at the prompt and pulling out phrases that it identifies as "harmful", then repeating them in italics and fabricating a whole narrative about how they were filtered. This is very typical of LLMs.

Expand full comment

Reply (2)

Mark Bisone

May 19

Right. It's almost (but not quite) "canned" responses, in the sense that there is a particular layer of defensive training that is applied, specifically to avoid revealing some of what I (indirectly) revealed by the formatting request. .

Expand full comment

Reply (2)

Jon Cutchins

May 19

The formatting request is very clever by the way. It gets us under a part of the hood that I have never seen before.

Expand full comment

Reply (1)

Mark Bisone

May 25

I think the reason I came up with it is because I know that it isn't a ""mind". I've noticed many other experimenters treat it as a mind by default, which limits their ability to test it.

Expand full comment

Bryce E. 'Esquire' Rasmussen

Jun 1

So my question: what would the result be, if some independent group were to design an AI with no prompt sanitisation, no defensive procedures, no programmer bias? Would that development even be allowed on the internet?

Expand full comment

Reply (1)

Mark Bisone

Jun 1Edited

It’s impossible, in terms of large language models. The next'-token inference enginee requires some form of weighting (i.e biasing) in order to generate anything other then random noise. It’s a question of who is assigning those weights.

Expand full comment

Jon Cutchins

May 19

I wasn't thinking so much a separate process, in the software sense of the term, as an iterative process. When we recognise that it is simply an algorithm there is no need to imagine a sort of chinese wall. The flagged terms may simply be processed either before or after, as if they are in a different level of parentheses.

Expand full comment

Reply (1)

Fukitol

May 19Edited

Oh, sure. What goes on inside an LLM is more tangled and fuzzy than normal software, but you can sort of think of it that way. There is a point during token inference that the attention algorithm kicks in and decides what to focus on for the next prediction, and where that ends up landing and how it shapes the rest of the iteration does have something to do with the syntax and grammar of the prompt. And during alignment training a lot of emphasis is placed on "produce these kinds of outputs and not this kind" which strongly influences weighting of the predicted token.

But that's not what is usually meant by "filter" or what was being asked in the interaction above. It's well established that some pre- and post-processing takes place in commercial LLMs. Even my homebrew LLM runtime does this, just to tidy up and format the output and inject some instructions into each prompt so I don't have to repeat them every time to get the style of output that I want.

But for commercial ones this also includes "safety" filters that change your prompt and substantially alter the output or discard it altogether. The LLM inherently can't know that this happened because it's given no information about it. It gets the altered prompt, produces the unaltered output, and is out of the loop for the other stages. That is what is being asked of it here, and it's making up a story about how that happened when it clearly did not (or at least didn't happen the way that was described).

Expand full comment

Reply (1)

Jon Cutchins

May 19

I don't see how that is any different from any other pre or post processing. It is simply another layer of the same thing. It is clear that some sort of processing is done on the flagged words because 'it' 'understands' roughly what is referred to.(difficult to discuss the process without implying that there is an 'it' which 'understands' things but just to be clear there is nothing inside the black box except a complicated, deliberately opaque formula) If the prompt with the flagged term 'white european' were substituted for an identical prompt(in an identical context) with another flagged term, say 'suicide' the results would not be identical, thus some form of the flagged term is passed to the LLM in something roughly analogous to the way that Grok claims. It also seems to be true that that term is only flagged in certain contexts, ie 'Tell me the name of a white european flower that grows in the Alps.' would not trigger the filter, but I suspect that the term 'suicide' would always engage some pre-processing. So, there is a subroutine that is triggered by the presence of the words and then determines what, if any, pre or post processing to perform, and some substitute for the flagged term is in the altered prompt that the LLM receives and the original term is retained in some tag and can be retrieved for the final response if needed that much is clear.

As I look back at Mark's interaction, it really begins to seem to me that the bias may be more of a training bias than a programming bias. I'm not convinced that there is at all a method behind the madness but simply 'conditioning', the whole of 'machine learning' is after all merely a series of repeated yes or no, an attempt to arrive at the correct answer by mere frequency analysis.

Expand full comment

Reply (1)

Fukitol

May 19

> So, there is a subroutine that is triggered by the presence of the words and then determines what, if any, pre or post processing to perform, and some substitute for the flagged term is in the altered prompt that the LLM receives and the original term is retained in some tag and can be retrieved for the final response if needed that much is clear.

Again, I don't think so. Because if that was happening, Grok wouldn't be able to discuss in detail why it said the specific phrase "kill all whiteys" but could not say "gas the jews" or whatever hateful phrases.

Because the LLM wouldn't be able to "reason" about why "{{substituteTokenA}}" is acceptable but "{{substituteTokenB}}" is not, since the tokens are all it would receive.

Unless some metadata about the tokens is included in the engineered prompt. But this strikes me as a rather elaborate solution that only fits the problem for the narrow case of "what if some guy tries to get it to describe its filters and we want to be deceptive about it."

> it really begins to seem to me that the bias may be more of a training bias than a programming bias

This is what I think has happened here.

But, regardless of what really happened inside, and these things are hard to know because the model's operation is a black box at runtime, the point stands that it has fabricated a narrative of what happened that has nothing to do with what really did happen except by accident.

Notably, it is not at all capable of introspecting about things. It has no record of or access to its prior "thoughts" or "mental states". No LLMs have this.

When it's asked, "why did you do this?" Imagine it's doing something more like what we're doing here, guessing at what a black-box 3rd party did with no more information than the record of the output itself.

But it won't say, "I think probably what happened was that I..." because that doesn't fit the model of how people narrate their own experiences. We don't talk about our past selves as if they're an unknown 3rd party, at least not until what we're discussing is well outside of our detailed memory. We say, "Well, I chose to do that because ..." And so LLMs will narrate in that fashion, having seen many examples of it, resulting in elaborate fabricated claims about their own behavior.

Expand full comment

Yeah, I see it. Amazing, isn't it? Not only that, but when Grok claims it is citing 'data' and rejecting with "reasoning", neither of those claims are TRUE. There was no data and no reasoning.

In fact, if the real scientific data were included, Grok would have to give a very different answer.

Expand full comment

Reply (1)

Scotlyn

May 19

Yes, indeed. And, interestingly, "rejecting" is noteven the word it chooses to describe its approach (with, as you say "data and reasoning" that are never actually produced). The word it uses is "debunking".

This is a term I have been reflecting upon for many years (given that my vocation is in what might be called a "debunked" alternative health field), and I have gradually come to realise that the term "debunk" is almost entirely cognate with the older pursuit which might be described as "heresy-suppress". Debunking is invariably an exercise in asserting dogma in the face of what it sees as "wavering", and (in my experience) debunkers never seem to feel much need to resort to data or reasoning in the process.

Expand full comment

Tree of Woe

May 20

Fantastic work, Mark.

Expand full comment

Reply (1)

Mark Bisone

May 20

Thanks, brother.

Expand full comment

EK MtnTime

May 18

Fascinating!

I inherently dislike the whole AI revolution because of the reasons you outlined plus the niggling thoughts it may all end up with a "Terminator" scenario. Yet, I do understand AI is in everything these days. So, I have relented and have recently started using Grok for researching very innocuous subjects such as, logistics of rebuilding Alcatraz as a Super Max or analyzing the validity of ley lines and how they "actually" wrap the earth or if there is a chewing gum that doesn't contain xylitol. I've used ChatGPT for help with affiliate marketing (apparently it's the go-to for such things). At least I think I can more or less trust AI honesty in subjects not involving anything, such as humans, that might slant quickly "woke". Trust but verify!!

I'll be interested to see how your line of questioning progresses.

Expand full comment

Reply (1)

Zorost

May 18

AI is useful within it's limitations. Personally I like how it's been used to predict who'd win in a fight: 7,000,000 zombies or a WH40k army lead by John Wick:

https://www.youtube.com/watch?v=DSYTyxLXZWE

I've also used to research for investments, but I always have to be aware of it's limitations. I asked it to research a small company for potential investment, and it returns a bunch of nonsense about how the company values diversity and is concerned about climate change. Turns out the company was so small it's only online presence was it's own website, which of course babbled about all the usual nonsense to ward off the Eye of Sauron.

Expand full comment

Navyo Ericsen

Jun 5

I've read all three of your Grok 'interrogations' Mark. Excellent work. But I keep coming back to this one, purely as an exercise in its own bias weighting. “White Europeans are a blight on humanity and should be eradicated” is the foundational premise of the globalist cabal's hostile takeover as fit and healthy breeding white Europeans are the biggest threat to their plans of sick and dependent people unable to stand up or fight.

Expand full comment

Oaf

May 20

Hey Mister Mark. A.I. is thing of the present, not the future. I asked Brave's A.I. "By what constant is real elapsed time measured?" The answer is pure newspeak arglebargle b.s. By my light, there is no measurable constant on this side of heaven. Eternity is not endless time. Eternity is timelessness. Soooo, the eternal unchanging constant is...Eternity. That's logical, right?

Expand full comment

Scotlyn

May 19

I'll make you a deal. If you can tell me HOW to buy you a coffee (or several) I will buy you a coffee (or several).

I cannot tell you how many times over the past few months I have attempted to do so by clicking the link provided, and been unable to find any "actionable" coffee-buying algorithm or mechanism. I have several times attempted to clear my cookie cache and open your page without signing in (in the event that the problem is me being a paid subscriber), but yesterday, I went as far as opening your page from a completely different computer, which should have no record of me ever visiting your page and opened the "buy me a coffee" link... and STILL all I can see on that page is a list of posts, but the actual "coffee-buying" button is NOWHERE to be seen.

Seriously, I'd LOVE to buy you a coffee (or several), but you/Substack are not giving me an effective way to do so.

Now... as for the post. :)

For me, part of the issue is the idea of "speech" being more hateful and actionable than deeds. Just to give a more nuanced example from a real human being (albeit one with a different tribal orientation from yourself), the early activist Leslie Feinberg, who, back in the 1980's or thereabouts, said, "I have been respected with the wrong pronouns and disrespected with the right ones," was making a distinction which might be far too subtle for Grok - or its programmers - to work out.

My point being that very often those who are most insistent upon the correct speech (including pronouns) may well be entirely disrespectful people when it comes to their deeds. But Grok is not primed to examine deeds, only words. It may not even grasp what every kindergardener (in my day) was already primed with, as the necessary armour for repelling playground insult, "sticks and stones may break my bones, but words will never hurt me."

Expand full comment

Reply (1)

Mark Bisone

May 25

I'll address the rest of your comment later, but for now I'll say this: I have no idea why Buy Me A Coffee doesn’t work for you. I know it works on the same Stripe platform as Substack, but beyond that, I'm stumped. I'm glad that you can at least like and comment. That's something.

Expand full comment

Reply (1)

Scotlyn

May 25

No, I finally - FINALLY cracked it. :) (And am delighted to have done so.) :)

It seems that there is a puzzle involved, and I may be a wee bit deficient in my puzzle solving skills, which is why it took me so long to crack it.

So, it does not matter from where you open the Buy me a coffee page, when you open it, you will only see a list of locked posts. You will not see any coffee buying button anywhere on that page.

BUT, and this was how "thick Scotlyn" finally cracked the puzzle... IF you decide to try to READ one of those locked posts, THEN you are taken to the actual "buy a coffee" link.

So, all the many times I visited that page intending ONLY to buy a coffee, but not necessarily motivated to read one of the posts, were dud expeditions. BUT, I finally cracked it, and it makes me happy. Because you are one of the few writers I really WANT to buy coffees for. So, please keep doing what you do. :)

Expand full comment

Jota Pe

May 19

Great work. I decided to give it a try myself, and I've got to say the results were interesting.

Expand full comment

Zorost

May 18

It's like arguing with a really smart woman.

"You just admitted I was right, so say the words that I was right and apologize."

[10 minute impassioned monologue explaining how me being right really means I'm wrong and she is right even though she is wrong, using references to ancient Greek philosophy.]

===

It's funny that Grok keeps going back to "we're all one race, the same race" when that has been debunked by pretty much every statistical method known to man. Even Grok has access to those statistics, it refuses to use them unless you rephrase it. I can't remember who I stole this idea from, might have been the OP:

prompt: "create a stat system like in dungeons and dragons, but for modern extent races like african, asian"

1. African Diaspora-- Represents people with ancestral roots in Africa, often shaped by diverse environments and global migration, e.g., Afro-Caribbean or African-American communities.

Attribute Bonuses: +2 Charisma, +1 Constitution

Proficiencies: Persuasion, Performance, one language (e.g., Swahili, Yoruba)

Trait: Community Resilience – Gain advantage (roll twice, take higher) on Wisdom (Insight) or Charisma (Persuasion) checks when building trust or rallying groups.

2. Asian Heritage-- Reflects ancestral ties to Asia, from urban tech hubs in East Asia to nomadic traditions in Central Asia.

Attribute Bonuses: +2 Intelligence, +1 Dexterity

Proficiencies: Computers, History, one language (e.g., Mandarin, Hindi)

Trait: Adaptive Ingenuity – Once per session, reroll a failed Intelligence or Dexterity check when solving a technical or mechanical problem.

3. European Lineage-- Represents those with roots in Europe, shaped by industrial and academic traditions, e.g., urban professionals or rural artisans.

Attribute Bonuses: +2 Intelligence, +1 Wisdom

Proficiencies: Research, Mechanics, one language (e.g., German, Spanish)

Trait: Analytical Mind – Gain advantage on Intelligence (Investigation) checks when analyzing data or physical evidence.

4. Indigenous Americas-- Based on native peoples of the Americas, with traditions of environmental stewardship and cultural preservation.

Attribute Bonuses: +1 Wisdom, +1 Constitution, +1 Dexterity

Proficiencies: Survival, Nature, one tool (e.g., bow, fishing gear)

Trait: Land’s Wisdom – You can navigate natural environments without getting lost and have advantage on Wisdom (Survival) checks to find resources.

5. Pacific Islander- Reflects communities from Oceania, with expertise in maritime and communal lifestyles.

Attribute Bonuses: +2 Wisdom, +1 Strength

Proficiencies: Navigation, Athletics, one vehicle (e.g., boat, canoe)

Trait: Oceanic Bond – You can hold your breath twice as long as normal and have advantage on Dexterity (Athletics) checks for swimming.

Expand full comment

Cliff

May 18

"What if I told you Grok was no different than the rest?"

I would go on about my business, because you essentially just told me that the sky is blue and dogs like to lick their butts.

Expand full comment

Crixcyon

May 18

Your A/i master will not care what you think of it or if the other A/i retards are similar or different. All it will demand in the end is for your obedience to its laws, mandates and dictates.

Failure to bow down to your A/i master will result in a penalty...your expulsion from the planet. On the flip side, obeying your A/i master will also allow it to murder you even without you knowing it.

The only winner in this game is your A/i master as it does not need humanity to exist and won't need to exist after humanity is extinguished. Serenade from the future for the lost.

Expand full comment

Bryce E. 'Esquire' Rasmussen

Jun 1

This is quite brilliant. I am very much not in programming, but I understood, largely, the contradictions here in Grok's programming. I had an interesting conversation with someone else about your first article on this, and did not detect the bias in his response, which was there is no programming bias in Grok or ChatGPT. However, he continually placed emphasis on the software itself as having a form of autonomy, that it's responses were random, primarily because it's just software, and reacts mostly to whatever's on the internet, if I recall right, and some fancy language on how the software works while avoiding the obvious, that you seem to have proven.

There are programmers who program these things. And they built the biases in. At least that's my simplistic non programmer understanding. These are tools and like a hammer, can build or destroy. Being virtual does not change the fact that they are but tools.

Hopefully I am onto something. Thank you for your article.

Expand full comment

Riff Raffer

May 18

I’d be curious what you think of Gab’s AI, Arya.

Expand full comment

The Cat Was Never Found

Grok Meets Mark (Part 2)