How ‘Based’ Is Grok 3?

Transcript of "Hard Fork" Episode

Kevin Roose: I’m having quite a morning. I was on the train today, got off, went up the escalator at Embarcadero, and someone bumped into me and knocked my phone out of my hand and onto the platform below. I thought, "Well, I’m midway up the escalator. Someone will have snatched it or kicked it onto the tracks. My phone is gone."

Casey Newton: So how far did it drop?

Kevin Roose: Probably 15 feet. A significant drop. I thought to myself, if I get to the end of this escalator and come back down, it’s going to be too late. Someone will have snatched it or accidentally kicked it onto the tracks. My phone is gone.

Casey Newton: You don’t have much faith in the citizens of San Francisco.

Kevin Roose: Have you visited San Francisco?

Casey Newton: Yes, I think a phone can generally survive 30 seconds on the ground, but I guess we’ll find out what happens.

Kevin Roose: Anyway, I had severe separation anxiety in the split second before I decided to do what I did, which was to try to run down the crowded up escalator. So I became that guy who was pushing through the commuters, saying, "I’m sorry, I’m sorry."

Casey Newton: You were a character in a bad comedy, running down the up escalator.

Kevin Roose: [Laughing] Yes.

Casey Newton: I was at that platform this morning and I heard a woman screaming. But now I’m realizing that was you. Did you get the phone?

Kevin Roose: I did. It’s safe. No cracks. It was retrieved. But yeah, that was a wild way to start my day.

Casey Newton: Well, thank you to all the good Samaritans of San Francisco who did not steal Kevin’s phone during the 30 seconds when it was on the floor. It kind of restores your faith in humanity a bit.

Kevin Roose: Oh, it does.

Casey Newton: So, this week, another upstart AI lab has the tech world talking with the release of a powerful new large language model. But unlike the others, this one might be running the federal government by springtime.

Kevin Roose: [Laughing]

Casey Newton: This week, xAI, which is Elon Musk’s AI company, released its latest model, Grok 3. And based on their own benchmark results and early reviews, it seems like it’s basically on par with the best models out there right now. And while it hasn’t been subjected to rigorous, independent testing, the early word from AI nerds is that it’s pretty good.

Kevin Roose: Well, Grok 3 is the new premium tier model of Grok, which is xAI’s AI model. It’s available to Premium+ subscribers on X, which is their $40 a month premium tier, which is cheaper than OpenAI’s most powerful plan, which is $200 a month. But it’s also built into X, the former Twitter app.

Casey Newton: Yeah, and I should say that I actually have used Grok 3 for this exact same reason, which is I have just been given free access to this thing for some reason. I guess the Department of X Efficiency or DOXE has not yet uncovered my account.

Kevin Roose: Right. [Laughing]

Casey Newton: We both played around with it a little bit. What were your impressions of Grok 3?

Kevin Roose: Well, like others who have commented, it seems like it’s about as good as some of the other models. When I asked Grok about itself, it said Grok 3 launch is a pivotal moment in AI. It seemed like a bit much. But I also asked it if it had an opinion about Platformer, my newsletter, and it actually said some really nice things – which I had to respect.

Casey Newton: And you?

Kevin Roose: So I put it through some of my proprietary evals. I actually do have things that I test AI models on. The Roose benchmarks. And yeah, I would say it did OK. It was not mind-blowingly good. It was not bad. It got some things that other models missed and vice versa. It did have access to X data, which is interesting. You can do things like tell it to analyze this person’s posts on X and tell me what they think about this topic.

Casey Newton: There’s this famous question that we always love to ask large language models. Can you count the Rs in strawberry? I asked Grok the equivalent question for X, which is, can you count Elon Musk’s children? Which it’s known to be very hard for large language models.

Kevin Roose: Well, a new one just dropped.

Casey Newton: Exactly. And that’s why it’s so hard for them to keep up.

Kevin Roose: And part of Elon Musk’s pitch for Grok for the past year or… are just certain decisions that it will prompt you to make where you’re like, "I don’t actually know what these terms mean or what the right decision is here." And you can ask the AI to just make the decision for you, but you might not be totally happy with the result.

Casey Newton: Now, during this process, I’m curious if you felt like you were learning something about the coding process. Like, if you spent the next year making these little one-off apps, do you feel like you would maybe be a decent junior software engineer? Or is the idea actually not to get into the details, to just let it build things? And if you don’t know what it’s doing, that’s none of your business.

Kevin Roose: Yeah, I think I’m more in the latter camp. I mean, this was the part that I found fascinating about what Andrej Karpathy said about vibe coding. He’s an extremely good programmer. But he says that he now can enter this mode where he basically just says, "OK, OK, OK, accept, accept, accept, and the computer will go off and do its thing."

Casey Newton: You know, I’ve been thinking about a blog post I read this week by a guy named Namanyay Goel. And his blog post was titled "New Junior Developers Can’t Actually Code". This post got a million views, according to the post that I’m looking at. And he is saying that when he talks to junior developers, they are having an experience very similar to you, which is that as they are building these systems, they are essentially just supervising an AI. They aren’t actually getting their hands dirty and understanding which mechanisms are leading to which results.

Kevin Roose: Yeah, I think this is a very real thing. I mean, the flip side of me, a non-coder being able to build stuff, is that if real coders are using these tools, there’s no incentive for them to learn the basic skills of programming and learn the syntax of the different languages. And yeah, I don’t know what to do about that.

Casey Newton: What do you think?

Kevin Roose: It seems like a version of what happened when we all got like Google Maps on our phones is that people started losing their sense of direction. There’s this kind of skill atrophy issue that people worry about. But I think that the returns to knowing how to use these things effectively are still great enough and still require enough knowledge of how the various pieces of code fit together, that it still makes sense for people to learn to code.

Casey Newton: All right.

Kevin Roose: What do you think?

Casey Newton: What I think is that as AI systems get more and more powerful, we need people who do understand them on a very detailed, technical, complex, down-to-the-metal kind of way. And that if we don’t do that, our only alternative will just be to trust the AI when we ask it, "Hey, how do you work?" And there are a lot of reasons why I don’t want to end up in that world. So I’m comfortable having fewer people in this world who know the code at that level of detail. And it’s fine with me if most software engineers don’t. But I want a solid core of people who do.

Kevin Roose: Yeah. And I’d like to continue with my vibe-coding experiments, trying to build increasingly more useful tools for myself and my friends.

Casey Newton: And I’m thinking about starting because if you can do it, surely I can.

Kevin Roose: [Laughing] Yes, anyone can. That is sort of the point. And I also would love to hear from our listeners, what they are vibe coding. What tools and apps are you building using AI that are solving your own personal, specific problems?

Casey Newton: Did you invent a novel bio weapon using ChatGPT? We’d love to hear from you.

Kevin Roose: Yeah, please email that one to tips@fbi.gov.

Casey Newton: [Laughing] But the others, hardfork@nytimes.com.

[Upbeat electronic music]

Kevin Roose: "Hard Fork" is produced by Whitney Jones and Rachel Cohn. We’re edited by Rachel Dry. We’re fact-checked by Caitlin Love. Today’s show was engineered by Alyssa Moxley. Original music by Elisheba Ittoop, Marion Lozano, Diane Wong, Rowan Niemisto, and Dan Powell. Our audience editor is Nell Gallogly. Video production by Chris Schott, Sawyer Roque, and Pat Gunther. You can watch this full episode on

Post Views: 93

How ‘Based’ Is Grok 3?

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

A better method for planning complex visual tasks | MIT News

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title A New AI Model Could Help Scientists Design New Forms of Life in 100 -150 characters. And it...

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

A better method for planning complex visual tasks | MIT News

Generate single title from this title Why AI insurance underwriting is finally attracting institutional capital in 100 -150 characters. And it must return only...

Generate single title from this title A New AI Model Could Help Scientists Design New Forms of Life in 100 -150 characters. And it...

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Designing assessments that assume AI is present in 100 -150 characters. And it must return only title i...

SmartThings Blog

A better method for planning complex visual tasks | MIT News

Categories

Useful Links

Our Newsletter