Skip to content
11 min read AI & Technology

The Hard Part of AI Just Started

The AI landscape is shifting from flashy demos to gnarly issues of integration, evaluation, and true impact. Here's what you need to know about the new era of AI.

The Hard Part of AI Just Started

AI & TECH ROUNDUP: The Hard Part of AI Just Started

THIS WEEK'S INTAKE

📊 9 episodes across 4 podcasts

⏱️ ~8 hours of AI & Tech intelligence

🎙️ Featuring: John Yang, Kevin Roose, Carina Hong, and insights from The AI Daily Brief

We listened. Here's what matters.

Alright, buckle up. We're past the "AI is coming!" phase and firmly into the "AI is यहां — now what?" era. The market is simultaneously buzzing with M&A, grappling with practical implementation, and bracing for what's next. While the headlines still scream about breakthroughs, the smartest folks are talking about the hard parts: the gnarly issues of evaluation, the true cost of inference, and the tricky path from flashy demo to reliable, long-term impact.

This week’s intelligence stream feels less like a rocket launch and more like a detailed engineering meeting. We’ve got deep dives into how we even know if an AI is good at coding, the surprisingly mundane challenges of AI adoption, and why everyone's suddenly thinking about "vibe coding." What ties it all together? A growing consensus that the easy wins are drying up. The focus is shifting from simply building powerful models to integrating and sustaining them effectively in real-world — and often, economically constrained — environments.

Here's what you need to know.


The Briefing

The Reasoning Wars Continue, And Evals Are Still Bad

We're all fascinated by those head-spinning demos of AI coding assistants or mathematical theorem provers. But how do you actually score them? Turns out, our evaluation methods are still playing catch-up to the complexity of the models. John Yang, a key figure in code evaluation, pulls back the curtain on benchmarks like SWE-bench and CodeClash. The core problem? Current eval systems, particularly for coding, are too narrow and don't reflect real-world tasks. They often focus on isolated problems, lack proper dependencies, and crucially, don't account for the "impossible tasks" that humans encounter and discard.

The Insight: If you're betting on AI to write complex code or prove theorems, the current methods for assessing their capability are woefully inadequate. We're often optimizing for metrics that don't directly translate to genuine utility or groundbreaking discovery. This isn't just an academic problem; it has direct implications for how companies invest in AI development and integration.

The Voice:

"I don't like unit tests as a form of verification. And I also think there's an issue with SWE-bench where all of the task instances are independent of each other. I think we should intentionally include impossible tasks as a flag of like, hey, you're cheating." — John Yang, on Latent Space

The So What: Don't be fooled by high benchmark scores alone. True AI progress will require much more sophisticated, holistic, and perhaps intentionally challenging evaluation methodologies that push models beyond their current limitations and highlight their genuine capacity for problem-solving. This gap is a significant risk for enterprises adopting powerful AI tools without proper internal evaluation frameworks.


Enter "Vibe Coding," Exit Unrealistic Expectations

If you thought AI was just about perfect logic and objective outputs, think again. The rise of "vibe coding" suggests a more human, almost intuitive component to how we'll interact with AI, especially in creative and adaptive workflows. Predictions for 2026 suggest that model upgrades will increasingly be "vibe-based," moving away from purely deterministic metrics to more nuanced, often subjective, improvements that resonate with user experience or brand. This also connects to the blurring lines between AI assistants and agents, which are becoming less about explicit instructions and more about understanding context and user preference.

The Insight: As AI becomes more sophisticated, its performance will be less about raw computational power and more about its ability to understand and respond to human intent, preferences, and even emotional cues. This isn't just about output quality; it's about the feel of the interaction, which drives adoption and loyalty.

The Voice:

"Model upgrades are going to be increasingly vibe based. I think in 2026 we're going to see the lines between assistants and agents get more blurry, not more clear." — The AI Daily Brief

The So What: For product teams, this means a shift in focus from purely functional metrics to qualitative user experience. For engineers, it means building systems that can interpret ambiguity. And for executives, it implies a need for leadership that understands the "soft" power of AI, not just its hard capabilities. Your next big AI competitive edge might come from its vibe, not its FLOPS.


The M&A Buffet: Is It Hunger Pangs or Indigestion?

The AI sector is heating up with M&A activity and VC funding – a lot of it. Meta's $2.5 billion acquisition of Manus, hot on the heels of OpenAI's eye-watering compensation packages, suggests a land grab for talent and tech. But is this a sign of bullish confidence or a nervous rush to secure positions before a potential market correction? The prevailing sentiment among some is that high valuations and acquisition sprees could be a strategic move to build "fortress balance sheets" against a future downturn or simply a reflection of the intense competition for AI infrastructure and specialized capabilities.

The Insight: The current M&A environment in AI is complex. While innovation is driving some deals, others are likely hedging strategies. Many companies are making aggressive moves not just for growth, but to ensure survival or gain a protected niche in a high-stakes, volatile market.

The Voice:

"Are people getting out while the getting is good, or is this just the start of what’s going to be a big deal in 2026? There's a chance that 2026 is a peak." — Tech Brew Ride Home

The So What: This signals a maturity curve for the AI industry where consolidation becomes as important as innovation. For investors, it means discerning sustainable long-term value from quick flips. For operators, it means keeping an eye on competitive shifts and understanding whether their company is a target, an acquirer, or at risk of being left behind. The AI chip sales boom projected for 2026 further underscores the scale of investment, suggesting that the underlying infrastructure build-out is massive, regardless of individual company fates.


The Watchlist

🔥 Heating Up:

👀 Worth Watching:

⚠️ Proceed With Caution:


The Contrarian Corner

While the market is buzzing with grand AI predictions, the smartest minds are also poking holes in the hype. Kevin Roose, despite his daily use of AI tools, points out the often laughably bad performance of some early AI hardware, like his robot vacuums. This underlines a quiet but important skepticism: many of the AI implementations today are still clunky and far from magical. He highlights that while the potential is undeniable, the current reality for many consumer-facing AI products is often underwhelming, suggesting a gap between developer ambition and user experience. This isn't just about imperfections; it's a reminder that truly valuable AI integrates seamlessly and reliably, a hurdle much harder to clear than simply demonstrating capability.


The Bottom Line

The AI narrative is shifting from pure excitement to rigorous engineering and strategic defense. The challenge isn't just building powerful models, but evaluating them accurately, integrating them effectively into workflows that actually improve, and navigating an increasingly competitive and consolidating market. Don't just watch the headlines; watch the infrastructure, the evaluation methods, and the subtle shifts in how people actually use AI. That's where the real signals are.


📚 APPENDIX: EPISODE COVERAGE


1. Latent Space: The AI Engineer Podcast: "[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang"

Guests: John Yang (Google)
Runtime: 1h 37m | Vibe: Geeky Deep Dive

Key Signals:

"I think an important philosophical point here is that if you have good evaluation metrics where the human is in the loop, you can develop more trustworthy agents, more trustworthy models."

2. The AI Daily Brief: Artificial Intelligence News and Analysis: "50 AI Predictions for 2026 - Part 1"

Guests: Not specified
Runtime: 16m | Vibe: Forward-Looking Brainstorm

Key Signals:

"Model upgrades are going to be increasingly vibe based. I think in 2026 we're going to see the lines between assistants and agents get more blurry, not more clear."

3. Tech Brew Ride Home: "The End Of Year M&A Rush"

Guests: Not specified
Runtime: 20m | Vibe: Market Pulse Check

Key Signals:

"Are people getting out while the getting is good, or is this just the start of what’s going to be a big deal in 2026? There's a chance that 2026 is a peak."

4. Hard Fork: "The Wirecutter Show: Tips for Using A.I. Smartly With Kevin Roose"

Guests: Kevin Roose (New York Times tech columnist, co-host of Hard Fork)
Runtime: 52m | Vibe: Pragmatic Journalist Insights

Key Signals:

"I pay for more subscription AI products than streaming TV services. A year or two ago barely any teenagers would have said, I have an AI friend. And now something like half of teenagers are regular users of these AI companion products."

5. The AI Daily Brief: Artificial Intelligence News and Analysis: "AI New Year’s: The 10-Week AI Resolution"

Guests: Not specified
Runtime: 15m | Vibe: Actionable Self-Improvement

Key Signals:

"The goal isn’t theory or trends, but habits, workflows, and systems that still matter months from now, setting a foundation for how AI fits into work and life heading into 2026."

6. Tech Brew Ride Home: "Manus, The Hands Of Fate"

Guests: Not specified
Runtime: 17m | Vibe: Corporate Strategy & Market Moves

Key Signals:

"Meta's existing AI offerings are widely available free in services including Instagram and WhatsApp, and the company has also fully integrated AI into its advertising in ways that have fattened its bottom line, according to analysts."

7. The Neuron: AI Explained: "Building Mathematical Superintelligence: A Stanford Dropout's $64M Bet on AI Math"

Guests: Carina Hong (Founder & CEO of Axiom Math)
Runtime: 59m | Vibe: Inspiring Visionary

Key Signals:

"Superhuman is an AI that can inspire great mathematicians like Terence Tao. An AI that prompts you to think out of the box, that generate new knowledge at scale, incredible scale and speed."

8. AI Breakdown: "Fal's $140M Raise Powers 10X Image Speed Surge"

Guests: Not specified
Runtime: 13m | Vibe: Innovation & Speed

Key Signals:

"Fal powers 10X image speed surge via $140 million investor backing. Optimized for diverse hardware, it enables edge AI image apps globally."

9. The AI Daily Brief: Artificial Intelligence News and Analysis: "50 AI Predictions for 2026 - Part 2"

Guests: Not specified
Runtime: 17m | Vibe: Strategic Outlook

Key Signals:

"I think it's going to be very hard to shake Anthropic off its coding lead. If Grok can't really differentiate itself...I wouldn't be surprised if we saw some mass absorption of the Elon empire."