Genesis Newsletter
Posts
👀 Putting GPT-4's new rivals to the test

👀 Putting GPT-4's new rivals to the test

Can Claude 3 and Gemini Ultra beat GPT-4 at everyday business tasks?

AI Genesis
April 06, 2024

Spotlight on Our Sponsor: MonsterONE

Dive into MonsterONE’s comprehensive subscription service, where unlimited access to a vast array of digital assets awaits. With over 386,200 high-quality items, your projects will shine like never before.

MonsterONE Offers:

Website Assets: Themes, templates, plugins.
Graphic Designs: Icons, fonts, logos.
Audiovisuals: Music tracks, video templates.
3D Elements: Models, textures.

Benefits Include:

Weekly Updates: Fresh assets continuously.
Full Support: Comprehensive assistance.
Flexible Subscriptions: Plans to suit every need.

Designers, developers, marketers, and freelancers, take note. Enjoy unlimited downloads with MonsterONE. Save 20% OFF with “ AIGenesis” until April 30

Get it here

PROMPT 1: Summarise the key points, action items, and decisions made during the meeting: [MEETING TRANSCRIPT]

PROMPT 2: Provide a critique of this article: [ARTICLE]

PROMPT 3: Estimate the total annual revenue generated by the sale of organic, non-GMO, gluten-free, vegan protein bars in the United States through online channels.

And because I’m a lazy, biased human who thought it would be amusing to let the machines judge each other, I decided to have each model anonymously rate its own responses as well as those of its rivals.

The results below are the average scores given by the LLMs rating each other’s (and their own) response to the task. After all, why should I put in the effort to be objective when I can just sit back and let the systematic bias of internet data do the work for me?

The verdict

The models have spoken, and the verdict is in: GPT-4 still reigns supreme in the AI realm, unless you need help with market sizing — or any similar analytical task — in which case you might want to look to Claude.

Yet, I can’t resist injecting my own subjective bias into the mix!

Let’s go through each task and analyse the models’ responses and whether their ratings match my own human judgement.

PROMPT 1: Summarise the key points, action items and decisions made during the meeting: [MEETING TRANSCRIPT]

For the summarising task, Claude 3 and Gemini’s ratings were generally consistent, favouring GPT-4; however, GPT-4 went against the grain and deemed Claude superior.

In my opinion, the summaries produced by Claude were far superior, allowing me to parse the information with greater ease. Gemini, however, fell short, omitting some crucial details, securing its place as the clear loser in this round. GPT-4’s summary, while comprehensive, was a little too long and unnecessarily detailed for my liking.

GPT-4 example: “Wheel Tax Increase: Commissioner Adkins moved to increase the wheel tax by $10 to compensate for state cuts in education funding. After debate and a motion for the previous question (which passed 17-2), the increase was approved on first passage with a vote of 17-2.”

Claude, on the other hand, struck the perfect balance, including all the essential points while wisely leaving out minor, irrelevant details like the Commission’s invitation to the ‘chilli supper’ (which was in both GPT-4 and Gemini’s summaries).

Claude 3 example: “Commissioner Adkins' resolution to increase the wheel tax by $10 to make up for the state cut in education funding passed on first passage with a vote of 17-2.”

So, in my humble and totally unbiased opinion, the winner of this showdown is none other than Claude 3, the Goldilocks of AI models. Not too much, not too little, but just right.

Tweet of the Week

Deepfake is getting out of hand⚠️
don't believe everything you see on social media
deepfakes can deceive you in ways
you can't even imagine🤯🤯
1. AI-generated Barack Obama has a message for you
— AI Genesis (@AIGenesis_)
12:14 AM • Mar 29, 2024

PROMPT 2: Provide a critique of this article: [ARTICLE]

The models all agreed on this one, GPT-4 is the best. And I’m also inclined to agree. It provides counter-arguments for specific claims in the article with much more nuance than Gemini and Claude. GPT-4’s response demonstrates a deeper level of reflection and draws upon outside evidence to support its points, setting it apart from its competitors.

GPT-4 example: Addressing the Coherence of Change Theory

“The critique that degrowth lacks a coherent theory of change dismisses the movement's contribution to expanding the discourse on sustainable living and economic models. Degrowth scholars and activists often highlight the importance of local initiatives, community-led projects, and the relocalization of economies as tangible steps toward their broader goals. These approaches encourage a bottom-up model of change that can coexist with broader political and economic reforms. Furthermore, the emphasis on reducing consumption and enhancing well-being can inspire innovative policy solutions focused on sustainability and equity.”

Claude and Gemini, while not quite reaching the heights of GPT-4’s nuanced analysis, still provide valuable and different paths of inquiry.

Claude 3 example: “Redefining growth: While the article advocates for reorienting growth patterns toward sustainability, degrowthers argue that the pursuit of endless economic growth is inherently unsustainable. They propose alternative measures of progress, such as the Genuine Progress Indicator, which account for social and environmental factors.”

Gemini Ultra example: “Focus on systemic change: The article criticizes degrowth for blaming "the system" and advocating for its abolition. However, degrowth might not solely target capitalism itself, but rather the specific features within the system (like hyper-consumerism and endless growth paradigms) that contribute to environmental degradation.”

All these models provide different flavours, and I could see myself attempting to use each for a brainstorming session or to receive feedback.

PROMPT 3: Estimate the total annual revenue generated by the sale of organic, non-GMO, gluten-free, vegan protein bars in the United States through online channels.

In the market sizing task, Claude 3 emerges victorious — again, I concur it is the best. It provides a more in-depth analysis by starting with the US population and working its way up, although it could have further improved by considering gluten-free, vegan, or health-conscious demographics. GPT-4, in contrast, arbitrarily selects the total market size for protein bars as a starting point. But props to both models — their maths was correct.

Claude 3 example: Step 1: Estimate the number of regular consumers of the specific protein bars.

Number of consumers = 5% of the U.S. population

Number of consumers = 0.05 × 330,000,000 = 16,500,000

GPT-4 example: Assume the total market size for protein bars in the U.S. is $1 billion annually.

Gemini, on the other hand, stubbornly insisted on using data from the web, despite my explicit instructions to avoid using it for a fair comparison. Even so, its estimates still didn’t make sense, likely because it was fixated on using web sources that didn’t work together cohesively. For example, it applied the 5% market share assumption it was making twice:

Gemini Ultra example: “Assuming a global protein bar market size of USD 4.54 billion in 2021, the projected plant-based segment share could be around USD 227 million (5% of the total market).

Applying the estimated 5% market share for your specific product category, the annual revenue through online channels in the US could be approximately USD 1.82 million (5% of USD 36.32 million).”

Ironically, Gemini rates itself as the best performer, showcasing its overly conservative output, causing more trouble than good.

Final thoughts

Overall, it’s a toss-up between GPT-4 and Claude 3 for output quality, but other factors play a significant role. Firstly, ChatGPT is quite verbose in its responses, with an average word count of 437 compared to Claude’s 290 and Gemini’s 315. I have to read over 100 extra words when GPT-4 responds, which is made even worse by the fact that GPT-4, even the Turbo version, antonymously produces responses at a snail’s pace.

In contrast, Gemini is exceptionally quick, making it my go-to choice for immediate assistance when I’m stuck on a thought or phrasing. If I require a more thoughtful response or one that is more statistically representative of the internet data corpus, I turn to Claude 3.

Source: Artificial Analysis

Despite its shortcomings, GPT-4 still has its merits, particularly in its UX features. The code interpreter and image generator, which its competitors currently lack, give it an edge in certain use cases. But this hasn’t stopped me from slipping away from its rivals. The king is dead, at least for now.

Now, I want to hear from you:

Which models have you tried out so far? Do you use specific models for certain tasks?
What response biases have you noticed in the models? Are they different depending on the model?

AI Art of the Week

a woman with her head covered in water, in the style of art nouveau-inspired illustrations, colorful collage, intricate floral arrangements, made of crystals, susan seddon boulet, symmetrical composition --s 750 --ar 55:86 --v 6.0

Thank you for reading today’s edition.

If you enjoyed this, please help spread the love by forwarding this Newsletter to a friend or colleague.

Here's how I can help you:

Get your product in front of 3500+ solopreneurs, business owners, and professionals. Sponsor this newsletter here.

I hope to see you in the next one!

AI Genesis