šŸ‘€ Putting GPT-4's new rivals to the test

Can Claude 3 and Gemini Ultra beat GPT-4 at everyday business tasks?

Sponsored by

Hey Genesis Residents!

For the past year, GPT-4 has reigned supreme, with no other model even coming close to its prowess.

However, in recent weeks, an oligarchy has emerged, challenging its dominance. Gemini Ultra and Claude 3 (Opus) have risen as formidable rivals, at least according to abstract benchmarks, threatening to dethrone the once-uncontested ruler of the AI landscape.

However, benchmarks often rely on questions that donā€™t align with how people practically use these models in everyday scenarios.

So I decided to test the models on tasks I often perform day-to-day including summarisation, providing a critique of an article and market sizing. These are very typical tasks for anyone in business.

To ensure a fair and comparable evaluation, I refrained from any attempt at prompt engineering, keeping the prompts plain and simple, allowing each modelā€™s inherent capabilities to shine through.

Letā€™s get into it!

Read Time: 15 minutes

Spotlight on Our Sponsor: MonsterONE

Dive into MonsterONEā€™s comprehensive subscription service, where unlimited access to a vast array of digital assets awaits. With over 386,200 high-quality items, your projects will shine like never before.

MonsterONE Offers:

  • Website Assets: Themes, templates, plugins.

  • Graphic Designs: Icons, fonts, logos.

  • Audiovisuals: Music tracks, video templates.

  • 3D Elements: Models, textures.

Benefits Include:

  • Weekly Updates: Fresh assets continuously.

  • Full Support: Comprehensive assistance.

  • Flexible Subscriptions: Plans to suit every need.

Designers, developers, marketers, and freelancers, take note. Enjoy unlimited downloads with MonsterONE. Save 20% OFF with ā€œ AIGenesisā€ until April 30

PROMPT 1: Summarise the key points, action items, and decisions made during the meeting: [MEETING TRANSCRIPT]

PROMPT 2: Provide a critique of this article: [ARTICLE]

PROMPT 3: Estimate the total annual revenue generated by the sale of organic, non-GMO, gluten-free, vegan protein bars in the United States through online channels.

And because Iā€™m a lazy, biased human who thought it would be amusing to let the machines judge each other, I decided to have each model anonymously rate its own responses as well as those of its rivals.

The results below are the average scores given by the LLMs rating each otherā€™s (and their own) response to the task. After all, why should I put in the effort to be objective when I can just sit back and let the systematic bias of internet data do the work for me?

The verdict

The models have spoken, and the verdict is in: GPT-4 still reigns supreme in the AI realm, unless you need help with market sizing ā€” or any similar analytical task ā€” in which case you might want to look to Claude.

Yet, I canā€™t resist injecting my own subjective bias into the mix!

Letā€™s go through each task and analyse the modelsā€™ responses and whether their ratings match my own human judgement.

PROMPT 1: Summarise the key points, action items and decisions made during the meeting: [MEETING TRANSCRIPT]

For the summarising task, Claude 3 and Geminiā€™s ratings were generally consistent, favouring GPT-4; however, GPT-4 went against the grain and deemed Claude superior.

In my opinion, the summaries produced by Claude were far superior, allowing me to parse the information with greater ease. Gemini, however, fell short, omitting some crucial details, securing its place as the clear loser in this round. GPT-4ā€™s summary, while comprehensive, was a little too long and unnecessarily detailed for my liking.

GPT-4 example: ā€œWheel Tax Increase: Commissioner Adkins moved to increase the wheel tax by $10 to compensate for state cuts in education funding. After debate and a motion for the previous question (which passed 17-2), the increase was approved on first passage with a vote of 17-2.ā€

Claude, on the other hand, struck the perfect balance, including all the essential points while wisely leaving out minor, irrelevant details like the Commissionā€™s invitation to the ā€˜chilli supperā€™ (which was in both GPT-4 and Geminiā€™s summaries).

Claude 3 example: ā€œCommissioner Adkins' resolution to increase the wheel tax by $10 to make up for the state cut in education funding passed on first passage with a vote of 17-2.ā€

So, in my humble and totally unbiased opinion, the winner of this showdown is none other than Claude 3, the Goldilocks of AI models. Not too much, not too little, but just right.

Tweet of the Week

PROMPT 2: Provide a critique of this article: [ARTICLE]

The models all agreed on this one, GPT-4 is the best. And Iā€™m also inclined to agree. It provides counter-arguments for specific claims in the article with much more nuance than Gemini and Claude. GPT-4ā€™s response demonstrates a deeper level of reflection and draws upon outside evidence to support its points, setting it apart from its competitors.

GPT-4 example: Addressing the Coherence of Change Theory

ā€œThe critique that degrowth lacks a coherent theory of change dismisses the movement's contribution to expanding the discourse on sustainable living and economic models. Degrowth scholars and activists often highlight the importance of local initiatives, community-led projects, and the relocalization of economies as tangible steps toward their broader goals. These approaches encourage a bottom-up model of change that can coexist with broader political and economic reforms. Furthermore, the emphasis on reducing consumption and enhancing well-being can inspire innovative policy solutions focused on sustainability and equity.ā€

Claude and Gemini, while not quite reaching the heights of GPT-4ā€™s nuanced analysis, still provide valuable and different paths of inquiry.

Claude 3 example: ā€œRedefining growth: While the article advocates for reorienting growth patterns toward sustainability, degrowthers argue that the pursuit of endless economic growth is inherently unsustainable. They propose alternative measures of progress, such as the Genuine Progress Indicator, which account for social and environmental factors.ā€

Gemini Ultra example: ā€œFocus on systemic change: The article criticizes degrowth for blaming "the system" and advocating for its abolition. However, degrowth might not solely target capitalism itself, but rather the specific features within the system (like hyper-consumerism and endless growth paradigms) that contribute to environmental degradation.ā€

All these models provide different flavours, and I could see myself attempting to use each for a brainstorming session or to receive feedback.

PROMPT 3: Estimate the total annual revenue generated by the sale of organic, non-GMO, gluten-free, vegan protein bars in the United States through online channels.

In the market sizing task, Claude 3 emerges victorious ā€” again, I concur it is the best. It provides a more in-depth analysis by starting with the US population and working its way up, although it could have further improved by considering gluten-free, vegan, or health-conscious demographics. GPT-4, in contrast, arbitrarily selects the total market size for protein bars as a starting point. But props to both models ā€” their maths was correct.

Claude 3 example: Step 1: Estimate the number of regular consumers of the specific protein bars.

Number of consumers = 5% of the U.S. population

Number of consumers = 0.05 Ɨ 330,000,000 = 16,500,000

GPT-4 example: Assume the total market size for protein bars in the U.S. is $1 billion annually.

Gemini, on the other hand, stubbornly insisted on using data from the web, despite my explicit instructions to avoid using it for a fair comparison. Even so, its estimates still didnā€™t make sense, likely because it was fixated on using web sources that didnā€™t work together cohesively. For example, it applied the 5% market share assumption it was making twice:

Gemini Ultra example: ā€œAssuming a global protein bar market size of USD 4.54 billion in 2021, the projected plant-based segment share could be around USD 227 million (5% of the total market).

Applying the estimated 5% market share for your specific product category, the annual revenue through online channels in the US could be approximately USD 1.82 million (5% of USD 36.32 million).ā€

Ironically, Gemini rates itself as the best performer, showcasing its overly conservative output, causing more trouble than good.

Final thoughts

Overall, itā€™s a toss-up between GPT-4 and Claude 3 for output quality, but other factors play a significant role. Firstly, ChatGPT is quite verbose in its responses, with an average word count of 437 compared to Claudeā€™s 290 and Geminiā€™s 315. I have to read over 100 extra words when GPT-4 responds, which is made even worse by the fact that GPT-4, even the Turbo version, antonymously produces responses at a snailā€™s pace.

In contrast, Gemini is exceptionally quick, making it my go-to choice for immediate assistance when Iā€™m stuck on a thought or phrasing. If I require a more thoughtful response or one that is more statistically representative of the internet data corpus, I turn to Claude 3.

Despite its shortcomings, GPT-4 still has its merits, particularly in its UX features. The code interpreter and image generator, which its competitors currently lack, give it an edge in certain use cases. But this hasnā€™t stopped me from slipping away from its rivals. The king is dead, at least for now.

Now, I want to hear from you:

  • Which models have you tried out so far? Do you use specific models for certain tasks?

  • What response biases have you noticed in the models? Are they different depending on the model?

AI Art of the Week

a woman with her head covered in water, in the style of art nouveau-inspired illustrations, colorful collage, intricate floral arrangements, made of crystals, susan seddon boulet, symmetrical composition --s 750 --ar 55:86 --v 6.0

Thank you for reading todayā€™s edition.

If you enjoyed this, please help spread the love by forwarding this Newsletter to a friend or colleague.

Here's how I can help you:

Get your product in front of 3500+ solopreneurs, business owners, and professionals. Sponsor this newsletter here.

I hope to see you in the next one!