- Genesis Newsletter
- Posts
- š Putting GPT-4's new rivals to the test
š Putting GPT-4's new rivals to the test
Can Claude 3 and Gemini Ultra beat GPT-4 at everyday business tasks?
Sponsored by
Hey Genesis Residents!
For the past year, GPT-4 has reigned supreme, with no other model even coming close to its prowess.
However, in recent weeks, an oligarchy has emerged, challenging its dominance. Gemini Ultra and Claude 3 (Opus) have risen as formidable rivals, at least according to abstract benchmarks, threatening to dethrone the once-uncontested ruler of the AI landscape.
However, benchmarks often rely on questions that donāt align with how people practically use these models in everyday scenarios.
So I decided to test the models on tasks I often perform day-to-day including summarisation, providing a critique of an article and market sizing. These are very typical tasks for anyone in business.
To ensure a fair and comparable evaluation, I refrained from any attempt at prompt engineering, keeping the prompts plain and simple, allowing each modelās inherent capabilities to shine through.
Letās get into it!
Read Time: 15 minutes
Spotlight on Our Sponsor: MonsterONE
Dive into MonsterONEās comprehensive subscription service, where unlimited access to a vast array of digital assets awaits. With over 386,200 high-quality items, your projects will shine like never before.
MonsterONE Offers:
Website Assets: Themes, templates, plugins.
Graphic Designs: Icons, fonts, logos.
Audiovisuals: Music tracks, video templates.
3D Elements: Models, textures.
Benefits Include:
Weekly Updates: Fresh assets continuously.
Full Support: Comprehensive assistance.
Flexible Subscriptions: Plans to suit every need.
Designers, developers, marketers, and freelancers, take note. Enjoy unlimited downloads with MonsterONE. Save 20% OFF with ā AIGenesisā until April 30
PROMPT 1: Summarise the key points, action items, and decisions made during the meeting: [MEETING TRANSCRIPT]
PROMPT 2: Provide a critique of this article: [ARTICLE]
PROMPT 3: Estimate the total annual revenue generated by the sale of organic, non-GMO, gluten-free, vegan protein bars in the United States through online channels.
And because Iām a lazy, biased human who thought it would be amusing to let the machines judge each other, I decided to have each model anonymously rate its own responses as well as those of its rivals.
The results below are the average scores given by the LLMs rating each otherās (and their own) response to the task. After all, why should I put in the effort to be objective when I can just sit back and let the systematic bias of internet data do the work for me?
The verdict
The models have spoken, and the verdict is in: GPT-4 still reigns supreme in the AI realm, unless you need help with market sizing ā or any similar analytical task ā in which case you might want to look to Claude.
Yet, I canāt resist injecting my own subjective bias into the mix!
Letās go through each task and analyse the modelsā responses and whether their ratings match my own human judgement.
PROMPT 1: Summarise the key points, action items and decisions made during the meeting: [MEETING TRANSCRIPT]
For the summarising task, Claude 3 and Geminiās ratings were generally consistent, favouring GPT-4; however, GPT-4 went against the grain and deemed Claude superior.
In my opinion, the summaries produced by Claude were far superior, allowing me to parse the information with greater ease. Gemini, however, fell short, omitting some crucial details, securing its place as the clear loser in this round. GPT-4ās summary, while comprehensive, was a little too long and unnecessarily detailed for my liking.
GPT-4 example: āWheel Tax Increase: Commissioner Adkins moved to increase the wheel tax by $10 to compensate for state cuts in education funding. After debate and a motion for the previous question (which passed 17-2), the increase was approved on first passage with a vote of 17-2.ā
Claude, on the other hand, struck the perfect balance, including all the essential points while wisely leaving out minor, irrelevant details like the Commissionās invitation to the āchilli supperā (which was in both GPT-4 and Geminiās summaries).
Claude 3 example: āCommissioner Adkins' resolution to increase the wheel tax by $10 to make up for the state cut in education funding passed on first passage with a vote of 17-2.ā
So, in my humble and totally unbiased opinion, the winner of this showdown is none other than Claude 3, the Goldilocks of AI models. Not too much, not too little, but just right.
Tweet of the Week
Deepfake is getting out of handā ļø
don't believe everything you see on social media
deepfakes can deceive you in ways
you can't even imagineš¤Æš¤Æ
1. AI-generated Barack Obama has a message for you
ā AI Genesis (@AIGenesis_)
12:14 AM ā¢ Mar 29, 2024
PROMPT 2: Provide a critique of this article: [ARTICLE]
The models all agreed on this one, GPT-4 is the best. And Iām also inclined to agree. It provides counter-arguments for specific claims in the article with much more nuance than Gemini and Claude. GPT-4ās response demonstrates a deeper level of reflection and draws upon outside evidence to support its points, setting it apart from its competitors.
GPT-4 example: Addressing the Coherence of Change Theory
āThe critique that degrowth lacks a coherent theory of change dismisses the movement's contribution to expanding the discourse on sustainable living and economic models. Degrowth scholars and activists often highlight the importance of local initiatives, community-led projects, and the relocalization of economies as tangible steps toward their broader goals. These approaches encourage a bottom-up model of change that can coexist with broader political and economic reforms. Furthermore, the emphasis on reducing consumption and enhancing well-being can inspire innovative policy solutions focused on sustainability and equity.ā
Claude and Gemini, while not quite reaching the heights of GPT-4ās nuanced analysis, still provide valuable and different paths of inquiry.
Claude 3 example: āRedefining growth: While the article advocates for reorienting growth patterns toward sustainability, degrowthers argue that the pursuit of endless economic growth is inherently unsustainable. They propose alternative measures of progress, such as the Genuine Progress Indicator, which account for social and environmental factors.ā
Gemini Ultra example: āFocus on systemic change: The article criticizes degrowth for blaming "the system" and advocating for its abolition. However, degrowth might not solely target capitalism itself, but rather the specific features within the system (like hyper-consumerism and endless growth paradigms) that contribute to environmental degradation.ā
All these models provide different flavours, and I could see myself attempting to use each for a brainstorming session or to receive feedback.
PROMPT 3: Estimate the total annual revenue generated by the sale of organic, non-GMO, gluten-free, vegan protein bars in the United States through online channels.
In the market sizing task, Claude 3 emerges victorious ā again, I concur it is the best. It provides a more in-depth analysis by starting with the US population and working its way up, although it could have further improved by considering gluten-free, vegan, or health-conscious demographics. GPT-4, in contrast, arbitrarily selects the total market size for protein bars as a starting point. But props to both models ā their maths was correct.
Claude 3 example: Step 1: Estimate the number of regular consumers of the specific protein bars.
Number of consumers = 5% of the U.S. population
Number of consumers = 0.05 Ć 330,000,000 = 16,500,000
GPT-4 example: Assume the total market size for protein bars in the U.S. is $1 billion annually.
Gemini, on the other hand, stubbornly insisted on using data from the web, despite my explicit instructions to avoid using it for a fair comparison. Even so, its estimates still didnāt make sense, likely because it was fixated on using web sources that didnāt work together cohesively. For example, it applied the 5% market share assumption it was making twice:
Gemini Ultra example: āAssuming a global protein bar market size of USD 4.54 billion in 2021, the projected plant-based segment share could be around USD 227 million (5% of the total market).
Applying the estimated 5% market share for your specific product category, the annual revenue through online channels in the US could be approximately USD 1.82 million (5% of USD 36.32 million).ā
Ironically, Gemini rates itself as the best performer, showcasing its overly conservative output, causing more trouble than good.
Final thoughts
Overall, itās a toss-up between GPT-4 and Claude 3 for output quality, but other factors play a significant role. Firstly, ChatGPT is quite verbose in its responses, with an average word count of 437 compared to Claudeās 290 and Geminiās 315. I have to read over 100 extra words when GPT-4 responds, which is made even worse by the fact that GPT-4, even the Turbo version, antonymously produces responses at a snailās pace.
In contrast, Gemini is exceptionally quick, making it my go-to choice for immediate assistance when Iām stuck on a thought or phrasing. If I require a more thoughtful response or one that is more statistically representative of the internet data corpus, I turn to Claude 3.
Source: Artificial Analysis
Despite its shortcomings, GPT-4 still has its merits, particularly in its UX features. The code interpreter and image generator, which its competitors currently lack, give it an edge in certain use cases. But this hasnāt stopped me from slipping away from its rivals. The king is dead, at least for now.
Now, I want to hear from you:
Which models have you tried out so far? Do you use specific models for certain tasks?
What response biases have you noticed in the models? Are they different depending on the model?
AI Art of the Week
a woman with her head covered in water, in the style of art nouveau-inspired illustrations, colorful collage, intricate floral arrangements, made of crystals, susan seddon boulet, symmetrical composition --s 750 --ar 55:86 --v 6.0
Thank you for reading todayās edition.
If you enjoyed this, please help spread the love by forwarding this Newsletter to a friend or colleague.
Here's how I can help you:
Get your product in front of 3500+ solopreneurs, business owners, and professionals. Sponsor this newsletter here.
I hope to see you in the next one!