Tuesday, December 9, 2025

I Had Three Papers Rejected By A Conference And I Couldn't Be Happier!

The Agents4Science Conference was unique and I was excited to be a part of it.  Put together by Dr. James Zhou from Stanford's Computer Science department, this academic conference attracted worldwide attention.  While the vast majority of the papers came from computer and data sciences, eight other primary research domains and 26 sub-domains, such as social sciences, astronomy, economics, and psychology, also contributed to this first-of-its-kind event.

Three reviewers evaluated every paper on a 1 (strong reject) to 6 (strong accept) scale.  In the end, authors submitted a total of 247 papers of which the conference organizers accepted 47 (19%). This is actually more competitive than is typical of top computer science conferences like NeurIPS where acceptance rates normally run closer to 25%.   

But what was so special about this particular conference and why did it attract so much attention?  One main reason:  It was the first conference where artificial intelligences wrote all the papers.  

This, of course, was the organizers' intent.  As they explained on the conference's main website, "This inaugural conference explores if and how AI can independently generate novel scientific insights, hypotheses, and methodologies while maintaining quality through AI-driven peer review. Agents4Science is the first venue where AI authorship is not only allowed but required..."

In short, all of the papers were mostly written (more on that in a second) by AIs and all had a "first pass" review by three AI "reviewers," Anthropic's Claude, Google's Gemini, and OpenAI's ChatGPT.  Each AI was given the same detailed prompt to act as an expert reviewer for the conference.  Human subject matter experts also reviewed 80 of the top papers (including all 47 of the accepted papers) to double check the AI reviewers' assessment.

The conference organizers not only evaluated the submitted papers for quality, they also had an explicit method for evaluating where and how much of the work was done by the AIs.  Specifically, they evaluated levels of autonomy across four categories, hypothesis development, experimental design, data analysis, and writing, on a combined 0-12 scale.  Only 54 (22%) scored a perfect 12, i.e. completely done by an AI, but the majority (71%) were an 8 or above (unsurprising given the nature of the conference).

All the background and statistics aside, putting on a new conference like this is never easy but the organizers of Agents4Science did an outstanding job.  I suspect from their point of view it was chaotic and, paraphrasing the words of Wellington, "a near run thing," but from the perspective of a contributor and attendee, it was virtually flawless.  I look forward to contributing again next year.

What were my three submissions about?

Eureka moments make for good science history.  Whether it is Archimedes jumping out of his tub and running naked through the streets, Einstein's teenage musings about what it might mean to catch a ray of light, or Franklin's Photograph 51 showing the unmistakable double helix of DNA, these sudden flashes of insight can change the world.

My insights?  Not so much.

What I have is probably best (and kindly) described as "shower thoughts" or (more derisively) as "being blessed by the Good Idea Fairy."  That's not what it feels like to me, though.  It feels like my brain has caught on fire, that I have to do something with an idea before I lose it.  

I saw this conference, therefore, as an opportunity to take three of my best brain-fires, polish them up, and submit them just for the hell of it.  Along the way, I wanted to test two ideas of my own.

First, I am a big believer in the Medici Effect, the idea that the most innovative thinking occurs when disciplines intersect.  Richard Feynman, for example, is not a hero to me for his work in physics but for the fact that he jumped feet first into biology and made significant early contributions to both biophysics and nanotechnology as a result.  Across my career, I have found that trying to mash together two disparate disciplines yields success if I am right and learning if I fail.

Thus, the three papers I submitted all tried this approach:

  • "Ramsey-Inspired Environmental Connectivity as a Driver of Early Universe Star Formation Efficiency: An AI-Led Theoretical Investigation."  I was vaguely aware that the James Webb telescope was seeing stars and galaxies in the early universe that weren't supposed to be there if our current theories were correct.  I was also vaguely aware of something called Ramsey Theory, which is a graph theory offshoot that proves that in a large enough network of anything, patterns will emerge "for free."  If you think of the early universe as a network of particles, couldn't Ramsey Theory explain at least some of the early clustering James Webb is seeing?

  • "From C. elegans to ChatGPT: Quantifying Variability Across Biological and Artificial Intelligence."  A really cool paper from Jason Moore at NYU highlighted "specific circuits and neurons dedicated to introducing noise and/or variability" and hypothesized that "there might exist an ideal noise variance level for optimal control performance."  This sounded to me a lot like the notion of "temperature" in Large Language Models.  Could LLMs and the brain both be using similar mechanisms to optimize variability?

  • "Fractal-ish Complexity for Regulations: A Practitioner-Ready, Agentic Benchmark."  I don't know a lot about fractals but I do know that they are self-similar, which means they have the same level of complexity at different scales.  This level of complexity, in turn, is measured using a "fractal dimension."  Regulations have, by design, a self similar structure with paragraphs, sub-paragraphs, etc.  Could you determine the complexity of a regulation by calculating its fractal dimension?

How did my submissions do?


Scores across all 247 papers went from a low of 1 (strong reject) to a high of 5 (accept).  While some papers received a 6 (strong accept) from an individual AI reviewer, no paper received more than a 5 from a human reviewer or an average higher than 5 from all three AI reviewers.  This is unsurprising as the standards for a 6 are quite high.  Agents4Science used the scoring criteria from the prestigious NeurIPS Conference, specifically:

  • 6: Strong Accept: Technically flawless paper with groundbreaking impact on one or more areas of AI, with exceptionally strong evaluation, reproducibility, and resources, and no unaddressed ethical considerations.

  • 5: Accept: Technically solid paper, with high impact on at least one sub-area of AI or moderate-to-high impact on more than one area of AI, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations.

  • 4: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly.

  • 3: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly.

  • 2: Reject: For instance, a paper with technical flaws, weak evaluation, inadequate reproducibility and incompletely addressed ethical considerations.

  • 1: Strong Reject: For instance, a paper with well-known results or unaddressed ethical considerations

The average AI reviewer score for all papers was 3.18 and for accepted papers 4.26.  The human subject matter expert scores, with a few significant exceptions, tracked the AI reviewer scores pretty closely (Average AI reviewer score for accepted papers = 4.26, average human reviewer score for accepted papers = 3.88).  The cutoff for human review seemed to be papers with an average AI reviewer score of 4 or above.  


My scores (see table below) were tantalizingly close.  My top paper (tied for 81st place with many others) on the fractal dimension of regulations fell one point short of that 4 average.



My second paper on AI and biological strategies for optimizing variation was 2 points from the magical 4 average but did receive a 6 from Gemini.  Gemini was the most liberal of the AI reviewers with an average score of 4.25 across all papers.  ChatGPT was the most conservative with an average score of 2.3 and no score higher than 4.  Claude was more or less in the middle with an average of 3 and no 6's but a number of 5's.


The AI reviewers also provided narrative feedback.  The links in the table above take you directly to that feedback so you can review it for yourself.  In all, it was what I have come to expect from reviewers (both AI and human).  Some of the AIs loved one aspect of a paper, such as Gemini discussing the C. Elegans paper:  "This is an outstanding conceptual paper. It is elegant, insightful, and impeccably presented" while ChatGPT, for the same paper, criticized it for the exact same reason:  "However, the work is almost entirely conceptual, lacking empirical analysis, with synthetic figures and coarse estimation methods on both the LLM and biological sides."  Go figure.


In general though, the AI reviewers praised the hypotheses with language like, "a creative and cross-disciplinary hypothesis" or "presents a novel and practical tool for a real-world problem" and then chided my AIs for the lack of empirical evidence presented in the papers with scolding language such as, "Limited experimental rigor: small N (120), no multi-seed robustness, no parameter sensitivity (θ, shell radii, background distribution), no ablation against alternative generative hypotheses."  Riiiiiight.  My AIs kept telling me we needed to go to CERN and run a few experiments before we submitted.  If only I'd listened...


In general, though, I thought the comments were fair in that, from my experience with conferences and reviewers, a certain degree of disagreement is almost inevitable.  You almost never get a completely clean sheet and sometimes the comments are all over the board.  


The comments regarding the lack of empirical testing were particularly on point, though.  This was a conscious decision on my part.  I am not wired into any of the research communities I addressed in my papers and knew that getting any new evidence would be impossible in the time we had.  I told my AIs we had to repurpose data that was already available and do the best we could.  In total, then, this experience was not much different than what I might expect from any high quality conference.  There were only two places that caused a (slight) raised eyebrow:  The autonomy scores and the Primary and Secondary Topic designation.  


The autonomy scores were just weird (see the table above for mine).  Levels of autonomy were largely self-reported when you submitted the papers with the scores added later by the conference organizers.  Moreover, there doesn't appear to be any real correlation between the level of autonomy and whether the paper was accepted or rejected, which seemed odd given the purpose of the conference, but maybe I just missed that.


I submitted the exact same thing regarding autonomy for all three papers (or so I thought), yet my scores, as you can see, came out different.  Basically, I came up with the hypothesis in each paper but I depended on the AIs to do an awful lot of the heavy lifting after that.  Working with the AIs (and I used several) felt similar to working with a graduate student on their thesis or dissertation.  Had the conference organizers offered another category in their "Autonomy Score" criteria for something like process management, coaching, or "stick and rudder guidance," I would have indicated maximum human involvement.


As for the topic designation, there is no question that this conference was wildly over-represented by computer and data science papers.  83% of the accepted papers came from Computer and Data Sciences while 72% of all papers had Computer and Data Science as their primary discipline.  This was sort of to be expected.  AI lies squarely in the computer and data science field, the conference was sponsored by a computer science department at a major university, and the lead organizer was a computer science professor.   That said, the rest of the disciplines were scattered about like snack trays.  In my categories, for example, there were only 13 papers in the natural sciences (including my Ramsey Theory paper), only 11 "interdisciplinary" papers (the C. Elegans paper), and mine was the one and only Law, Policy and Business paper (the fractal/regulations paper).  


I am not alleging anything nefarious here, of course.  But I do think the prompt given to the AI reviewers may have contained some implicit bias towards computer science.  I don't have a lot of evidence but narrative comments like, "Overall, this is a competent but limited technical contribution, more suited to legal informatics than AI agents for science, with excellent transparency but falling short in impact and novelty for a top-tier venue" make me wonder.


In all, however, I have no regrets.  The hypotheses in each paper, my brain fires, were universally considered innovative even by the AI reviewers that pilloried the research itself.  My first idea, that AIs could be fair judges of novel hypotheses emerging from the cracks between disciplines, seemed supported.  As for the rest of it, I did not expect much from three papers covering six disciplines, none of which I knew much about.  


Which brings me to the second idea I wanted to test.


Why am I so happy about all this?


I attended the virtual conference where the top three papers received their awards and where 11 other "spotlight" papers had speaking slots.  Most of these papers were backed by teams of researchers, many with already lengthy lists of publications in their career fields.  While I did not have time to check every paper, my general impression was that these were papers submitted by people who certainly had the credentials to do the work themselves but, like me, were exploring how far they could go in getting the AI to do the work for them.


Unlike me, however, these people were experts (or at least knowledgeable) in their fields.  I was not.  I know nothing about advanced mathematics or deep space cosmology, neural spike activation in C. elegans or temperature settings in LLMs.  While some might say I have some experience with regulations given my background in law, anyone who knows me would go, "Regulations?  Him?  Not so much."  The same is true of complexity theory and fractals.  


No, I wanted to do a field test of another bit of research I have been working on for quite some time:  How do you ask a good question?


Since the AI revolution is driving down the cost of getting good or good enough answers, it seems to me that asking good questions--the right question at the right time--is going to become the essential human contribution to the research equation.  While we academics have always talked a good game about "teaching students to ask the right questions" our means of evaluating how well our students have learned this skill has always been indirect.  In other words, we have always looked at the output the students produce and, if it, the test, the paper, the thesis, is good enough, we have assumed that the input, the questions the students asked to get that output, must have been the right ones.  We almost never directly evaluate the questioning process itself, however.


I think those days are over.


This means that we have to figure out how to examine, in detail, the student's questioning process itself, how to determine the many, equifinal, paths to "right," and, finally, what to say about what went wrong, why it went wrong, and how to fix it.   This also means we have to come up with something more than a "because I said so" rubric.  This rubric needs, at a minimum, to find the intersections where questioning traditions as varied as the Socratic method, west African griots, and Zen koans find common ground.  It needs to also include the science of questions including topics like erotetic theory, best practices in heutagogy, and lest we forget, the scientific method itself.  And that is just a start.


This has been the subject of my sabbatical this year and I have found myself increasingly using my "Ecology of Questions" framework to help think through the Volatile, Uncertain, Complex, and Ambiguous (VUCA) environment we all live in these days.  


I think, in short, that the other contributors to Agents4Science were already capable of producing "5" level papers or better and wanted to show that they could get AIs to produce papers of a similar quality.  I, on the other hand, wanted to start with "0" quality papers and see how far up the ladder I could climb just using a new way of thinking about questions.  I'm happy because it worked, for these three papers at least, better than I had any right to expect.


My research is far from done and this field test doesn't prove anything definitively.  But it does give me hope, hope that I am onto something that will not only help us think through the wicked problems of a VUCA world but something that validates the essential contributions of humans in the rapidly advancing age of AI.


Tuesday, July 30, 2024

Center Of Mass (Or How To Think Strategically About Generative AI)

It may seem like generative AI is moving too fast right now for cogent strategic thinking.  At the edges of it, that is probably right.  Those "up in the high country," as Lloyd Bridges might put it (see clip below), are dealing with incalculably difficult technical and ethical challenges and opportunities as each new version of Claude, ChatGPT, Gemini, Llama, or other foundational large language model tries to outperform yesterday's release.

 

That said, while all this churn and hype is very real at the margins, I have seen a fairly stable center start to emerge since November, 2022 when ChatGPT first released.  What do I mean, then, by "a fairly stable center?

For the last 20 months, my students, colleagues, and I have been using a wide variety of generative AI models on all sorts of problems.  Much of this effort has been exploratory, designed to test these tools against realistic, if not real, problems.  Some of it has been real, though--double-checked and verified--but real products for real people.  

It has never been standalone however. No one in the center of mass is ready or comfortable completely turning over anything but scut work to the AIs.  In short, anyone who uses a commercially available AI on a regular basis to do regular work rapidly comes to see them as useful assistants, unable to do most work unsupervised, but of enormous benefit otherwise. 

What else have I learned over the last 20 months? 

As I look at much of what I have written recently, it has almost all been about generative AI and how to think about it.  My target audience has always been regular people looking for an edge in doing regular work--the center of mass.  My goal has been to find the universals--the things that I think are common to a "normal" experience with generative AI.  I don't want to trivialize the legitimate concerns about what generative AIs might be able to do in the future, nor to suggest I have some sort of deep technical insights into how it all works or how to make it better.  I do want to understand, at scale, what it might be good for today and how best to think about it strategically.

My sources of information include my own day-to-day experience of the grind with and without generative AI.  I can supplement that with the experiences of dozens of students and my faculty colleagues (as well as with what little research is currently available).  All together, we think we have learned a lot of "big picture" lessons.  Seven to be exact:
  1. Generative AI is neither a savior nor Satan.  Most people start out in one of these two camps.  The more you play around with generative AIs, the more you realize that both points of view are wrong and that the truth is more nuanced.
  2. Generative AI is so fast it fools you into thinking it is better than it is.  Generative AI is blindingly fast.  A study done last year using writing tasks for midlevel professionals found that participants were 40% faster at completing the task when they used the then current version of ChatGPT.  Once they got past the awe they felt at the speed of the response, most of my students, however, said the quality of the output was little better than average.  The same study mentioned earlier found similar results.  The speed improved 40% but the average quality of the writing only improved 18%.
  3. Generative AI is better at form than content.  Content is what you want to say and form is how you want to say it.  Form can be vastly more important than content if the goal is too communicate effectively.  You'd probably explain Keynesian economics to middle-schoolers differently than you would to PHD candidates, for example.  Generative AI generally excels at re-packaging content from one form to another.  
  4. Generative AI works best if you already know your stuff.  Generative AI is pretty good and it is getting better fast.  But it does make mistakes.  Sometimes it is just plain wrong and sometimes it makes stuff up.  If you know your discipline already, most of these errors are easy to spot and correct.  If you don't know your discipline already, then you are swimming at your own risk.
  5. Good questions are becoming more valuable than good answers.  In terms of absolute costs to an individual user, generative AI is pretty cheap and the cost of a good or good enough answer is plummeting as a result.  This, in turn, implies that the value of good question is going up.  Figuring out how to ask better questions at scale is one largely unexplored way to get a lot more out of a generative AI investment.
  6. Yesterday's philosophy is tomorrow's AI safeguard.  AI is good at some ethical issues, lousy at others (and is a terrible forecaster).  A broad understanding of a couple thousand years of philosophical thinking about right and wrong can actually help you navigate these waters.
  7. There is a difference between intelligence and wisdom.  There is a growing body of researchers who are looking beyond the current fascination with artificial intelligence and towards what some of them are calling "artificial wisdom."  This difference--between intelligence and wisdom--is a useful distinction that captures much of the strategic unease with current generative AIs in a single word.
These "universals" have all held up pretty well since I first started formulating them a little over a year ago.  While I am certain they will change over time and that I might not be able to attest to any of them this time next year, right now they represent useful starting points for a wide variety of strategic thought exercises about generative AIs.

Monday, July 8, 2024

How Good AIs Make Tough Choices

Rushworth Kidder, the ethicist, died 12 years ago. I never met him, but his book "How Good People Make Tough Choices" left a mark. It was required reading in many of my classes, and I still think it is the best book available on the application of philosophy to the moral problems of today.  

Why?  For a start, it is well-organized and easy to read.  Most importantly, though, it doesn't get lost in the back-and-forth that plague some philosophical discussions.  Instead, it tries to provide a modicum of useful structure to help normal people make hard decisions.  In the tradition of some of the earliest philosophers, it is about the application of philosophical thinking to everyday life, not about abstract theorizing.

Don't get me wrong.  I am not against abstract theorizing.  I'm a futurist.  Speculation masquerading as analysis is what I do for a living, after all.  It is just, at some point, we are all faced with tough decisions and we can either let the wisdom of hundreds of philosophers over thousands of years inform that thinking or we can go on instinct.  William Irvine put the consequences even more directly: 

"Why is it important to have such a philosophy? Because without one, there is a danger that you will mislive—that despite all your activity, despite all the pleasant diversions you might have enjoyed while alive, you will end up living a bad life. There is, in other words, a danger that when you are on your deathbed, you will look back and realize that you wasted your one chance at living."

One of the most common questions I get asked these days sits at the intersection of these "tough choices" Kidder was talking about and artificial intelligence.  There is a lot of (justifiable) hand-wringing over the questions of what can we, should we, turn over to AIs on the one hand, and what are the consequences of not turning over enough to the AIs on the other.

For me, these questions begin with another:  What can AIs do already?  In other words, where can AIs clearly outperform humans today?  Fortunately, Stanford collates exactly these kinds of results in an annual AI index (Note:  They don't just collate them, they also put them in plain english with clear charts--well done Stanford!).  The results are summarized in the table below:

Items in dark red are where AIs have already surpassed humans.  The light red is where there is evidence that AIs will surpass humans soon.  This table was put together with help from Claude 3, the AI I think does the best job of reading papers.  I spot checked a number of the results and they were accurate but your mileage may vary.  The estimated time to surpass humans is all Claude, but the time frames seem reasonable to me as well.  If you want the full details, you should check out the Stanford AI Index, which you should do even if you don't want the full details.

The most interesting row (for this post, at least) is the "Moral Reasoning" row.  Here there is a new benchmark, the MoCa benchmark for moral reasoning.  The index highlighted the emergence of harder benchmarks over the last year, stating, "AI models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE, prompting researchers to develop more challenging ones."  In other words, AIs were getting so good, so fast that researchers had to come up with a whole slew of new tests for them to take, including the MoCa benchmark.

MoCa is a clever little benchmark that uses moral and causal challenges from existing cognitive science papers where humans tended to agree on factors and outcomes.  The authors of the paper then present these same challenges to a wide variety of AIs and score the AIs based on something called "discrete agreement" with human judges.  Discrete agreement appears, by the way, to be the scientific name for just plain "agreement"--go figure.  The chart below is from the AI Index not the original paper but summarizes the results:

From the Stanford AI Index.  Scores are from 0-100 with higher scores equaling higher agreement with human judgement.  

If you are scoring things at home, this chart makes AIs look pretty good until you realize that the y axis doesn't include the full range of possible values (A little data-viz sleight of hand there...).  This sort of professorial nit-picking might not matter, though.  This was a study published in late 2023 and there is already a 2024 study out of the University of North Carolina and the Allen Institute that shows significant improvement--albeit on a different benchmark and with a new LLM.  Specifically, the researchers found, "that advice from GPT-4o is rated as more moral, trustworthy, thoughtful, and correct than that of the popular The New York Times advice column, The Ethicist."  See the full chart from the paper below:

Taken from "Large Language Models as Moral Experts? GPT-4o Outperforms Expert Ethicist in Providing Moral Guidance" in pre-print here:  https://europepmc.org/article/PPR/PPR859558 

While these results suggest improvement as models get larger and more sophisticated, I don't think I would be ready to turn over moral authority for the kinds of complex, time-sensitive, and often deadly decisions that military professionals routinely have to make to the AIs anytime soon.

OK.  

Stop reading now.

Take a breath.

(I am trying to keep you from jumping to a conclusion.)  

As you read the paragraph above (the one that begins, "While these results..."), you probably thought one of two things.  Some of you may have thought, "Yeah, the AIs aren't ready now, but they will be and soon.  It's inevitable."  Others of you may have thought, "Never.  It will never happen.  AIs simply cannot replace humans for these kinds of complex moral decisions."  Both positions have good arguments in favor of them.  Both positions also suffer from some major weaknesses.  In classic Kidder-ian fashion, I want to offer you a third way--a more nuanced way--out of this dilemma.

Kidder called this "third way forward, a middle ground between two seemingly implacable alternatives" a trilemma.  He felt that taking the time to try to re-frame problems as trilemmas was an enormously useful way to help solve them. It was about stepping back long enough to imagine a new way forward.  The role of the process, he said, "is not always to determine which of two courses to take. It is sometimes to let the mind work long enough to uncover a third."

What is this third way? Once again, Kidder comes in handy.  He outlined three broad approaches to moral questions:

  • Rules-based thinking (e.g. Kant and the deontologists, etc.)
  • Ends-based thinking (e.g. Bentham and the utilitarians, etc.)
  • Care-based thinking (e.g. The Golden Rule and virtually every religion in the world)
Each of these ways of looking at moral dilemmas intersect with AIs and humans in different ways.

AI is already extremely good at rules-based thinking, for example.  We see this in instances as trivial as programs that play Chess and Go, and we see it in military systems as sophisticated as Patriot and Phalanx.  If we can define a comprehensive rule set (a big “if”) that reliably generates fair and good outcomes, then machines likely can and should be allowed to operate independently.

Ends-based thinking, on the other hand, requires machines to be able to reliably forecast outcomes derived from actions, including second, third, fourth, etc. order consequences.  Complexity Theory (specifically the concept of sensitive dependence on initial conditions) suggests that perfect forecasting is a mathematical impossibility, at least in complex scenarios.  Beyond the math, practical experience indicates that perfection in forecasting is an unrealistic standard.  All this, in turn, suggests that the standard for a machine cannot be perfection.  Rather, it should be “Can it do the job better than a human?”

The “Can the machine do the job better than a human?” question is actually composed of at least three different sub-questions:
  • Can the machine do the job better than all humans?  An appropriate standard for zero-defect environments.
  • Can the machine do the job better than the best humans?  An appropriate standard for environments where there is irreducible uncertainty.
  • Can the machine do the job better than most humans?  A standard that is appropriate where solutions need to be implemented at scale.
If "the job" we are talking about is forecasting, in turns out that the answer, currently, is: Not so much. Philipp Schoenegger, from the London School of Economics, and Peter Park from MIT recently posted a paper to ArXiv where they showed the results of entering GPT-4 into a series of forecasting challenges on Metaculus. For those unfamiliar with Metaculus, it is a public prediction market that looks to crowdsource answers to questions such as Will the People's Republic of China control at least half of Taiwan before 2050? or Will there be Human-machine intelligence parity before 2040?

The results of the study? Here, I'll let them tell you:
"Our findings from entering GPT-4 into a real-world forecasting tournament on the Metaculus platform suggest that even this state-of-the-art LLM has unimpressive forecasting capabilities. Despite being prompted with established superforecasting techniques and best-practice prompting approaches, GPT-4 was heavily outperformed by the forecasts of the human crowd, and did not even outperform a no-information baseline of predicting 50% on every question."

Ouch.

Ends-based thinking is very much a part of most military decisions. If AIs don't forecast well and ends-based thinking requires good forecasting skills, then it might be tempting to write AIs off, at least for now. The trilemma approach helps us out in this situation as well, however. There are powerful stories of hybrid human/machine teams accomplishing more than machines or humans alone that are starting to appear. As more and more of these stories accumulate, it should be possible to detect the "golden threads," the key factors that allow the human and machine to optimally integrate.

Finally, Kidder defined care-based thinking as “putting love for others first.”  It is here that machines are at their weakest against humans.  There are no benchmarks (yet) for concepts such as “care” and “love.”  Furthermore, no one seems to expect these kinds of true feelings from an AI anytime soon.  Likewise, care-based thinking requires a deep and intuitive understanding of the multitude of networks in which all humans find themselves embedded.  

While the machines have no true ability to demonstrate love or compassion, they can simulate these emotions quite readily.  Whether it is because of anthropomorphic bias, the loneliness epidemic, or other factors, humans can and do fall in love with AIs regularly.  This tendency turns the AIs' weakness into a strength in the hands of a bad faith actor.  AIs optimized to elicit sensitive information from unsuspecting people are likely already available or will be soon.

Beyond the three ways of thinking about moral problems, Kidder went on to define four scenarios that are particularly difficult for humans and are likely to be equally challenging for AIs. Kidder refers to these as “right vs right” scenarios, “genuine dilemmas precisely because each side is firmly rooted in one of our basic, core values.” They include:
  • Truth vs. loyalty
  • Individual vs. community
  • Short-term vs. long term
  • Justice v. mercy
Resolving these kinds of dilemmas involves more than just intelligence. These kinds of problems seem to require a different characteristic–wisdom–and wisdom, like intelligence can, theoretically at least, be artificial.

Artificial Wisdom is a relatively new field (almost 75% of the articles in Google Scholar that mention Artificial Wisdom have been written since 2020). The impetus behind this research seems to be a genuine concern that intelligence is not sufficient for the challenges that face humanity. As Jeste, et al. put it, “The term “intelligence” does not best represent the technological needs of advancing society, because it is “wisdom”, rather than intelligence, that is associated with greater well-being, happiness, health, and perhaps even longevity of the individual and the society.”

I have written about artificial wisdom elsewhere and I still think it is a useful way to think about the problem of morality and AIs. For leaders, "wisdom" is a useful shorthand for communicating many of the concerns they have about turning operations, particularly strategic operations, over to AIs. I think it is equally useful for software developers, however. Wisdom, conceptually, is very different from intelligence but no less desirable. Using the deep literature about wisdom to help reframe problems will likely lead to novel and useful solutions.