Filed as (LLMs will always fail on clearly identifiable classes of problems)
You asked a question about a subject that has a large number of fairly consistent copies on the Internet. I know much of the Internet and STEMC. So it is easy to predict where the LLMs fail. OpenAI, Gemini, Grok and CoPilot all fail on harder problems and OpenAI and CoPilot always fail when asked to do problems that involve scientific notation, many unit conversions, division of values in scientific notation, comparison of sizes, and many other simple, but clear tests. I checked. Hundreds of long conversations and tests.
The best solution is to use LLMs to handle human languages, NOT have them invent methods from examples from the web, require them to use carefully selected set of methods for STEMC problems, standardize the tokens to be real and translatable to all human languages, have them share conversations in global open formats, have them use computer software themselves and not make the humans do it. Standardize extension methods, standardize interfaces, always have feedback and “report this AI”.
Because LLMs are so badly trained from shallow skims of the Internet, there are many easily identifiable problems they will never get right. But a global effort where many players work together, it is possible. True machine intelligence is possible. it is not that hard now. But each commercial group insisting on their own methods — not in collaboration with others, not in collaboration and oversight with users — will not work.
Filed as (LLMs will always fail on clearly identifiable classes of problems)
Richard Collins, The Internet Foundation