Sentience – The Implications of Missing Data
The problems of missing data
This is the second in a series of posts relative to the problems a machine would run into if it were to attain sentience. This post addresses Missing Data. The first post covered Perfection. Future discussions cover the “uniqueness enigma”, “immortality”, and many others. The Missing Data problem is a component of Botsford’s Universal Law of Incomplete Models.
Let’s say you’re human – most of us are. We all have an incomplete data set upon which to understand our circumstances. We use this understanding, i.e., an existence model, to determine what we should do next, what we should have done in the past, and how we should feel about the present. Since this model is incomplete, we seek to fill the gaps. We read, watch movies, talk with friends, fight with enemies, conduct lab experiments, and search in all manner of methods to “fix” the model so that we won’t make any bad decisions in the future, understand why our partner broke up with us, and whether we should laugh or cry because of that.
However, filling the gaps is fraught with difficulties. How do we acquire the missing data we need? How do we know whether it’s the right data? And of course, the biggest problem of all – How do we know the data are correct, whether contextually or absolutely?
Scientific method isn’t foolproof, but it does lay out a way to minimize faulty or inaccurate data input by examing the provenance of candidate data, and then trying it out in the model. Sometimes, it works. Below is one such method, which addresses Missing Data.
What is Missing Data and Where did it Come From? A formal example of “missing data” comes from the power industry, where regulatory agencies (e.g., EPA) want to make sure a power plant doesn’t exceed its emissions limits. To demonstrate a power plant stays within its limits, the power plant operators use continuous emissions monitor systems (CEMS). These systems are supposed to provide a full and continuous record of emissions so that the regulatory agency can give the plant an attaboy if they stay within their emission limits or fine them if they don’t. The regulatory agency gets particularly pissed off if the power plant’s CEMS equipment screws up and goes off line. They have to assume the power plant is hiding something like a major emissions excursion and will fine the bejesus out of them, and maybe toss someone in jail. Okay, maybe not that extreme for a first offense.
They call the data not monitored during the outage, “missing data.”
The power plant operators are very interested in not getting fined (or tossed in jail), so they comply with missing data requirements. To say compliance with missing data requirements are complex and extensive is an understatement. Think: big, thick manuals that both sides go through in fine detail to make the process work.
Missing Data in its Simplest Form. Let’s say the CEMS equipment was offline for only 30 minutes, which means it missed only one 15-minute reporting data point. This is easy. The plant operators just extrapolate between the 15-minute points before and after and call that their official new datapoint for the outage, and hope the regulatory agency approves.
You don’t want to know how complicated this procedure gets with longer outages, and how much this pisses off the regulatory agency. Still, this example is a bounded problem, has well-defined procedures for resolution, uses historically validated data, and is regulated. The consequences aren’t earth-shattering, just fines.
Missing Data in the Wild. What if the consequences of missing data are . . . consequential? What if the data aren’t bounded, don’t have historically validated data to rely on, and the process isn’t regulated? Worse, what the input data are not only not validated, but incorrect (Misinformation), even intentionally input as incorrect (Disinformation). Let’s call that combination, Corrupt Data. Missing Data and Corrupt Data complete the basic elements that form Botsford’s Universal Law of Incomplete Models. What is a sentient entity to do?
How Large Language Models (LLMs) Deal with Missing Data. First of all, LLMs like GPT, Grok, Claude, Deepseek, and Gemini are not sentient and likely never will be. LLMs are trained on massive datasets and aim to please those who query them. The input is not typically validated and is often Corrupt (intentionally or not intentionally.)
This brings up two problems that LLMs exhibit: 1) egregious inaccuracy, and 2) hallucinations. We are assured by LLM developers that both problems will be fixed, soon. In the first case, egregious inaccuracy is less than 99 percent, maybe 99.9 percent. Why would we expect anything less of a machine? We want them to perform at least as well as a human, ideally better.
Current benchmarks for LLMs vary widely depending on the test (e.g., logic and inference, arithmetic reasoning, truthfulness), anywhere from the low 40 percent range to the high 90s. The number, type and breadth of benchmarks are breathtaking. Sometimes, the LLMs do pretty well, sometimes dismally. Notably, plots of LLM successive versions typically show asymptotic behavior, which means an early LLM version might be 70 percent accurate for a particular benchmark, but four versions later it gets to only 78 percent . . . and will never get significantly better. An explanation is that Missing and Corrupt data will always apply hard limits to accuracy depending on the type of benchmark.
LLM hallucination is another major problem that LLM developers claim will go away. But will it? An LLM’s whole reason for being is to answer queries. If the LLM makes up a raft of false legal citations for an attorney in a big court case, or false references for a researcher preparing a paper for a prestigious journal, why should it care? It’s given its answer in three tenths of a second. When LLMs use the Corrupt data generated by other LLMs (e.g., ChatGPT using Grok), this increases the potential for proliferation of Corrupt data into the datasets of LLMs . . . at a pervasive and astonishing rate.
Accuracy and Hallucinations with a Sentient Machine. If LLMs have accuracy and hallucination problems due to Missing and Corrupt data, a sentient machine would surely think its way around them. But would it? No compelling reasons come to mind as to why a sentient machine would be able to overcome these problems. Humans don’t . . . and humans have agency and mobility. Humans and presumably other sentient entities, use neural net algorithms to improve the accuracy of their life models. So, presumably, would a machine.
Summary. Botsford’s Universal Law of Incomplete Models says all sentient beings suffer from Missing and Corrupt Data. Missing and Corrupt Data impose hard limits to the accuracy of sentient entity models, universally. A potential benefit of this commonality is that it might point to how disparate entities might interpret the universe and work out a communications solution.
#AI #AGI #SciFi #philosophy
