Handling ALL human and natural sounds is NOT impossible.

Richard K Collins Assistive Technologies April 18, 2024

I apologize. I turned 75 this year and I have been working every day for the last 26 years on the Internet Foundation. Just when I see how it all works and can be improved substantially, I am getting too tired to change anything. So this is rough and has many topics. I am not happy with any of the AI groups pushing their LLM methods on the world. But I am too small and old and tired to fight such a monstrosity. So I am just going to focus on data, models, statistics and fields (3D volumetric video). I think that the groups working on human sounds can do a much better job – if they will treat it as potential real services in the world where there are up to 20 Billion humans and their AIs and related species. Not some little few millions.

I wish I had a way to show the 3D volumetric videos I use in my head to organize the whole Internet, the Universe, all humans, all related species, all computers and things. Maybe these “Text to anything” methods will mature before I die.

If you cannot understand me, give this text to ChatGPT or Bing CoPilot. They will understand what I am saying, for the most part but they lie a lot. Grok is a baby and Gemini is having organizational internal conflicts.

If you are going to propose yet another text to (Body, Hands, head, face, lips, speech, emotions) method, you should at least provide a list of the human and domain languages supported, details of the training set, and links to it. If your data is difficult to locate or requires processors that are too large for most of the 8.1 Billion humans, you will be excluded.

You ought to include a financial spreadsheet. What would it cost to provide T2(BHHFLSE) service to all 8.1 Billion humans where they use it in their daily lives for speech translation.

It can be reversed: BHHFLSE to permanently recorded knowledge can be made available to all humans. That can transform all human groups. No longer limited by typing on keyboards and screens on computer systems that do not listen, made by companies who do not care.

Since “Text” here has to cover hundreds up to 1000s of human languages, just managing the languages is hard. Suggest you start with CommonVoice.Mozilla.Org which has raw data that seems to be open, except they need a query interface for small developers to take random samples for experimentation,training and testing.

I am not at all happy with CommonVoice using MP3 without an obvious encoder error system. It looks like a lossy format, but the pipeline is not clear. Since 7000+ human languages are possible, and Millions of human experimenters (1 in 1000 of 8.1E9 = 8.1E6) that is a lot of demand if every person is only offered “download our entire dataset for each language” or random “deltas”.

I have investigated all sensors used on the Internet, including most of the ones in development or proposed.

(“microphone” OR “microphones”) show 410 Million entries today and there are no clear evaluation methods online so any human wanting to contribute voice content (or other time series which can also be represented with a speech (Head Face Lips Voice Emotions) model. I was going to try to make at least basic statistics on all the samples and organize them. The consonants of all the languages might seem to cluster, but if you look at the higher velocity and acceleration and jerk (first, second, third time derivatives out through “snap crackle pop”, and further) – It is probably the limits on human ear training that limit things, not the potential of all humans to produce sounds.

I have a very limited goal right now – to produce and understand all human sounds. The microphones limit the space of possible sequences. In practical terms the whole dataset can be considered ONE speaker that has a wide range of skills. If you check the power and acceleration for all that, it will have very specific statistics that can be used for rough classification. (You can add pets because people have talking pets and speech from “aliens” is popular)

If you try to lump the whole thing into one “toy” (a thing with hidden workings, cheap construction, no information, costly learning curve, low return) the whole world and many developers will suffer. As in “Grok, ChatGPT, Gemini, CoPilot are toys.”

The sounds of all humans are bounded by the bodies of humans. The AIs (the real ones in the future) can produce sound and electromagnetic fields powerful enough to move things. We have that already, but those groups like the “Text to something” groups are all over the place, not really serious (do the finances, do the engineering, do it for real) – for all humans and related species.

STOP (Microsoft, Google, OpenAI, HuggingFace, other big proud and loud groups) your playing and get serious.

There are 2 Billion humans less than 20 (roughly) all of them might need to explore and learn and learn how to learn responsibily. But if you make them “install developments systems on computers they do NOT have, and systems and methods that most adults cannot use — that is going the wrong way, most likely.

Now if the linguists, phonologists, bible translators, audio groups, and about 3000 other professions that touch on sound, music, speech and communication all continue to babel in their own private languages, with no common Internet methods – the world will likely crash.

I have seen some AI evaluation sites. I just have not had time to review many of them. But if Millions or Billions of humans in the world are seriously going to refine their own languages, then they need access to the data, models, “audio-izations” (like “visual-izations”) so they can put real listeners and ground truth into the process. “Training” is just using data to compare. “Testing” is just using data to compare. It is not complicated, but often gets made to seem complicated. Start shuffling ten cards and you can misdirect easily. So the groups who have pretensions to become “global open resources” or “Internet open resources” for the whole internet, should get their act together and start working seriously.

If your group does something in a topic and you say “we are the 800 pound gorilla” (Google), then you are also the parents, the responsible party. And if you neglect 8.1 Billion children, you are failing in a fundamental way.

I know many things I am saying will be somewhat hard to understand. If you just take (“audio” OR “sound”) it is at 9.85 Billion entry points today on Google (they will not share so it might be a lie). That is lots of humans and lots of stuff tied up with what ought to be a simple and accessible thing. If you force 5 Billion humans using the Internet to search through 9.85 Billion things just for ONE concept, that is rather wasteful. If you concentrate it, then the 400 Kilogram gorilla becomes a single point of failure and manipulation that likely would enslave the human species. These are not difficult concepts. That is how most topics are evolving on the Internet.

I simply cannot (now, and probably not in the rest of my limited life) gather and organize “everything to do with creating and testing and institutionalizing a single voice generation language for all humans and related species”.

As this one example shows, some people want to make faces and have them say things. When a corporation says that it is not play. So the whole framework has to have dimensions of literally all things that exist. We have LOTS of words (sounds in different languages) and many more variations of those words meaning and intention. That can be taken from “all time series data on the Internet, or all time series that can be generated”. A video is a lot of time series. “All the satellites sensors in orbit” is a lots of time series.

There must be good organizers out there who are OUTSTANDING facilitators. Who care about the whole human species – no matter the shape. Who also care about all living things. The human species cannot survive without related species — many more than we know.

I am filing this under “Handling ALL human and natural sounds is NOT impossible.”

Richard Collins, The Internet Foundation

Handling ALL human and natural sounds is NOT impossible.

About: Richard K Collins

Leave a Reply Cancel reply