Monday, July 16, 2007

Word Fossils -- A Statistical Fantasy

We started getting cable television at our home just a few months ago. I never got much opportunity to watch any of the cable shows in the past, so I’ve watched it pretty intently for a while now. One series that fascinates me is Mythbusters on the Discovery Channel. These people are real scientists! They are amazingly resourceful, appropriately skeptical, knowledgeable and very funny. I particularly liked the show where they tried to read recorded sound off of a ceramic pot. The spectacle of the crew shouting through a phonograph horn into a lump of clay was so much better than anything on the Comedy Channel. The story behind the scene was even more intriguing.

The suggestion had been made that sound recordings would be naturally embedded by a potter applying tools to the surface of the wet clay as it rotates on the potter’s wheel, analogous in principle to an Edison phonograph, I suppose. The tool would vibrate and the vibration would leave a mark. The mark could theoretically be interpreted to reconstruct the original sound. Old pots could be made to give up the words of the dead. Even the audio track of the life of Jesus might be recovered if a potter had been nearby, in Jerusalem perhaps, inadvertently pressing His words into the stoneware.

Now, by my way of thinking, the physics of the situation is undeniable. Everything in the environment is going to effect that wet clay. We should be able to pick up that sound. Based on the historical analysis of potsherds, pebbles and ash we have reconstructed entire civilizations and chronicled the oft repeated Fall of Troy. The wet clay should record the nature of the tools, the chemistry of glaze, the location, the temperature, the humidity, the firing fuel, the fact that the potter was a big left-handed galoot with a sore tooth and even note the presence of a nearby supernova in a northern constellation. Of course vibrations in the air will effect the clay! So why couldn’t the Mythbusters team retrieve the sound?

Well, maybe they weren’t trying hard enough. But the fact is that, assuming it’s possible, it’s not going to be easy. Edison himself was famous for his persistence, and he only succeeded because he was controlling all the variables. We forget how hard it is to do something for the first time.

--------------------------------------------

If I were going to attempt it, how would I approach the problem of interpreting a pot? First I would forget about all the mechanical nonsense and treat it as a data source. Laser scan the whole thing, CAT scan it, whatever, to get a data-rich digital map of the surface, and maybe the interior as well. I’m imagining that these scans come in as bit maps where each digital point is a 3D address with sensor measurements associated with that point. I imagine also that the preprocessing provided by the scanning equipment retains only the significant points, employing massive computing power to whittle down the 3D space into a 3D form. Nevertheless, from the resolution that these images have, I’m guessing that there would be a tremendous amount of data. I don’t think I’d manage it on my PC, so I’d have to have access to some specialized computing power. (Maybe I’m underestimating today’s PCs.) Then, being of a statistical bent, I would take a random sample of the surface points, something large but reasonable.

The first objective IMO would be to identify simplifying physical parameters – in particular, the axis of rotation. This seems very straightforward, but in fact it is not. The pot will have been cut from the wedge unevenly. The bottom will have been trimmed, probably, but not necessarily square with the pot. Furthermore, the shape will have warped in the kiln to a degree depending on imperfections, unevenness and non-symmetrical heating. Sore Tooth may have thumbed it a little hard in places. I’m guessing, though, that we could easily fit some fairly simple 3D curve as the axis using a least-squares process. Then I would do it again with another, completely independent, sample of the same data. … The results of the two tests will not be the same, I promise you.

At that point, we have to make a decision. Do we try to straighten out the surface, or the axis? I’m betting we go for the surface first to remove major anomalies. NASA has some algorithms that I’m aware of (I’m talking 25 years ago) that are used for stretching maps and images to correct for the angle of viewing and atmospheric distortions. They do it by marking known positions and then spreading the distortions evenly in order to maintain neighborhood consistency. It’s as if they took a rubber sheet and pinned some of the points to an earthlike, rounded surface. The points in between would adjust themselves over the theoretical frame. In this case we would we would search for large explainable anomalies and try to remove them statistically. In the case of thumbjabs or other dents, we might look for and catalog local distortions that have an inverse curve and try to reverse the curve, preserving the local terrain, but not the original altitude.

In order to make this surface correction, we might have to develop a whole science of thumbjabs and the propagation of their impact on the remainder of the vessel. Learning how to correct these simple anomalies (akin to Oklahoma City in the previous post) would require us to experiment with real clay, pushing and prodding otherwise perfect pots to see what happens. Maybe we could exclude the anomalous data, but then we would lose the detail. Remember, we’re ultimately interested in identifying the sound vibrations. These are just steps we would need to trace before getting there.

Eventually, having removed as many dents and bulges from the virtual shape as seems appropriate, we will refit the axis to see if we can get more consistency. Finally we may decide on an axis that represents the statistical combination of numerous attempts. We may look at all sorts of possible axes including expanding helices. The variation among our calculations will give us clues and the closeness of the fit will be our interim measure of success.

We don’t really expect a close fit at this point, because we don’t really have a circular pot. Sore Tooth pulled the pot off the wedge a little too vigorously and apparently stored the greenware overnight, laying the pot on its side and thus warping the circular pattern that one would hope for. That pattern, described with mental reference to integral calculus, perhaps, as a stack of many stubby cylinders of varying diameter, is a mere dream, one of Plato’s Forms. In fact, the slight torque exerted in the process of throwing the pot would have produced elliptical shapes at best and probably something more complicated. All this could be corrected. It takes work, but it can be corrected. The helical axis could be unwound; the out-of-round could be identified, measured and removed. (These parameters might also help us discover that Sore Tooth was left-handed.)

The overall concept here is that we are extracting significant variables that help us generate an idealized formulation of the pot by means of statistical corrections and, at thesame time, tell us stories about its making. The main measure of success will be the R-squared of the model. This is a statistical measure of the goodness-of-fit that is very widely used in scientific research. It is regarded as providing an intuitively helpful measure of the percentage of variance explained by the variables that have been incorporated in the model. The closer to 100 percent you can get, the better you understand the data. In the physical sciences, as opposed to the social sciences, we can sometimes get very high values for R-squared, and that is good. The whole purpose of this exercise is to find out what the surface should be, and subtract that value leaving us with the so-called residual, or remaining unexplained variation, where we hope to find the magic vibrations.

At this point, the form has effectively been removed and we are left with a giant tube-sock of data in the form of standard deviations from the mean. And what is the mean? The mean is the form, the surface in question, which is now zeroed out, and we are looking instead at small deviations away from that form. The purpose of removing the form is to allow us to find the sound track. If it were an LP disk, we would be looking at a single long track of numbers with a mean of zero, and we would be trying to avoid track repeats and track skipping. This case is a lot more complicated because the track might wander. Even if the pot were perfect, the presumed recording "head" would have produced a moving target.

Now that we have the relatively pure signal, we are going to mathematically scan across, trying to find a track, slightly up, slightly down, anything we can think of – looking for what? Sine waves! If our data density is good enough, we should be able to find horizontally connected amplitude signals that correspond to the classic curves. We don’t expect good sine waves, of course. We expect cluttered overlapping sine waves that we can break down with Fourier Analysis. We are looking for sound as opposed to noise. In order to find it, we have to make a guess at how fast each track was moving and maybe we have to know what kind of noises to expect in an ancient potter’s workshop. And then we have to get lucky.

In all likelihood we will not be able to find anything. So then we give up, right? Not even close. First we look for combinations of adjacent tracks that might have more statistical power. Then we’re going to repeat the whole process looking at variations on the inner wall, or in the thickness of the walls, or linearity of the mineral crystals, or variable chemical states caused by changing pressure. Who knows what would actually work. And we could study the various interactions of all these measures until the Second Coming. In the end, lines of inquiry are only dropped because people run out of money or get tired of it all. If the research is important, people never give up. They just keep looking for new approaches.

The statistical mindset is just one of these approaches. The mindset of engineering would handle it like the Mythbusters team did. The first thing you try to do is record sound using the same medium under perfect conditions and see if you can learn from that. But I wouldn’t do it that way because it doesn’t suit my personality. For one thing, statistical analysis will find a lot of other things as well, which I like.

---------------------------------

The question returns to just how important this research might be. Think of it! To recover dead languages, to literally hear the words of Jesus, of Pericles, of Caesar, the roar of the crowd in the Colosseum. More likely, I suppose it is, to hear the yelps of children chasing around the workshop or the cackle of hens. Maybe there are better ways to spend our time – providing electricity for third world countries or preventing malaria might seem to be better goals. Part of it always comes back to faith. Do you really believe that the research can pay off? And do you believe it’s worth the effort?

In the end, I have my doubts about the song in the pot. The real problem is that a medium can only hold just so many dimensions of information before the residue becomes random noise. There are just too many tiny factors effecting the outcome in small ways, and it doesn’t help that the signal is analog, as is everything else in the Universe. As a matter of fact, I think I’ve changed my mind. The Mythbusters approach may actually be better. We just have to build the perfect reading head for the given medium.

The many little things get washed out by a background of big factors. Theoretically, you could use the same approach to look at the stars in the daytime. Just subtract the statistical effects of the sun and sky. In practice, though, you just can’t know it well enough to subtract it. And this is what you need to understand about the previous post concerning the impact of the death penalty on the homicide rate. There are some pretty strong factors that motivate the likelihood of murder and the decision to murder. Although you know it has to have some sort of effect in some cases, it’s really hard to identify and measure that effect. And when combined with other factors, the result might not be what you expect. For instance, if you correct for poverty, then the impact of the death penalty might be to increase the likelihood of murder rather than decrease it. The reason for this is that the deterrence effect will have differing impacts on different groups, an effect which is correlated with the direct effect, but stronger. This is one of many problems involved with extracting second and third order factors by means of statistical correction.

IMO, you can’t squeeze sound from stoneware, and tiny incentives have indiscernible effects.

Labels: , , ,