Professional audio technician, checking in. So there’s a few big questions here, all layered on top of one another. And honestly, each one would be worthy of its own post. Let me break them down and answer them in order of complexity. I’ll gloss over some stuff and leave out a lot, but I’m trying to hit the basics so you have an understanding of the fundamentals.
First off, we need to understand what sound is. It’s a physical compression wave. Imagine stretching out a slinky, then gathering a few coils on one end and letting them go. You’d see a physical wave travel down the coils, with a compression followed by an expansion. And sound works much the same way. When you make a vibration, air molecules get compressed and expanded, creating teensy tiny pockets of high and low air pressure. This compression travels through the air as a wave, radiating from the source of the vibration. Higher pitched sounds have a faster vibration, while lower pitched sounds have fewer vibrations. These vibrations are measured in Hertz. Some audiophiles say they can hear more than that, but we’re sticking to the basics here… And physical waves carry energy. Sound waves don’t typically carry a lot of energy, but it’s enough to wiggle sensitive things (like our eardrums) around just a little bit. And since it has energy, we can capture that energy in various ways.
So if we can find a way to capture that energy in some way, we can convert it into other types of energy. The most primitive early records were actually made out of soft wax. Then the recorder was basically a giant horn connected to a suuuuper tiny needle. As they dragged a needle across the wax, they shouted into the horn, (they had to shout because louder sounds have more energy, and their primitive recorder wasn’t sensitive enough to pick up quiet sounds.) The horn collected all of that sound energy, and focused it down into the tip of the needle. The same way a trombone takes a player’s vibrating lips and expands it into a massively loud sound, the opposite also works and you can use horns to focus sound into small receivers. The needle would vibrate from the sound waves’ compression and expansion, and cut a groove into the wax.
Then playback was the inverse operation, where they dragged the needle across the wax groove, it vibrated as the groove wobbled, and those vibrations were expanded by the horn. Later iterations improved the design and sensitivity, and they quickly swapped to vinyl because it’s more durable than wax. And now we have a record player. That’s a sound wave captured in physical form.
Next, let’s talk analog electrical audio. So we know that sound waves have energy, and they can wiggle things around. So what if we had a way to capture those vibrations and turn them into an electrical energy instead of physical energy? That’s what a microphone does. The most basic microphone is basically just a magnet and some copper wire, attached to a diaphragm. When you have a copper coil and move a magnet through it, it creates an electrical charge. Move it one direction through the coil, you get a positive charge. Move it the other direction, and you get a negative charge. So what if we find a way to move a magnet based on a sound wave?
Let’s take a really sensitive diaphragm. Sensitive enough to wobble when sound waves hit it. Because remember, it’s just a wave of high and low air pressure, so it can blow and suck on a diaphragm the same way wind blows on a sail. So we make a sensitive diaphragm, which wiggles in relation to the air pressure. When a sound wave hits it, it vibrates in response. Now we attach the magnet to it, so it wiggles the magnet back and forth. Now we have a basic microphone. When sound hits the diaphragm, it wiggles which moves the magnet, creating positive and negative electrical charges in the copper wire, which directly correspond to the sound wave. Congrats, we’ve just invented something called the dynamic microphone. (There are other, more complicated types of mics, but they all do the same basic task of capturing that sound wave and converting it to electricity.) So now we have an analog electrical signal. And now that it’s on copper as electricity, we can use electronics to amplify it and send it to a speaker. A basic speaker does the exact same thing as a dynamic microphone, but in reverse. It has a magnet and copper coil, attached to a horn-shaped cone which can wiggle back and forth. When you run an electrical charge through the copper, the magnet moves in response. So if we send that analog audio signal (amplified to be powerful enough to drive the speaker) to the speaker coil, it will wiggle the attached speaker cone forwards and backwards, and produce a vibration that matches the signal the mic captured. (Tangentially, you can actually use a speaker as a microphone in a pinch. Since they’re doing the same basic thing in opposite directions, you can plug your headphones into a mic input and yell into it, and the tiny speaker drivers in your headphones will act as a mic diaphragm.)
But how do we capture that analog electrical signal, and save it as a digital file? There’s something called the Nyquist–Shannon sampling theorem, which comes into play. Basically, the theorem states that any wave can be perfectly sampled and reproduced, as long as the sample rate is at least two times the maximum frequency of the wave.
Let’s break that down. First off, what is a sample, and sampling rate? The computer doesn’t just listen to the constant stream of analog audio and record it directly. Instead, it samples the analog wave at extremely precise, regular intervals. So for each sample, it checks to see what the electrical charge is. It records the wave’s amplitude and polarity at that specific point in time, then saves just that.
And according to the theorem, it needs to do that at least twice as often as the maximum expected frequency of the wave. Generally, the human hearing range is considered to be 20Hz (20 vibrations per second,) to 20KHz, (20,000 vibrations per second.) Some audiophiles say they can hear more than that, but we’re sticking to the basics here… So according to the theorem, as long as we have a sample rate of at least 40KHz, we should be able to accurately reproduce any audio wave in the human hearing range. So we’re sampling the wave at least 40,000 times per second. That sounds like a lot, but remember that each sample is relatively small because we’re only saving a point on a graph.
Lastly, let’s go over bit depth. Every sample has the same number of bits, and that number is referred to as bit depth. Ever seen how binary counting works? Each sample is saved as a value, represented by bits. Each bit is either a 1 or a 0, so to get to higher numbers we need more bits. With 8 bits, we can count all the way up to 255. 00000000 is 0, 00000001 is 1, 00000010 is 2, 00000011 is 3, etc etc… So if we have a bit depth of 8 bits, we have 255 potential steps that we can record the sample at. But that means we’re rounding each sample to the nearest bit. Higher bit depth allows us to record more accurate samples, but also increases file size as each sample is now larger. Think of each bit as a “step” on a staircase, and you’re trying to measure a curve to the nearest step. With smaller steps, we can get more accurate measurements from the curve.
Then we use the theorem to reconstruct those samples into the analog audio wave, and send it to our speakers.
Professional audio technician, checking in. So there’s a few big questions here, all layered on top of one another. And honestly, each one would be worthy of its own post. Let me break them down and answer them in order of complexity. I’ll gloss over some stuff and leave out a lot, but I’m trying to hit the basics so you have an understanding of the fundamentals.
First off, we need to understand what sound is. It’s a physical compression wave. Imagine stretching out a slinky, then gathering a few coils on one end and letting them go. You’d see a physical wave travel down the coils, with a compression followed by an expansion. And sound works much the same way. When you make a vibration, air molecules get compressed and expanded, creating teensy tiny pockets of high and low air pressure. This compression travels through the air as a wave, radiating from the source of the vibration. Higher pitched sounds have a faster vibration, while lower pitched sounds have fewer vibrations. These vibrations are measured in Hertz. Some audiophiles say they can hear more than that, but we’re sticking to the basics here… And physical waves carry energy. Sound waves don’t typically carry a lot of energy, but it’s enough to wiggle sensitive things (like our eardrums) around just a little bit. And since it has energy, we can capture that energy in various ways.
So if we can find a way to capture that energy in some way, we can convert it into other types of energy. The most primitive early records were actually made out of soft wax. Then the recorder was basically a giant horn connected to a suuuuper tiny needle. As they dragged a needle across the wax, they shouted into the horn, (they had to shout because louder sounds have more energy, and their primitive recorder wasn’t sensitive enough to pick up quiet sounds.) The horn collected all of that sound energy, and focused it down into the tip of the needle. The same way a trombone takes a player’s vibrating lips and expands it into a massively loud sound, the opposite also works and you can use horns to focus sound into small receivers. The needle would vibrate from the sound waves’ compression and expansion, and cut a groove into the wax.
Then playback was the inverse operation, where they dragged the needle across the wax groove, it vibrated as the groove wobbled, and those vibrations were expanded by the horn. Later iterations improved the design and sensitivity, and they quickly swapped to vinyl because it’s more durable than wax. And now we have a record player. That’s a sound wave captured in physical form.
Next, let’s talk analog electrical audio. So we know that sound waves have energy, and they can wiggle things around. So what if we had a way to capture those vibrations and turn them into an electrical energy instead of physical energy? That’s what a microphone does. The most basic microphone is basically just a magnet and some copper wire, attached to a diaphragm. When you have a copper coil and move a magnet through it, it creates an electrical charge. Move it one direction through the coil, you get a positive charge. Move it the other direction, and you get a negative charge. So what if we find a way to move a magnet based on a sound wave?
Let’s take a really sensitive diaphragm. Sensitive enough to wobble when sound waves hit it. Because remember, it’s just a wave of high and low air pressure, so it can blow and suck on a diaphragm the same way wind blows on a sail. So we make a sensitive diaphragm, which wiggles in relation to the air pressure. When a sound wave hits it, it vibrates in response. Now we attach the magnet to it, so it wiggles the magnet back and forth. Now we have a basic microphone. When sound hits the diaphragm, it wiggles which moves the magnet, creating positive and negative electrical charges in the copper wire, which directly correspond to the sound wave. Congrats, we’ve just invented something called the dynamic microphone. (There are other, more complicated types of mics, but they all do the same basic task of capturing that sound wave and converting it to electricity.) So now we have an analog electrical signal. And now that it’s on copper as electricity, we can use electronics to amplify it and send it to a speaker. A basic speaker does the exact same thing as a dynamic microphone, but in reverse. It has a magnet and copper coil, attached to a horn-shaped cone which can wiggle back and forth. When you run an electrical charge through the copper, the magnet moves in response. So if we send that analog audio signal (amplified to be powerful enough to drive the speaker) to the speaker coil, it will wiggle the attached speaker cone forwards and backwards, and produce a vibration that matches the signal the mic captured. (Tangentially, you can actually use a speaker as a microphone in a pinch. Since they’re doing the same basic thing in opposite directions, you can plug your headphones into a mic input and yell into it, and the tiny speaker drivers in your headphones will act as a mic diaphragm.)
But how do we capture that analog electrical signal, and save it as a digital file? There’s something called the Nyquist–Shannon sampling theorem, which comes into play. Basically, the theorem states that any wave can be perfectly sampled and reproduced, as long as the sample rate is at least two times the maximum frequency of the wave.
Let’s break that down. First off, what is a sample, and sampling rate? The computer doesn’t just listen to the constant stream of analog audio and record it directly. Instead, it samples the analog wave at extremely precise, regular intervals. So for each sample, it checks to see what the electrical charge is. It records the wave’s amplitude and polarity at that specific point in time, then saves just that.
And according to the theorem, it needs to do that at least twice as often as the maximum expected frequency of the wave. Generally, the human hearing range is considered to be 20Hz (20 vibrations per second,) to 20KHz, (20,000 vibrations per second.) Some audiophiles say they can hear more than that, but we’re sticking to the basics here… So according to the theorem, as long as we have a sample rate of at least 40KHz, we should be able to accurately reproduce any audio wave in the human hearing range. So we’re sampling the wave at least 40,000 times per second. That sounds like a lot, but remember that each sample is relatively small because we’re only saving a point on a graph.
Lastly, let’s go over bit depth. Every sample has the same number of bits, and that number is referred to as bit depth. Ever seen how binary counting works? Each sample is saved as a value, represented by bits. Each bit is either a 1 or a 0, so to get to higher numbers we need more bits. With 8 bits, we can count all the way up to 255. 00000000 is 0, 00000001 is 1, 00000010 is 2, 00000011 is 3, etc etc… So if we have a bit depth of 8 bits, we have 255 potential steps that we can record the sample at. But that means we’re rounding each sample to the nearest bit. Higher bit depth allows us to record more accurate samples, but also increases file size as each sample is now larger. Think of each bit as a “step” on a staircase, and you’re trying to measure a curve to the nearest step. With smaller steps, we can get more accurate measurements from the curve.
Then we use the theorem to reconstruct those samples into the analog audio wave, and send it to our speakers.
I had to copy and paste your reply to SpeechCentral for me to listen thru hehe. Thanks for that, 8mins of good stuff :)
I actually just edited it, so you may want to redo that lol
Imma ask the Voyager guy if we can get a full text copy option