I’m sure you’ve all seen games that incorporate music and generate game play based on the music that’s playing. Hell, who hasn’t played guitar hero! I know I’ve contributed a fair share to my carpal tunnel while strumming away on the cheap plastic guitars. Music as they say soothes the soul and calms the beast – it also makes for some awesome game play! So incorporating music into your game really is a no-brainer. Taking it one step further and having the music itself generate your game levels is just plain awesome.
So, how can we use audio files to procedurally generate game play elements in Unity 3?
That’s a good question, and for the short answer – not very easily as of Unity 3.0. So what the hell am I writing this post for? Well, just because it’s not easy doesn’t mean it can’t be done *cue evil laugh*.
First of all, Unity 3 provides us with 2 functions that will provide raw data that we can use to generate effects, content or whole entire levels. These 2 functions are..
GetSpectrumData() & GetOutputData()
Let’s take a look at each.
This function as described by Unity’s documentation “Returns a block of the currently playing source’s output data”. Unity’s documentation is great but in this case it really sh*ts the bed. I’m no audiophile so I’m going to keep the terminology very basic. This function returns volume data. That’s right, the higher the number the higher the volume of the track that is current playing. Now that may not be a very technical explanation but at least it gives us something we can work with.
On another note, the Unity documentation says that this function takes 2 arguments, both integers and that it returns an array of float values. This method works but it is now deprecated and the documentation has not been updated.
The CORRECT method is to provide 2 arguments, the first being a pre-defined float array with the length being set to the number of samples you’d like to receive, and the second being an integer representing the channel. As follows..
//define our array and set the length to 64 var sampleData = new float[64]; //the sampleData array will now be filled with the sample data from the GetOutPutData function listener.GetOutputData(sampleData, 0); //cycle through the data for (i = 0; i < sampleData.Length; i++) { print ("Sample Vol: " + sampleData[i]); }
Again, Unity’s definition of this function is somewhat archaic. Unity’s documentation describes it as..
Returns a block of the currently playing source’s spectrum data
Number of values (numSamples) must be a power of 2. (ie 128/256/512 etc). Min = 64. Max = 8192. Use window to reduce leakage between frequency bins/bands. Note, the more complex window type, the better the quality, but reduced speed.
Ok, so there’s a few key points that need to be made here. First of all, the function takes 3 arguments and the Unity documentation is behind the times on this function as well. The first argument is again a pre-defined float array with the length set to the number of samples you would like to retrieve. The second argument is the channel, and the third argument is the spectrum analysis window type.
There are 5 different window types to choose from.
| Rectangular | w[n] = 1.0 |
| Triangle | w[n] = TRI(2n/N) |
| Hamming | w[n] = 0.54 – (0.46 * COS(n/N) ) |
| Hanning | w[n] = 0.5 * (1.0 – COS(n/N) ) |
| Blackman | w[n] = 0.42 – (0.5 * COS(n/N) ) + (0.08 * COS(2.0 * n/N) ) |
| BlackmanHarris | w[n] = 0.35875 – (0.48829 * COS(1.0 * n/N)) + (0.14128 * COS(2.0 * n/N)) – (0.01168 * COS(3.0 * n/N)) |
Think of a window as a smoothing function. The window is used to smooth out the spectrum values before being passed into your array. They zero out values and let us better “zoom in” on the data that we want to see – the frequencies. You can read more about windows and spectral analysis at this window functions Wikipedia article.
Here is the CORRECT method to use GetSpectrumData, the first argument is a pre-defined float array with the length being set to the number of samples you’d like to receive, the second is the channel, and the third is the window function type.
//define our array and set the length to 64 var sampleData = new float[64]; //the sampleData array will now be filled with the sample data from the GetOutPutData function GetSpectrumData(sampleaData, 1, FFTWindow.BlackmanHarris); //cycle through the data for (i = 0; i < sampleData.Length; i++) { print ("Sample Frequency: " + sampleData[i]); }
Smoothing Out the Sample Data With RMS (Root Mean Square) to Make it Usable
In the above examples we are grabbing a sample size of 64. This means that are float array contains 64 different samples. We need to combine those 64 samples into 1 usable float value. To do this we will use a simple RMS function. RMS stands for root mean square. Essentially, it adds up the sum of the 64 values, divides the sum by the number of values (64), and returns the square root of the average.
Here’s an example..
function RMS(samples: float[]) { var result = 0.0; //get the sum of the samples for (i = 0; i < samples.Length; i++) { result += samples[i] * samples[i]; } //get the average of the sample values result /= samples.Length; //return the square root of the average return Mathf.Sqrt(result); }
The RMS function returns one single number that we can now use in our game.
Using Volume and Frequency Data to Procedurally Generate Game Elements
Ok, so we know we can get both the volume data and the frequency data from the audio source, and we know how to smooth out the returned data – now let’s put it all together and do something with this data. In this example, we’re simply going to place objects in the scene according to the volume and frequency data. This will produce an effect similar to how Guitar Hero works with the music notes. I should mention that since I’m not experienced in dealing with advanced audio techniques this is a very rudimentary example and won’t produce very accurate results. It’s neat nonetheless and the possibilities are endless if you take the time to experiment.
//the audio clip we want to play var audioTrack : AudioClip; //the listener (main camera for this example) var listener : AudioListener; //the number of samples we want to take var sampleRate : float = 256; //how often we want to take a sample var timeSpace : float= 0.2; //the game object we will be using to represent our samples var visualPrefab : GameObject; private var volData : float[]; private var freqData : float[]; private var numSamples : int; //we'll move along the z plane when placing our objects private var curZ : int = 0; //we'll clamp our positions between these values to avoid erratic placements private var maxX : float = 5.0; private var minX : float = -5.0; function Start() { //create audio player game object and position it at the same point as our audio listener audioPlay = new GameObject("audioPlay"); audioPlay.AddComponent("AudioSource"); audioPlay.transform.position = listener.transform.position; audioPlay.audio.clip = audioTrack; audioPlay.audio.Play(); //prep our number of samples, we clamp it between 64 and 8192 since this is the min and the max for the numSamples argument numSamples = Mathf.Clamp(sampleRate * timeSpace, 64, 8192); //prep our float arrays volData = new float[numSamples]; freqData = new float[numSamples]; InvokeRepeating("PlaceNewObject", 0, timeSpace); } function PlaceNewObject() { //update z position curZ += 2; //get the output data from the listener listener.GetOutputData(volData, 0); //get the root mean square of the output data (this is the square root of the average of the samples) curVol = RMS(volData); //amplify the volume, and maintain our range of minX and maxX xPos = Mathf.Clamp(curVol * 100, minX, maxX); //only place a new object if we aren't at the extremes of our clamp values if (xPos != minX && xPos != -maxX) { //get the spectrum data from the listener (we use the blackman harris window for maximum contrast) listener.GetSpectrumData(freqData, 1, FFTWindow.BlackmanHarris); //get the root mean square of the spectrum data curFreq = RMS(freqData); //amplify the frequency for more visual impact yPos = Mathf.Clamp(curFreq * 200, 0, maxX); //instantiate our new visual object, adjusting x, y and z position as we go newVisual = Instantiate(visualPrefab, Vector3(xPos, yPos, curZ), transform.rotation); } } function RMS(samples: float[]) { var result = 0.0; //add sample values together for (i = 0; i < samples.Length; i++) { result += samples[i] * samples[i]; } //get the average of the sample values result /= samples.Length; //return the square root of the average return Mathf.Sqrt(result); }
Now this example is very simple and isn’t very practical but it should give you a good idea of what you can do with the audio data. To use this script just drag it onto an empty game object (or the main camera) and assign the necessary exposed variables in the inspector panel. I used a sphere for my visual prefab, you can use whatever you want. When you click play you should see your prefab being instantiated along the Z plane in tune with the audio track that you’re playing.
That was pretty easy, wasn’t it? So why did I say that this wasn’t an easy task to accomplish in Unity 3 at the beginning of this post?
Well, the problem is when using audio to procedurally generate game elements you generally want to analyze the audio track *before* the level is loaded. You probably want the player to be able to use their own personal MP3′s as well. That’s where things become complicated. Unity doesn’t offer any way to speed up the processing of the audio clip, and also doesn’t offer any good way of loading a MP3 file or WAV into the scene for processing. Some workarounds are required to accomplish these tasks.
How to Pre Analyze Audio Tracks in Unity 3
One method I’ve considered for web based games is communicating with Flash and/or PHP to upload and process audio tracks before the level is loaded. This way you could allow your users to upload files to your server, process them on your server, then return JSON data containing the required data to generate the level. The tracks could be processed in a matter of seconds instead of minutes, and would allow virtually any audio file to be used to create the level.
Creating an integrated system like that is beyond the scope of this quick tutorial. If I receive enough requests for something like this to be demonstrated I will try and make the time to whip up a working example that can be expanded on. I know that it would definitely make for some very interactive game play.
That’s it for now, have fun!
1,064 Responses to “Audio Based Procedural Level Generation & Manipulation in Unity 3”


I’d love to see more on this topic… Seems like building a level based on audio data is out of the scope of the Unity web player, but what about for bundled projects? Can a true Unity project call an external program to handle the audio data extraction?
Alex,
With Unity Pro it is definitely possible as you can write a plugin to do all of the data processing for you.
It is possible in web players and in standalone projects to do it without using plugins, but it isn’t pretty and not very practical.
Why do people ALWAYS have to use RAR files? WHY OH WHY cant there be a zip version?
Hi,
Pretty neat tutorial !
I’m as well interested in a track preloading tutorial.
U r not getting frequency data at all. U like applying rms to stuff! (-: The size of array is used to define the number of frequency bins(there will be 64 in ur case) and ultimately the bandwidth of each bin. Given that ur source audio was 44.1khz the bandwidth of each bin would be (44100/2)/64 (we divide 44100 by 2 to keep Nyquist a happy chappy). Hence the first index in array represents the energy of the frequencies 0-344.5hz the second index represents energyband 344.5-689hz etc… all the way to 21705-22050hz. Applying RMS, whilst rather cute and all, is also rather inappropriate (-:
Hey, thanks for the reply.
I mentioned in the post audio processing is not one of my strong points, so it’s good to have a second opinion.
I took the idea of using RMS from a developer at Unity.
Ryan
I can’t download the example file… Can somebody link it or reupload it?
I’ve updated the link in the post now so you can download the example file.
Enjoy!
Thanks for saving my life!
OD1TekNoBee is right. You are not getting frequency data at all. Hey do this to get Low Med High…
//define our array and set the length to 1024
var sampleData = new float[1024];
var freaq = new float[3];
//the sampleData array will now be filled with the sample data from the GetOutPutData function
listener.GetSpectrumData(sampleData, 1, FFTWindow.BlackmanHarris);
//cycle through the data
for (i = 0; i < sampleData.Length; i++) {
/*print ("Sample Frequency: " + i + " " + sampleData[i]);*/
}
//Arithmic operation to get the first 0 to 172 Hz… This is typical Low.
for (i = 0; i < 4; i++) {
freaq[0] = sampleData[i] + freaq[0];
}
//Same as above except we get 172 to 3014 Hz… This is typical Med.
for (i = 0; i < 70; i++) {
freaq[1] = sampleData[i + 5] + freaq[1];
}
//Same as above except we get 3014 to 44100 Hz!!! This is typical high measure.
for (i = 0; i < 950; i++) {
freaq[2] = sampleData[i + 74] + freaq[2];
}
print("low: " + (freaq[0] * 100 * 10));
print("med: " + (freaq[1] * 100));
print("high: " + (freaq[3] * 100));
We don't need to use the whole array we get from GetSpectrumData… And you should use a minimum of 1024 samples, because this divides the HZ (44100 / 1024 = 43) into good parts we can use…
Yo Danny boy, don’t you hate it when the same word ‘frequency’ is used to refer to 2 different yet closely related things? Can make life kinda confusing huh?. 44.1khz is the “SAMPLING frequency” this means the highest frequency in the resultant AUDIO file (ie .wav etc) will be 22.5khz. This is because it takes at least two sample points to detect the particular representative sine waves in the audio (hence 44.1/2 = 22.5 khz. Looks like someone don’t care much for checking up on Nyquist’s happiness).
This dude Fourier had a theory about breaking up audio (or indeed any function) into its CONSTITUENT sine waves which is to say constituent frequencies, or frequency SPECTRUM. Wave interaction is such a fascinating thing and all… Hmm energy travels in waves and nrg makes up stuff, Rumour even has it that nrg and mass/matter itself have an interesting equivalence relationship, or at least that’s the type of relationship rumour mongering that Einstein was trying to spread, but i diverge and have no wish to partake of such rumour mongering, tho i fear it may be too late…
Ever wondered why 44.1khz is such a popular sampling rate? Well that’s because us humane humans hear frequencies in the range up to 20khz or so. If we were dastardly dogs the max sampling rate would be higher which would also entail more computational resource, which dog gone it, would be a bit of a bitch really… WOOF!