Please note: for an even more thorough discussion of using histogram, you can always check out dfanning.com. His page has a more in depth discussion, and even more examples. So why are you still reading this?
histogram is much more valuable than just building histograms. By default histogram simply bins your data and returns the number of data points in each bin. However you can also get histogram to look at each bin and return the indexes to the data points (in the original array) that fell in a given bin. This way you can perform whatever calculations you want to on a bin to bin basis. Because histogram runs natively on your processor and is highly optimized, using it to perform your binning is substantially faster than the standard for-where combination. The key is in the reverse_indices keyword to histogram. Unfortunately, the array returned by reverse_indices is in a very strange format. Here's how it looks:
data = randomu(seed,10)
Just looking at it ri doesn't seem to be terribly useful. What has happened is that reverse_indices returns to arrays that have been concatenated together. To use it, we must understand how to separate the two arrays. In the end we want to reconstruct which data points went into which bin (or vice-versa). So we want to ask the question, which data points fell in the first bin? For now let's answer that question just looking at the array.
print,data
; 0.0275016 0.487574 0.653317 0.362500 0.0742411 0.442709
; 0.218974 0.545769 0.842201 0.889318
hist = histogram( data )
print,hist
; 10
hist = histogram( data, binsize=.2 )
print,hist
; 3 1 3 1 2
hist = histogram( data, binsize=.2, reverse_indices=ri )
print,ri
; 6 9 10 13 14 16
; 0 4 6 3 1 5
; 7 2 8 9
histogram will automatically set the starting value of the first bin to be the minimum value in the array: 0.0275016. I specified a bin size of .2, so the first bin will be populated with values between 0.0275016 and 0.2275016. So three data points should fall in the first bin: 0.0275016, 0.0742411, and 0.218974. These correspond to indexes 0, 4, and 6. Now we just have to see how we can retreive this information from the ri array.
As I said, the ri array is two arrays concatentated together. The first array (often refered to as the 'i' vector) has nbins+1 elements, and can be found at the start of ri. In other words i = ri[0:nbins] where nbins = n_elements(hist). This 'i' vector tells you which indices (in the ri array) can be used to find the indices to the original array. Basically, the ri array is an array that indexes itself. Confusing, I know. Here's the 'i' vector in the above case:
nbins = n_elements(hist)
Our histogram has 5 bins, and there are 6 elements in the 'i' vector. If we want to know where we can find the indexes for the first bin (technically, the zeroth bin since IDL is zero-indexed) we would look at the values in the zeroth and first indexes in the 'i' vector. The zeroth index will tell us starting index in the
print,ri[0:nbins]
; 6 9 10 13 14 16
ri array, and the first index will tell us the ending index in the ri array. In this particular case ri[0] = 6 & ri[1] = 9. So if we want to know the indexes in the original array that contain the data points in the zeroth bin, we would extract them from the second part of the ri array like this: indexes = ri[6:9-1]. Since each value in the 'i' vector specifies the starting point for the index list of a bin, you have to subtract one to get the ending point for the previous bin (hence, ri[6:9-1] rather than ri[6:9]. We can go ahead and print out this value, and see that it is what we expect. Previously we looked at our array and determined that values 0, 4, and 6 fell in the first bin.
print,ri[6:9-1]
So, we've successfully used reverse_indices to calculate which data point went into which bin! Putting it all together, this is how you extract data from the ith bin:
; 0 4 6
; 0th bin:
Finally, if we index the originally array with the above indices, we can extract the actual data points in each bin:
print, ri[ ri[0]:ri[1]-1 ]
; 0 4 6
; 1st bin:
print, ri[ ri[1]:ri[2]-1 ]
; 3
; 2nd bin:
print, ri[ ri[2]:ri[3]-1 ]
; 1 5 7
; ith bin:
i = 3
print, ri[ ri[i]:ri[i+1]-1 ]
; 2
i = 0
One very important note. It is necessary to check before hand and actually make sure that there is data in a bin before trying to print it out. If no data points fell into a bin, then for that bin ri[i] will equal ri[i+1], and so ri[ ri[i]:ri[i+1]-1 ] will return an error. Here's an explicit example:
print, data[ ri[ ri[i]:ri[i+1]-1 ] ]
; 0.0275016 0.0742411 0.218974
i = 1
print, data[ ri[ ri[i]:ri[i+1]-1 ] ]
; 0.362500
i = 2
print, data[ ri[ ri[i]:ri[i+1]-1 ] ]
; 0.487574 0.442709 0.545769
i = 3
print, data[ ri[ ri[i]:ri[i+1]-1 ] ]
; 0.653317
i = 4
print, data[ ri[ ri[i]:ri[i+1]-1 ] ]
; 0.842201 0.889318
arr = [.25, .75, 2.5, 2.7] ; no data in the i=1 bin
hist = histogram(arr, reverse_indices = ri )
nbins = n_elements(hist)
print,ri[0:nbins]
; 4 6 6 8
i = 1
print,ri[ ri[i]:ri[i+1]-1 ]
;% Subscript range values of the form low:high must be >= 0, < size, with low <= high: RI.
;% Execution halted at: $MAIN$
if ri[i] eq ri[i+1] then print, 'No data points in bin ' + string(i)
;No data points in bin 1
Hopefully, that fully explains usage of the reverse_indices keyword. Of course at this point it's important to discuss the bottom line. Just how much faster is this anyway? I've written two simple programs that do the same thing. They both bin a data set and return an array giving the bin location, the mean value of the data points in a bin, and the standard deviation of the data points in the bin. You could imagine doing this if say you wanted to look for variation in stellar populations as a function of radius (in a star cluster?), and so you could bin stars accordging to radius and then calculate the mean magnitude of stars in each radius bin. Click here to download a program using for and where. Click here to download a program that does exactly the same thing using histogram. Comparing the execution times:
pro hist_where_time
You'll find that using histogram and reverse indices is about 15 times faster than for and where!
n = 100000
vals = randomu(seed,n)
t1 = systime(/seconds)
res1 = for_stats(vals,binsize=binsize,minval=minval,maxval=maxval)
t2 = systime(/seconds)
res2 = hist_stats(vals,binsize=binsize,minval=minval,maxval=maxval)
t3 = systime(/seconds)
print,(t2-t1)/(t3-t2)
end
Chunk Indexing
If you read through the dfanning.com tutorial, you'll see that there are a number of uses for histogram above and beyond just this. Out of those, the one I use the most is chunk indexing. Imagine you have an two arrays, one with values and another that lists how many times you want the values in the first array to be repeated:
values = [5,2,3,8]
You can easily do this with histogram:
repeated = [4,1,2,3]
; result = [5,5,5,5,2,3,3,8,8,8]
values = [5,2,3,8]
You could of course accomplish this same thing with for loops, but using histogram is typically about a factor of ~10 faster than an efficient implementation with for loops.
repeated = [4,1,2,3]
h = histogram( total(repeated,/cumulative)-1, /binsize, min=0, reverse_indices=ri )
result = values[ri[0:n_elements(h)-1]-ri[0]]
print,result
; 5 5 5 5 2 3 3 8 8 8