Jeffkillian.com initially started as a Geocities website when I was in 6th grade. Over the years, it has slowly evolved into a place where I can practice and become familiar
with web development as it progresses. I started out coding with Microsoft Frontpage, then moved to Dreamweaver. Initially, there was very little coding, and it was all
WYSIWYG. However, as I learned more, I could make it more interactive.

I incorporate drawings and blog posts to keep those that are interested updated. Through the evolution of the website, I was forced to learn HTML, PHP, CSS, XML, jQuery, javascript, and the handling of MySQL Databases. I've put online some code samples.

I incorporate drawings and blog posts to keep those that are interested updated. Through the evolution of the website, I was forced to learn HTML, PHP, CSS, XML, jQuery, javascript, and the handling of MySQL Databases. I've put online some code samples.

Looking for even

Forcing Non-Normal Data Visually into a Normally Distributed Bar Chart

November 21st, 2012, edited on 2012-11-27 17:53:33

A friend recently came to me asking if I had any idea how to do a statistics related request, and I figured I’d detail it here. He gave me about 4000, non-normally distributed data points in the range of 0-1000. My task was to make a bar graph with 5 bars, each corresponding to a different interval (don’t have to be the same length) between 0-1000. The intervals were mutually exclusive, and were to be decided so that, when you graph them from least to greatest, they appear normal.

I had to turn this data:

into this:

"Just make the graph look pretty” they said.

My first approach was to do it visually. I knew that we wanted 68% of the data to be within one standard deviation, so I just made the middle bar try to have 68% of the data. I made the outside bars have 5% each, and the 2nd and 4th each had 11%.

I used excel to guess and check, making a sheet that listed the goal count I was aiming for in each bar (calculated based on percentage of the 4000 points), and how many were currently in that column. I then started from the lowest level, trying to get 5% of the total points into that. Because the top of the first interval is the bottom of the second, I then had to adjust the top of the second interval to try to match the 11%, and so on. This worked somewhat well. I got a graph that looked like this:

However, the overall counts were way off of what I had wanted. I was 2477 data points off of ideal (bar 3 was too low, and bars 2 and 4 were too high). 5 Bars was not going to cut it simply because of the way that the data was distributed.

I asked and was told that 7 bars might also work. Onward.

Instead of guessing the z scores and percent that should be contained in each, I sat down and calculated how much should be in each bar if they were perfectly normally distributed (within reason, I didn’t have the population SD or mean). They also mentioned that they really wanted the middle-most bar to be as close to true as possible. For that reason, I decided to start with that interval rather than the leftmost.

It was at this point I realized this was a dynamic programming problem, and I was just trying to find the best combination of intervals that minimized the difference between the ideal normal curve and the obtained “curve”. We had done this in my Algorithms class, but it wasn’t of this level, and I wasn’t ready to go back and look into that if all that they wanted was to “make the graph look pretty”. Not yet.

But just for kicks, I called my friend at Google and he agreed that creating a programming solution would be more time than it’s worth for creating a “pretty graph”.

The first step then was to figure out what the interval for the middle data interval should be. I created an excel table mapping every interval (rounded to 5 for the sake of excel columns getting way out of hand).

Each cell in the image corresponds to an interval, with the bottom(start) of the interval as the row , and the top of the interval as the column. In the image, the highlighted number indicates that an interval between 30 (inclusive) and 75(exclusive) would have 331 entries in it. I had calculated that I knew my target middle interval would be within the light orange square.

Through a process of guess and check, I found the ideal interval for the middle column. I then expanded on either side, changing the interval sizes so that they best approximated a normal curve.

I ended up with this:

Of the 4066 data points, I had a cumulative error of 157, meaning the intervals were larger than or smaller than the ideal columns by a sum-total of 157. That’s around 4%, which qualified as “pretty” in their book.

It was a good way to learn about excel, and I ended up having to write (aka copy from internet) some functions to make it easier.