CSC 239: Personal Computing
Concept: Constructing a histogram

Jacob Furst
School of Computing
DePaul University
jfurst@cdm.depaul.edu

April 4, 2009

The construction of a histogram is an important first step in any statistical analysis. The visual appearance of the histogram can be used to determine if there are outliers and what the appropriate statistics are for reporting center and spread. For more complex statistical analysis, the shape of the histogram can be used to determine what analyses are appropriate and which are not. However, the creation of a histogram is not a simple process; it is iterative and benefits greatly from automation. The iterative process for constructing a histogram is an example of computational thinking that belongs in the Evaluation category.

Learning goal

Students can use a statistics package to generate multiple histograms from a single data
set and can choose the most meaningful visualization of the symmetry or skew of the data.
Discussion: students are shown the algorithm for constructing a histogram:
1) Determine the minimum and maximum of the data
2) Create a number of bins for aggregating data
a. Bin range must equal or exceed the data range
b. Bins must not be too few or too many
3) Create a distribution table for the data based on the bins
4) Graph the distribution table
a. If the resulting histogram is too spiky, return to step 2 and create fewer bins
b. If the resulting histogram is too shapeless, return to step 2 and create more bins
There are computer programs that completely automate the construction of a histogram. However,
since the histogram is a visual statistic, it is best constructed using human automation; in particular the
choice of the number of bins is critical in generating a histogram that best provides visual information
about the distribution of the underlying data. In CSC239, students are shown that the construction of a
histogram is critical in determining whether the mean and standard deviation or the mode and quartiles
are better for summarizing the center and spread of the data.
Assessment – Students are given a data set and are asked to:
generate histogram bins that include all the data
o students will need to find the min and max of the data set and construct data bins that
include both min and max.
generate histogram bins that are all of equal width
o the generated bins are all of equal width; especially, there is no “more” bin at the top of
the range, or “less” at the bottom of the range
generate the “best” number of bins
o students generate multiple histograms and choose the one with the number of bins that
best visualizes the data (i.e. patterns in the data set are made explicit and reasonable
hypotheses can be made about the data set).
As this activity is done for many problem sets, the students have many opportunities to mean the
objectives, and also to see different visualizations for different data sets. (E.g. small data sets versus
large ones; symmetric distributions versus skewed ones.)


Assessment: The following question will be added to an assignment:
The registrar’s office at a community college is short of people handling course
registrations due to the recent surge of student enrollment. This results in students waiting
in a line for hours and employees working for long hours. They decide to develop a Web-
based course registration system to automate this process. Please answer the following
questions:
1) Please use a flow chart and your own language to describe a common registration
process.
2) What characteristics of the registration process make it suitable for automation?
3) What are the benefits of using automation?

Learning Goal 2: Students are able to recognize characteristics of the coordination concept,
apply this concept in a realistic system, and understand the benefits.
Assessment: The following question will be added to an assignment:

The Operations Department at a community college is responsible for finding and
assigning classrooms, providing proper technical support, and working with different
academic units to assign instructors for all the classes. Recently the college has
dramatically increased the number of course offerings due to strong demand. This
increase has caused some glitches in the operations: some classrooms were too small to
fit the designed classes while some bigger rooms were assigned to small classes,
students/instructors went to the wrong classrooms, and etc. To solve these problems, the
college decides to develop a computer application to organize the logistics among
different units by sharing real-time information and synchronizing various activities.
Please answer the following questions:
1) Please give some examples of the information needed to be coordinated among
different units.
2) What characteristics of this logistics system make it suitable for collaboration?
3) What are the benefits of using collaboration?

Learning Goal 3: Students understand that (large) data sets can be mined for patterns; they can
provide high-level explanations on what data is relevant to a particular question and
interpretations on data mining results.


Assessment: The following question will be added to an assignment:

A community college has been experiencing problems in finding instructors and
classrooms due to the fluctuating student enrollment. They must determine whether to
hire more faculty, how many more to hire, and whether to build more classrooms. They
decide to apply a computer application to analyze their enrollment data for the last 30
years which they have recently digitized and to make forecasts of future enrollments of
different classes based on the findings from the historical data. The following are the data
they were able to collect:
1) Student demographic information (gender, age, education, occupation, etc.)
2) Faculty demographic information (gender, age, education, specialty, etc.)
3) Demographic information for the region
4) Regional employment statistics
5) Enrollments of all the classes
Please answer the following two questions: A) Please explain what data in the above list could be used in the data mining process to provide a useful and meaningful forecast? And why? B) Suppose the data mining analysis finds that enrollment of classes in a certain area increased by a similar percentage whenever the demand of workers in the same area increased. If current regional employment statistics show that the job demand for software engineers have increased by 10% in the last six years. What recommendations would you make to the college?