# CSC 239: Personal Computing

Concept: Constructing a histogram

Jacob Furst

School of Computing

DePaul University

jfurst@cdm.depaul.edu

April 4, 2009

The construction of a histogram is an important first step in any statistical analysis. The visual appearance of the histogram can be used to determine if there are outliers and what the appropriate statistics are for reporting center and spread. For more complex statistical analysis, the shape of the histogram can be used to determine what analyses are appropriate and which are not. However, the creation of a histogram is not a simple process; it is iterative and benefits greatly from automation. The iterative process for constructing a histogram is an example of computational thinking that belongs in the Evaluation category.

## Learning goal

Students can use a statistics package to generate multiple histograms from a single dataset and can choose the most meaningful visualization of the symmetry or skew of the data.

Discussion: students are shown the algorithm for constructing a histogram:

1) Determine the minimum and maximum of the data

2) Create a number of bins for aggregating data

a. Bin range must equal or exceed the data range

b. Bins must not be too few or too many

3) Create a distribution table for the data based on the bins

4) Graph the distribution table

a. If the resulting histogram is too spiky, return to step 2 and create fewer bins

b. If the resulting histogram is too shapeless, return to step 2 and create more bins

There are computer programs that completely automate the construction of a histogram. However,

since the histogram is a visual statistic, it is best constructed using human automation; in particular the

choice of the number of bins is critical in generating a histogram that best provides visual information

about the distribution of the underlying data. In CSC239, students are shown that the construction of a

histogram is critical in determining whether the mean and standard deviation or the mode and quartiles

are better for summarizing the center and spread of the data.

Assessment – Students are given a data set and are asked to:

generate histogram bins that include all the data

o students will need to find the min and max of the data set and construct data bins that

include both min and max.

generate histogram bins that are all of equal width

o the generated bins are all of equal width; especially, there is no “more” bin at the top of

the range, or “less” at the bottom of the range

generate the “best” number of bins

o students generate multiple histograms and choose the one with the number of bins that

best visualizes the data (i.e. patterns in the data set are made explicit and reasonable

hypotheses can be made about the data set).

As this activity is done for many problem sets, the students have many opportunities to mean the

objectives, and also to see different visualizations for different data sets. (E.g. small data sets versus

large ones; symmetric distributions versus skewed ones.)

Assessment: The following question will be added to an assignment:

The registrar’s office at a community college is short of people handling course

registrations due to the recent surge of student enrollment. This results in students waiting

in a line for hours and employees working for long hours. They decide to develop a Web-

based course registration system to automate this process. Please answer the following

questions:

1) Please use a flow chart and your own language to describe a common registration

process.

2) What characteristics of the registration process make it suitable for automation?

3) What are the benefits of using automation?

Learning Goal 2: Students are able to recognize characteristics of the coordination concept,

apply this concept in a realistic system, and understand the benefits.

Assessment: The following question will be added to an assignment:

The Operations Department at a community college is responsible for finding and

assigning classrooms, providing proper technical support, and working with different

academic units to assign instructors for all the classes. Recently the college has

dramatically increased the number of course offerings due to strong demand. This

increase has caused some glitches in the operations: some classrooms were too small to

fit the designed classes while some bigger rooms were assigned to small classes,

students/instructors went to the wrong classrooms, and etc. To solve these problems, the

college decides to develop a computer application to organize the logistics among

different units by sharing real-time information and synchronizing various activities.

Please answer the following questions:

1) Please give some examples of the information needed to be coordinated among

different units.

2) What characteristics of this logistics system make it suitable for collaboration?

3) What are the benefits of using collaboration?

Learning Goal 3: Students understand that (large) data sets can be mined for patterns; they can

provide high-level explanations on what data is relevant to a particular question and

interpretations on data mining results.

Assessment: The following question will be added to an assignment:

A community college has been experiencing problems in finding instructors and

classrooms due to the fluctuating student enrollment. They must determine whether to

hire more faculty, how many more to hire, and whether to build more classrooms. They

decide to apply a computer application to analyze their enrollment data for the last 30

years which they have recently digitized and to make forecasts of future enrollments of

different classes based on the findings from the historical data. The following are the data

they were able to collect:

1) Student demographic information (gender, age, education, occupation, etc.)

2) Faculty demographic information (gender, age, education, specialty, etc.)

3) Demographic information for the region

4) Regional employment statistics

5) Enrollments of all the classes

Please answer the following two questions: A) Please explain what data in the above list could be used in the data mining process to provide a useful and meaningful forecast? And why? B) Suppose the data mining analysis finds that enrollment of classes in a certain area increased by a similar percentage whenever the demand of workers in the same area increased. If current regional employment statistics show that the job demand for software engineers have increased by 10% in the last six years. What recommendations would you make to the college?