Automatic optic music recognition is the automatic analysis of images of musical notations, either written or printed. A music notation represents musical information by symbolic form. It includes much information that is important for many other music information processing studies, so music notaion has been input in most laboratories where music information processing is studied. In ost cases, this is done manually be use of computer keyboard or mouse. Automatic recognition would be preferable here.
Automatic optical music recognition can be roughly classified into two categories: on-line and off-line.
In an online system, the machine analyses the musical score and generates the result almost instantaneously. Such a system can be attached to device such as robotic arms adjacent to a piano and perform the piece of musical work in real time. In such a system , the machine must be able to carry out the analysis in short time. This implies that the system may not have enuf time to analyze the whole score before generating its output.
In an offline system, the score is first digitized as an image
file and stored. Usually, optical scanners are used and cameras provide
an alternative. The stored image is then analyzed by the computer and converted
into a binary form using a coding tht should be designed to be suitable
for both performing the piece of musical work, and re-printing of the score.
Since an offline system can analyze the whole score before generating the
ouput, accuracy of recognition is improved. For example, a sophisticated
semantic checker can be developed to correct suspected mistakes made in
an earlier stage of recognition.
Text is printed by placing chracters from several fonts onto a black sheet, while a musical score lays musical symbols, which can be treated as chracters in some special fonts, onto a sheet of staff lines. A musical score can be divided into groups, with symbols attached to the same staff being considered as a group. These groups are then analogous to the rows of characters in optical character recogntion. These observations suggest that current optical character recognition systems may be adapted to perform optical music recognition. This would be desirable, since optical character recognition has been under development for many years and we can use and modify such a system to build an optical music recognition system.
Unfortunately, there are some major differences between text and musical scores that make the adaptation of current optical character recogntion system to recognize music difficult.
|
|
||
|
|
Figure 1:Relative sizes of musical symbols vs. text characters.
Owing to these major differences , ordinary optical character recognition
techniques do not perform well for music scores. Special techniques have
been developed to handle musical scores efficintly and effectively.
The existence of an automatic printed music recognition system would make practical the conversion of large quantities of the printed music into computer readable form. This is similar to the requirement of automatic conversion of engineering drawing, circuit layout etc. from the existing documents. Once the music is stored electronically using some music representational language, it can be manipulated freely, enabling applications such as musicological analysis, point-of-sale printing , production of new editions possible.
Stated below are some of other applications for printed music recognition:-
Above and all the image recognition of the printed musical characters present
us with a challenging problem in the field of image pattern matching and
semantic analysis.
Staff Line On a musical score,a staff line is a long,thin horizontal
line which defines a co-ordinate system. Along the line from left to right
is the time axis. The scale on this axis is roughly linear. The directio
perpendicular to staff lines is the dimension of pitch, with higher positions
denoting higher pitches.
staff space This is the distance between adjacent staff lines of
the same stave.
stave(staff) A stave is also known as a staff. It is a group of
five staff lines.
bar line A vertical line in a musical score to separate notes into
groups called "bar units".
ledger line Ledger lines are additional horizontal lines added near
a note symbol when a note lies too far above or below the staff. They help
to clrify the positions of these notes.
bar unit It is a small unit of a piece of music. Each bar unit occupies
the same length of time.
note symbol A note symbol in a musical score is a symbol that represents
a musical note and its duration. The pitch of a note is determined by the
vertical position of the note symbol relative to the staff.
note head This is the elliptical portion of the note symbol. For
whole notes and half notes, the note heads are hollow. For other notes,
the note head is a solid ellipse.
note stem This is the vertical line segment of a note symbol. Besides
a whole note, al other note symbols have a stem, with its end touching
a note head.
note flag This is the tail part of a node to determine the typr
of a note. The tail is on the note stem other than that attached to the
note head. A whole note, half note or quarter note does not have flags.
An eighth note has one flag; a sixteenth note has two flags and so on.
voice A voice is a muscial line. A voice may correspond to a single
instrument , though a piano part of a score is usually notated as two or
more voices.
slur A slur is a thin, wide and curly line that spans across a group
of note symbols. Slurs may span over several bar units.
pedal marking A pedal marking tells a pianist how to control the
foot pedals of a piano.
dynamic marking Dyanmic markings are present in a musical score
to indicate the loudness of subsequent notes.
Figure 2: Names of various basic components on a musical
score.
Here is a list of musical symbols, with the names given by their side.
Figure 3:
Musical symbols
Work on automatic recognition of printed music began in the late 1960's and the early 1970's with the research of Pruslin and Prerau at the MIT. The limitations of the hardware available at that time for acquiring and manipulating images restricted the possibilities of the work, but some progress has was made using techniques including low-pass filtering and contour tracing. In mid and late 80's, Matsusima and Katayose , at Wadesa University made the WABOT-2 keyboard-playing robot which has vision system and uses mask-matching implemented in hardware in conjunction with localized measurements to read nursery song sheets. Most of the works through 90's has been concetrated on locating staves and isolating and recognizing symbols.
Pruslin preprocesses the music image by eliminating all thin horizontal and vertical lines, including many bare staff-line sections and stems. This results in an image of isolated symbols, such as note heads and beams, which are then recognized using contour-tracing methods. Prerau describes a "fragmentation and assemblage" method for treating staff lines and isolating music symbols.
Some automatic recognition systems for music notation have been developed. Nakamura and Fujinaga has proposed using projection profiles, the type and position of each symbol are recognized by means of the extraction of the feature of the shape and position from the horizontal or vertical projection, because a certain point in the pattern of the symbol has an important feature. The advantage of these methods is simplicity. If symbols are connected or drawn in vertical alignment, however, recognition is difficult. These methods are able to recognize only simple music notation of monophony such as children's songs.
Tojo has proposed a recognition method using the classification of symbols into large groups according to the shape of the rectangle circumscribed with symbols and the discrimination of the symbols from the structured analysis. In order for this method to be carried out, a careful elimination of the staff lines as a preprocessing and a fine segmentation of symbols is required. Matsushima has developed a high-speed recognition system for real time musical performance with a robot. This system has hardware to detect symbols in about 10 seconds and is too inflexible to handle complex notations.
Difficulties of the recognition systems are various. In simple notations,
the density of symbols is low, so the concentration between symbols doesn't
need to be considered. In complex notations, symbols are drawn with high
density, so connection, overlap and containment of symbols appear everywhere.
The segmentation of symbols is very difficult for this feature in addition
to the overlap with the staff lines. Considering from the point of view
of semantics, simple notations can be interpreted uniquely by means of
simple rules. In complex notations, there are ambiguous descriptions ,so
the proper knowledge is required to interpret such description. Existing
recognition systems of music notation cannot deal with complex notations
for these difficulties.
Figure 4:
The processing flow
A common first operation in a music recognition system is thresholding
to convert a grey scale image into a binary image. Other forms of preprocessing
are sometimes used for noise reduction. The other important steps in recogntion
of printed music documents include staff line identification, symbol classification,
symbol recogntion and analyzing the relative positions of the symbols
.
Brief Synopsis
In an offline optical music recognition, the musical score is first
scanned with an optical scanner. The output of this process is a bitmap,
which is the input to the optical musical recogntion.
The bitmap is analyzed to detect the staff lines. Deatils of this will
be discussed in section 5.1 . After this there are two branches to go.
One is to remove all staff line segments that do not overlap with other
musical symbols. This isolates the musical symbols that have been connected
by the staff lines. Another option is to keep the staff lines. With this
option recogntion in the later stages must use techniques that can work
with the presence of staff lines. For examle template matching methods
can be used to perform recognition in the presence of staff lines. We use
the first approach. Figure 5 explains more.
Figure 5:
Removal of staff lines
Having the staff lines detected, the next step is to detect the
bar lines. Detecting the bar lines helps separating a musical score
into smaller bar units. Being able to partition an image into smaller units,
the optical musical recognition can work more quickly and
require less
memory. Section 5.1 explains this in more detail.
Done with the preprocessing stage, we come to the recognition of symbols.
The image is processed bar unit by bar unit. First, the note symbols are
recognized. Next comes the recognition of attributive symbols. Recognition
of symbols is done by using Neural Nets.Section 5.4 explains this
in detail.
After recognising the symbols, the output is written in a predefined format.
This format is read by another program which then creates a MIDI
equivalent of the original musical score. This can now be played on any
sound card.
Staff lies play a major role in optical musical recognition. Most musical
symbols of a musical score are laid around the staff lines ina two dimensional
manner. The horizontal axisis the time axis
while the vertical axis tells the pitch of note symbols.
To a human reader, staff lines are important because they help the reader
to find out precisely the vertical position of note. From this information,
the human reader can know the pitch of a note.
Interstingly, the importance of the staff lines to an optical musci recognition
system is quite different. Computers can often accurately find out the
position of a note symbol on the vertical axis without the aid of the auxillary
lines. However, the staff lines embed some other information that are very
important for the optical music recognition.
Following information is important for various reasons.
1. The thickness of the staff lines
The thickness of the staff lines in pixel units
tells the optical musical recognition system about the
quality of printing of the original musical score and the resolution of
scnning process used to convert the score to a bitmap.
Hence it is used to set up many thresholds and acts as the tolerance
value for many measurements and comparisons.
2. Staff spacing
The amount of space between adjacent staff lines
gives the optical musical recognition system a very important
hint about the resolution of the scanned bitmap, as well as the size of
the score printing. The staff spacing gives a size normalization
that is useful for the subsequent recogntion stages.
Sizes and distances can be measured in units that are normalized to the
staff spacing. This can avoid the inflexibility of absolute
measures and static threshold values.
3. The inclination of the staff lines.
Most of the time, the bitmap of the original muiscal
score does not have the staff lines horizontal. This
is because -
This inclination of the staff lines lets the recognition
system know the image skew. This can help to improve
the accuracy of the recognition system. For example, when the skew is too
large, the system may rotate the image before further
recognition.
Although the staff lines contain such useful information,
their prescence makes the optical musical recognition
difficult.
Perhaps the most straightforward method to locate the staff lines is
to project the whole image onto the vertical axis of the image. This is
illustrted in Figure 4. The group of five equally
spaced peaks in the projection reflects the presence of a staff group.
The staff line thickness can be found from the width of the peak and the
staff spacing is the distance between successive peaks of the groups of
the five peaks. Then, what about the staff line inclination ?
The inclination is controlled by using Hough Transform. This
method is described in detail in the nest section.
Hough transform, patented by Hough (1962), is a method commonly used
in image processing for locating straight lines in an image. It can find
out lines in all orientations and positions. In short, the Hough transform
is a voting process in which each pixel of the image votes for the candidate
lines that it belongs to. Candidate lines that get higher vote counts correspond
to lines in the image.
Although staff lines ae long, thin straight lines in the musical score
and Hough transform can detect straight lines in an image, empirical results
showed that Hough transform is not a robust method for staff line identification.
The following are some suggested reasons for the failurer. Nonetheless
we have used this this method in our approach.
Figure 7: A thick line
can contain several thin lines.
Our work
Some modifications had been tried to reduce the effect of (1) above.
One of hich was to restrict the slope of the candidate lines to the range
[-1,1] (for a skew of +- 45 degrees). This avoids having bar lines and
note stems from being captured. Furthersome, long vertical runs of black
pixels are not considered by the transform, because they are most probably
not part of a staff line ( remember that staff lines are thin). So only
short vertical runs of black pixels were allowed to vote for the candidate
lines.
To avoid the thickness of the staff lines causing us any problem, we
took the hough transform of the whole image and then found out the total
angle with wich the whole image is inclined, this cancels out the variations
due to the thickness of staff lines. Then we rotate the image accordingly
to get a totally unskewed image. Again this helps us in our bar line recognition
and symbols identification.
Hough transform is a rather computationally expensive operation with time
complexity of about O(n3), where n is the maximum of the image height
and image width, in pixel unit.
The next processing is the detection of the bar lines. Bar lines are
thin, vertical lines in a musical score. Unlike
staff lines, bar lines are seldom intercepted by other musical symbols.
This makes the detection of the bar lines easier.
Our Approach
We have not considered slurs in our approach which cross over bar
lines. Bar lines are revealed as sharp peaks in the projection.
Moreover symbol density just near the bar lines is low. This makes the
detection of bar lines easier. Our method is robust in a
sense that even if symbols are closely paked around the bar lines or even
slurs are crossing the bar lines, our method will be able to detect the
bar lines easily.
The note symbols are recognised using a neural net.
What is a neural net ?
A neural network is a computational model that shares some of the
properties of brains:it consists of many simple units working in parallel
with no central control. The connections between units have numeric weights
that can be modified by the learning element.
Why did we implement neural nets ?
Intially we hade been in two minds. Either use neural nets or Huet's
Method to recognize the notes. But finally we dedided upon the neural net
implementation.
Firstly we are recognizing only printed music. Unlike text, printed
music doesn't come in different fonts, only yhte relative sizes of symbols
are different. Since our aim was not to recognize handwritten music we,
thus, had a small number of symbols to recognize.
Secondly the recognition of the symbols by neural net was fast enough.
This has a positive point in reading music online !.
Pitfalls
The net takes an appreciable time to learn symbols. Since its all like a
black box, we don't have much control over the functioning of the net.
Also if two symbols in the character set resemble each other too closely,
there might me some error in recognising these symbols.
Specifiactions of the net
The neural net we are implementing consists of one input layer,with
81 nodes, and one hidden layer with 100 nodes. The output layer consists
of 10 nodes,each standing for a unique symbol.
Figure 8: A simple neural net.
What is MIDI
MIDI is to sampled audio waveform as sheet music is to a compact
disk recording. MIDI stands for Musical Instrument Digital Interface
and is a standard for the digital communication of musical data. MIDI allows
you to connect various MIDI compatible music devices (synthesizers, for
instance) together and control them from other MIDI-compatible equipment(a
computer, for instance).
MIDI file format
MIDI has also defined a file format for the interchange and playback of,
effectively, binary sheet of music. A MIDI file specifies which notes should
be played at what time and by which instruments, in order to create a piece
of music. A piece of software on the host computer interprets the commands
in the MIDI file and causes hardware to output the correct note of the
musical instrument.
MIDI is binary data, and a MIDI file is therefore a binary file. You
can't load a MIDI file into a text editor and view it. (Well, you can,
but it will look like gibberish, since the data is not ASCII, ie,
text. Of course, you can use my MIDI File Disassembler/Assembler
utility, available on this web site, to convert a MIDI file to
readable text).
MIDI files are not specific to any particular computer platform or
product.
Our Work
After recognising the symbols intermediate output is generated which is read
by a routine and converted intoequivalent standard MIDI format.
. Only single track MIDI files are generated
as the muscial score is assumed to contain voice of only one instrument
i.e. guitar. The time specification, pitch etc. of the note to played are
picked from the intermediate format itself.
A binary MIDI file is written using some routines. The MIDI file thus generated
can be played on any sound card(assuming there is a software installed
to understand MIDI format.)
Here the input is given as a scanned image of the printed music and
the output is sent to the sound card of a system to play the recognised
music.Initially we are thinking of elliminating the need for recognition
of the key signature recognition, which tells what key you are playing
in.Apart from that we are assuming that the symbols like staccato, portato,
and the accent are absent from the scanned input.
Sample Input file.
Sample output file(MIDI FILE).
Click here if you are using a midi compatible browser e.g. Internet Explorer
In this project, optical music recogntion was investigated. The existence
of staff lines makes optical music recognition unique from the class of
optical recognition topics. Since musical symbolsare connected together
by staff lines, to successfully recognize the symbol requires special techniques
that are tailored for optical music recognition.
The staff lines are important since they tell the size of the symbols,
quality of scanning and the image skew. However, the are at the same time
noise to the recognition of the musicl symbols on the score. So, intutively,
staff lines should be removed before further processing. However, there
are some optical music recognition system that employs template matching
techniques to recogniz the symbols in the prescence of staff lines.
Having staff lines removed , the musical symbols can be isolated from each
other. Neural network approach was used in doing this. Then the process
of converting the musical score to a representation which represents the
information completely is also a tedious task. The final output was written
in the form of a MIDI file.
At present, the program can recognize only a few symbols which are not printed too closely. We are trying to train our neural net for as many symbnols as possible. Uptil now we have been able to identify treble cles, time signature, whole note and a full note. The program ignores tempo, rests, accidentals and duration dots. Here is a list of the places where further improvements to the program are suggested:
Add the capability of recognising rests and duration dots
Duration dots shoulsd be easily identified because of its size.
After staff line removal, the remaining symbols that have a square bounding
and the side length smaller than half the staff spa ce should be a duration
dot. Whole and half rests appear as short strokes whose height are approximately
half the staff spacing. They are rectangular. So, they can be identified
by examining the size of the bounding box, as weel as the symbol area.
Whole note and half note differ by the position at which they appear on
the staff. Other rest symbols have approximately the same size. A check
on the size of the bounding box would be able to identify them.
Allow multiple instruments to be played simulatneusly
This feature can be done using the midi routines which allows the
voice of multiple instruments to be written to at most 16 channels and
so we can have more engrossing music.
Allow different time-signatures
At present we have considered the case of 4/4 time signature as
it gives the uniform beat. But complex time signatures like 3/4 can also
be allowed, in which case the beats will be difficult to program using
MIDI.
1) Y. Nakamura et al., "Input method of Note and Realization of Folk Music Database," TG PRL78-73, pp. 41-50,Institute of Electronics and Comm. Engineers. of Japan (IECE), (in Japanese)(1978).
2) I. Fujinaga et al.," Issues in the Design of Optical Music Recogntion System," Proc. 1989 International Computer Music Conf., Columbus, Ohio.
3) A. Tojo and H. Aoyama, "Automatic Recognition of Music Score," Proc. 6th ICPR, Munich, W.Germany,p. 1223.
4) T. Matsushima et al., "Automatic High Speed Recognition of
Printed Music (WABOT 2 Vision System)," Proc. Int'l Conf. on Advanced
Robotics , Tokyo, pp. 477-482, (1985).