Although Rapidminer is a great tool for selecting and applying your chosen
data
mining technique, it falls a bit short with respect to the
visualization
of the data and/or the results. As an alternative, we are going to look
at Matlab, which is installed in Caslab as well.
Matlab is a commercial package for numerical
computation and visualization,
with a large number of add-on toolboxes for specialized tasks. As a
numerical
tool, it doesn't deal with categorical values, which is something
to keep in mind for the target classes in general. There is a way to
add
text labels to graphs, which will be explained below.
Matlab has a command line interface, where you can
switch from one directory
to another. If you'd like to go up one level in the directory
structure,
type cd .. Togo
to a subdirectory of the current directory, type cd
directoryname.
Matlab likes
things to be in certain
places. The easiest way to make things work is to have everything in
the
same directory, this includes for example your data file and any
scripts
you write.
The command whos
shows you which
variables are currently in use, together with their size
The Matlab command save saves
your current session, for example this would save you from loading the
same values from your data files again. To save a matlab space,
type save
myMatlabspace.mat To load it again, you type
load myMatlabspace.mat
The command help
help gets
you started with Matlab's help feature
Typing a name on the left side of an equal sign
creates a variable with
that name and assignes whatever is on the right of the equal sign to
it.
For example, the command myVariable =
4 creates a variable called myVariable and assigns the
value
4 to it. In Matlab, all variables are considered matrices.
How to get the data into
Matlab
Some of the data formats supported by Matlab are delimited text , tab
separated
text and comma separated numbers. Matlab can also import Excel
worksheet
but might give you trouble if the values aren't numerical. To get a
list
of valid fileformats in Matlab, type help
fileformats from the Matlab command line. Since we have the data
in a comma separated file, we'll be using csv read.
Assuming that we're in the same directory as the data file, the
command myMatlabData
= csvread('myData.csv'); will read the data contained
in the file called myData.csv into a Matlab variable called
myMatlabData.
The semicolon behind the command indicates that the values contained in
the file will not be printed out to the screen as the file is read. To
see the contents of a variable, type the variable name without a
semicolon,
followed by return.
Working with matrices
Matlab allows you to access parts of the matrices. For example, a colon
means everything. Assuming we have a matrix called myOldMatrix, then
typing myNewMatrix
= myOldMatrix would be equivalent to typing myNewMatrix
= myOldMatrix(:,:) . myOldMatrix(:,:) refers to elements in all
rows and all columns of myOldMatrix.
For accessing parts of the matrix, you can specify row
and column numbers.
Some examples are:
myOldMatrix(1,1)
refers to the element in the first row and first column of myOldMatrix
myOldMatrix(1,:)
refers
to the elements in the first row and all columns of myOldMatrix
myOldMatrix(:,1)
refers to the elements in all rows and the first column of myOldMatrix
myOldMatrix(:,1-4)
refers to the elements in all rows and the first 4 columns of
myOldMatrix
Plotting the data in two
dimensions
Choose which two of your columns you would like to visualize. To create
a plot of the first two columns of the values of the matrix
called
myMatlabData, which contains the Iris data, you can type plot(myMatlabData(:,1),myMatlabData(:,2)); which
produces the figure below.
By default, Matlab connects
the data values
with blue lines. This can be changed by adding additional parameters to
the plot command. By the way, the matlab command help
plot gives you a lot of information about plotting in two
dimensions.
If you want to change the color in which the data values are plotted
from
the default blue, you can use the following letters to specify other
colors:
g |
green |
r |
red |
k |
black |
c |
cyan |
y |
yellow |
m |
magenta |
b |
blue |
For example, the command plot(myMatlabData(:,1),myMatlabData(:,2),'r');
produces the same plot with red lines. Note that everytime you execute
another plot command, the previous figure is cleared. You can change
this
by typing hold on to
add to the initial figure. This can be turned off with
hold off.
You can also change the markers for the values.
The markers are
specified in the following way (this is a partial list, for more
information,
use help plot).
. |
point |
o |
circle |
x |
x-mark |
+ |
plus sign |
As an example, if you want to use green x-marks to plot all row values
s for the last two columns of the matrix, you would use
plot(myMatlabData(:,3),myMatlabData(:,4),'gx'); which
produces the following:
You
can also change the marker size, add legends, titles and labels to your
plots. Please refer to help plot
or to the Mathworks site (link is at bottom of page) for complete
information
on how to do that. There are a number of buttons on the top of the
figure,
which let you zoom in or out, or even rotate the picture, which is
especially
helpful in 3 dimensions.
Plotting the data in
three dimensions
Plotting
the data in three dimensions is simply an extension of plotting it in
two
dimensions. Assuming we want to plot the values contained in the
first three dimensions for all rows, we can use the following command.
Note though that now we have to use the plot3 instead
of the plot command.
plot3(myMatlabData(:,1),myMatlabData(:,2),myMatlabData(:,3),'rx');
This produces the figure:
How to add text
labels
Note that this is shown
below for two
dimensions, but it extends to 3 d in a straighforward manner.
Now that the data is plotted, we need to find a way to identify the
points in the plots. There are a number of ways to do this. Perhaps the
easiest way is to plot points in different colors according to the
different
classes they belong to. Of course, the data needs to be sorted to save
you a lot of typing. For the Iris data, if we assume the data is sorted
according to the target classes, then you can plot the first 50 values
in one color, the next 50 values in a second color and the last 50 in a
third color.
There is also a way to add the values in the matrix itself to the plot.
This can get really messy for even a small number of data points, but
it
gives you some information about columns (remember, they are attribute
values) that aren't plotted. In order to do so for the first 20 rows of
your matrix you type (the order of the commands is important, you also
need to press enter after each command)
plot the values of
the first 20 rows
in the first two column with values marked by a red x
plot(myMatlabData(1:20,1),myMatlabData(1:20,2),'rx');
convert
the numerical values in the matrix to strings
n=num2str(myMatlabData,'%5.3f/');
remove
the trailing slash from the strings, which is the last character in the
character array (the matlab command end refers
to the last column).
n=n(1:20,1:end-1);
add
the text to the plot (the text command can also be used for pointing
out
interesting parts of the plot, try out help
text from the Matlab command line for further information).
text(myMatlabData(1:20,1),myMatlabData(1:20,2),n);
This
produces the following picture for the Iris data, where next to each of
the 20 data points marked in red and plotted according to the values in
the first two columns, the complete attribute values are added as
labels.
Of course the labels
don't have to be
the values themselves, they could also be the target class names
themselves.
To create a matrix holding the target classes you need to use textread,
since the names of the classes in general will not be numeric like in
the
dwarf dataset. For the Iris dataset, this works as follows. Let's
assume
we have the original target classes in a text file called
IrisClasses.txt
(one target class per row for each of the 150 samples). To read in the
class values, we use textread. The second (and last) parameter in the
bracket
and enclosed in quotes meansthat we read in one column at a time,
ignoring
everything following until the newline character.
[IrisClasses]=textread('IrisClasses.txt','%s%*[^\n]')
Add a blank character to the front of the strings
contained in IrisClasses for easier readability.
[IrisClasses]=strcat({' '},IrisClasses);
Now plot the Iris
data, this time
we plot all the values for the first two columns
plot(myMatlabData(:,1),myMatlabData(:,2),'rx');
This time, the values
we want as
labels are already strings, so we can skip the num2str conversion we
performed
before and can now add
the
labels to the plot, this time using the Strings in IrisClasses rather
than
n.
text(myMatlabData(:,1),myMatlabData(:,2),IrisClasses);
This produces the
figure below (for
readibility, I changed the labels in the original text file).
Of course, you can then try to plot more than one class value at a
time...for
example the predicted vs. the target class from the original datafile.
To achieve this, you could use a combination of text labels, colors and
marker shapes to plot the values.
How to save your figures
If you look at the top of the figure you produced in Matlab, you will
see
a FILE button.
Clicking
on that will bring up a menu, from which you can choose EXPORT
to save the figure in various file formats. Some of the file formats
you
can choose are jpeg and bitmap as well as eps files. You can also print
your figures from the FILE menu.
Where to get more help
One site to get help from is Matlab's
home. For example, you can search for keywords from the
support
page.
There are also a number of FAQs you can try, some of these are
comp.soft-sys.matlab
University
of Texas
Go to the Matlab
tutorial part 2
Back to the course
homepage
|