Microsoft word - ijrte02018387.doc
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
EVISTA – Interactive Visual Clustering System
K. Thangavel1, P. Alagambigai2
1 Department of Computer Science, Periyar University, Salem, Tamilnadu, India
Email: [email protected]
2 Department of Computer Applications, Easwari Engineering College, Chennai, Tamilnadu, India
Email: [email protected]
Abstract—Due to the enormous increase in the data, exploring
involvement in the interactive process. More recently there
and analyzing them is increasingly important but difficult to
are a lot of discussions on visualization for data mining.
achieve. Information visualization and visual data mining can
Visual data mining can be viewed as an integration of data
help to deal with this. Visual data exploration has a high
visualization and data mining [5, 15]. Considering
potential and many applications such as fraud detection and
visualization as a supporting technology in data mining, four
data mining will use information visualization technology for an
improved data analysis. The advantage of visual data
possible approaches are stated in [1]. The first approach is the
exploration is that the user is directly involved in the data
usage of visualization technique to present the results that are
mining process. There are a large number of information
obtained from mining the data in the database. Second
visualization techniques which have been developed over the last
approach is applying the data mining technique to
decade to support the exploration of large data sets. VISTA is an
visualization by capturing essential semantics visually. The
interactive visual cluster rendering system which invites human
third approach is to use visualization techniques to
into the clustering process, but there are some limitations in
complement the data mining techniques. The fourth approach
identifying the cluster distribution and human-computer
uses visualization technique to steer mining process.
interaction. In this paper, we propose an Enhanced VISTA
In general, visualization can be used to explore data to
(EVISTA) which addresses these drawbacks. EVISTA improves
the visualization in two ways: first it uses the weighted vector
confirm a hypothesis or to manipulate a view. Exploratory
normalization instead of max-min normalization, which visualization creates a dynamic scenario in which interaction
improves the data visualization such that the user can
is critical. The user not necessarily know that what he/she is
understand the underlying pattern without human intervention.
looking for, can search for structures or trends and is
Secondly it completely eliminates the use of α tuning, which
attempting to arrive at some hypothesis. The confirmatory
reduces the complexity in visual distance computation and eases
visualization, in which the system parameters are often
the human computer interaction in a better way. The
predetermined and the visualization tools are used to confirm
experiment results show that EVISTA explore the underlying
or refute the hypothesis. The manipulative visualization
pattern of the dataset effectively and reduces the user operation
focuses on refining the visualization to optimize the
burden greatly.
presentation. Visualization has been categorized in to two
Index Terms— Clustering, EVISTA, Human-computer major areas: i) scientific visualization –which focuses
interaction, Information visualization, Visual data mining.
primarily on physical data such as human body, etc. ii) Information visualization – which focuses on abstract
nonphysical data such as text, hierarchies and statistical data. Data mining techniques primarily oriented on information
Data visualization is essential for understanding the
visualization [4]. Both scientific visualization and
concept of multidimensional spaces [5]. It allows the user to
information visualization create graphical models and visual
explore the data in different ways at different levels of
representations from data that support direct user interaction
abstraction to find the right levels of details. Therefore
for interaction for exploring and acquiring insight in to useful
techniques are most useful if they are highly interactive,
information embedded in the underlying data [10, 15]. Even
permit direct manipulation and include a rapid response time.
though visualization techniques have advantages over
Visualization is defined by ware as "a graphical automatic methods, it brings up some specific problems such
representation of data or concepts" which is either an
as limitation in visibility, visual bias due to mapping of
"internal construct of the mind" or an "external artifact
dataset to 2D/ 3D representation, easy-to-use visual interface
supporting decision making". Visualization provides valuable
operations and reliable human-computer interaction. In most
assistance to the human by representing information visually.
of the visualization methods the human-computer interaction
This assistance may be called cognitive support. Visualization
costs than automated [9]. In general, the visual data mining is
can provide cognitive support through a number of different from scientific visualization and it has the following
mechanisms such as grouping related information for easy
characteristics:
search and access, representing large volumes of data in a
Wide range of users
small space and imposing structure on data and tasks can
Wide choice of visualization techniques and
reduce time complexity, allowing interactive exploration
Important dialog function.
through manipulation of parameter values [11].
The users of scientific visualization are scientists and
Visualization techniques could enhance the current engineers who can endure the difficulty in using the system
knowledge and data discovery methods by increasing the user
for little at most, whereas a visual data mining must have the
2009 ACADEMY PUBLISHER
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
possibility that the general persons uses widely and so on
circumference of the circle C, where the unit vectors are
easily [16]. By considering this issue, this paper proposes a
novel information visualization technique called enhanced
visual clustering system (EVISTA), an extension version of
Si = (cos(
1 2,., k
VISTA [8]. VISTA, a dynamic data visualization model
which invite human into the clustering process. Even though
And the 2D point Q( x, y) is obtained by,
VISTA proved to be an efficient interactive visual cluster
rendering system, it requires a complete user interaction
throughout the clustering process. When the number of
)∑ yi'sin
dimension increases, the human computer interaction
becomes tedious. EVISTA designed in such a way to provide
an efficient data visualization such that the user can able to
understand the underlying pattern of the given data set
without human intervention.
The rest of the paper is organized as follows: Section 2
x represents the given data object, i
discusses reviews of the related works in the domain of
normalized data value based on weighted vector wt [14]
information visualization. Section 3 deals with the EVISTA.
Section 4 discusses the experimental analysis. Section 5
concludes the paper.
EVISTA employs the design of VISTA visual cluster
II. RELATED WORKS
rendering proposed by KeKe Chen and L. Liu [8] provides an
Various efforts are made to visualize multidimensional
intuitive way to visualize clusters with interactive feedbacks
datasets [2, 10, 11, 13]. The early research on general plot
to encourage domain experts to participate in the clustering
based data visualization is Grand Tour and Projection Pursuit
revision and cluster validation process. It allows the user to
[2]. The purpose of the Grand Tour and Projection Pursuit is
interactively observe potential clusters in a series of
to guide user to find the interesting projections.
continuously changing visualizations through α. More
L.Yang [2] utilizes the Grand Tour technique to show
importantly, it can include algorithmic clustering results and
projections of datasets in an animation. They project the
serve as an effective validation and refinement tool for
dimensions to co-ordinate in a 3D space. However, when the
irregularly shaped clusters [9]. The VISTA system has two
3D space is shown on a 2D screen, some axes may be
unique features. First, it implements a linear and reliable
overlapped by other axes, which make it hard to perform
visualization model to interactively visualize the multi-
direct interactions on dimensions.
dimensional datasets in a 2D star-coordinate space. Second, it
Star coordinate [7] is an interactive visualization model
provides a richest set of user-friendly interactive rendering
which treats dimensions uniformly, in which data are
operations, allowing users to validate and refine the cluster
represented coarsely and by simple and more space efficient
structure based on their visual experience as well as their
points, which result in less cluttered visualization for large
domain knowledge.
The VISTA visualization model consists of two linear
Interactive visual clustering (IVC) [10] combines spring-
mappings: Max-min normalization followed by α-mapping.
embedded graph layout techniques with user interaction and
Equation (5) represents the Max-Min normalization: is used
constrained clustering.
to normalize the columns in the datasets so as to eliminate the
VISTA [8, 9] is a recent visualization models utilizes star
dominating effect of large-valued columns.
coordinate system provide similar mapping function like star
⎡ 2 (v − min)
co-ordinate systems. There are two types of cluster rendering
in VISTA model. The former one is unguided rendering and
the latter is guided rendering.
where v is the original and v is the normalized value. The α - mapping maps k dimensional points on to two
III. ENHANCED VISUAL CLUSTERING SYSTEM
dimensional visual spaces with the convenience of visual parameter tuning.
Enhanced VISTA (EVISTA) is an information
The proposed visualization model EVISTA utilizes the
visualization frameworks employs improved data weighted vector normalization which is performed on rows
visualization and reveal the hidden patterns in complex high
instead of columns, such that the visualization model defines
dimensional data sets, without human intervention. The
the reliable position of Q ( x , y ) . EVISTA completely
EVISTA model is designed based on the star coordinates.
eliminates the usage of α- tuning, since α- mapping is tedious
Star coordinate system is a traditional multivariate data
when the number of dimensions is high. And each change in
visualization technique in which the k-axis is defined by an
α- values requires a fresh visual distance computation. As the
O = ( x, y)
k coordinate
number of dimensions increases, visual distance computation
S ,1 S 2, S ,.,
S represents the k dimensions in 2D spaces.
process may create time complexity. Similar effects may
The k coordinates are equidistantly distributed on the
occur when the number of data objects increases. This makes
2009 ACADEMY PUBLISHER
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
the human computer interaction ineffective and affects the
B. Results and Discussion
applicability of VISTA.
IV. EXPERIMENTAL ANLYSIS
To illustrate the efficiency of our proposed visualization,
empirical analyses are conducted on number of bench mark
data sets available in the UCI machine learning data
repository. The performance of EVISTA is compared against
VISTA system and the automatic clustering algorithm K-
Figure 1. Visualization of Iris Dataset using VISTA system
Means. The experiments in VISTA are conducted by setting α
value as 1.The detailed information of the data sets is shown
DETAILS OF DATASETS
A. Cluster validation
Validation of clusters is very important in cluster analysis,
Figure 2. Visualization of Iris Dataset after α- tuning using VISTA system
because clustering methods tend to generate clustering even
for fairly homogeneous datasets. The quality of clusters
obtained through visual clustering is measured in terms of
three classical methods proposed in [3];
The Rand index and Jaccard coefficient validations
Attributes Classes
Figure 3. Visualization of Iris Dataset using EVISTA system
In VISTA, the domain knowledge plays a vital role in
finding the optimum number of clusters. In general, the
domain knowledge in the form of labeled items obtained by
e traditional automatic clustering algorithms such as K-Means
d on the agreement between clustering results and
can be incorporated in to the visual clustering process. And a
the "ground truth".
user without domain knowledge may fail in finding the
The classical validity measures are heavily related to the
optimum clusters, since α tuning change the data point
geometry or density nature of clusters and they do not work
distribution. Most of the automated clustering algorithms
well for arbitrary shaped clusters [8]. In such cases, visual
require the number of clusters to be specified prior, that may
perception plays an important in deciding right clusters.
not coincide with real cluster distribution of the dataset. This
increases the complexity of clustering process. EVISTA
Iris Data: Iris dataset is a benchmark dataset widely used
reduces the complexity of clustering by eliminating the usage
in pattern recognition and clustering. It is formed by 150 four
of α. Figure. 3 show the iris dataset visualization based on
dimensional instances of the three classes of plants classified
according to the sepal length and width and the petal length
From the results, it is observed that one cluster is
and width. The iris dataset consists of three clusters with
completely separated from the others and the visual
equal distribution. One cluster is linearly separable from the
boundaries between the other two clusters are clearly
other two; the latter two are not exactly linearly separable
identified. It is also noticed that there are only two data points
from each other. Figure.1 shows the initial visualization of
are overlapped. Since EVISTA doesn't possess α tuning the
iris dataset in VISTA model, where we observe the possibility
process of visual distance computation process is completely
of three clusters. And it is observed from the figure that, one
eliminated, which reduces the time complexity. EVISTA
cluster is completely separated from the other two, where the
doesn't require the domain knowledge in any form, which
remaining two are found to be overlapped. After performing
eases the human computer interaction and it visualizes the
interactive visual clustering with suitable α tuning the visual
exact pattern of the given dataset without human intervention.
boundaries between the clusters become clearer. Figure. 2
show the visualization of iris dataset after α tuning. As the
Australian Data:
literature of iris dataset specified, the two clusters are not
Australian Dataset concerns with credit card applications.
linearly separable. In VISTA it could be observed after the
This dataset is interesting because there is a good mix of
fine tuning of α. And the small region which consisting of the
attributes continuous, nominal with small numbers of
overlapping data points are also observed. And more
values, and nominal with larger numbers of values. This data
importantly the separation of two clusters found to be
set also has missing values. Suitable statistical based
difficult for the users.
computation is applied for finding the missing values. It has
2009 ACADEMY PUBLISHER
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
two classes. The class distribution is 44.5% for class A and
data visualization. EVISTA is designed with weight vector
55.5% for class B.
normalization, which improves the data exploration. And the
Figure.4 show the visualization of Australian data set in
elimination of α tuning in the visualization process reduces
VISTA, where possibly one single cluster is observed. During
the complexity of human – computer interaction. More
α tuning, the user can able to identify the two clusters. If the α
importantly EVISTA doesn't require the domain knowledge
tuning is not performed carefully, the user may get different
in any form, which improves the applicability of EVISTA.
pattern which may leads confusion. Figure. 5 show the
The experiment results show that the EVISTA efficiently
process of α tuning, where it is observed four cluster
identifies the cluster distribution and reduces the complexity
distribution. This leads a poor cluster quality. In such case,
in the visual distance computation. Specifically it eases the
domain knowledge is the only aid to identify the optimum
human-computer interaction.
number of clusters. Figure. 5 show the cluster distribution
using EVISTA; where two potential clusters are observed.
Since α tuning is not included in the EVISTA model, the
cluster distribution can be clearly visualized. Even though the
user doesn't have enough domain knowledge in any of the
form such as: number of clusters, cluster distribution,
visualization model EVISTA suitably identifies the optimum
Figure 4. Visualization of Australian Dataset using VISTA system
number of clusters.
Pima Data
Pima Dataset is an Indian Diabetes Database with 768 data
objects. It has two classes with class distribution as 500 and
268. It consists of attributes such as number of times
pregnant, Plasma glucose concentration, Diastolic blood
pressure (mm Hg), Triceps skin fold thickness (mm),
Diabetes pedigree function, etc. Figure. 7 show the VISTA
Figure 5. Visualization of Australian Dataset using VISTA system with α-
visualization of pima Indian dataset. When the pima dataset is
visualized using VISTA, one possible cluster is observed. Even the suitable α tuning doesn't distinguish the clusters.
The boundary regions of the two clusters are possibly not
Whereas EVISTA visualization of pima dataset clearly
shows two potential clusters. From Fig. 8 it is observed that
pima dataset contains two potential clusters, and few data
objects are scattered around the potential area. Since EVISTA
doesn't require α tuning the user may find it very flexible in finding the underlying pattern of the dataset without human
Figure 6. Visualization of Australian Dataset using EVISTA
intervention. And with suitable geometric transformation
such as scaling and rotation the user may able to observe the
cluster distribution according to their visual perception.
C. Comparative Analysis
This part of the section compares the results of EVISTA
with VISTA and the centroid based automatic clustering
algorithm K-Means. In EVISTA the cluster labeling is
Figure 7. Visualization of Pima Dataset using VISTA system
performed using free hand drawing. The area with potential
data points are covered by convex hull and the data points in the convex hull are labeled as one single cluster. The cluster
results are evaluated based on Rand Index and Jaccard
coefficients are shown in Table II and Table III. The results
of VISTA are obtained by conducting the experiments on
several runs and the average of them is taken for experimental
Figure 8. Visualization of Pima Dataset using EVISTA system
With the development of data collection technology,
effective data visualization models are required to understand the pattern of multidimensional and multivariate data. In this
paper Enhanced VISTA is proposed to gain improvement in
2009 ACADEMY PUBLISHER
FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009
First author expresses his thanks to University Grants
Commission for financial support (F-No. 34-105/2008, SR).
COMPARISON OF EVISTA WITH VISTA AND K-MEANS BASED ON RAND
Visual Clustering
[1] Bhavani Thuraisingham, "DataMining: Technologies,
Techniques, Tools and Trends", CRC press, London,Newyork,
Washington,1999.
Without α
With α
[2] Cook, D.R., Buja, A., Cabrea, J., and Harley, H.: Grand Tour
and Projection pursuit. J.Computational and Graphical
Statistics, v23, (1995).
[3] Daxin Jiang, Chun Tang, Aidong Zhang, "Cluster analysis for
68.00 65.45 61.71
gene expression data: a survey", IEEE Transactions on
Knowledge and Data Engineering, Vol. 16, No.11, 2004.
55.13 [4] Daniel, Keim, A., and Hans-Peter (1996), ‘Visualization
Australian 63.46
Techniques for Mining Large Databases:A Comparison', IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No.
[5] J. Han and M. Kamber," Data Mining: Concepts and
Techniques," Morgan Kaufmann Publishers, August 2000, ISBN 1-55860-489-8.
[6] A., K ,Jain,, M. N., Murty and Flynn P.J," Data clustering : A
COMPARISON OF EVISTA WITH VISTA AND K-MEANS
Review", ACM computing surveys, 1999.
BASED ON JACCARD COEFFICIENT
[7] E. Kandogan," Visualizing Multi-dimensional Clusters,"
Trends and outliers using star co-ordinates, Proc of ACM
Visual Clustering
KDD, 2001.
[8] Keke Chen and Liu. L, "VISTA: "Validating and Refining
With α
clusters via Visualization", Information Visualization, Vol. 3, 4,
α tuning
Keke Chen and Liu.L, "iVIBRATE:" Interactive Visualization-
Based Framework for Clustering Large Datasets", ACM
58.00 64.31 59.05
Transactions on Information Systems, Vol. 24, April 2006,
45.84 [10] Marie desJardins, James MacGlashan, Julia Ferraioli,"
Australian 48.82
Interactive visual clustering," Intelligent User Interfaces 2007,
[11] Melanie Tory and Torsten Moller, "Human Factors in
Visualization Research," IEEE Transactions on Visualization and Computer Graphics, 10(1), 2004.
[12] Pang-ning Tan, Michael Steinbach and Vipin Kumar,
"Introduction to Data Mining", Pearson Addison Wesley, Boston, 2006.
[13] O.,Sourina., D., Liu.,"Visual interactive 3-dimensional
clustering with implicit functions", Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems, Volume:
1, 1-3 Dec 2004, pp. 382-386.
[14] Thangavel. K and Ashok Kumar. D, ‘Optimization of code
Figure 9. Comparison based on Rand Index
book in Vector Quantization", International Journal Annals of Operations Research, Vol.143, No.1, 317-325, 2006.
[15] Ye N., "The Hand Book of Data Mining", Lawrence Erlabum
Associates, Publishers, Mahwah, Newjersey, 2003.
[16] Zhen Liu, Shinichi Kamohara., Minyi Guo,"A Scheme of
interactive Data Mining Support System in Parallel and Distributed Environment," ISPA 2003, LCNS 2745, Springer-verlag, pp. 263-272, 2003.
Figure 10. Comparison based on Jaccard coefficients
2009 ACADEMY PUBLISHER
Source: http://www.academypublisher.com/ijrte/vol02/no01/ijrte02018387.pdf
australia21.org.au
Editors Bob Douglas and Jo Wodak Trauma-related stress in AustraliaEssays by leading Australian thinkers and researchers Stress arising from trauma is affecting millions of Australians. A national conversation is required to consider how we can better manage this problem. Australia21 LtdABN: 25096242410 ACN: 096242410 PO Box 3244 Weston ACT 2611Phone: +61(0)2 62880823Email: [email protected]
Datasheet for bl21(de3) competent e. coli (c2527; lot 30)
5 Minute Transformation Protocol BL21(DE3) A shortened transformation protocol resulting in approximately 10% effi-ciency compared to the standard protocol may be suitable for applications Competent E. coli where a reduced total number of transformants is acceptable.Follow the Transformation Protocol with the following changes:1. Steps 3 and 5 are reduced to 2 minutes.