Microsoft word - ijrte02018387.doc

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 EVISTA – Interactive Visual Clustering System K. Thangavel1, P. Alagambigai2 1 Department of Computer Science, Periyar University, Salem, Tamilnadu, India Email: [email protected] 2 Department of Computer Applications, Easwari Engineering College, Chennai, Tamilnadu, India Email: [email protected] Abstract—Due to the enormous increase in the data, exploring
involvement in the interactive process. More recently there and analyzing them is increasingly important but difficult to
are a lot of discussions on visualization for data mining. achieve. Information visualization and visual data mining can
Visual data mining can be viewed as an integration of data help to deal with this. Visual data exploration has a high
visualization and data mining [5, 15]. Considering potential and many applications such as fraud detection and
visualization as a supporting technology in data mining, four data mining will use information visualization technology for an
improved data analysis. The advantage of visual data
possible approaches are stated in [1]. The first approach is the exploration is that the user is directly involved in the data
usage of visualization technique to present the results that are mining process. There are a large number of information
obtained from mining the data in the database. Second visualization techniques which have been developed over the last
approach is applying the data mining technique to decade to support the exploration of large data sets. VISTA is an
visualization by capturing essential semantics visually. The interactive visual cluster rendering system which invites human
third approach is to use visualization techniques to into the clustering process, but there are some limitations in
complement the data mining techniques. The fourth approach identifying the cluster distribution and human-computer
uses visualization technique to steer mining process. interaction. In this paper, we propose an Enhanced VISTA
In general, visualization can be used to explore data to (EVISTA) which addresses these drawbacks. EVISTA improves
the visualization in two ways: first it uses the weighted vector
confirm a hypothesis or to manipulate a view. Exploratory normalization instead of max-min normalization, which visualization creates a dynamic scenario in which interaction
improves the data visualization such that the user can
is critical. The user not necessarily know that what he/she is understand the underlying pattern without human intervention.
looking for, can search for structures or trends and is Secondly it completely eliminates the use of α tuning, which
attempting to arrive at some hypothesis. The confirmatory reduces the complexity in visual distance computation and eases
visualization, in which the system parameters are often the human computer interaction in a better way. The
predetermined and the visualization tools are used to confirm experiment results show that EVISTA explore the underlying
or refute the hypothesis. The manipulative visualization pattern of the dataset effectively and reduces the user operation
focuses on refining the visualization to optimize the burden greatly.

presentation. Visualization has been categorized in to two Index Terms— Clustering, EVISTA, Human-computer major areas: i) scientific visualization –which focuses
interaction, Information visualization, Visual data mining.
primarily on physical data such as human body, etc. ii) Information visualization – which focuses on abstract nonphysical data such as text, hierarchies and statistical data. Data mining techniques primarily oriented on information Data visualization is essential for understanding the visualization [4]. Both scientific visualization and concept of multidimensional spaces [5]. It allows the user to information visualization create graphical models and visual explore the data in different ways at different levels of representations from data that support direct user interaction abstraction to find the right levels of details. Therefore for interaction for exploring and acquiring insight in to useful techniques are most useful if they are highly interactive, information embedded in the underlying data [10, 15]. Even permit direct manipulation and include a rapid response time. though visualization techniques have advantages over Visualization is defined by ware as "a graphical automatic methods, it brings up some specific problems such representation of data or concepts" which is either an as limitation in visibility, visual bias due to mapping of "internal construct of the mind" or an "external artifact dataset to 2D/ 3D representation, easy-to-use visual interface supporting decision making". Visualization provides valuable operations and reliable human-computer interaction. In most assistance to the human by representing information visually. of the visualization methods the human-computer interaction This assistance may be called cognitive support. Visualization costs than automated [9]. In general, the visual data mining is can provide cognitive support through a number of different from scientific visualization and it has the following mechanisms such as grouping related information for easy characteristics: search and access, representing large volumes of data in a Wide range of users small space and imposing structure on data and tasks can Wide choice of visualization techniques and reduce time complexity, allowing interactive exploration Important dialog function. through manipulation of parameter values [11]. The users of scientific visualization are scientists and Visualization techniques could enhance the current engineers who can endure the difficulty in using the system knowledge and data discovery methods by increasing the user for little at most, whereas a visual data mining must have the 2009 ACADEMY PUBLISHER

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 possibility that the general persons uses widely and so on circumference of the circle C, where the unit vectors are easily [16]. By considering this issue, this paper proposes a novel information visualization technique called enhanced visual clustering system (EVISTA), an extension version of Si = (cos( 1 2,., k VISTA [8]. VISTA, a dynamic data visualization model which invite human into the clustering process. Even though And the 2D point Q( x, y) is obtained by, VISTA proved to be an efficient interactive visual cluster rendering system, it requires a complete user interaction throughout the clustering process. When the number of )∑ yi'sin dimension increases, the human computer interaction becomes tedious. EVISTA designed in such a way to provide an efficient data visualization such that the user can able to understand the underlying pattern of the given data set without human intervention. The rest of the paper is organized as follows: Section 2 x represents the given data object, i discusses reviews of the related works in the domain of normalized data value based on weighted vector wt [14] information visualization. Section 3 deals with the EVISTA. Section 4 discusses the experimental analysis. Section 5 concludes the paper. EVISTA employs the design of VISTA visual cluster II. RELATED WORKS rendering proposed by KeKe Chen and L. Liu [8] provides an Various efforts are made to visualize multidimensional intuitive way to visualize clusters with interactive feedbacks datasets [2, 10, 11, 13]. The early research on general plot to encourage domain experts to participate in the clustering based data visualization is Grand Tour and Projection Pursuit revision and cluster validation process. It allows the user to [2]. The purpose of the Grand Tour and Projection Pursuit is interactively observe potential clusters in a series of to guide user to find the interesting projections. continuously changing visualizations through α. More L.Yang [2] utilizes the Grand Tour technique to show importantly, it can include algorithmic clustering results and projections of datasets in an animation. They project the serve as an effective validation and refinement tool for dimensions to co-ordinate in a 3D space. However, when the irregularly shaped clusters [9]. The VISTA system has two 3D space is shown on a 2D screen, some axes may be unique features. First, it implements a linear and reliable overlapped by other axes, which make it hard to perform visualization model to interactively visualize the multi- direct interactions on dimensions. dimensional datasets in a 2D star-coordinate space. Second, it Star coordinate [7] is an interactive visualization model provides a richest set of user-friendly interactive rendering which treats dimensions uniformly, in which data are operations, allowing users to validate and refine the cluster represented coarsely and by simple and more space efficient structure based on their visual experience as well as their points, which result in less cluttered visualization for large domain knowledge. The VISTA visualization model consists of two linear Interactive visual clustering (IVC) [10] combines spring- mappings: Max-min normalization followed by α-mapping. embedded graph layout techniques with user interaction and Equation (5) represents the Max-Min normalization: is used constrained clustering. to normalize the columns in the datasets so as to eliminate the VISTA [8, 9] is a recent visualization models utilizes star dominating effect of large-valued columns. coordinate system provide similar mapping function like star ⎡ 2 (v − min) co-ordinate systems. There are two types of cluster rendering in VISTA model. The former one is unguided rendering and the latter is guided rendering. where v is the original and v is the normalized value. The α - mapping maps k dimensional points on to two III. ENHANCED VISUAL CLUSTERING SYSTEM dimensional visual spaces with the convenience of visual parameter tuning. Enhanced VISTA (EVISTA) is an information The proposed visualization model EVISTA utilizes the visualization frameworks employs improved data weighted vector normalization which is performed on rows visualization and reveal the hidden patterns in complex high instead of columns, such that the visualization model defines dimensional data sets, without human intervention. The the reliable position of Q ( x , y ) . EVISTA completely EVISTA model is designed based on the star coordinates. eliminates the usage of α- tuning, since α- mapping is tedious Star coordinate system is a traditional multivariate data when the number of dimensions is high. And each change in visualization technique in which the k-axis is defined by an α- values requires a fresh visual distance computation. As the O = ( x, y) k coordinate number of dimensions increases, visual distance computation S ,1 S 2, S ,., S represents the k dimensions in 2D spaces. process may create time complexity. Similar effects may The k coordinates are equidistantly distributed on the occur when the number of data objects increases. This makes 2009 ACADEMY PUBLISHER

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 the human computer interaction ineffective and affects the B. Results and Discussion applicability of VISTA. IV. EXPERIMENTAL ANLYSIS To illustrate the efficiency of our proposed visualization, empirical analyses are conducted on number of bench mark data sets available in the UCI machine learning data repository. The performance of EVISTA is compared against VISTA system and the automatic clustering algorithm K- Figure 1. Visualization of Iris Dataset using VISTA system Means. The experiments in VISTA are conducted by setting α value as 1.The detailed information of the data sets is shown DETAILS OF DATASETS A. Cluster validation Validation of clusters is very important in cluster analysis, Figure 2. Visualization of Iris Dataset after α- tuning using VISTA system because clustering methods tend to generate clustering even for fairly homogeneous datasets. The quality of clusters obtained through visual clustering is measured in terms of three classical methods proposed in [3]; The Rand index and Jaccard coefficient validations Attributes Classes Figure 3. Visualization of Iris Dataset using EVISTA system In VISTA, the domain knowledge plays a vital role in finding the optimum number of clusters. In general, the domain knowledge in the form of labeled items obtained by e traditional automatic clustering algorithms such as K-Means d on the agreement between clustering results and can be incorporated in to the visual clustering process. And a the "ground truth". user without domain knowledge may fail in finding the The classical validity measures are heavily related to the optimum clusters, since α tuning change the data point geometry or density nature of clusters and they do not work distribution. Most of the automated clustering algorithms well for arbitrary shaped clusters [8]. In such cases, visual require the number of clusters to be specified prior, that may perception plays an important in deciding right clusters. not coincide with real cluster distribution of the dataset. This increases the complexity of clustering process. EVISTA Iris Data: Iris dataset is a benchmark dataset widely used reduces the complexity of clustering by eliminating the usage in pattern recognition and clustering. It is formed by 150 four of α. Figure. 3 show the iris dataset visualization based on dimensional instances of the three classes of plants classified according to the sepal length and width and the petal length From the results, it is observed that one cluster is and width. The iris dataset consists of three clusters with completely separated from the others and the visual equal distribution. One cluster is linearly separable from the boundaries between the other two clusters are clearly other two; the latter two are not exactly linearly separable identified. It is also noticed that there are only two data points from each other. Figure.1 shows the initial visualization of are overlapped. Since EVISTA doesn't possess α tuning the iris dataset in VISTA model, where we observe the possibility process of visual distance computation process is completely of three clusters. And it is observed from the figure that, one eliminated, which reduces the time complexity. EVISTA cluster is completely separated from the other two, where the doesn't require the domain knowledge in any form, which remaining two are found to be overlapped. After performing eases the human computer interaction and it visualizes the interactive visual clustering with suitable α tuning the visual exact pattern of the given dataset without human intervention. boundaries between the clusters become clearer. Figure. 2 show the visualization of iris dataset after α tuning. As the Australian Data: literature of iris dataset specified, the two clusters are not Australian Dataset concerns with credit card applications. linearly separable. In VISTA it could be observed after the This dataset is interesting because there is a good mix of fine tuning of α. And the small region which consisting of the attributes continuous, nominal with small numbers of overlapping data points are also observed. And more values, and nominal with larger numbers of values. This data importantly the separation of two clusters found to be set also has missing values. Suitable statistical based difficult for the users. computation is applied for finding the missing values. It has 2009 ACADEMY PUBLISHER

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 two classes. The class distribution is 44.5% for class A and data visualization. EVISTA is designed with weight vector 55.5% for class B. normalization, which improves the data exploration. And the Figure.4 show the visualization of Australian data set in elimination of α tuning in the visualization process reduces VISTA, where possibly one single cluster is observed. During the complexity of human – computer interaction. More α tuning, the user can able to identify the two clusters. If the α importantly EVISTA doesn't require the domain knowledge tuning is not performed carefully, the user may get different in any form, which improves the applicability of EVISTA. pattern which may leads confusion. Figure. 5 show the The experiment results show that the EVISTA efficiently process of α tuning, where it is observed four cluster identifies the cluster distribution and reduces the complexity distribution. This leads a poor cluster quality. In such case, in the visual distance computation. Specifically it eases the domain knowledge is the only aid to identify the optimum human-computer interaction. number of clusters. Figure. 5 show the cluster distribution using EVISTA; where two potential clusters are observed. Since α tuning is not included in the EVISTA model, the cluster distribution can be clearly visualized. Even though the user doesn't have enough domain knowledge in any of the form such as: number of clusters, cluster distribution, visualization model EVISTA suitably identifies the optimum Figure 4. Visualization of Australian Dataset using VISTA system number of clusters. Pima Data Pima Dataset is an Indian Diabetes Database with 768 data objects. It has two classes with class distribution as 500 and 268. It consists of attributes such as number of times pregnant, Plasma glucose concentration, Diastolic blood pressure (mm Hg), Triceps skin fold thickness (mm), Diabetes pedigree function, etc. Figure. 7 show the VISTA Figure 5. Visualization of Australian Dataset using VISTA system with α- visualization of pima Indian dataset. When the pima dataset is visualized using VISTA, one possible cluster is observed. Even the suitable α tuning doesn't distinguish the clusters. The boundary regions of the two clusters are possibly not Whereas EVISTA visualization of pima dataset clearly shows two potential clusters. From Fig. 8 it is observed that pima dataset contains two potential clusters, and few data objects are scattered around the potential area. Since EVISTA doesn't require α tuning the user may find it very flexible in finding the underlying pattern of the dataset without human Figure 6. Visualization of Australian Dataset using EVISTA intervention. And with suitable geometric transformation such as scaling and rotation the user may able to observe the cluster distribution according to their visual perception. C. Comparative Analysis This part of the section compares the results of EVISTA with VISTA and the centroid based automatic clustering algorithm K-Means. In EVISTA the cluster labeling is Figure 7. Visualization of Pima Dataset using VISTA system performed using free hand drawing. The area with potential data points are covered by convex hull and the data points in the convex hull are labeled as one single cluster. The cluster results are evaluated based on Rand Index and Jaccard coefficients are shown in Table II and Table III. The results of VISTA are obtained by conducting the experiments on several runs and the average of them is taken for experimental Figure 8. Visualization of Pima Dataset using EVISTA system With the development of data collection technology, effective data visualization models are required to understand the pattern of multidimensional and multivariate data. In this paper Enhanced VISTA is proposed to gain improvement in 2009 ACADEMY PUBLISHER

FULL PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009 First author expresses his thanks to University Grants Commission for financial support (F-No. 34-105/2008, SR). COMPARISON OF EVISTA WITH VISTA AND K-MEANS BASED ON RAND Visual Clustering [1] Bhavani Thuraisingham, "DataMining: Technologies, Techniques, Tools and Trends", CRC press, London,Newyork, Washington,1999. Without α
With α
[2] Cook, D.R., Buja, A., Cabrea, J., and Harley, H.: Grand Tour and Projection pursuit. J.Computational and Graphical Statistics, v23, (1995). [3] Daxin Jiang, Chun Tang, Aidong Zhang, "Cluster analysis for 68.00 65.45 61.71 gene expression data: a survey", IEEE Transactions on
Knowledge and Data Engineering, Vol. 16, No.11, 2004. 55.13 [4] Daniel, Keim, A., and Hans-Peter (1996), ‘Visualization Australian 63.46 Techniques for Mining Large Databases:A Comparison', IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. [5] J. Han and M. Kamber," Data Mining: Concepts and Techniques," Morgan Kaufmann Publishers, August 2000, ISBN 1-55860-489-8. [6] A., K ,Jain,, M. N., Murty and Flynn P.J," Data clustering : A
COMPARISON OF EVISTA WITH VISTA AND K-MEANS Review", ACM computing surveys, 1999.
BASED ON JACCARD COEFFICIENT [7] E. Kandogan," Visualizing Multi-dimensional Clusters,"
Trends and outliers using star co-ordinates, Proc of ACM Visual Clustering KDD, 2001. [8] Keke Chen and Liu. L, "VISTA: "Validating and Refining With α
clusters via Visualization", Information Visualization, Vol. 3, 4, α tuning
Keke Chen and Liu.L, "iVIBRATE:" Interactive Visualization- Based Framework for Clustering Large Datasets", ACM 58.00 64.31 59.05 Transactions on Information Systems, Vol. 24, April 2006, 45.84 [10] Marie desJardins, James MacGlashan, Julia Ferraioli," Australian 48.82 Interactive visual clustering," Intelligent User Interfaces 2007,
[11] Melanie Tory and Torsten Moller, "Human Factors in Visualization Research," IEEE Transactions on Visualization and Computer Graphics, 10(1), 2004. [12] Pang-ning Tan, Michael Steinbach and Vipin Kumar, "Introduction to Data Mining", Pearson Addison Wesley, Boston, 2006. [13] O.,Sourina., D., Liu.,"Visual interactive 3-dimensional clustering with implicit functions", Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems, Volume: 1, 1-3 Dec 2004, pp. 382-386. [14] Thangavel. K and Ashok Kumar. D, ‘Optimization of code Figure 9. Comparison based on Rand Index book in Vector Quantization", International Journal Annals of Operations Research, Vol.143, No.1, 317-325, 2006. [15] Ye N., "The Hand Book of Data Mining", Lawrence Erlabum Associates, Publishers, Mahwah, Newjersey, 2003. [16] Zhen Liu, Shinichi Kamohara., Minyi Guo,"A Scheme of interactive Data Mining Support System in Parallel and Distributed Environment," ISPA 2003, LCNS 2745, Springer-verlag, pp. 263-272, 2003. Figure 10. Comparison based on Jaccard coefficients 2009 ACADEMY PUBLISHER

Source: http://www.academypublisher.com/ijrte/vol02/no01/ijrte02018387.pdf

australia21.org.au

Editors Bob Douglas and Jo Wodak Trauma-related stress in AustraliaEssays by leading Australian thinkers and researchers Stress arising from trauma is affecting millions of Australians. A national conversation is required to consider how we can better manage this problem. Australia21 LtdABN: 25096242410 ACN: 096242410 PO Box 3244 Weston ACT 2611Phone: +61(0)2 62880823Email: [email protected]

Datasheet for bl21(de3) competent e. coli (c2527; lot 30)

5 Minute Transformation Protocol BL21(DE3) A shortened transformation protocol resulting in approximately 10% effi-ciency compared to the standard protocol may be suitable for applications Competent E. coli where a reduced total number of transformants is acceptable.Follow the Transformation Protocol with the following changes:1. Steps 3 and 5 are reduced to 2 minutes.