VISUAL DATA MINING APPLICATION

IN

MATERIAL RESEARCH

Anand M., Bharath B.N., Chaitra, Kiran Kumar M.S., Vinay C.           * Mr.T.L.Bharatheesh*

* P.E.S.Institute of Technology, Bangalore University

Bangalore, India

Emails: anand_palm@yahoo.co.in, bharath_97@yahoo.com, chaitrabhat@mecheng.iisc.ernet.in, kiran_k18@yahoo.com, vinay_channakeshavarao@hotmail.com

 

Abstract. The advances in manufacturing technology and proliferation of information from research laboratories have posed the problem of information explosion. To understand any systems behavior it would be appreciable if the process can be visualized as a mapping from inputs to outputs. In materials research also, the generation of hypothesis about various materials properties and the compositions leading to such properties can be better understood by multivariate visualization and machine learning techniques.

                 In this project an attempt has been made to develop a novel method for multivariate visualization using parallel coordinate systems. The materials classification problem is solved by using two popular methods clustering and k-nearest neighbor technique. The prototype will allow the user to interactively discover the hidden patterns by semiautomatic generation of hypothesis and testing.

                As a case study the developed prototype has been tested on an Aluminium alloy database. The data for case study is taken from an international laboratory in html format. Later it has been transformed to the standard format for data mining. The cleaned data is later used for visual and analytic data mining.

                The preliminary application of data mining on the aluminium alloy database indicates promising results. Few interesting combination of materials that have wider industrial applications have been discovered. It is hoped that the prototype with some additional features can help knowledge discovery in materials research.

 

                Keywords. Data mining, knowledge discovery, visual data mining, multivariate visualization, parallel coordinates, data mining process, manufacturing, materials research, clustering, k- nearest neighbor..

 

 

1. Introduction.

In manufacturing industries the constant Quest for success & progress revealed that the problems in the industries could be fixed by the analysis of large volumes of unorganized & scattered or raw information that may be defined as data. Everyday there is lot of raw data that are collected and a need arises to understand these data. The industries realized that what they really wanted was the knowledge, trends patterns within the data- not the data itself. The inability to discover valuable information hidden in the data prevented these industries from transforming this data into knowledge. Hence a need arose to extract valid, previously unknown, and comprehensible information from large databases and uses it for the growth.

To fulfill these goals, industries need to follow the following steps:

·        Capture and integrate both the internal and external data into a comprehensive view that encompasses the whole industries

·        "Mine" the integrated data for information

·        Organize and present the information and knowledge in ways that expedite complex decision-making.

 

1.1. Need for Data Visualization.

The driving force behind visualizing data mining models can be broken down into key areas: Understanding and trust. Understanding is undoubtedly the most fundamental motivation behind visualizing the model. The more interesting way to use a data-mining model is to get the user to actually understand what is going on, so that action can be taken directly. Visualization will help in analyzing n dimensions that can be selected and the evolution of the system in these emphasized variables may be visualized directly. In order to minimize the loss of information, where the variables can be mapped onto visual attributes like color or texture. It helps in depicting the behavior of a dynamical system in more than one dimension.

            Other reason for data visualization is the limitation of human beings to absorb the large amount of information. The volumes of data are overwhelming and the human visual systems and brain are not equipped to work with the data in this form. Using data visualization, we allow much faster processing of the data and the ability to see the patterns in the data.

            Data visualization seeks to combat the data understanding problem by utilizing the tools of data mining. In brief data visualization is a method of presenting the output so that the entire problem and the solution is clearly visible to domain experts.

      

1.2. Different methods available for Data Visualization.

            Data visualization uses a graphical and numerical tools to reveal the information contained in data. It is more effective approach for understanding and communication than the use of common numerical tools. Graphical methods hold a key for visualization. The best visualization method is one, which supports the most insight into the phenomenon under study. There are different methods available for visualization of the data.

Data can be:

  1. Univariate
  2. Bivariate
  3. Multivariate

1.2.1. Univariate Data.

            Univariate data consist of samples or measurements of a single quantitative variable. A fundamental task with this type of data is to characterize its distribution. Other important tasks are comparing the distributions of samples from two or more populations and comparing the data distributions to standard distributions, especially the normal distribution. Transformations of the data are sometimes used to achieve a more desirable distribution.

            Univariate data is represented using the following methods.

·        Histogram

·        Pie chart

Histogram: A histogram is the graphical representation generally employed for a continuous distribution. A histogram is constructed by drawing bars or rectangles on equal class intervals. The class intervals are taken on the x-axis and the frequency on the y-axis. The bars are drawn so that their height is proportional to the class frequency. The area of the bars will then be proportional to the class frequency. If, however, the class intervals are not equal, then the bars are drawn to represent frequency density. The frequency density is obtained by dividing each frequency by the width of the corresponding class interval .One of the best ways to summarize the data is to provide a histogram of the data. This is the most useful way of getting a high-level understanding of the database. Looking at the histogram, it is also possible to build an intuition about the important factors. When there are many values for a given predictor, the histogram begins to look smoother and smoother.

                                                       

F = Frequency        C = Class

Figure 1 Histogram 

The above Univariate graph is a histogram representation of only one attribute with respect to its frequency.

 

Pie chart: A pie chart is also a pictorial representation of the univariate data. In a pie chart or diagram a circle is drawn and it is divided into sectors on such a way that the area of the sectors are proportional to the magnitudes given. The angle at the center of a circle that is 3600 is divided in the ratio of the individual data. Thus the chart is obtained.

 

           Figure 2 Pie Chart              

 

1.2.2. Bivariate Data

Bivariate data consist of paired samples or measurements of two quantitative variables. The general task with this type of data is to determine how the variables are related. In come cases the variables are a factor (independent variable) and a response (dependent variable). In other cases the variables are not functionally related, but their distributions can be compared.

            Bivariate data is represented using the following methods          

·        Scatter plots

·        Line graph

 Scatter Plots: scatter plots have almost become a field of which research. A number of variables are graphed to 2 to 3 axes. The simplest case is where each variable has its own axis. Color can be used to encode additional variables. Scatter plots can use additional visual coding. Given an m-dimensional matrix defining the data where 2 or 3 dimensions should be projected. Some objective function can be defined on X. X = xij where xij is the ith observation or value from the jth domain. The scatter plot is thus a projection Rm =R2 or Rm  = R3 to give (Xa, Xb, (Xc)) where a, b,(and c)represent specific attribute types.[1]


Figure 3 Scatter Plot   

Line graph: The line graph involves a method of curve fitting. The workhorse method for curvefitting is least square regression estimating the curve which has the minimum sum of the squares of the residuals. The most common practice is to attempt to fit a straight line to the data. This is a case of polynomial fitting:

  

  Figure 4 Line Graph

This is a representation of a Bivariate graph with only two attributes that can be represented at a time.

 

1.2.3. Multivariate Data:      

Since the above-mentioned techniques have multiple disadvantages we use the multidimensional representations for visualization of the data.Multivariate data consist of one quantitative variable, the response, and two or more categorical variables as factors. There is a value of the response for each combination of levels of the categorical variables. The general task with this type of data is to determine how the response depends on the factors. N-1 points represent a line in N dimensional space in the parallel system.

             Multivariate data is represented using the following methods      

·        Dynamic Parallel coordinate plots

·        Chernoff faces

·        Independence diagram

·        Parallel coordinate plots

 Dynamic Parallel coordinate system: It is a graphic device for visual examination of large multivariate data sets. The dynamic parallel coordinate system plot uses the novel technique of representing a series of variables as a series of parallel axes, rather than orthogonal axes. The dynamic plot extends the capabilities of existing versions of this graphic device by providing a suite of interactive capabilities including brushing, facing, color manipulation, classification scheme, customization and axis variable reassignment. [6]

 Chernoff faces: The method Chernoff faces, as the name implies was invented by Chernoff in 1973 for representing multivariate data. The procedure is simple bests effective. In this procedure each facial feature denotes a particular variable. For example x1 can be associated with the size of the mouth, x2 with the size of the nose, x3 with the size of the eyes and so on. [7]

Independence diagram: In order to recognize the complex dependencies between attributes we use independence diagram. In this case each attribute is divided into ranges, for each pair of attributes, the combination of these ranges defines a two-dimensional grid. For each cell of this grid, the number of data items is stored in it. The grid is displayed scaling each attribute axis so that the displayed width of a range is prepositional to the total number of data items within the range. The brightness of a cell is proportional to the density of data items in it.[8]

Brushing: Perhaps the most common and historically first widely used technique explicitly identified as graphical exploratory data analysis is brushing, an interactive method allowing one to select on screen specific data points or subsets of data and identify their characteristics or to examine their effects on relations between relevant variables.

 

2. Visualizing Multivariate Data

2.1. Parallel Coordinate plots.

The classic scatter diagram is a fundamental tool in the construction of a model for data. It allows us to detect structures in data as linear or non-linear features, clustering outliners &similar things. Unfortunately, scatter diagrams do not generalize readily beyond three dimensions, because of this reason visualization of multivariate data is an unsolved problem.[5]

            In 1985, Prof. Alfred Inselberg proposed a method to visualize a multivariate data where instead of preserving the orthogonality of the n-dimensions draw them parallel, that is all axes are parallel to one another and equally spaced.

Each data dimension is represented as a horizontal or vertical axis, and the n-axes are organized as uniformly spaced lines. A data element is an n-dimensional space is mapped to a polyline that traverses across all of the horizontal or vertical axes.

A vector (x1, x2, x3, ------------xn) is created by plotting x1 on axis one, x2 on axis two and so on through xn on axis n. A broken line joins these points. The figure shows 2 pints one solid and one broken plotted in parallel coordinate representation, that agree in IV coordinate. The principal advantage of this plotting is the each vector (x1, x2, x3, ------xn) is represented in a planar diagram, so each vector component has essentially the same representation.

It is important to note that the parallel coordinates are really a generalization of the simple bar graph or one to one plots. The parallel coordinates represents a collection of data points as y axis coordinate values arrayed along the x axis. The name parallel coordinates are derived from the fact that a specific point in n- dimensional Euclidean space can be represented by n y-axis values arrayed along the x-axis.[5]

In order to formalize the mathematical representation, there are two additional representational features. A fixed spacing between the points arrayed along the x-axis, a vertical line through these x-axis points results in a collection of lines all parallel to the y-axis. Connection using straight lines between the points arrayed along the x-axis.

Dual property:

            The parallel coordinate representation enjoys some elegant duality properties with the visual Cartesian orthogonal coordinate representation.A line in Cartesian coordinate plane is given by y = mx + b and [(a, ma + b) and(c, mc + b)] are the two points lying on that line.

           

The xy Cartesian axes mapped into the parallel axes as shown above. Superimpose Cartesian coordinate axes as shown above. Superimpose Cartesian coordinate axes tu on the xy parallel axes so that the y parallel axis has the equation u=1. The point (a, ma + b) in the xy Cartesian system maps into the line joining (a, 0) to (ma + b, 1) in the tu coordinate axes. Similarly(c, mc + b) maps into the line joining (c, 0) to (mc + b, 1). It is a straightforward computation to show that these two lines intersect at a point in the tu plane given by L=[b (1-m)-1;(1-m)-1 ]. This point in parallel plot depends on m and b only. The parameters of the original line in Cartesian plot. Thus L is the dual of L given the interesting duality that points in Cartesian coordinates map into lines in parallel coordinates and lines in Cartesian coordinate map into points in parallel coordinate.

Parallel coordinates provide a means to visualize higher order geometric in an easily recognizable two-dimensional representation. As a tool for the exploratory data analysis of multi dimensional systems parallel coordinates can be display the patterns, trends, and correlation's in the data while additionally revealing hyper dimensional geometry.

Construction of parallel coordinates:

            The construction of the parallel coordinate system is fairly simple. A single horizontal line is drawn and a series of vertical axes, each representing a separate variable are placed equal distances along the line. The number of vertical axes is equal to number of variables that is if there are n-variables n vertical axes are to be drawn the spacing between each axes is typically of unit length. The values of a given variable are represented on the vertical axes pertaining to that variable. Line segments between successive vertical axes connect the values on each of the n-axes that correspond to an individual point in Euclidean space. The result is a graph of line segments connected between axes to form polygonal lines across the entire representation. Each polygonal line of (n-1) segments represents a distinct point in n-dimensional space. The collection of polygonal lines share some important dualities with the Cartesian coordinate system and these dualities provide a method of interpreting the parallel coordinate system through a transition into Cartesian coordinates and Euclidean space.[2], [3], [4], [5]

Advantages:

  1. Parallel coordinates are easy to implement and easy to comprehend.
  2. Multi dimensional data can be viewed with a simple two-dimensional graph.
  3. Potential marking of one data point by another in 3 dimensional space is eliminated.
  4. Correlation patterns in multi dimensional data can be visualizes if present.
  5. The number of dimensions that can be visualized is only restricted by the horizontal resolution of the screen.

 

 

Figure 6 A five dimensional parallel coordinate plot

 

2.2. Chernoff faces.

            In this method each facial feature denotes a particular variable. As the data is related to facial features in chernoff faces, they illustrate trends in multidimensional data very effectively because it is something which we are used to differentiate between.[7]

For example , X1 can be associated with the size of the mouth,X2 with the size of the nose,X3 with the size of the eyes and so on.

Chernoff faces are represented as shown above for a given data. Chernoff faces are used for condensation of data and this avoids viewung of large tables of data and hence helps in data digestion. In order to make use of chernoff faces the values that the features represent must be clearly shown in addition to the plot list. The plot itself does not contain any information on actual data values which are plotted which is a limitation of chernoff faces. The chernoff faces can be used where knowledge in the trends of the data determine which sections of the data are of particular interest.

Figure 7 Chernoff Faces

 

Advantages of Chernoff faces:

  1. Trends in data are easily identified.
  2. Animated Chernoff faces can be used to show multiple relationship which are more effective.

Disadvantages of Chernoff faces:

  1. Subjective assignment of facial expressions causes an error rate as high as 25 for classifying faces in to groups.
  2. The actual values are not shown with the plot.

 

2.3. Independence diagrams:

In data mining the recognition of complex dependencies between attributes is a major issue. Earlier correlation co-efficients scatter plots and equiwidth histograms are used to identify these attribute dependence. But these techniques are sensitive to outliers, and often are not sufficiently informative to identify the kind of attribute dependence present. To overcome these problems independence diagrams are proposed.[8]

In independence diagrams each attribute is divided into ranges and for each pair of attributes the combination of these ranges defines a two dimensional grid. For each cell of this grid the number of data items in it are stored. By scaling each attribute axis the grid is displayed. So that the displayed width of a range is proportional to the total number of data items with in that range. The brightness of a cell is proportional to the density of data items in it.

As a result both attributes are independently normalized by frequency ensuring insensitivity to outliers, skew and allowing specific focus on attribute dependencies. Independence diagrams provide quantitative measures of the interaction between two attributes, and allow formal reasoning about issues such as statistical significance.

Independence diagrams enable the visual analysis of the dependence between any two attributes of a data set this technique is not affected by data skew and outliers. It does nit require any transformation of data to be specified by an expert. It has an ability to focus purely on the dependence between two attributes stripping away effects due to the respective univariate distributions. the basic idea in independence diagrams is to divide the attributes independently into slices or (rows or columns) such that each slice contains roughly the same number of data items and additionally split slices having a large extension. This might be seen as a combination of an equi-depth and equi-width histogram. Each intersection of row or column defines a two-dimensional bucket, with which count is stored. This kind of two-dimensional histogram is called as an equi-slice histogram. This equi-slice histogram is maped to the screen in such a way that the width of data items in the slice. Finally the brightness of a bucket is proportional to the count of the bucket divided by its area.

This kind of visualization is amenable to interpreting dimension dependence effects, because the one- dimensional distributions have already been “normalized" by depicting the slices in an equi-depth fashion. Equally populated slices occupy equal space in the image, meaning that resolution is spent in proportion to the data population, and that the image is not sensitive to outliers.

Given d attributes there are  (d2-d) / 2 images. These images could be shown together as thumbnails, lining up the images along the same attributes.

There are three steps to generate an independence diagrams

  1. Determine the boundaries of the slices used in each dimension and count the number of data items in each grid cell and size.
  2. Scale each dimension and obtain a mapping from buckets to pixels.
  3. Determine the brightness for the pixels in each grid cell.

 

Figure 8

 

3. Visual Data Mining in Material Science. 

             This phase deals with the implementation of the techniques and functions of data mining for material classification, which involves different stages that are discussed below.

 3.1. System architecture

 The system architecture consists of five stages, they are

·        Data collection

·        Data modeling

·        Data transformation

·        Data visualization

·        Data analysis

         Data visualization is done using Parallel coordinate system (for multivariate visualization) and histogram (for univariate visualization), where as data analysis is done using Clustering and k-Nearest Neighbor techniques.

  Figure 9 System Architecture

 

3.1.1. Data collection

            The prime most tasks involved is the collection of the data. The aluminum alloys are chosen for the purpose. The data containing the information of Aluminum alloys and their properties are considered. 1000 Series, 2000 series, 3000 series, 4000 series and 5000 series of Aluminum are taken from the matweb.com. An example of the exact form of data available on the web and in which the data was collected is as shown below:

Table.2

Aluminum 1060-H12

 

Subcategory: Aluminum Alloy; Nonferrous Metal; 1000 Series Aluminum

Close Analogs: Four Other Tempers

Key Words: Aluminium 1060-H12; UNS A91060; AA1060-H12;

Composition:

 

                                                                                                                                   [11]

 

3.1.2. Data modeling

             After the data has been collected the data has to be modeled, that is cleaning of the data is necessary. This is an important stage in knowledge discovery. The Aluminum alloys contain a number of other materials in them. The materials are taken and their statuses are being determined as per the data available and are incorporated in a table as shown below. This involves the stage of modeling the data. There are mainly two variables available the predicator and the descriptor variables. The properties of the material contribute to the descriptor variable and the materials themselves form the predictor. All the properties are not considered. Only three properties are taken into account. The table given below gives the complete details of the data being collected.

  Table.3

 

 3.1.3.  Data transformation.

             The data that is being modeled and cleaned has to be transferred into a database that can be used for the development of the software. The cleaned data hence is taken and is put into the database with all the compositions and the values of the properties. Thus a database of the material is generated. The database of materials as an example is exactly as shown below: This is the data transformation stage.

   Table.4

 

 

3.1.3.1.  Statistics

            The statistical tables for all the materials are tabulated. The table gives the computed values of maximum and minimum values of the variables.

  Table.5

           

3.1.4.      Visualization.

3.1.4.1. Parallel coordinates.

            The multivariate visualization is achieved by parallel coordinate representation. The plot involves vertical planes equal to number of variables drawn parallel to each other. The planes are equi-spaced on the screen at a distance

X = screen width / n,    where n = number of variables +1.

Vertical coordinates of the planes Starts at        Ystart =  (screen height) /10

                                                       Ends at      Yend = 8 * (screen height) / 10

Each plane represents a variable, the Ystart point represents the maximum value and the Yend point represents the minimum value of that variable in the training dataset. Other values are mapped on the line using a scaling factor.

            Scaling factor = (Yend – Ystart)/ (Range)

One record is retrieved at a time from the dataset and values of all the variables are mapped using the scaling factor, all these points are then joined with straight lines. All the records are mapped in the similar way to obtain a raw plot as shown in figure 10. 

In order to generate hypothesis and to answer the user's query the statistical analysis is necessary. The calculation of mean and mode is given by the following formula:

Calculation of mean :                     Xi = sum of all values/number of occurrence 

Calculation of mode :

M= Lm +((Fm-Fp)/(2Fm-Fp-Fp))*w              

                         Where Lm = lower limit of the modal class

                                    Fm = frequency of the modal class

                                     Fp = frequency of the modal preceding class

                                      Fs = frequency of the successor class

                                      W = width of the class

 3.1.5. Data analysis.

 3.1.5.1. Clustering.

            Clustering is used for segmenting the dataset into groups of like data. Clusters are observed in the raw plot in all the attributes. Using this technique a relationship between clusters of one attribute with other attributes can be seen. The number of clusters k, as seen in the raw plot for any attribute is entered. K, number of records starting from the first record is chosen as arbitrary cluster centers.

            Ac1 = [ X111, X121,X131……..X1n1 ]……… ack = [ Xk11, Xk21,……..Xkn1 ]

From each of the cluster centers, distances of all the records are calculated using the distance formula    dij = | Xk11 – X1 | + | Xk21 – X2 | +..…+ | Xkn1 – Xn |

            A record is grouped into that cluster with which it has the minimum distance.New cluster centers are calculated based on allocated clusters. The average of the allocated clusters forms new cluster centers. The distances of all the records from the new calculated cluster centers are calculated and the new clusters are assigned again. If the new clusters assigned match with the previous clusters, the clusters are accepted and visualized. Otherwise, the process is repeated again by calculating new cluster centers and the distances and clusters until these clusters match the previously assigned clusters.

Finally the clusters thus formed are visualized with different colors for different clusters (As shown in figure 14). This helps in ease of visualization and verification for further analysis.

 3.1.5.2. k-Nearest Neighbor.

            In k-NN, the predictors are used to find the descriptors. The predictors here are the composition of the alloy and the descriptors are mechanical properties namely Hardness, Tensile Strength and Shear Strength.

            The known composition of the material (predictors) [ X11, X21, X31, X41,…Xn1 ]  and number of closest cases required (k) are entered. The distance of this given composition with available training dataset is calculated.

            Distance i  = | Xi1 – X11| + | Xi2 -  X21 | + …….. + | Xin – Xn1|

                        i = number of records.

                        n = number of attributes

Descriptors of k number of nearest records are listed in an ascending order of the distance.


 

4. Results.

 The process of obtaining the results in this project is divided into a three-stage procedure.   The three stages are

  1. Generation of hypothesis using visual information (Histogram – a univariate visualization tool) available.
  2. Testing of the generated hypothesis on the experimental data
    collected (This is accomplished by querying).
  3. Verification of the hypothesis for augmented or rejection by
     analyzing the parallel coordinate plot obtained in response to the
     input query.

 Among the results obtained a few are interesting and which depict novel properties are explained.

 

Result 1.

Ho: Increase in Bismuth concentration increases Tensile strength.

H1: Increase in Bismuth concentration has no effect on Tensile strength.

             In result 1 the null hypothesis Ho states increase in Bismuth concentration increases tensile strength and the alternate hypothesis H1 states increase in Bismuth has no effect on tensile strength. For generation of the above stated hypothesis a univariate visualization tool, histogram (figure 11a) is used. When this hypothesis was tested on the experimental data, two plots were obtained which are as shown in figure 11b and figure11c.

In figure 11b it can be seen that when the Bismuth ranges from 0.2 to 0.3%, the tensile strength ranges from 262 to 270Mpa. In figure 11c, it can be seen that when Bismuth concentration ranges from 0.4 to 0.6%, the tensile strength ranges from 276 to 310Mpa. This is rather a simple result showing one to one relation between the concentration of Bismuth and the tensile strength involving only two variables.

  1. Concentration of Bismuth a predictor variable.
  2. Tensile strength a descriptive variable.

By verifying the plot null hypothesis is augmented.

 

Result 2.

Ho: Increase in concentrations of Lithium and Zirconium increases values of mechanical properties like hardness, tensile strength, and shear strength.

H1: Increase in concentration of Lithium and Zirconium has no effect on hardness, tensile strength, and shear strength.

             In this result the null hypothesis states increase in concentrations of Lithium and Zirconium increases values of mechanical properties and alternate hypothesis states that increase in concentration of Lithium and Zirconium have no effect on the mechanical properties. For generation of the above stated hypothesis a univariate visualization tool, histograms (figure12a and 12b) is used.

This hypothesis involves 5 variables of which 2 are predictor variables and 3 descriptor variables. Testing this hypothesis and verifying the plot obtained can appreciate the real powers of parallel coordinate system.

When this hypothesis was tested on the experimental data, the corresponding plot obtained is shown in figure12c. From the plot it can be seen that when the concentration of Lithium and Zirconium are between (2 to 2.2%) and (0.08 to 0.1%) respectively, the hardness value ranges from 57 to 86 BHN, the tensile strength ranges from 190 to 210MPa and the shear strength ranges from 130 to 190MPa respectively. Whereas when the concentration of Lithium and zirconium are between (2.4 to 2.6%) and (0.13 to 0.16%) respectively, the hardness ranges from 140 to 150 BHN, the tensile strength ranges from 470 to 520MPa and the shear strength being a maximum of 320MPa.

For a small increase on the concentration of Lithium and Zirconium more than a 100% increase in the properties are observed and it follows that null hypothesis is augmented.

The novel properties observed in this result is very useful in aeronautical, space research and such other fields which demand for materials with high tensile strength & high shear strength with the weight of the material being light.

 

Result 3.

Ho: Increase in Vanadium concentration improves tensile strength, hardness and shear strength.

H1: Increase in Vanadium concentration has no effect on tensile strength, hardness and shear strength.

             In result 3, the null hypothesis states that increase in vanadium concentration improves tensile strength, hardness and shear strength and the alternate hypothesis states that increase in vanadium concentration has no effect on tensile strength, hardness and shear strength. For generation of the above stated hypothesis a univariate visualization tool, histogram (figure13a) is used.

When this hypothesis has tested on the experimental data, two plots were obtained as shown in the figure 13b and figure 13c. From parallel coordinate plot (figure 13b) it can be seen that when the concentration of Vanadium ranges from 0.01 to 0.03%, hardness ranges from 17 to 50 BHN tensile strength ranges from 30 to 165MPa and shear strength ranges from 26 to 105MPa. Whereas from figure 13c, it can be seen that when the concentration of vanadium is increased from 0.03 to 0.05%, no significant change in the values of the properties is observed. By this verification, null hypothesis is rejected that means vanadium hardly affects the properties.   

 

 

 

Figure 10 Parallel coordinates plot showing all records.

 

 

  

 

 

  Figure 11a Histogram for Bismuth Concentration

 

 

  Figure 11b Plot for increasing Bi concentration from 0.2 to 0.3 %

 

 

 

 Figure 11c Plot for increasing Bi concentration from 0.4 to 0.6 %

 

 

 

  

 Figure 12a Histogram for Lithium                            Figure 12b Histogram for Zirconium

 

 Figure 12c Plot for increasing Li and Zr concentration.

 

Figure 13a Histogram for Vanadium

 

 

 Figure 13b Plot for increasing V concentration from 0.04 to 0.05 %

 

  

 

 Figure 13c Plot for increasing V concentration from 0.01 to 0.03 %

 

 

  

 Figure 14 Parallel Plot showing different clusters in Shear Strength

                                                             

 

5. Conclusions. 

             In this project software prototype has been developed for Visual and Analytic Data Mining. For the development of the prototype, Parallel Coordinate System is used for multivariate data visualization. K-NN technique is used for classification of the specified composition of Al alloy to the class that it belongs to and also to predict the properties of the specified alloy composition. Clustering technique is used for grouping of different compositions & properties of those alloys in different syndicates/groups. The k-NN technique & clustering technique together constitute the Analytic Data Mining.

            The developed prototype has been tested on the standard data collected from an International Material Research Lab. Some of the Novel properties, which are interesting but not explicitly available, are discovered. These are explained in Results.

 

 6. Scope for further work.

The following points can enumerate the road ahead for further improvement of this project

 

References.

 [1] Datatool.com/dvt/dataviz

[2] Alfred Inselberg, The plane with parallel coordinates, Visual Computer, 1 (1985), pp 69-97

[3] Alfred Inselberg and B.Dimsdale, Multidimensional Lines I and II, SIAM J. Appl. Math., 54  (1993), pp. 559-596.

[4] Alfred Inselberg, Visual Data Mining with Parallel Coordinates *multidimensional graph Ltd.  (1981), Computer Science dept. TelAviv    University, Israel.

 [5] E. Wegman, Hyper dimensional data analysis using parallel coordinates, J. Amer. Statist. Assoc., 85 (1990), pp. 664-675.

 [6] Robert .M. Edsall, Dynamic parallel coordinate plot for visualizing multivariate data, Dept of Geography, Penn State University.

 [7] H. Chernoff, The use of faces to represent points in k-dimensional space graphically, J. Amer. Statist. Assoc., 68 (1973), pp. 361-368.

 [8] Stephen Berchtold, H.V. Jagadish (AT&T Laboratories) and Kenneth Ross (Columbia University) Independence diagram - A technique for  Visual Data Mining.

 [9] Tim McLean (1995), www.islandnet.com

 [10] Alex Berson, Stephen Smith and Kurt Thearling - Building Data Mining Application for CRM.

 [11] www.Matweb.com

1