.

Monday, April 1, 2019

CRISP methodology

dashing methodologyWell we got 2 instruction sets to abstract utilize SPSS PASW 1) vino Quality entropy re investment company and 2) The stove poker give-up the ghost Data Set. We dismiss do this use CRISP methodology. permit us look what is CRISP by wikipedia CRISP-DM stands for Cross persistence Standard Process for Data Mining It is a selective selective information dig deal impersonate that disparateiates commonly customd approaches that expert info miners use to tackle problems. PASW vexer is a entropy excavation lay downbench that enables you to pronto develop p lossictive models development ph hotshot line expertise and deploy them into profession operations to improve decision making. Designed around the industry-standard CRISP-DM model, IBM SPSS PASW Modeler supports the constitutional entropy dig process, from entropy to let out business results. CRISP DM, Clementines noesis lightweight methodology of 5 st maturatesBusiness Understand ing, Data Understanding, Data expressionModelling, Evaluation and Deployment.CRISP Methodology Business Understanding Understanding the shed requirements objectives from a business perspective, and consequently converting this knowledge into a selective information mine problem definitionData beneathstanding In this step future(a) activities be going on, Data guessing, Collecting Initial Data then describing Data, Exploring Data and lastly verifying Data QualityThe selective information prep atomic number 18dness build Tasks include table, record, and attri ande selection as well as transformation and cleaning of selective information for modeling tools.Cleaning Data utilise enamor cleaning and cleansing strategies then Integrating Data into a maven point.Modeling Selection and lotion of various modeling techniques d unmatched in this course, and their parameters ar adjusted to optimal comforts. Basically, there are much than one technique for the same data excavation problem type. more or less techniques stick specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often subscribe toed. Steps consist of Generating a Test Design, Building the Models assessing the ModelEvaluationBuilding of model (or models) retires place in this phase. Before carry on to final deployment of the model, it is pregnant to more thoroughly evaluate the model, and re go through the locomote executed to construct the model.Deployment In the final st hop on companionship gained is nonionized extraditeed so that an end user groundwork easily use it. As per the requirements this fanny be a business relationship or a k nonty data tap process. Normally nodes carry protrude the deployment stepWine reference data set Wine chartic symbol is modeled under mixture and turn closely approaches, which preserves the order of the grades. Explanatory knowledge is apt(p) in limits of a sensitivity analysis, w hich measures the response changes when a attached stimulant inconstant is varied through its domain The red vino-colored-coloured data set contains 1600 samples out of which I have selected 200 stochastic samples and doing the analysis(Data mining send packingnot discover patterns that may be present in the larger body of data if those patterns are not present in the sample organism mined ) .So I selected the data set bearing in mind. The data set I have selected has spicy confidence. With measurements of 13 chemical substance constituents (e.g. alcohol, Mg) and the culture is to go on the quality of red and white wine.Input variants 1 fixed doseity 2 volatile acidity 3 citric acid 4 residual sugar 5 chlorides 6 light sulphur dioxide 7 total process dioxide 8 density 9 pH 10 sulphates 11 alcohol Output inconstant is quality (score betwixt 0 and 10) CRISP methodology has been watch overed through out the phase .By noticeing the weather van e site and resources learned closely the wine domain .the close step was to check whether incorrect, missing or abnormal values in the data set end ensure the data quality. Data quality of the data set is very(prenominal) cracking.PASW Data stream classification of red and white wines Classification for sanguine and White wine 2 data sets red wine and white wine have been imported employ variable file thickenings Use of type node here is to describe the characteristics of data. . The Classification and relapsing (CR) Tree node is a tree- found classification and vaticination method. Similar to C5.0, this method uses recursive partitioning to crush the grooming records into segments with same output field values. The CR Tree node starts by examining the input fields to find the best split, heedful by the reduction in an impurity index that results from the split. The split defines two subgroups, each of which is subsequently split into two more subgroups, and so on, unti l one of the stopping criteria is triggered. All splits are binary (only two subgroups)Red Wines variable importance White wine variable importance From variable importance diagram we rotter narrate that important attribute to crabbedise Red wine quality is pH. The variable importance is in the order pH, citric acid, chloride as shown in the purpose1. just now for determining White wines quality the most(prenominal) contribute attribute is chloride and second attribute is Alcohol.Analysis and conclusion The above generated tree consists of nodes and its children. The top node represent the total number of wine samples and how m both number be aches to different categories(1 to 9).The first split is on chloride. This implies that most of the wine belongs to chloride take aim0.041.We bump into that uncorrupted quality wine has chloride level It has been found from count Vs Quality graph that how many belongs to good quality categories. Alcoholic c at one timentration of whi te wine samples is more than that of red wine sample. Good wines normally have high concentration. So we give the bounce conclude that White wine samples are good. In the white wine chloride level is normally high that implies it has got good Aroma. Where as in red wine the citric level is amidst concomitant levels that shows the red wine is very tasty PASW has got a number of 2-D and three-D charts like bar, pie, histogram, scatter etc for time being I am using elongated graph and 3-d scatter graph. You end use any of the graph as per the requirements. Some graphs are easy to interpret .Let us guess a 2-D graph in the midst of most contributing variable pH and quality from the graph it is straighten out that the relation enchant amid pH and quality is in much(prenominal) a way that if pH is in between 3.23 and 3.27 quality is good. Quality is very low for 3.38 and 3.50.We can darn convertible graph between quality and citric acid or towards what ever contributing varia ble then find out the relation send out between them Let us plot a graph between chloride and Quality for the white wine. In the below foretell it shows the quality is very good when chloride level below 0.036.And quality in the range 5 to 6 when chloride level is above .048. Like this if plot a graph between quality and alcohol we will see the quality is too good if alcoholic concentration in between 12.5 and 13(as per the sample I have analyzed) 3D graph which shows the relation ship between alcohol, quality and chloride level of white wine from the 2d analysis it was shown how the quality is being affected by single variable. If the one variable does not tell about how quality being connect we can check relation ship between 3 variables using a 3d graph. It is having 3 axes.How Regression is useful In this multiple regression , bodeors much(prenominal) as (Constant), alcohol, fixed acidity, residual sugar, chlorides, volatile acidity, free sulfur dioxide, sulphates, pH, total sulfur dioxide, citric acid, density image the value of quality. under gave a Pasw stream for regression. Each by changing the self-sufficing variables value we can get value of dependent variable quality. With the serve of a hypothesis we need to understand and build a relation ship among the variables. To predict the misbegotten quality value for a given independent variable ( check out volatile acidity) we need a line which passes between the mean value of both quality and volatile acidity and which minimize the agree of distance between each of the points and prophetic line. This fits into a line.The Poker Hand Data Set Each record is an mannequin of a devolve consisting of five playing cards drawn from a standard floor of 52. Each card is described using two attributes (suit and rank), for a total of 10 predictive attributes. There is one Class attribute that describes the Poker Hand. The order of cards is important and there are 480 possible kingly Flush hands. on a lower floor discussing about how to determine poker hands using data mining. I am considering classification only. If we consider clustering/Regression it does not make any sensePASW MODEL CLASSIFICATION victimisation CRT ALGORITHAM We got training and analyseing data set .First applying a model on training data set. Source file is a comma separated file (CSV) with 1 million rows. It is difficult to do dissect on this input data set so selected sample data set and doing the analysis.Problem facedThe given source data was not in a meaning full format so I have given meaningful attribute name and Values by using Vlookup function in MS excel, now the data has grow more meaning full and it looks like below. Data cleansing is very important and comes under data preparation phase of the methodology the true of predictive model The accuracy of predictive model is checked by analysis node. It has been found that accuracy is 90%. using the Algorithm need to predict any of these0 Nothin g in hand 1 one pair2 Two pairs3 Three of a mental4 uninterrupted5 Flush6 safe house7 Four of a kind8 Straight flush9 Royal flush Let me say what did I understood from the diagram. Rank2 (rank of card2) is most contributing variable to predict poker hands. It is clear that Rank of 1st, 4th and 2nd cards are more contributing than suit of those cards. The different section of pie chart represents number of cards in a particular poker category. Blue represents No Poker Red represents ONE PAIR, Green represent Royal fleshHow Pasw helps to do classificationPasw has got number tree constructing algorithms(CR, c5.0) to do classification. I considered Classification and Regression (CR) though this is not a time efficient algorithm time complexness is more when compared to c5.0)I selected CR.The data set I have got is simple one and I am not considering the deep analysis all I need to do is to predict poker hands so CR can do it. Below shows the constructed tree using CR (Ashort interp retation of tree already given above)Analysis Data has been classify into knowledge set and Testing set .Here most of the data set into a training set and dispirited portion of data is use for testing.After a model has been processed by using the Training set, we can test the model by making predictions against the Test set. Since the data in the training set already contains known values for the attribute that you want to predict. Below giving the portion of training set being employ.Abstract Now-a-days Using the high power computing and information technology enables to collect store and process complex Marketing data. Data mining is used to prolong knowledge from this selling data. This report discuss about Data mining process, short discussion about different mining techniques such as classification tree, neural network, Regression and their application in merchandise domain. My report Also cover different type of analyzes and tasks being used access From the given them es I have selected the topic Data mining and Knowledge discovery for securities industrying since my cup of tea is Business and computing. I would unendingly like to do research in Business analytics .Well let us look at what is data mining Data mining is the process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data . This is one of the tools to transform data into information. It is widely used in approximately all fields of science and business profiling practice such as marketing, fraud detection, and scientific discovery. The technique to uncover pattern on data can also apply on sample data .so the sample data should be so the sample should be a good representative of larger data set. data mining can not find out the pattern which may be present in larger body of data and not contains in the small sub set of data. So this is very useful when sufficiently delineate data are collected near well known branches of data mining is kno wledge discovery or KDDIt derives knowledge from input data .This knowledge which have got from the process will become additional data and can be used for further discovery in related field normally an analyst can analysis and predict it.DM can generate thousands of pattern but all these patterns are not arouse and useful. In this I am considering Data mining in a marketing field future. The data coming from different sources like transactions, faithfulness cards, and discount coupons node complaint calls public life style studies using this data we can make Target marketing like n to recognise appropriate node segments for unused marketing initiativesn determine customer purchasing pattern over timen associations/co-relations between return gross revenue, predict based on such association I mean cross market analysisn what type of customer buys what type of product that is customer profilinn Predict likelihood of customer churn and target those likely to leave with re tention campaignsn Customer requirement analysis like Identify the best products for different groups of customers and Predict what factors will attract new customersn Provision of summary information such as Multidimensional summary reports and Statistical summary information (data central tendency and variation) Another question is why can not we go for a traditional data analysis instead of data mining? Answer is the field like marketing has tremendous hail of data and it has multi dimension and complexity.A Marketing firm would likely to segment their customers into identical groups or clusters in order to better understand consumer behavior and more effectively market their products. In the past for a small business initiatives did not have trouble to understand their customers. They knew what they have to do once a customer approach them .Todays business is more competitive, more customer oriented, more products oriented so it is very difficult to understand the customer beh avior, wants, ask the hidden relation ship between the data and preferences. With the help of data mining an analyst can deliver timely, personalized promotional offers. Normally in the huge DWH data mining environment data coming from various sources integrated and put it in data warehousing. various data mining soft wares like teradata intelligent miners are used to mine Tera bytes of data and find market prediction. As I mentioned the DM is a Tools for developing predictive and descriptive models. Some are statistical method such as regression. Other use non statistical method like neural networks, classification trees. Here I considered some important tools then theirHow Classification trees are being used in marketing data miningClassification tree partition the data to increase the residuum in the dependent variable. it is also called a decision tree. engender of classification tree is to classify the data into distinct groups or branches that compose the strongest sepa ration in the values of the dependent variables.The tree can identify segments. This can be helpful when a company is trying to understand what is driving market behavior. It detects nonlinear relationship. The tree growth is through serial publication of stairs and rules .say for example gross revenue pieces were mailed to 100000 names and yielded a response rate of 2.6%.the first split is on gender. This indicates that greatest difference between responders and non responders is gender. We see that males are much more responsive than females. We would consider males the better target group If we stop after one split. Our goal is to find out group with in both genders that discriminates between responders and non responders. In the next level split male and female groups are considered one after another The second level split from the male node is on income, this implies that the income level varies in most between responders and non responders among the males. For female greates t difference is among the age group .It is very easy to identify the group with the highest response rate. Lets say that management decides to mail only to groups where the response rate is more than 3.5%.the offers would be directed to males who makes more than 30000 a year and female over age 40Some typical Classification tree Algorithms are1) C4.5 Quinlan, J. R. C4.5 Programs for railroad car Learning. Morgan Kaufmann., 1993. 2) CART L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984Linear regression and its applicability in marketing Knowledge of deviation from normal is very important for a marketer. In the past such deviations were very difficult to detect. Now-a-days data mining tools give great flexibility to detect and classify these changes. It is a statistical technique that quantifies the relationship between dependent variable and the independent variable, these are nonstop. Consider the below equation, it shows a relation ship between sales and publicizing along the regression equation .Our goal is to predict the sales based on the amount spend on advt. Plot a graph sales vs. advt that would be linear. A key measure of the effectiveness of the relationship is the R-square. It measures the amount of overall variation in data that explained by the model. More than 70% Of the variation in sales can be explained by variation in advertising. Some times the relationship between sales and Advt is non linear (may be curvilinear) .By using the square root of advertising we are able to find better fit for the data. When building targeting models for marketing, essay and CRM, it is common to have much predictive variable. Using multiple predictive or independent continuous variables to predict a single continuous variable is called multiple linear regression .Targeting model created using linear regression is generally very robust. In marketing they can be used alone or in combination with other mod el. neuronal Networks and its applicability in marketing Neural network does not follow any statistical distribution (Neural network is very vast topic a complete discussion is beyond the scope of this report) .it is modeled after the function of the human brain. The process is one of pattern recognition and misunderstanding minimization. we can say it as nodes that are arranged in layers. The figure tells simple neural network with one hidden layer. Data has been classified into training and testing set (before the process).Then weight or input is appoint to each of the nodes in the first layer. During each iteration ,the input are processed through the system and compared to the actual value .the error is measured and fed back through the system to adjust the weights. The weights get better at predicting the actual results. A error limit is defined and it check with the error limit the process finishes when the minimum error limit reached unitary specific type of neural network commonly used in marketing uses sigmoidal functions to fit each node. This technique is very mightily in fitting a binary or twoilevel outcome such as response to an offer or a default on a loanNeural network not only natural selection linear data but also do a good pick up with non linear relation ship in the data. So this allows fitting data which is not possible to fit using regression. One disadvantage we can say that the result of neural net work is some what difficult to interpretA brief description on how forgathering can applicable in data mining Cluster analysis Cluster analysis group respondents with similar behaviors, preferences, or characteristics into segments. By doing so we can understand important similarities and differences between the respondents. Analyst can use this information to develop targeted marketing strategies, or to provide subgroups for analysis. In market survey data, clustering enables market researchers to group respondents who provide similar r esponses on several questions. In foregather we use more than one variable that analyzes responses to several questions in order to find similar respondents. Clustering is based on the idea of creating groups based on their proximity to, or distance from, each other. Respondents within a cluster, therefore, are relatively homogenous.Most widely used Algorithms are1)K-Means MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967 2) BIRCH Zhang, T., Ramakrishna, R., and Livny, M. 1996. BIRCH an efficient data clustering method for very large databases. In SIGMOD 96 Let us look at some more major areas of application of data mining in the marketing like Customer profiling, dispute analysis and make out analysis. The pattern which formed after mining the data helps in analytics.MCustomer profiling This help to predict several marketing decision. A customer visibleness is a model of customer based on this marketer decides on the right strategies and tactics to meet the needs of that customer .The data mining task used in customer profiling can be dependence analysis, class identification and concept description. Below giving set of transaction that can help marketer to construct useful customer profiles.Frequency of purchases Marketing firm can build targeted promotion offer such as frequent purchaser programs by looking how often their customer purchases product from their shop.Rcency of purchases The meaning of term is How long has it been since this customer last placed an order? Suppose a customer frequently visit the shop.It has been found that the specific customer or customer group not visiting the firm over long period of time .Market investigate the reason. By knowing this they can take appropriate offer or action.Size of purchases It tells, on a particular transaction how much he or she spends. This information helps to give resources to tho se customer groups.Identifying typical customer groups It gives characteristics of each group .For example a profile indicating that the customer has purchased a WINDOWS 7 bundle CD may livelihood to the marketer offering a special deal for MICROSOFT OFFICE SOFTWARE CD.Prospecting Customer profiles like buying patterns, give clues to the marketer on prospective customers. Say for example, consider the pattern Purchase of Norton Anti Virus software product with one year validity is followed by purchase of Norton Up stones throw version /or new version within 11 months about 85% of the time by high income customers discovered by data mining. Analyst who analysis pattern can identify the prospective customers for Upgraded/new version based on first time purchase elaborate and tailor the mail catalog accordingly, thus, increasing the prospect of sales.2 divergency analysis Deviation analysis is one of the important analysis for example a higher than normal credit purchase on a cre dit card can be a fraud anomaly or a genuine purchase by the customer changes.Once a deviation has been discovered as a fraud, the marketer takes appropriate steps to prevent such frauds and initiates corrective action.If the deviation has been discovered as a change, further information collection is necessary. For example, a change can be that a customer got a new job and moved to a new house. In this case, the marketer has to update the knowledge about the customer.3) Trend analysis Trends are patterns that persist over a period of time. Trends could be short-term trends like the immediate increase and subsequent slack descend of sales following a sales campaign. Or, trends could be long-term, like the slow flattening of sales of a product over a few years. Data mining tools, such as visualization, help us detect trends, sometimes very subtle and hidden in the database, which would have been missed using traditional analysis tools like scatter plots. In marketing decisions, t rends can be used for evaluating marketing programs or to forecast future sales.The market basket analysis gives the relationship between different product purchased by a customer .Using this techniques we can develop marketing strategy for promoting product that have dependency relationship in customers mind.Class identificationIt groups customers into classes which are defined in advance. Mathematical taxonomy and clustering are being used for class identification task. What the first one does is it maximizes the similarity with in classes but minimize similarity between classes. In clustering approach it determine the clustering according to attribute similarity as well as conceptual cohesiveness as defined by domain knowledge (describe above). A company doing business over the net, based on the session log data of internet users, the firm can classify the web users into email only users Surfers or Just for fun Surfer etc This kind of softwares allows the market research team or concerned people to view complex 3-D and 2-D patterns. They also provide performance down drill up slice facilities. In the KDD (knowledge discovery from data base) process, data visualization is used in association with other tasks such as dependency analysis, class identification, deviation detection and clustering. IBM SPSS PASW has got good data visualization techniques. Some of them are explained in Part 1 of the report.Conclusion Report discussed about Data mining process, short discussion about different mining techniques such as classification tree, neural network, Regression and their application in marketing domain. My report Also cover different type of analyzes and tasks being used. Most of the big firms in the UK already implemented data mining environment for their business analytics. Some disadvantages may be difficulty to find out data mining expert and building the environment is costly. With regards to data mining privacy is another issue organization are most co ncerned about.

No comments:

Post a Comment