Data Mining Orange program
The purpose of this project is to familiarize you with the process of data mining using a modern
programming toolkit to apply numerous data mining strategies.
This project uses Orange, a suite of data mining tools interfaced via C++, Python or through GUI
This project will require you to do five things:
1. Read and briefly summarize all documents on the reading list.
• Summarize the following
• What is data mining
• Orange as a Data mining tool
• Basic Data Manipulation and Preparation
• Data modeling
• Evaluation of model performance
2. Using the tutorial and documentation examples as a guide, complete your own data mining
process against a dataset of your choosing.
Your Own Data Mining Process
After you’ve completed step 1 above, you should have a good understanding of what tools are
available to you in Orange. Now it’s time to try some of these approaches on a dataset of your
own choosing. For this part of the project, you must:
• Choose and explore a dataset. You can use any of the ones provided in orange or import
your own dataset.
• Select at least 3 data mining strategies and apply them to the dataset.
• Describe your intent, approach and results.
• Lastly, did you discover anything meaningful or surprising? If so, document your
findings. If not, describe what you might do next to refine your process or choose
different/improved mining strategies.
Your data mining process doesn’t have to be perfect, or even yield incredibly interesting results;
the important thing is the process. So don’t be afraid to try something fun even if it may not
yield amazing results.
3. Submit complete documentation of items 1 and 2 above.
Data Mining, Material for lectures on Data mining, at Kyoto University, Dept. of Health
A Data Mining Tutorial: http://maths-people.anu.edu.au/~steve/pdcn.pdf
Functional Genomics Workshop http://docs.orange.biolab.si/_downloads/bio-tutorial.pdf
Thedatasetusedinthisexerciseistheheartdiseasedatasetavailableinheart_disease.tabobtained from the Orange datasets repository. This dataset describesriskfactorsforheartdisease.Theattributediameter narowingrepresentsthe(binary)classattribute: class 1 means there is diameter narrowing; class 0 indicates no diameter narrowing.
The main aim of this exercise is to predict heart disease in terms of diameter narrowing from the otherattributesinthedataset.Obviously,thisisaclassificationproblem.ThesoftwaretobeusedisOrange.However,feelfreetotryanyideasyoumayhavetotackletheproblemwithanyothersoftware.
The description of this exercise is stepwise. Therefore, I hope you can getabetterunderstandingofthevariousaspectsandquestionsinvolvedintheKDD(Knowledge Discovery in Databases) process.
The first step in approaching the problem is to get acquainted with thedata.Answeringthefollowingquestionswillhelpyoutobetterunderstandthedata.Thedatafileheart_disease.tabcontainssomeinformationaboutthedatastoredin it.
Load the data file inOrange.
(a) The attribute type, e.g. nominal, ordinal,numeric.
(c) Max, min, mean, standarddeviation.
(d) Arethereanyrecordsthathaveavaluefortheattributethatnoother recordhas?
(e) Study the histogram at the lower right and informally describehowtheattributeseemstoinfluencetheriskforheartdisease.Whatdoesitmeanthepop-upmessagesthatappearwhendraggingthemouseover thegraphic?
(f) Are there any outliers for the attribute underconsideration?
i. InvestigatethepossibilityofusingtheOrange widgetsto detectoutliers.
2. Use Visualize widgets to visualize 2D-scatter plots for each pair of attributes.
(a) Which attributes seem to be the most/least linked to heartdisease?Summarize in a table your findings concerning the predictivevalueof eachattribute.
(b) Does any pair of attributes seem to be correlated?
3. Investigate also possible multivariate associations of attributes with the class attribute, i.e. study scatter plots of two attributes X and Y and try to identify possible ”dense” heart disease areas (if any).
(a) Ifyoufind”dense”heartdiseaseareasinanyscatterplotthenquantifytheheartdiseaserateintheseareaswithrespecttotheentire dataset.
1.2 Data Preprocessing
The second step is to preprocess the data such that the transformed data isinamoresuitableformfortheminingalgorithms.
Investigate the possibility of using the widget AttributeSelectionfor selecting a subset of attributes with good predicting capability.Then,describebrieflythewidgetyouusedandcomparetheresultsyouobtainedwiththeconclusionsyouobtainedintheprevioussection.
2. Handling missing values.
Consider the following methods for handling missing values and investigate each possibility within Orange. Note that, as rule of thumb, if an attribute has more than 5% missing values then the records should not be deleted and it is advisable to impute values where data is missing, using a suitable method.
(a) Replace the missing values by the attribute mean, if the attributeis numeric. Otherwise, replace missing values by attribute mode(ifthe attribute is categorical). Save the dataset you obtainedwithoutmissing values in the file heart-disease2.tab.
(b) Investigate the possibility of using (linear) regression to estimate the missing values for each attribute. Save the dataset you obtained without missing values in the file heart_disease3.tab
3. Eliminating outliers.
(a) Eliminate the outlier records and save the dataset you obtained without outliers in the file heart_disease4.tab
1.3 Mining the Data
The third step is to use some classifier algorithms available in Orange to discover hidden patterns in the data. You should repeat the steps described below for each of the datasets you created during preprocessing, besides using also the original dataset (if possible).
1. Use more than one classifier (Decision Tree, SVM, K Nearest Neighbor)
(a) What can you conclude? Compare your conclusions with your previous conclusions obtained in section 1.1.
(b) Compare the accuracy of the classifier on the training set with the accuracy estimation obtained through 10 fold-cross validation. How do you explain the difference (if any)?
(b) Describe the patterns you obtained and compare with your previous conclusions.
1.4 Clustering Tendency
Investigate whether there is a clustering tendency in the dataset. You may start by clustering the data with K Means Clustering algorithm.
1. Do not use the class attribute, diameter narrowing for clustering.
2. Find a suitable value for k, i.e. the number of clusters you are going to build. Justify your choice of k.
1.5 Predicting Performance
In the previous step you have built several models. Finally, you need to compare the different models and describe your final conclusions.
1. Orange outputs several performance measures. Choose some of the performance measures and motivate your choice.
2. Summarize in a table the performance measures for each classifier and each dataset.
3. What can you conclude?
Describe your final conclusions and indicate which risk factors for heart disease have you found in the data.
Our Service Charter
Excellent Quality / 100% Plagiarism-FreeWe employ a number of measures to ensure top quality essays. The papers go through a system of quality control prior to delivery. We run plagiarism checks on each paper to ensure that they will be 100% plagiarism-free. So, only clean copies hit customers’ emails. We also never resell the papers completed by our writers. So, once it is checked using a plagiarism checker, the paper will be unique. Speaking of the academic writing standards, we will stick to the assignment brief given by the customer and assign the perfect writer. By saying “the perfect writer” we mean the one having an academic degree in the customer’s study field and positive feedback from other customers.
Free RevisionsWe keep the quality bar of all papers high. But in case you need some extra brilliance to the paper, here’s what to do. First of all, you can choose a top writer. It means that we will assign an expert with a degree in your subject. And secondly, you can rely on our editing services. Our editors will revise your papers, checking whether or not they comply with high standards of academic writing. In addition, editing entails adjusting content if it’s off the topic, adding more sources, refining the language style, and making sure the referencing style is followed.
Confidentiality / 100% No DisclosureWe make sure that clients’ personal data remains confidential and is not exploited for any purposes beyond those related to our services. We only ask you to provide us with the information that is required to produce the paper according to your writing needs. Please note that the payment info is protected as well. Feel free to refer to the support team for more information about our payment methods. The fact that you used our service is kept secret due to the advanced security standards. So, you can be sure that no one will find out that you got a paper from our writing service.
Money Back GuaranteeIf the writer doesn’t address all the questions on your assignment brief or the delivered paper appears to be off the topic, you can ask for a refund. Or, if it is applicable, you can opt in for free revision within 14-30 days, depending on your paper’s length. The revision or refund request should be sent within 14 days after delivery. The customer gets 100% money-back in case they haven't downloaded the paper. All approved refunds will be returned to the customer’s credit card or Bonus Balance in a form of store credit. Take a note that we will send an extra compensation if the customers goes with a store credit.
24/7 Customer SupportWe have a support team working 24/7 ready to give your issue concerning the order their immediate attention. If you have any questions about the ordering process, communication with the writer, payment options, feel free to join live chat. Be sure to get a fast response. They can also give you the exact price quote, taking into account the timing, desired academic level of the paper, and the number of pages.