diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/Solution.pdf b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/Solution.pdf new file mode 100644 index 00000000..4d222a4b Binary files /dev/null and b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/Solution.pdf differ diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Academic_journal b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Academic_journal new file mode 100644 index 00000000..899168a6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Academic_journal @@ -0,0 +1 @@ + Academic journal - Wikipedia, the free encyclopedia

Academic journal

From Wikipedia, the free encyclopedia
Jump to: navigation, search

An academic journal is a peer-reviewed periodical in which scholarship relating to a particular academic discipline is published. Academic journals serve as forums for the introduction and presentation for scrutiny of new research, and the critique of existing research.[1] Content typically takes the form of articles presenting original research, review articles, and book reviews.

The term academic journal applies to scholarly publications in all fields; this article discusses the aspects common to all academic field journals. Scientific journals and journals of the quantitative social sciences vary in form and function from journals of the humanities and qualitative social sciences; their specific aspects are separately discussed.

Contents

Scholarly articles[edit]

There are two kinds of article or paper submissions in academia: solicited, where an individual has been invited to submit work either through direct contact or through a general submissions call, and unsolicited, where an individual submits a work for potential publication without directly being asked to do so.[2] Upon receipt of a submitted article, editors at the journal determine whether to reject the submission outright or begin the process of peer review. In the latter case, the submission becomes subject to review by outside scholars of the editor's choosing who typically remain anonymous. The number of these peer reviewers (or "referees") varies according to each journal's editorial practice — typically, no fewer than two, though sometimes three or more, experts in the subject matter of the article produce reports upon the content, style, and other factors, which inform the editors' publication decisions. Though these reports are generally confidential, some journals and publishers also practice public peer review. The editors either choose to reject the article, ask for a revision and resubmission, or accept the article for publication. Even accepted articles are often subjected to further (sometimes considerable) editing by journal editorial staff before they appear in print. The peer review can take from several weeks to several months.[3]

Reviewing[edit]

Review articles[edit]

Review articles, also called "reviews of progress," are checks on the research published in journals. Some journals are devoted entirely to review articles, others contain a few in each issue, but most do not publish review articles. Such reviews often cover the research from the preceding year, some for longer or shorter terms; some are devoted to specific topics, some to general surveys. Some journals are enumerative, listing all significant articles in a given subject, others are selective, including only what they think worthwhile. Yet others are evaluative, judging the state of progress in the subject field. Some journals are published in series, each covering a complete subject field year, or covering specific fields through several years. Unlike original research articles, review articles tend to be solicited submissions, sometimes planned years in advance. They are typically relied upon by students beginning a study in a given field, or for current awareness of those already in the field.[4]

Book reviews[edit]

Book reviews of scholarly books are checks upon the research books published by scholars; unlike articles, book reviews tend to be solicited. Journals typically have a separate book review editor determining which new books to review and by whom. If an outside scholar accepts the book review editor's request for a book review, he or she generally receives a free copy of the book from the journal in exchange for a timely review. Publishers send books to book review editors in the hope that their books will be reviewed. The length and depth of research book reviews varies much from journal to journal, as does the extent of textbook and trade book review.[5]

Prestige[edit]

Different types of peer-reviewed research journals; these specific publications are about economics

An academic journal's prestige is established over time, and can reflect many factors, some but not all of which are expressible quantitatively. In each academic discipline there are dominant journals that receive the largest number of submissions, and therefore can be selective in choosing their content. Yet, not only the largest journals are of excellent quality.[6]

Ranking[edit]

In the natural sciences and in the "hard" social sciences, the impact factor is a convenient proxy, measuring the number of later articles citing articles already published in the journal. There are other, possible quantitative factors, such as the overall number of citations, how quickly articles are cited, and the average "half-life" of articles, i.e. when they are no longer cited. There also is the question of whether or not any quantitative factor can reflect true prestige; natural science journals are categorized and ranked in the Science Citation Index, social science journals in the Social Sciences Citation Index.[6]

In the Anglo-American humanities, there is no tradition (as there is in the sciences) of giving impact-factors that could be used in establishing a journal's prestige. Recent moves have been made by the European Science Foundation to rectify the situation, resulting in the publication of preliminary lists for the ranking of academic journals in the Humanities.[6]

In some disciplines such as Knowledge management/Intellectual capital the lack of a well-established journal ranking system is perceived as "a major obstacle on the way to tenure, promotion and achievement recognition". [7]

The categorization of journal prestige in some subjects has been attempted, typically using letters to rank their academic world importance.

We can distinguish three categories of techniques to assess journal quality and develop journal rankings:[8]

  • stated preference;
  • revealed preference; and
  • publication power approaches[9]

Publishing[edit]

Many academic journals are subsidized by universities or professional organizations, and do not exist to make a profit; however they often accept advertising, page and image charges from authors to pay for production costs. On the other hand, some journals are produced by commercial publishers who do make a profit by charging subscriptions to individuals and libraries. They may also sell all of their journals in discipline-specific collections or a variety of other packages.[10]

Journal editors tend to have other professional responsibilities, most often as teaching professors. In the case of the largest journals, there are paid staff assisting in the editing. The production of the journals is almost always done by publisher-paid staff. Humanities and social science academic journals are usually subsidized by universities or professional organization.[11]

New developments[edit]

The Internet has revolutionized the production of, and access to, academic journals, with their contents available online via services subscribed to by academic libraries. Individual articles are subject-indexed in databases such as Google Scholar. Some of the smallest, most specialized journals are prepared in-house, by an academic department, and published only online — such form of publication has sometimes been in the blog format. Currently, there is a movement in higher education encouraging open access, either via self archiving, whereby the author deposits a paper in a repository where it can be searched for and read, or via publishing it in a free open access journal, which does not charge for subscriptions, being either subsidized or financed with author page charges. However, to date, open access has affected science journals more than humanities journals. Commercial publishers are now experimenting with open access models, but are trying to protect their subscription revenues.[12]

See also[edit]

References[edit]

  1. ^ Gary Blake and Robert W. Bly, The Elements of Technical Writing, pg. 113. New York: Macmillan Publishers, 1993. ISBN 0020130856
  2. ^ Gwen Meyer Gregory (2005). The successful academic librarian: winning strategies from library leaders. Information Today. pp. 36–37. 
  3. ^ Michèle Lamont (2009). How professors think: inside the curious world of academic judgment. Harvard University Press. pp. 1–14. 
  4. ^ Deborah E. De Lange (2011). Research Companion to Green International Management Studies: A Guide for Future Research, Collaboration and Review Writing. Edward Elgar Publishing. pp. 1–5. 
  5. ^ Rita James Simon and Linda Mahan (October 1969). "A Note on the Role of Book Review Editor as Decision Maker". The Library Quarterly. p. 353-356. 
  6. ^ a b c Rowena Murray (2009). Writing for Academic Journals. McGraw-Hill International. pp. 42–45. 
  7. ^ Nick Bontis (2009). "A follow-up ranking of academic journals". Journal of Knowledge Management. p. 17. 
  8. ^ Lowry, P.B.; Humphreys, S.; Malwitz, J.; Nix, J. (2007). "A scientometric study of the perceived quality of business and technical communication journals". IEEE Transactions of Professional Communication. 
  9. ^ Alexander Serenko and Changquan Jiao (2011). "Investigating Information Systems Research in Canada". June 11, 2011. p. ff. 
  10. ^ Bergstrom, Theodore C. (2001). "Free Labor for Costly Journals?". Journal of Economic Perspectives 15 (3): 183–198. doi:10.1257/jep.15.4.183. 
  11. ^ Robert A. Day and Barbara Gastel (2011). How to Write and Publish a Scientific Paper. ABC-CLIO. pp. 122–124. 
  12. ^ James Hendler (2007). "Reinventing Academic Publishing-Part 1". IEEE Intelligent Systems. p. 2-3. 

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Accuracy_paradox b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Accuracy_paradox new file mode 100644 index 00000000..d525d303 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Accuracy_paradox @@ -0,0 +1 @@ + Accuracy paradox - Wikipedia, the free encyclopedia

Accuracy paradox

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall.

Accuracy is often the starting point for analyzing the quality of a predictive model, as well as an obvious criterion for prediction. Accuracy measures the ratio of correct predictions to the total number of cases evaluated. It may seem obvious that the ratio of correct predictions to cases should be a key metric. A predictive model may have high accuracy, but be useless.

In an example predictive model for an insurance fraud application, all cases that are predicted as high-risk by the model will be investigated. To evaluate the performance of the model, the insurance company has created a sample data set of 10,000 claims. All 10,000 cases in the validation sample have been carefully checked and it is known which cases are fraudulent. To analyze the quality of the model, the insurance uses the table of confusion. The definition of accuracy, the table of confusion for model M1Fraud, and the calculation of accuracy for model M1Fraud is shown below.

\mathrm{A}(M) = \frac{TN + TP}{TN + FP + FN + TP} where

TN is the number of true negative cases
FP is the number of false positive cases
FN is the number of false negative cases
TP is the number of true positive cases

Formula 1: Definition of Accuracy

Predicted Negative Predicted Positive
Negative Cases 9,700 150
Positive Cases 50 100

Table 1: Table of Confusion for Fraud Model M1Fraud.

\mathrm A (M) = \frac{9,700 + 100}{9,700 + 150 + 50 + 100} = 98.0%

Formula 2: Accuracy for model M1Fraud

With an accuracy of 98.0% model M1Fraud appears to perform fairly well. The paradox lies in the fact that accuracy can be easily improved to 98.5% by always predicting "no fraud". The table of confusion and the accuracy for this trivial “always predict negative” model M2Fraud and the accuracy of this model are shown below.

Predicted Negative Predicted Positive
Negative Cases 9,850 0
Positive Cases 150 0

Table 2: Table of Confusion for Fraud Model M2Fraud.

\mathrm{A}(M) = \frac{9,850 + 0}{9,850 + 150 + 0 + 0} = 98.5%

Formula 3: Accuracy for model M2Fraud

Model M2Fraudreduces the rate of inaccurate predictions from 2% to 1.5%. This is an apparent improvement of 25%. The new model M2Fraud shows fewer incorrect predictions and markedly improved accuracy, as compared to the original model M1Fraud, but is obviously useless.

The alternative model M2Fraud does not offer any value to the company for preventing fraud. The less accurate model is more useful than the more accurate model.

Model improvements should not be measured in terms of accuracy gains. It may be going too far to say that accuracy is irrelevant, but caution is advised when using accuracy in the evaluation of predictive models.

See also[edit]

Bibliography[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Affinity_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Affinity_analysis new file mode 100644 index 00000000..b981c028 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Affinity_analysis @@ -0,0 +1 @@ + Affinity analysis - Wikipedia, the free encyclopedia

Affinity analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Affinity analysis is a data analysis and data mining technique that discovers co-occurrence relationships among activities performed by (or recorded about) specific individuals or groups. In general, this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans.[1]

Contents

Examples[edit]

Market basket analysis might tell a retailer that customers is often purchase shampoo and conditioner together, so putting both items on promotion at the same time would not create a significant increase in profit, while a promotion involving just one of the items would likely drive sales of the other.

Market basket analysis may provide the retailer with information to understand the purchase behavior of a buyer. This information will enable the retailer to understand the buyer's needs and rewrite the store's layout accordingly, develop cross-promotional programs, or even capture new buyers (much like the cross-selling concept). An apocryphal early illustrative example for this was when one super market chain discovered in its analysis that customers that bought diapers often bought beer as well, have put the diapers close to beer coolers, and their sales increased dramatically. Although this urban legend is only an example that professors use to illustrate the concept to students, the explanation of this imaginary phenomenon might be that fathers that are sent out to buy diapers often buy a beer as well, as a reward. This kind of analysis is supposedly an example of the use of data mining. A widely used example of cross selling on the web with market basket analysis is Amazon.com's use of "customers who bought book A also bought book B", e.g. "People who read History of Portugal were also interested in Naval History".

Market basket analysis can be used to divide customers into groups. A company could look at what other items people purchase along with eggs, and classify them as baking a cake (if they are buying eggs along with flour and sugar) or making omelets (if they are buying eggs along with bacon and cheese). This identification could then be used to drive other programs.

Business use[edit]

Business use of market basket analysis has significantly increased since the introduction of electronic point of sale.[1] Amazon uses affinity analysis for cross-selling when it recommends products to people based on their purchase history and the purchase history of other people who bought the same item. Family Dollar plans to use market basket analysis to help maintain sales growth while moving towards stocking more low-margin consumable goods.[2] A common urban legend highlighting the unexpected insights that can be found involves a chain (often incorrectly given as Wal-Mart) discovering that beer and diapers were often purchased together, and responding to that by moving the beer closer to the diapers to drive sales; however, while the relationship seems to have been noted, it is unclear whether any action was taken to promote selling them together.[3]

See also[edit]

References[edit]

  1. ^ a b "Demystifying Market Basket Analysis". Retrieved 3 November 2009. 
  2. ^ "Family Dollar Supports Merchandising with IT". Retrieved 3 November 2009. 
  3. ^ "The parable of the beer and diapers". The Register. Retrieved 3 September 2009. 

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Alpha_algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Alpha_algorithm new file mode 100644 index 00000000..8468f8be --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Alpha_algorithm @@ -0,0 +1 @@ + Alpha algorithm - Wikipedia, the free encyclopedia

Alpha algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The α-algorithm is an algorithm used in process mining, aimed at reconstructing causality from a set of sequences of events. It was first put forward by van der Aalst, Weijter and Măruşter.[1] Several extensions or modifications of it have since been presented, which will be listed below.

It constructs P/T nets with special properties (workflow nets) from event logs (as might be collected by an ERP system). Each transition in the net corresponds to an observed task.

Contents

Short description [edit]

The algorithm takes a workflow log W\subseteq T^{*} as input and results in a workflow net being constructed.

It does so by examining causal relationships observed between tasks. For example, one specific task might always precede another specific task in every execution trace, which would be useful information.

Definitions used [edit]

  • A workflow trace or execution trace is a string over an alphabet T of tasks.
  • A workflow log is a set of workflow traces.

Description [edit]

Declaratively, the algorithm can be presented as follows. Three sets of tasks are determined:

  • T_W is the set of all tasks which occur in at least one trace
  • T_I is the set of all tasks which occur trace-initially
  • T_O is the set of all tasks which occur trace-terminally

Basic ordering relations are determined (\succ_{W} first, the latter three can be constructed therefrom)

  • a \succ_W b iff b directly precedes a in some trace
  • a\rightarrow_W b iff a\succ_Wb \wedge b\not\succ_Wa
  • a\#{}_Wb iff a\not\succ_Wb \wedge b\not\succ_Wa
  • a\Vert_Wb iff a\succ_Wb \wedge b\succ_Wa

Places are discovered. Each place is identified with a pair of sets of tasks, in order to keep the number of places low.

  • Y_W is the set of all pairs (A,B) of maximal sets of tasks such that
    • Neither A \times A and B \times B contain any members of \succ_W and
    • A \times B is a subset of \rightarrow_W
  • P_W contains one place p_{(A,B)} for every member of Y_W, plus the input place i_W and the output place o_W

The flow relation F_W is the union of the following:

  • \{(a,p_{(A,B)}) | (A,B) \in Y_W \wedge a \in A\}
  • \{(p_{(A,B)},b) | (A,B) \in Y_W \wedge b \in B\}
  • \{(i_W,t) | t\in T_I\}
  • \{(t,i_O) | t\in T_O\}

The result is

  • a petri net structure \alpha(W) = (P_W,T_W,F_W)
  • with one input place i_W and one output place o_W
  • because every transition of T_W is on a F_W-path from i_W to o_W, it is indeed a workflow net.

Properties [edit]

It can be shown [2] that in the case of a complete workflow log generated by a sound SWF net, the net generating it can be reconstructed. Complete means that its \succ_W relation is maximal. It is not required that all possible traces be present (which would be countably infinite for a net with a loop).

Limitations [edit]

General workflow nets may contain several types of constructs [3] which the α-algorithm cannot rediscover.

Constructing Y_W takes exponential time in the number of tasks, since \succ_W is not constrained and arbitrary subsets of T_W must be considered.

Extensions [edit]

for example [4] [5]

References [edit]

  1. ^ van der Aalst, W M P and Weijter, A J M M and Maruster, L (2003). "Workflow Mining: Discovering process models from event logs", IEEE Transactions on Knowledge and Data Engineering, vol 16
  2. ^ van der Aalst et al. 2003
  3. ^ A. de Medeiros, A K and van der Aalst, W M P and Weijters, A J M M (2003). "Workflow Mining: Current Status and Future Directions". in: "volume 2888 of Lecture Notes in Computer Science", Springer-Verlag
  4. ^ A. de Medeiros, A K and van Dongen, B F and van der Aalst, W M P and Weijters, A J M M (2004). "Process mining: extending the α-algorithm to mine short loops"
  5. ^ Wen, L and van der Aalst, W M P and Wang, J and Sun, J (2007). "Mining process models with non-free-choice constructs", "Data Mining and Knowledge Discovery" vol 15, p. 145--180, Springer-Verlag

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Analytics b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Analytics new file mode 100644 index 00000000..acb4e7d7 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Analytics @@ -0,0 +1 @@ + Analytics - Wikipedia, the free encyclopedia

Analytics

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Alternative text
A sample Google Analytics dashboard. Tools like this help businesses identify trends and make decisions.

Analytics is the discovery and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight.

Firms may commonly apply analytics to business data, to describe, predict, and improve business performance. Specifically, arenas within analytics include enterprise decision management, retail analytics, store assortment and SKU optimization, marketing optimization and marketing mix analytics, web analytics, sales force sizing and optimization, price and promotion modeling, predictive science, credit risk analysis, and fraud analytics. Since analytics can require extensive computation (See Big Data), the algorithms and software used for analytics harness the most current methods in computer science, statistics, and mathematics.[1]

Contents

Analytics vs. analysis[edit]

Analytics is a two-sided coin. On one side, it uses descriptive and predictive models to gain valuable knowledge from data - data analysis. On the other, analytics uses this insight to recommend action or to guide decision making - communication. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology. There is a pronounced tendency to use the term analytics in business settings e.g. text analytics vs. the more generic text mining to emphasize this broader perspective.

Examples[edit]

Marketing optimization[edit]

Marketing has evolved from a creative process into a highly data-driven process. Marketing organizations use analytics to determine the outcomes of campaigns or efforts and to guide decisions for investment and consumer targeting. Demographic studies, customer segmentation, conjoint analysis and other techniques allow marketers to use large amounts of consumer purchase, survey and panel data to understand and communicate marketing strategy.

Web analytics allows marketers to collect session-level information about interactions on a website. Those interactions provide the web analytics information systems with the information to track the referrer, search keywords, IP address, and activities of the visitor. With this information, a marketer can improve the marketing campaigns, site creative content, and information architecture.

Analysis techniques frequently used in marketing include marketing mix modeling, pricing and promotion analyses, sales force optimization, customer analytics e.g.: segmentation. Web analytics and optimization of web sites and online campaigns now frequently works hand in hand with the more traditional marketing analysis techniques. A focus on digital media has slightly changed the vocabulary so that marketing mix modeling is commonly referred to as attribution modeling in the digital or mixed-media context.

These tools and techniques support both strategic marketing decisions (such as how much overall to spend on marketing and how to allocate budgets across a portfolio of brands and the marketing mix) and more tactical campaign support in terms of targeting the best potential customer with the optimal message in the most cost effective medium at the ideal time. An example of the holistic approach required for this strategy is the Astronomy Model.

Portfolio analysis[edit]

A common application of business analytics is portfolio analysis. In this, a bank or lending agency has a collection of accounts of varying value and risk. The accounts may differ by the social status (wealthy, middle-class, poor, etc.) of the holder, the geographical location, its net value, and many other factors. The lender must balance the return on the loan with the risk of default for each loan. The question is then how to evaluate the portfolio as a whole.

The least risk loan may be to the very wealthy, but there are a very limited number of wealthy people. On the other hand there are many poor that can be lent to, but at greater risk. Some balance must be struck that maximizes return and minimizes risk. The analytics solution may combine time series analysis, with many other issues in order to make decisions on when to lend money to these different borrower segments, or decisions on the interest rate charged to members of a portfolio segment to cover any losses among members in that segment.

Risk analytics[edit]

Predictive models in banking industry is widely developed to bring certainty across the risk scores for individual customers. Credit scores are built to predict individual’s delinquency behaviour and also scores are widely used to evaluate the credit worthiness of each applicant and rated while processing loan applications.

Challenges[edit]

In the industry of commercial analytics software, an emphasis has emerged on solving the challenges of analyzing massive, complex data sets, often when such data is in a constant state of change. Such data sets are commonly referred to as big data. Whereas once the problems posed by big data were only found in the scientific community, today big data is a problem for many businesses that operate transactional systems online and, as a result, amass large volumes of data quickly.[2]

The analysis of unstructured data types is another challenge getting attention in the industry. Unstructured data differs from structured data in that its format varies widely and cannot be stored in traditional relational databases without significant effort at data transformation.[3] Sources of unstructured data, such as email, the contents of word processor documents, PDFs, geospatial data, etc., are rapidly becoming a relevant source of business intelligence for businesses, governments and universities.[4] For example, in Britain the discovery that one company was illegally selling fraudulent doctor's notes in order to assist people in defrauding employers and insurance companies,[5] is an opportunity for insurance firms to increase the vigilance of their unstructured data analysis. The McKinsey Global Institute estimates that big data analysis could save the American health care system $300 billion per year and the European public sector €250 billion.[6]

These challenges are the current inspiration for much of the innovation in modern analytics information systems, giving birth to relatively new machine analysis concepts such as complex event processing, full text search and analysis, and even new ideas in presentation.[7] One such innovation is the introduction of grid-like architecture in machine analysis, allowing increases in the speed of massively parallel processing by distributing the workload to many computers all with equal access to the complete data set.[8]

Analytics is increasingly used in education, particularly at the district and government office levels. However, the complexity of student performance measures presents challenges when educators try to understand and use analytics to discern patterns in student performance, predict graduation likelihood, improve chances of student success, etc. For example, in a study involving districts known for strong data use, 48% of teachers had difficulty posing questions prompted by data, 36% did not comprehend given data, and 52% incorrectly interpreted data.[9] To combat this, some analytics tools for educators adhere to an over-the-counter data format (embedding labels, supplemental documentation, and a help system, and making key package/display and content decisions) to improve educators’ understanding and use of the analytics being displayed.[10]

One more emerging challenge is dynamic regulatory needs. For example, in the banking industry, Basel III and future capital adequacy needs are likely to make even smaller banks adopt internal risk models. In such incidents, cloud computing and open source R can help smaller banks to adopt risk analytics and support branch level monitoring by applying predictive analytics.[citation needed]

See also[edit]

References[edit]

  1. ^ Kohavi, Rothleder and Simoudis (2002). "Emerging Trends in Business Analytics". Communications of the ACM 45 (8): 45–48. 
  2. ^ Naone, Erica. "The New Big Data". Technology Review, MIT. Retrieved August 22, 2011. 
  3. ^ Inmon, Bill (2007). Tapping Into Unstructured Data. Prentice-Hall. ISBN 978-0-13-236029-6.  More than one of |author= and |last= specified (help)
  4. ^ Wise, Lyndsay. "Data Analysis and Unstructured Data". Dashboard Insight. Retrieved February 14, 2011. 
  5. ^ "Fake doctors' sick notes for Sale for £25, NHS fraud squad warns". London: The Telegraph. Retrieved August 2008. 
  6. ^ "Big Data: The next frontier for innovation, competition and productivity as reported in Building with Big Data". The Economist. May 26, 2011. Archived from the original on 3 June 2011. Retrieved May 26, 2011. 
  7. ^ Ortega, Dan. "Mobililty: Fueling a Brainier Business Intelligence". IT Business Edge. Retrieved June 21, 2011. 
  8. ^ Khambadkone, Krish. "Are You Ready for Big Data?". InfoGain. Retrieved February 10, 2011. 
  9. ^ U.S. Department of Education Office of Planning, Evaluation and Policy Development (2009). Implementing data-informed decision making in schools: Teacher access, supports and use. United States Department of Education (ERIC Document Reproduction Service No. ED504191)
  10. ^ Rankin, J. (2013, March 28). How data Systems & reports can either fight or propagate the data analysis error epidemic, and how educator leaders can help. Presentation conducted from Technology Information Center for Administrative Leadership (TICAL) School Leadership Summit.

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Anomaly_Detection_at_Multiple_Scales b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Anomaly_Detection_at_Multiple_Scales new file mode 100644 index 00000000..10ad0d00 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Anomaly_Detection_at_Multiple_Scales @@ -0,0 +1 @@ + Anomaly Detection at Multiple Scales - Wikipedia, the free encyclopedia

Anomaly Detection at Multiple Scales

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Anomaly Detection at Multiple Scales
Establishment 2011
Sponsor DARPA
Value $35 million
Goal Detect insider threats in defense and government networks
Website www.darpa.mil

Anomaly Detection at Multiple Scales, or ADAMS, is a $35 million DARPA project designed to identify patterns and anomalies in very large data sets. It is under DARPA's Information Innovation office and began in 2011.[1][2][3][4]

The project is intended to detect and prevent insider threats such as "a soldier in good mental health becoming homicidal or suicidal", an "innocent insider becoming malicious", or a "a government employee [whom] abuses access privileges to share classified information".[2][5] Specific cases mentioned are Nidal Malik Hasan and Wikileaks alleged source Bradley Manning.[6] Commercial applications may include finance.[6] The intended recipients of the system output are operators in the counterintelligence agencies.[2][5]

The Proactive Discovery of Insider Threats Using Graph Analysis and Learning is part of the ADAMS project.[5][7] The Georgia Tech team includes noted high-performance computing researcher David A. Bader.[8]

See also[edit]

References[edit]

  1. ^ "ADAMS". DARPA Information Innovation Office. Retrieved 2011-12-05. 
  2. ^ a b c "Anomaly Detection at Multiple Scales (ADAMS) Broad Agency Announcement DARPA-BAA-11-04". General Services Administration. 2010-10-22. Retrieved 2011-12-05. 
  3. ^ Ackerman, Spencer (2010-10-11). "Darpa Starts Sleuthing Out Disloyal Troops". Wired. Retrieved 2011-12-06. 
  4. ^ Keyes, Charley (2010-10-27). "Military wants to scan communications to find internal threats". CNN. Retrieved 2011-12-06. 
  5. ^ a b c "Georgia Tech Helps to Develop System That Will Detect Insider Threats from Massive Data Sets". Georgia Institute of Technology. 2011-11-10. Retrieved 2011-12-06. 
  6. ^ a b "Video Interview: DARPA’s ADAMS Project Taps Big Data to Find the Breaking Bad". Inside HPC. 2011-11-29. Retrieved 2011-12-06. 
  7. ^ Brandon, John (2011-12-03). "Could the U.S. Government Start Reading Your Emails?". Fox News. Retrieved 2011-12-06. 
  8. ^ "Anomaly Detection at Multiple Scales". Georgia Tech College of Computing. Retrieved 2011-12-06. 


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Anomaly_detection b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Anomaly_detection new file mode 100644 index 00000000..9dbc457e --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Anomaly_detection @@ -0,0 +1 @@ + Anomaly detection - Wikipedia, the free encyclopedia

Anomaly detection

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Anomaly detection, also referred to as outlier detection[1] refers to detecting patterns in a given data set that do not conform to an established normal behavior.[2] The patterns thus detected are called anomalies and often translate to critical and actionable information in several application domains. Anomalies are also referred to as outliers, change, deviation, surprise, aberrant, peculiarity, intrusion, etc.

In particular in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns.[3]

Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then testing the likelihood of a test instance to be generated by the learnt model.[citation needed]

Contents

Applications[edit]

Anomaly detection is applicable in a variety of domains, such as intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, and detecting Eco-system disturbances. It is often used in preprocessing to remove anomalous data from the dataset. In supervised learning, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy.[4][5]

Popular techniques[edit]

Several anomaly detection techniques have been proposed in literature. Some of the popular techniques are:

Application to data security[edit]

Anomaly detection was proposed for Intrusion detection systems (IDS) by Dorothy Denning in 1986.[7] Anomaly detection for IDS is normally accomplished with thresholds and statistics, but can also be done with Soft computing, and inductive learning.[8] Types of statistics proposed by 1999 included profiles of users, workstations, networks, remote hosts, groups of users, and programs based on frequencies, means, variances, covariances, and standard deviations.[9] The counterpart of Anomaly detection in Intrusion detection is Misuse Detection.

See also[edit]

References[edit]

  1. ^ Hans-Peter Kriegel, Peer Kröger, Arthur Zimek (2009). "Outlier Detection Techniques (Tutorial)". 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2009) (Bangkok, Thailand). Retrieved 2010-06-05. 
  2. ^ Varun Chandola, Arindam Banerjee, and Vipin Kumar, Anomaly Detection: A Survey, ACM Computing Surveys, Vol. 41(3), Article 15, July 2009
  3. ^ Dokas, Paul; Levent Ertoz, Vipin Kumar, Aleksandar Lazarevic, Jaideep Srivastava, Pang-Ning Tan (2002). "Data mining for network intrusion detection". Proceedings NSF Workshop on Next Generation Data Mining. 
  4. ^ Ivan Tomek (1976). "An Experiment with the Edited Nearest-Neighbor Rule". IEEE Transactions on Systems, Man and Cybernetics 6. pp. 448–452. 
  5. ^ Michael R Smith and Tony Martinez (2011). "Improving Classification Accuracy by Identifying and Removing Instances that Should Be Misclassified". Proceedings of International Joint Conference on Neural Networks (IJCNN 2011). pp. 2690–2697. 
  6. ^ Breunig, M. M.; Kriegel, H. -P.; Ng, R. T.; Sander, J. (2000). "LOF: Identifying Density-based Local Outliers". ACM SIGMOD Record 29: 93. doi:10.1145/335191.335388.  edit
  7. ^ Denning, Dorothy, "An Intrusion Detection Model," Proceedings of the Seventh IEEE Symposium on Security and Privacy, May 1986, pages 119-131.
  8. ^ Teng, Henry S., Chen, Kaihu, and Lu, Stephen C-Y, "Adaptive Real-time Anomaly Detection Using Inductively Generated Sequential Patterns," 1990 IEEE Symposium on Security and Privacy
  9. ^ Jones, Anita K., and Sielken, Robert S., "Computer System Intrusion Detection: A Survey," Technical Report, Department of Computer Science, University of Virginia, Charlottesville, VA, 1999

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Apriori_algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Apriori_algorithm new file mode 100644 index 00000000..e14d5a6a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Apriori_algorithm @@ -0,0 +1 @@ + Apriori algorithm - Wikipedia, the free encyclopedia

Apriori algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Apriori[1] is a classic algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

Contents

Setting[edit]

Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing). Each transaction is seen as a set of items (an itemset). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database.

Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k-1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.

The pseudo code for the algorithm is given below for a transaction database T, and a support threshold of \epsilon. Usual set theoretic notation is employed, though note that T is a multiset. C_k is the candidate set for level k. Generate() algorithm is assumed to generate the candidate sets from the large item sets of the preceding level, heeding the downward closure lemma. count[c] accesses a field of the data structure that represents candidate set c, which is initially assumed to be zero. Many details are omitted below, usually the most important part of the implementation is the data structure used for storing the candidate sets, and counting their frequencies.

\mathrm{Apriori}(T,\epsilon)
L_1 \gets \{ \mathrm{large~1-item sets} \}
k \gets 2
\mathrm{\textbf{while}}~ L_{k-1} \neq \ empty set
C_k \gets \{ c |c = a \cup \{b\}  \and  a \in L_{k-1} \and b \in \bigcup L_{k-1} \and b \not \in  a  \}
\mathrm{\textbf{for}~transactions}~t \in T
C_t \gets \{ c | c \in C_k \land c \subseteq t \}
\mathrm{\textbf{for}~candidates}~c \in C_t
count[c] \gets count[c]+1
L_k \gets \{ c |c \in C_k \and ~ count[c] \geq \epsilon \}
k \gets k+1
\mathrm{\textbf{return}}~\bigcup_k L_k

Examples[edit]

Example 1[edit]

Consider the following database, where each row is a transaction and each cell is an individual item of the transaction:

alpha beta gamma
alpha beta theta
alpha beta epsilon
alpha beta theta

The association rules that can be determined from this database are the following:

  1. 100% of sets with alpha also contain beta
  2. 25% of sets with alpha, beta also have gamma
  3. 50% of sets with alpha, beta also have theta

we can also illustrate this through variety of examples

Example 2[edit]

Assume that a large supermarket tracks sales data by stock-keeping unit (SKU) for each item: each item, such as "butter" or "bread", is identified by a numerical SKU. The supermarket has a database of transactions where each transaction is a set of SKUs that were bought together.

Let the database of transactions consist of the sets {1,2,3,4}, {1,2}, {2,3,4}, {2,3}, {1,2,4}, {3,4}, and {2,4}. We will use Apriori to determine the frequent item sets of this database. To do so, we will say that an item set is frequent if it appears in at least 3 transactions of the database: the value 3 is the support threshold.

The first step of Apriori is to count up the number of occurrences, called the support, of each member item separately, by scanning the database a first time. We obtain the following result

Item Support
{1} 3
{2} 6
{3} 4
{4} 5

All the itemsets of size 1 have a support of at least 3, so they are all frequent.

The next step is to generate a list of all pairs of the frequent items:

Item Support
{1,2} 3
{1,3} 1
{1,4} 2
{2,3} 3
{2,4} 4
{3,4} 3

The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the minimum support of 3, so they are frequent. The pairs {1,3} and {1,4} are not. Now, because {1,3} and {1,4} are not frequent, any larger set which contains {1,3} or {1,4} cannot be frequent. In this way, we can prune sets: we will now look for frequent triples in the database, but we can already exclude all the triples that contain one of these two pairs:

Item Support
{2,3,4} 2

In the example, there are no frequent triplets -- {2,3,4} is below the minimal threshold, and the other triplets were excluded because they were super sets of pairs that were already below the threshold.

We have thus determined the frequent sets of items in the database, and illustrated how some items were not counted because one of their subsets was already known to be below the threshold.

Limitations[edit]

Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawned other algorithms. Candidate generation generates large numbers of subsets (the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all 2^{|S|}-1 of its proper subsets.

Later algorithms such as Max-Miner[2] try to identify the maximal frequent item sets without enumerating their subsets, and perform "jumps" in the search space rather than a purely bottom-up approach.

See also[edit]

References[edit]

  1. ^ Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for mining association rules in large databases. Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994.
  2. ^ Bayardo Jr, Roberto J. "Efficiently mining long patterns from databases." ACM Sigmod Record. Vol. 27. No. 2. ACM, 1998.

External links[edit]

  • "Implementation of the Apriori algorithm in C#"
  • ARtool, GPL Java association rule mining application with GUI, offering implementations of multiple algorithms for discovery of frequent patterns and extraction of association rules (includes Apriori)
  • SPMF: Open-source java implementations of more than 50 algorithms for frequent itemsets mining, association rule mining and sequential pattern mining. It offers Apriori and several variations such as AprioriClose, UApriori, AprioriInverse, AprioriRare, MSApriori, AprioriTID, etc., and other more efficient algorithms such as FPGrowth.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Artificial_intelligence b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Artificial_intelligence new file mode 100644 index 00000000..24c84ddf --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Artificial_intelligence @@ -0,0 +1 @@ + Artificial intelligence - Wikipedia, the free encyclopedia

Artificial intelligence

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Artificial intelligence (AI) is technology and a branch of computer science that studies and develops intelligent machines and software. Major AI researchers and textbooks define the field as "the study and design of intelligent agents",[1] where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success.[2] John McCarthy, who coined the term in 1955,[3] defines it as "the science and engineering of making intelligent machines".[4]

AI research is highly technical and specialised, deeply divided into subfields that often fail to communicate with each other.[5] Some of the division is due to social and cultural factors: subfields have grown up around particular institutions and the work of individual researchers. AI research is also divided by several technical issues. There are subfields which are focused on the solution of specific problems, on one of several possible approaches, on the use of widely differing tools and towards the accomplishment of particular applications.

The central problems (or goals) of AI research include reasoning, knowledge, planning, learning, communication, perception and the ability to move and manipulate objects.[6] General intelligence (or "strong AI") is still among the field's long term goals.[7] Currently popular approaches include statistical methods, computational intelligence and traditional symbolic AI. There are an enormous number of tools used in AI, including versions of search and mathematical optimization, logic, methods based on probability and economics, and many others.

The field was founded on the claim that a central property of humans, intelligence—the sapience of Homo sapiens—can be so precisely described that it can be simulated by a machine.[8] This raises philosophical issues about the nature of the mind and the ethics of creating artificial beings, issues which have been addressed by myth, fiction and philosophy since antiquity.[9] Artificial intelligence has been the subject of tremendous optimism[10] but has also suffered stunning setbacks.[11] Today it has become an essential part of the technology industry and many of the most difficult problems in computer science.[12]

Contents

History[edit]

Thinking machines and artificial beings appear in Greek myths, such as Talos of Crete, the bronze robot of Hephaestus, and Pygmalion's Galatea.[13] Human likenesses believed to have intelligence were built in every major civilization: animated cult images were worshiped in Egypt and Greece[14] and humanoid automatons were built by Yan Shi, Hero of Alexandria and Al-Jazari.[15] It was also widely believed that artificial beings had been created by Jābir ibn Hayyān, Judah Loew and Paracelsus.[16] By the 19th and 20th centuries, artificial beings had become a common feature in fiction, as in Mary Shelley's Frankenstein or Karel Čapek's R.U.R. (Rossum's Universal Robots).[17] Pamela McCorduck argues that all of these are examples of an ancient urge, as she describes it, "to forge the gods".[9] Stories of these creatures and their fates discuss many of the same hopes, fears and ethical concerns that are presented by artificial intelligence.

Mechanical or "formal" reasoning has been developed by philosophers and mathematicians since antiquity. The study of logic led directly to the invention of the programmable digital electronic computer, based on the work of mathematician Alan Turing and others. Turing's theory of computation suggested that a machine, by shuffling symbols as simple as "0" and "1", could simulate any conceivable act of mathematical deduction.[18][19] This, along with concurrent discoveries in neurology, information theory and cybernetics, inspired a small group of researchers to begin to seriously consider the possibility of building an electronic brain.[20]

The field of AI research was founded at a conference on the campus of Dartmouth College in the summer of 1956.[21] The attendees, including John McCarthy, Marvin Minsky, Allen Newell and Herbert Simon, became the leaders of AI research for many decades.[22] They and their students wrote programs that were, to most people, simply astonishing:[23] Computers were solving word problems in algebra, proving logical theorems and speaking English.[24] By the middle of the 1960s, research in the U.S. was heavily funded by the Department of Defense[25] and laboratories had been established around the world.[26] AI's founders were profoundly optimistic about the future of the new field: Herbert Simon predicted that "machines will be capable, within twenty years, of doing any work a man can do" and Marvin Minsky agreed, writing that "within a generation ... the problem of creating 'artificial intelligence' will substantially be solved".[27]

They had failed to recognize the difficulty of some of the problems they faced.[28] In 1974, in response to the criticism of Sir James Lighthill and ongoing pressure from the US Congress to fund more productive projects, both the U.S. and British governments cut off all undirected exploratory research in AI. The next few years would later be called an "AI winter",[29] a period when funding for AI projects was hard to find.

In the early 1980s, AI research was revived by the commercial success of expert systems,[30] a form of AI program that simulated the knowledge and analytical skills of one or more human experts. By 1985 the market for AI had reached over a billion dollars. At the same time, Japan's fifth generation computer project inspired the U.S and British governments to restore funding for academic research in the field.[31] However, beginning with the collapse of the Lisp Machine market in 1987, AI once again fell into disrepute, and a second, longer lasting AI winter began.[32]

In the 1990s and early 21st century, AI achieved its greatest successes, albeit somewhat behind the scenes. Artificial intelligence is used for logistics, data mining, medical diagnosis and many other areas throughout the technology industry.[12] The success was due to several factors: the increasing computational power of computers (see Moore's law), a greater emphasis on solving specific subproblems, the creation of new ties between AI and other fields working on similar problems, and a new commitment by researchers to solid mathematical methods and rigorous scientific standards.[33]

On 11 May 1997, Deep Blue became the first computer chess-playing system to beat a reigning world chess champion, Garry Kasparov.[34] In 2005, a Stanford robot won the DARPA Grand Challenge by driving autonomously for 131 miles along an unrehearsed desert trail.[35] Two years later, a team from CMU won the DARPA Urban Challenge when their vehicle autonomously navigated 55 miles in an Urban environment while adhering to traffic hazards and all traffic laws.[36] In February 2011, in a Jeopardy! quiz show exhibition match, IBM's question answering system, Watson, defeated the two greatest Jeopardy champions, Brad Rutter and Ken Jennings, by a significant margin.[37] The Kinect, which provides a 3D body–motion interface for the Xbox 360, uses algorithms that emerged from lengthy AI research[38] as does the iPhones's Siri.

Goals[edit]

The general problem of simulating (or creating) intelligence has been broken down into a number of specific sub-problems. These consist of particular traits or capabilities that researchers would like an intelligent system to display. The traits described below have received the most attention.[6]

Deduction, reasoning, problem solving[edit]

Early AI researchers developed algorithms that imitated the step-by-step reasoning that humans use when they solve puzzles or make logical deductions.[39] By the late 1980s and 1990s, AI research had also developed highly successful methods for dealing with uncertain or incomplete information, employing concepts from probability and economics.[40]

For difficult problems, most of these algorithms can require enormous computational resources – most experience a "combinatorial explosion": the amount of memory or computer time required becomes astronomical when the problem goes beyond a certain size. The search for more efficient problem-solving algorithms is a high priority for AI research.[41]

Human beings solve most of their problems using fast, intuitive judgements rather than the conscious, step-by-step deduction that early AI research was able to model.[42] AI has made some progress at imitating this kind of "sub-symbolic" problem solving: embodied agent approaches emphasize the importance of sensorimotor skills to higher reasoning; neural net research attempts to simulate the structures inside the brain that give rise to this skill; statistical approaches to AI mimic the probabilistic nature of the human ability to guess.

Knowledge representation[edit]

An ontology represents knowledge as a set of concepts within a domain and the relationships between those concepts.

Knowledge representation[43] and knowledge engineering[44] are central to AI research. Many of the problems machines are expected to solve will require extensive knowledge about the world. Among the things that AI needs to represent are: objects, properties, categories and relations between objects;[45] situations, events, states and time;[46] causes and effects;[47] knowledge about knowledge (what we know about what other people know);[48] and many other, less well researched domains. A representation of "what exists" is an ontology: the set of objects, relations, concepts and so on that the machine knows about. The most general are called upper ontologies, which attempt to provide a foundation for all other knowledge.[49]

Among the most difficult problems in knowledge representation are:

Default reasoning and the qualification problem
Many of the things people know take the form of "working assumptions." For example, if a bird comes up in conversation, people typically picture an animal that is fist sized, sings, and flies. None of these things are true about all birds. John McCarthy identified this problem in 1969[50] as the qualification problem: for any commonsense rule that AI researchers care to represent, there tend to be a huge number of exceptions. Almost nothing is simply true or false in the way that abstract logic requires. AI research has explored a number of solutions to this problem.[51]
The breadth of commonsense knowledge
The number of atomic facts that the average person knows is astronomical. Research projects that attempt to build a complete knowledge base of commonsense knowledge (e.g., Cyc) require enormous amounts of laborious ontological engineering — they must be built, by hand, one complicated concept at a time.[52] A major goal is to have the computer understand enough concepts to be able to learn by reading from sources like the internet, and thus be able to add to its own ontology.[citation needed]
The subsymbolic form of some commonsense knowledge
Much of what people know is not represented as "facts" or "statements" that they could express verbally. For example, a chess master will avoid a particular chess position because it "feels too exposed"[53] or an art critic can take one look at a statue and instantly realize that it is a fake.[54] These are intuitions or tendencies that are represented in the brain non-consciously and sub-symbolically.[55] Knowledge like this informs, supports and provides a context for symbolic, conscious knowledge. As with the related problem of sub-symbolic reasoning, it is hoped that situated AI, computational intelligence, or statistical AI will provide ways to represent this kind of knowledge.[55]

Planning[edit]

A hierarchical control system is a form of control system in which a set of devices and governing software is arranged in a hierarchy.

Intelligent agents must be able to set goals and achieve them.[56] They need a way to visualize the future (they must have a representation of the state of the world and be able to make predictions about how their actions will change it) and be able to make choices that maximize the utility (or "value") of the available choices.[57]

In classical planning problems, the agent can assume that it is the only thing acting on the world and it can be certain what the consequences of its actions may be.[58] However, if the agent is not the only actor, it must periodically ascertain whether the world matches its predictions and it must change its plan as this becomes necessary, requiring the agent to reason under uncertainty.[59]

Multi-agent planning uses the cooperation and competition of many agents to achieve a given goal. Emergent behavior such as this is used by evolutionary algorithms and swarm intelligence.[60]

Learning[edit]

Machine learning is the study of computer algorithms that improve automatically through experience[61][62] and has been central to AI research since the field's inception.[63]

Unsupervised learning is the ability to find patterns in a stream of input. Supervised learning includes both classification and numerical regression. Classification is used to determine what category something belongs in, after seeing a number of examples of things from several categories. Regression is the attempt to produce a function that describes the relationship between inputs and outputs and predicts how the outputs should change as the inputs change. In reinforcement learning[64] the agent is rewarded for good responses and punished for bad ones. These can be analyzed in terms of decision theory, using concepts like utility. The mathematical analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory.[65]

Within Developmental robotics, developmental learning approaches were elaborated for lifelong cumulative acquisition of repertoires of novel skills by a robot, through autonomous self-exploration and social interaction with human teachers, and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation [66] [67] .[68][69]

Natural language processing[edit]

A parse tree represents the syntactic structure of a sentence according to some formal grammar.

Natural language processing[70] gives machines the ability to read and understand the languages that humans speak. A sufficiently powerful natural language processing system would enable natural language user interfaces and the acquisition of knowledge directly from human-written sources, such as Internet texts. Some straightforward applications of natural language processing include information retrieval (or text mining) and machine translation.[71]

A common method of processing and extracting meaning from natural language is through semantic indexing. Increases in processing speeds and the drop in the cost of data storage makes indexing large volumes of abstractions of the users input much more efficient.

Motion and manipulation[edit]

The field of robotics[72] is closely related to AI. Intelligence is required for robots to be able to handle such tasks as object manipulation[73] and navigation, with sub-problems of localization (knowing where you are, or finding out where other things are), mapping (learning what is around you, building a map of the environment), and motion planning (figuring out how to get there) or path planning (going from one point in space to another point, which may involve compliant motion - where the robot moves while maintaining physical contact with an object).[74][75]

Perception[edit]

Machine perception[76] is the ability to use input from sensors (such as cameras, microphones, sonar and others more exotic) to deduce aspects of the world. Computer vision[77] is the ability to analyze visual input. A few selected subproblems are speech recognition,[78] facial recognition and object recognition.[79]

Social intelligence[edit]

Kismet, a robot with rudimentary social skills[80]

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects.[81][82] It is an interdisciplinary field spanning computer sciences, psychology, and cognitive science.[83] While the origins of the field may be traced as far back as to early philosophical inquiries into emotion,[84] the more modern branch of computer science originated with Rosalind Picard's 1995 paper[85] on affective computing.[86][87] A motivation for the research is the ability to simulate empathy. The machine should interpret the emotional state of humans and adapt its behaviour to them, giving an appropriate response for those emotions.

Emotion and social skills[88] play two roles for an intelligent agent. First, it must be able to predict the actions of others, by understanding their motives and emotional states. (This involves elements of game theory, decision theory, as well as the ability to model human emotions and the perceptual skills to detect emotions.) Also, in an effort to facilitate human-computer interaction, an intelligent machine might want to be able to display emotions—even if it does not actually experience them itself—in order to appear sensitive to the emotional dynamics of human interaction.

Creativity[edit]

A sub-field of AI addresses creativity both theoretically (from a philosophical and psychological perspective) and practically (via specific implementations of systems that generate outputs that can be considered creative, or systems that identify and assess creativity). Related areas of computational research are Artificial intuition and Artificial imagination.

General intelligence[edit]

Most researchers think that their work will eventually be incorporated into a machine with general intelligence (known as strong AI), combining all the skills above and exceeding human abilities at most or all of them.[7] A few believe that anthropomorphic features like artificial consciousness or an artificial brain may be required for such a project.[89][90]

Many of the problems above may require general intelligence to be considered solved. For example, even a straightforward, specific task like machine translation requires that the machine read and write in both languages (NLP), follow the author's argument (reason), know what is being talked about (knowledge), and faithfully reproduce the author's intention (social intelligence). A problem like machine translation is considered "AI-complete". In order to solve this particular problem, you must solve all the problems.[91]

Approaches[edit]

There is no established unifying theory or paradigm that guides AI research. Researchers disagree about many issues.[92] A few of the most long standing questions that have remained unanswered are these: should artificial intelligence simulate natural intelligence by studying psychology or neurology? Or is human biology as irrelevant to AI research as bird biology is to aeronautical engineering?[93] Can intelligent behavior be described using simple, elegant principles (such as logic or optimization)? Or does it necessarily require solving a large number of completely unrelated problems?[94] Can intelligence be reproduced using high-level symbols, similar to words and ideas? Or does it require "sub-symbolic" processing?[95] John Haugeland, who coined the term GOFAI (Good Old-Fashioned Artificial Intelligence), also proposed that AI should more properly be referred to as synthetic intelligence,[96] a term which has since been adopted by some non-GOFAI researchers.[97][98]

Cybernetics and brain simulation[edit]

In the 1940s and 1950s, a number of researchers explored the connection between neurology, information theory, and cybernetics. Some of them built machines that used electronic networks to exhibit rudimentary intelligence, such as W. Grey Walter's turtles and the Johns Hopkins Beast. Many of these researchers gathered for meetings of the Teleological Society at Princeton University and the Ratio Club in England.[20] By 1960, this approach was largely abandoned, although elements of it would be revived in the 1980s.

Symbolic[edit]

When access to digital computers became possible in the middle 1950s, AI research began to explore the possibility that human intelligence could be reduced to symbol manipulation. The research was centered in three institutions: Carnegie Mellon University, Stanford and MIT, and each one developed its own style of research. John Haugeland named these approaches to AI "good old fashioned AI" or "GOFAI".[99] During the 1960s, symbolic approaches had achieved great success at simulating high-level thinking in small demonstration programs. Approaches based on cybernetics or neural networks were abandoned or pushed into the background.[100] Researchers in the 1960s and the 1970s were convinced that symbolic approaches would eventually succeed in creating a machine with artificial general intelligence and considered this the goal of their field.

Cognitive simulation
Economist Herbert Simon and Allen Newell studied human problem-solving skills and attempted to formalize them, and their work laid the foundations of the field of artificial intelligence, as well as cognitive science, operations research and management science. Their research team used the results of psychological experiments to develop programs that simulated the techniques that people used to solve problems. This tradition, centered at Carnegie Mellon University would eventually culminate in the development of the Soar architecture in the middle 1980s.[101][102]
Logic-based
Unlike Newell and Simon, John McCarthy felt that machines did not need to simulate human thought, but should instead try to find the essence of abstract reasoning and problem solving, regardless of whether people used the same algorithms.[93] His laboratory at Stanford (SAIL) focused on using formal logic to solve a wide variety of problems, including knowledge representation, planning and learning.[103] Logic was also the focus of the work at the University of Edinburgh and elsewhere in Europe which led to the development of the programming language Prolog and the science of logic programming.[104]
"Anti-logic" or "scruffy"
Researchers at MIT (such as Marvin Minsky and Seymour Papert)[105] found that solving difficult problems in vision and natural language processing required ad-hoc solutions – they argued that there was no simple and general principle (like logic) that would capture all the aspects of intelligent behavior. Roger Schank described their "anti-logic" approaches as "scruffy" (as opposed to the "neat" paradigms at CMU and Stanford).[94] Commonsense knowledge bases (such as Doug Lenat's Cyc) are an example of "scruffy" AI, since they must be built by hand, one complicated concept at a time.[106]
Knowledge-based
When computers with large memories became available around 1970, researchers from all three traditions began to build knowledge into AI applications.[107] This "knowledge revolution" led to the development and deployment of expert systems (introduced by Edward Feigenbaum), the first truly successful form of AI software.[30] The knowledge revolution was also driven by the realization that enormous amounts of knowledge would be required by many simple AI applications.

Sub-symbolic[edit]

By the 1980s progress in symbolic AI seemed to stall and many believed that symbolic systems would never be able to imitate all the processes of human cognition, especially perception, robotics, learning and pattern recognition. A number of researchers began to look into "sub-symbolic" approaches to specific AI problems.[95]

Bottom-up, embodied, situated, behavior-based or nouvelle AI
Researchers from the related field of robotics, such as Rodney Brooks, rejected symbolic AI and focused on the basic engineering problems that would allow robots to move and survive.[108] Their work revived the non-symbolic viewpoint of the early cybernetics researchers of the 1950s and reintroduced the use of control theory in AI. This coincided with the development of the embodied mind thesis in the related field of cognitive science: the idea that aspects of the body (such as movement, perception and visualization) are required for higher intelligence.
Computational Intelligence
Interest in neural networks and "connectionism" was revived by David Rumelhart and others in the middle 1980s.[109] These and other sub-symbolic approaches, such as fuzzy systems and evolutionary computation, are now studied collectively by the emerging discipline of computational intelligence.[110]

Statistical[edit]

In the 1990s, AI researchers developed sophisticated mathematical tools to solve specific subproblems. These tools are truly scientific, in the sense that their results are both measurable and verifiable, and they have been responsible for many of AI's recent successes. The shared mathematical language has also permitted a high level of collaboration with more established fields (like mathematics, economics or operations research). Stuart Russell and Peter Norvig describe this movement as nothing less than a "revolution" and "the victory of the neats."[33] Critics argue that these techniques are too focused on particular problems and have failed to address the long term goal of general intelligence.[111] There is an ongoing debate about the relevance and validity of statistical approaches in AI, exemplified in part by exchanges between Peter Norvig and Noam Chomsky.[112][113]

Integrating the approaches[edit]

Intelligent agent paradigm
An intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success. The simplest intelligent agents are programs that solve specific problems. More complicated agents include human beings and organizations of human beings (such as firms). The paradigm gives researchers license to study isolated problems and find solutions that are both verifiable and useful, without agreeing on one single approach. An agent that solves a specific problem can use any approach that works – some agents are symbolic and logical, some are sub-symbolic neural networks and others may use new approaches. The paradigm also gives researchers a common language to communicate with other fields—such as decision theory and economics—that also use concepts of abstract agents. The intelligent agent paradigm became widely accepted during the 1990s.[2]
Agent architectures and cognitive architectures
Researchers have designed systems to build intelligent systems out of interacting intelligent agents in a multi-agent system.[114] A system with both symbolic and sub-symbolic components is a hybrid intelligent system, and the study of such systems is artificial intelligence systems integration. A hierarchical control system provides a bridge between sub-symbolic AI at its lowest, reactive levels and traditional symbolic AI at its highest levels, where relaxed time constraints permit planning and world modelling.[115] Rodney Brooks' subsumption architecture was an early proposal for such a hierarchical system.[116]

Tools[edit]

In the course of 50 years of research, AI has developed a large number of tools to solve the most difficult problems in computer science. A few of the most general of these methods are discussed below.

Search and optimization[edit]

Many problems in AI can be solved in theory by intelligently searching through many possible solutions:[117] Reasoning can be reduced to performing a search. For example, logical proof can be viewed as searching for a path that leads from premises to conclusions, where each step is the application of an inference rule.[118] Planning algorithms search through trees of goals and subgoals, attempting to find a path to a target goal, a process called means-ends analysis.[119] Robotics algorithms for moving limbs and grasping objects use local searches in configuration space.[73] Many learning algorithms use search algorithms based on optimization.

Simple exhaustive searches[120] are rarely sufficient for most real world problems: the search space (the number of places to search) quickly grows to astronomical numbers. The result is a search that is too slow or never completes. The solution, for many problems, is to use "heuristics" or "rules of thumb" that eliminate choices that are unlikely to lead to the goal (called "pruning the search tree"). Heuristics supply the program with a "best guess" for the path on which the solution lies.[121] Heuristics limit the search for solutions into a smaller sample size.[74]

A very different kind of search came to prominence in the 1990s, based on the mathematical theory of optimization. For many problems, it is possible to begin the search with some form of a guess and then refine the guess incrementally until no more refinements can be made. These algorithms can be visualized as blind hill climbing: we begin the search at a random point on the landscape, and then, by jumps or steps, we keep moving our guess uphill, until we reach the top. Other optimization algorithms are simulated annealing, beam search and random optimization.[122]

Evolutionary computation uses a form of optimization search. For example, they may begin with a population of organisms (the guesses) and then allow them to mutate and recombine, selecting only the fittest to survive each generation (refining the guesses). Forms of evolutionary computation include swarm intelligence algorithms (such as ant colony or particle swarm optimization)[123] and evolutionary algorithms (such as genetic algorithms, gene expression programming, and genetic programming).[124]

Logic[edit]

Logic[125] is used for knowledge representation and problem solving, but it can be applied to other problems as well. For example, the satplan algorithm uses logic for planning[126] and inductive logic programming is a method for learning.[127]

Several different forms of logic are used in AI research. Propositional or sentential logic[128] is the logic of statements which can be true or false. First-order logic[129] also allows the use of quantifiers and predicates, and can express facts about objects, their properties, and their relations with each other. Fuzzy logic,[130] is a version of first-order logic which allows the truth of a statement to be represented as a value between 0 and 1, rather than simply True (1) or False (0). Fuzzy systems can be used for uncertain reasoning and have been widely used in modern industrial and consumer product control systems. Subjective logic[131] models uncertainty in a different and more explicit manner than fuzzy-logic: a given binomial opinion satisfies belief + disbelief + uncertainty = 1 within a Beta distribution. By this method, ignorance can be distinguished from probabilistic statements that an agent makes with high confidence.

Default logics, non-monotonic logics and circumscription[51] are forms of logic designed to help with default reasoning and the qualification problem. Several extensions of logic have been designed to handle specific domains of knowledge, such as: description logics;[45] situation calculus, event calculus and fluent calculus (for representing events and time);[46] causal calculus;[47] belief calculus; and modal logics.[48]

Probabilistic methods for uncertain reasoning[edit]

Many problems in AI (in reasoning, planning, learning, perception and robotics) require the agent to operate with incomplete or uncertain information. AI researchers have devised a number of powerful tools to solve these problems using methods from probability theory and economics.[132]

Bayesian networks[133] are a very general tool that can be used for a large number of problems: reasoning (using the Bayesian inference algorithm),[134] learning (using the expectation-maximization algorithm),[135] planning (using decision networks)[136] and perception (using dynamic Bayesian networks).[137] Probabilistic algorithms can also be used for filtering, prediction, smoothing and finding explanations for streams of data, helping perception systems to analyze processes that occur over time (e.g., hidden Markov models or Kalman filters).[137]

A key concept from the science of economics is "utility": a measure of how valuable something is to an intelligent agent. Precise mathematical tools have been developed that analyze how an agent can make choices and plan, using decision theory, decision analysis,[138] information value theory.[57] These tools include models such as Markov decision processes,[139] dynamic decision networks,[137] game theory and mechanism design.[140]

Classifiers and statistical learning methods[edit]

The simplest AI applications can be divided into two types: classifiers ("if shiny then diamond") and controllers ("if shiny then pick up"). Controllers do however also classify conditions before inferring actions, and therefore classification forms a central part of many AI systems. Classifiers are functions that use pattern matching to determine a closest match. They can be tuned according to examples, making them very attractive for use in AI. These examples are known as observations or patterns. In supervised learning, each pattern belongs to a certain predefined class. A class can be seen as a decision that has to be made. All the observations combined with their class labels are known as a data set. When a new observation is received, that observation is classified based on previous experience.[141]

A classifier can be trained in various ways; there are many statistical and machine learning approaches. The most widely used classifiers are the neural network,[142] kernel methods such as the support vector machine,[143] k-nearest neighbor algorithm,[144] Gaussian mixture model,[145] naive Bayes classifier,[146] and decision tree.[147] The performance of these classifiers have been compared over a wide range of tasks. Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems; this is also referred to as the "no free lunch" theorem. Determining a suitable classifier for a given problem is still more an art than science.[148]

Neural networks[edit]

A neural network is an interconnected group of nodes, akin to the vast network of neurons in the human brain.

The study of artificial neural networks[142] began in the decade before the field AI research was founded, in the work of Walter Pitts and Warren McCullough. Other important early researchers were Frank Rosenblatt, who invented the perceptron and Paul Werbos who developed the backpropagation algorithm.[149]

The main categories of networks are acyclic or feedforward neural networks (where the signal passes in only one direction) and recurrent neural networks (which allow feedback). Among the most popular feedforward networks are perceptrons, multi-layer perceptrons and radial basis networks.[150] Among recurrent networks, the most famous is the Hopfield net, a form of attractor network, which was first described by John Hopfield in 1982.[151] Neural networks can be applied to the problem of intelligent control (for robotics) or learning, using such techniques as Hebbian learning and competitive learning.[152]

Hierarchical temporal memory is an approach that models some of the structural and algorithmic properties of the neocortex.[153]

Control theory[edit]

Control theory, the grandchild of cybernetics, has many important applications, especially in robotics.[154]

Languages[edit]

AI researchers have developed several specialized languages for AI research, including Lisp[155] and Prolog.[156]

Evaluating progress[edit]

In 1950, Alan Turing proposed a general procedure to test the intelligence of an agent now known as the Turing test. This procedure allows almost all the major problems of artificial intelligence to be tested. However, it is a very difficult challenge and at present all agents fail.[157]

Artificial intelligence can also be evaluated on specific problems such as small problems in chemistry, hand-writing recognition and game-playing. Such tests have been termed subject matter expert Turing tests. Smaller problems provide more achievable goals and there are an ever-increasing number of positive results.[158]

One classification for outcomes of an AI test is:[159]

  1. Optimal: it is not possible to perform better.
  2. Strong super-human: performs better than all humans.
  3. Super-human: performs better than most humans.
  4. Sub-human: performs worse than most humans.

For example, performance at draughts is optimal,[160] performance at chess is super-human and nearing strong super-human (see computer chess: computers versus human) and performance at many everyday tasks (such as recognizing a face or crossing a room without bumping into something) is sub-human.

A quite different approach measures machine intelligence through tests which are developed from mathematical definitions of intelligence. Examples of these kinds of tests start in the late nineties devising intelligence tests using notions from Kolmogorov complexity and data compression.[161] Two major advantages of mathematical definitions are their applicability to nonhuman intelligences and their absence of a requirement for human testers.

An area that artificial intelligence had contributed greatly to is Intrusion detection.[162]

Applications[edit]

An automated online assistant providing customer service on a web page – one of many very primitive applications of artificial intelligence.

Artificial intelligence techniques are pervasive and are too numerous to list. Frequently, when a technique reaches mainstream use, it is no longer considered artificial intelligence; this phenomenon is described as the AI effect.[163]

Competitions and prizes[edit]

There are a number of competitions and prizes to promote research in artificial intelligence. The main areas promoted are: general machine intelligence, conversational behavior, data-mining, robotic cars, robot soccer and games.

Platforms[edit]

A platform (or "computing platform") is defined as "some sort of hardware architecture or software framework (including application frameworks), that allows software to run." As Rodney Brooks[164] pointed out many years ago, it is not just the artificial intelligence software that defines the AI features of the platform, but rather the actual platform itself that affects the AI that results, i.e., there needs to be work in AI problems on real-world platforms rather than in isolation.

A wide variety of platforms has allowed different aspects of AI to develop, ranging from expert systems, albeit PC-based but still an entire real-world system, to various robot platforms such as the widely available Roomba with open interface.[165]

Philosophy[edit]

Artificial intelligence, by claiming to be able to recreate the capabilities of the human mind, is both a challenge and an inspiration for philosophy. Are there limits to how intelligent machines can be? Is there an essential difference between human intelligence and artificial intelligence? Can a machine have a mind and consciousness? A few of the most influential answers to these questions are given below.[166]

Turing's "polite convention"
We need not decide if a machine can "think"; we need only decide if a machine can act as intelligently as a human being. This approach to the philosophical problems associated with artificial intelligence forms the basis of the Turing test.[157]
The Dartmouth proposal
"Every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it." This conjecture was printed in the proposal for the Dartmouth Conference of 1956, and represents the position of most working AI researchers.[167]
Newell and Simon's physical symbol system hypothesis
"A physical symbol system has the necessary and sufficient means of general intelligent action." Newell and Simon argue that intelligences consist of formal operations on symbols.[168] Hubert Dreyfus argued that, on the contrary, human expertise depends on unconscious instinct rather than conscious symbol manipulation and on having a "feel" for the situation rather than explicit symbolic knowledge. (See Dreyfus' critique of AI.)[169][170]
Gödel's incompleteness theorem
A formal system (such as a computer program) cannot prove all true statements.[171] Roger Penrose is among those who claim that Gödel's theorem limits what machines can do. (See The Emperor's New Mind.)[172]
Searle's strong AI hypothesis
"The appropriately programmed computer with the right inputs and outputs would thereby have a mind in exactly the same sense human beings have minds."[173] John Searle counters this assertion with his Chinese room argument, which asks us to look inside the computer and try to find where the "mind" might be.[174]
The artificial brain argument
The brain can be simulated. Hans Moravec, Ray Kurzweil and others have argued that it is technologically feasible to copy the brain directly into hardware and software, and that such a simulation will be essentially identical to the original.[90]

Predictions and ethics[edit]

Artificial Intelligence is a common topic in both science fiction and projections about the future of technology and society. The existence of an artificial intelligence that rivals human intelligence raises difficult ethical issues, and the potential power of the technology inspires both hopes and fears.

In fiction, Artificial Intelligence has appeared fulfilling many roles.
These include:

Mary Shelley's Frankenstein considers a key issue in the ethics of artificial intelligence: if a machine can be created that has intelligence, could it also feel? If it can feel, does it have the same rights as a human? The idea also appears in modern science fiction, including the films I Robot, Blade Runner and A.I.: Artificial Intelligence, in which humanoid machines have the ability to feel human emotions. This issue, now known as "robot rights", is currently being considered by, for example, California's Institute for the Future, although many critics believe that the discussion is premature.[175] The subject is profoundly discussed in the 2010 documentary film Plug & Pray.[176]

Martin Ford, author of The Lights in the Tunnel: Automation, Accelerating Technology and the Economy of the Future,[177] and others argue that specialized artificial intelligence applications, robotics and other forms of automation will ultimately result in significant unemployment as machines begin to match and exceed the capability of workers to perform most routine and repetitive jobs. Ford predicts that many knowledge-based occupations—and in particular entry level jobs—will be increasingly susceptible to automation via expert systems, machine learning[178] and other AI-enhanced applications. AI-based applications may also be used to amplify the capabilities of low-wage offshore workers, making it more feasible to outsource knowledge work.[179]

Joseph Weizenbaum wrote that AI applications can not, by definition, successfully simulate genuine human empathy and that the use of AI technology in fields such as customer service or psychotherapy[180] was deeply misguided. Weizenbaum was also bothered that AI researchers (and some philosophers) were willing to view the human mind as nothing more than a computer program (a position now known as computationalism). To Weizenbaum these points suggest that AI research devalues human life.[181]

Many futurists believe that artificial intelligence will ultimately transcend the limits of progress. Ray Kurzweil has used Moore's law (which describes the relentless exponential improvement in digital technology) to calculate that desktop computers will have the same processing power as human brains by the year 2029. He also predicts that by 2045 artificial intelligence will reach a point where it is able to improve itself at a rate that far exceeds anything conceivable in the past, a scenario that science fiction writer Vernor Vinge named the "singularity".[182]

Robot designer Hans Moravec, cyberneticist Kevin Warwick and inventor Ray Kurzweil have predicted that humans and machines will merge in the future into cyborgs that are more capable and powerful than either.[183] This idea, called transhumanism, which has roots in Aldous Huxley and Robert Ettinger, has been illustrated in fiction as well, for example in the manga Ghost in the Shell and the science-fiction series Dune. In the 1980s artist Hajime Sorayama's Sexy Robots series were painted and published in Japan depicting the actual organic human form with life-like muscular metallic skins and later "the Gynoids" book followed that was used by or influenced movie makers including George Lucas and other creatives. Sorayama never considered these organic robots to be real part of nature but always unnatural product of the human mind, a fantasy existing in the mind even when realized in actual form. Almost 20 years later, the first AI robotic pet (AIBO) came available as a companion to people. AIBO grew out of Sony's Computer Science Laboratory (CSL). Famed engineer Dr. Toshitada Doi is credited as AIBO's original progenitor: in 1994 he had started work on robots with artificial intelligence expert Masahiro Fujita within CSL of Sony. Doi's, friend, the artist Hajime Sorayama, was enlisted to create the initial designs for the AIBO's body. Those designs are now part of the permanent collections of Museum of Modern Art and the Smithsonian Institution, with later versions of AIBO being used in studies in Carnegie Mellon University. In 2006, AIBO was added into Carnegie Mellon University's "Robot Hall of Fame".

Political scientist Charles T. Rubin believes that AI can be neither designed nor guaranteed to be friendly.[184] He argues that "any sufficiently advanced benevolence may be indistinguishable from malevolence." Humans should not assume machines or robots would treat us favorably, because there is no a priori reason to believe that they would be sympathetic to our system of morality, which has evolved along with our particular biology (which AIs would not share).

Edward Fredkin argues that "artificial intelligence is the next stage in evolution", an idea first proposed by Samuel Butler's "Darwin among the Machines" (1863), and expanded upon by George Dyson in his book of the same name in 1998.[185]

See also[edit]

References[edit]

Notes[edit]

  1. ^ Definition of AI as the study of intelligent agents:
  2. ^ a b The intelligent agent paradigm: The definition used in this article, in terms of goals, actions, perception and environment, is due to Russell & Norvig (2003). Other definitions also include knowledge and learning as additional criteria.
  3. ^ Although there is some controversy on this point (see Crevier (1993, p. 50)), McCarthy states unequivocally "I came up with the term" in a c|net interview. (Skillings 2006) McCarthy first used the term in the proposal for the Dartmouth conference, which appeared in 1955. (McCarthy et al. 1955)
  4. ^ McCarthy's definition of AI:
  5. ^ Pamela McCorduck (2004, pp. 424) writes of "the rough shattering of AI in subfields—vision, natural language, decision theory, genetic algorithms, robotics ... and these with own sub-subfield—that would hardly have anything to say to each other."
  6. ^ a b This list of intelligent traits is based on the topics covered by the major AI textbooks, including:
  7. ^ a b General intelligence (strong AI) is discussed in popular introductions to AI:
  8. ^ See the Dartmouth proposal, under Philosophy, below.
  9. ^ a b This is a central idea of Pamela McCorduck's Machines Who Think. She writes: "I like to think of artificial intelligence as the scientific apotheosis of a venerable cultural tradition." (McCorduck 2004, p. 34) "Artificial intelligence in one form or another is an idea that has pervaded Western intellectual history, a dream in urgent need of being realized." (McCorduck 2004, p. xviii) "Our history is full of attempts—nutty, eerie, comical, earnest, legendary and real—to make artificial intelligences, to reproduce what is the essential us—bypassing the ordinary means. Back and forth between myth and reality, our imaginations supplying what our workshops couldn't, we have engaged for a long time in this odd form of self-reproduction." (McCorduck 2004, p. 3) She traces the desire back to its Hellenistic roots and calls it the urge to "forge the Gods." (McCorduck 2004, pp. 340–400)
  10. ^ The optimism referred to includes the predictions of early AI researchers (see optimism in the history of AI) as well as the ideas of modern transhumanists such as Ray Kurzweil.
  11. ^ The "setbacks" referred to include the ALPAC report of 1966, the abandonment of perceptrons in 1970, the Lighthill Report of 1973 and the collapse of the Lisp machine market in 1987.
  12. ^ a b AI applications widely used behind the scenes:
  13. ^ AI in myth:
  14. ^ Cult images as artificial intelligence: These were the first machines to be believed to have true intelligence and consciousness. Hermes Trismegistus expressed the common belief that with these statues, craftsman had reproduced "the true nature of the gods", their sensus and spiritus. McCorduck makes the connection between sacred automatons and Mosaic law (developed around the same time), which expressly forbids the worship of robots (McCorduck 2004, pp. 6–9)
  15. ^ Humanoid automata:
    Yan Shi:
    Hero of Alexandria: Al-Jazari: Wolfgang von Kempelen:
  16. ^ Artificial beings:
    Jābir ibn Hayyān's Takwin:
    Judah Loew's Golem: Paracelsus' Homunculus:
  17. ^ AI in early science fiction.
  18. ^ This insight, that digital computers can simulate any process of formal reasoning, is known as the Church–Turing thesis.
  19. ^ Formal reasoning:
  20. ^ a b AI's immediate precursors: See also Cybernetics and early neural networks (in History of artificial intelligence). Among the researchers who laid the foundations of AI were Alan Turing, John Von Neumann, Norbert Wiener, Claude Shannon, Warren McCullough, Walter Pitts and Donald Hebb.
  21. ^ Dartmouth conference:
    • McCorduck 2004, pp. 111–136
    • Crevier 1993, pp. 47–49, who writes "the conference is generally recognized as the official birthdate of the new science."
    • Russell & Norvig 2003, p. 17, who call the conference "the birth of artificial intelligence."
    • NRC 1999, pp. 200–201
  22. ^ Hegemony of the Dartmouth conference attendees:
  23. ^ Russell and Norvig write "it was astonishing whenever a computer did anything kind of smartish." Russell & Norvig 2003, p. 18
  24. ^ "Golden years" of AI (successful symbolic reasoning programs 1956–1973): The programs described are Daniel Bobrow's STUDENT, Newell and Simon's Logic Theorist and Terry Winograd's SHRDLU.
  25. ^ DARPA pours money into undirected pure research into AI during the 1960s:
  26. ^ AI in England:
  27. ^ Optimism of early AI:
  28. ^ See The problems (in History of artificial intelligence)
  29. ^ First AI Winter, Mansfield Amendment, Lighthill report
  30. ^ a b Expert systems:
  31. ^ Boom of the 1980s: rise of expert systems, Fifth Generation Project, Alvey, MCC, SCI:
  32. ^ Second AI winter:
  33. ^ a b Formal methods are now preferred ("Victory of the neats"):
  34. ^ McCorduck 2004, pp. 480–483
  35. ^ DARPA Grand Challenge – home page
  36. ^ "Welcome". Archive.darpa.mil. Retrieved 31 October 2011. 
  37. ^ Markoff, John (16 February 2011). "On 'Jeopardy!' Watson Win Is All but Trivial". The New York Times. 
  38. ^ Kinect's AI breakthrough explained
  39. ^ Problem solving, puzzle solving, game playing and deduction:
  40. ^ Uncertain reasoning:
  41. ^ Intractability and efficiency and the combinatorial explosion:
  42. ^ Psychological evidence of sub-symbolic reasoning:
  43. ^ Knowledge representation:
  44. ^ Knowledge engineering:
  45. ^ a b Representing categories and relations: Semantic networks, description logics, inheritance (including frames and scripts):
  46. ^ a b Representing events and time:Situation calculus, event calculus, fluent calculus (including solving the frame problem):
  47. ^ a b Causal calculus:
  48. ^ a b Representing knowledge about knowledge: Belief calculus, modal logics:
  49. ^ Ontology:
  50. ^ Qualification problem: While McCarthy was primarily concerned with issues in the logical representation of actions, Russell & Norvig 2003 apply the term to the more general issue of default reasoning in the vast network of assumptions underlying all our commonsense knowledge.
  51. ^ a b Default reasoning and default logic, non-monotonic logics, circumscription, closed world assumption, abduction (Poole et al. places abduction under "default reasoning". Luger et al. places this under "uncertain reasoning"):
  52. ^ Breadth of commonsense knowledge:
  53. ^ Dreyfus & Dreyfus 1986
  54. ^ Gladwell 2005
  55. ^ a b Expert knowledge as embodied intuition: Note, however, that recent work in cognitive science challenges the view that there is anything like sub-symbolic human information processing, i.e., human cognition is essentially symbolic regardless of the level and of the consciousness status of the processing:
    • Augusto, Luis M. (2013). "Unconscious representations 1: Belying the traditional model of human cognition". Axiomathes. doi:10.1007/s10516-012-9206-z. 
    • Augusto, Luis M. (2013). "Unconscious representations 2: Towards an integrated cognitive architecture". Axiomathes. doi:10.1007/s10516-012-9207-y. 
  56. ^ Planning:
  57. ^ a b Information value theory:
  58. ^ Classical planning:
  59. ^ Planning and acting in non-deterministic domains: conditional planning, execution monitoring, replanning and continuous planning:
  60. ^ Multi-agent planning and emergent behavior:
  61. ^ This is a form of Tom Mitchell's widely quoted definition of machine learning: "A computer program is set to learn from an experience E with respect to some task T and some performance measure P if its performance on T as measured by P improves with experience E."
  62. ^ Learning:
  63. ^ Alan Turing discussed the centrality of learning as early as 1950, in his classic paper Computing Machinery and Intelligence.(Turing 1950) In 1956, at the original Dartmouth AI summer conference, Ray Solomonoff wrote a report on unsupervised probabilistic machine learning: "An Inductive Inference Machine".(pdf scanned copy of the original) (version published in 1957, An Inductive Inference Machine," IRE Convention Record, Section on Information Theory, Part 2, pp. 56–62)
  64. ^ Reinforcement learning:
  65. ^ Computational learning theory:
  66. ^ Weng, J., McClelland, Pentland, A.,Sporns, O., Stockman, I., Sur, M., and E. Thelen (2001) Autonomous mental development by robots and animals, Science, vol. 291, pp. 599–600.
  67. ^ Lungarella, M., Metta, G., Pfeifer, R. and G. Sandini (2003). Developmental robotics: a survey. Connection Science, 15:151–190.
  68. ^ Asada, M., Hosoda, K., Kuniyoshi, Y., Ishiguro, H., Inui, T., Yoshikawa, Y., Ogino, M. and C. Yoshida (2009) Cognitive developmental robotics: a survey. IEEE Transactions on Autonomous Mental Development, Vol.1, No.1, pp.12--34.
  69. ^ Oudeyer, P-Y. (2010) On the impact of robotics in behavioral and cognitive sciences: from insect navigation to human cognitive development, IEEE Transactions on Autonomous Mental Development, 2(1), pp. 2--16.
  70. ^ Natural language processing:
  71. ^ Applications of natural language processing, including information retrieval (i.e. text mining) and machine translation:
  72. ^ Robotics:
  73. ^ a b Moving and configuration space:
  74. ^ a b Tecuci, G. (2012), Artificial intelligence. WIREs Comp Stat, 4: 168–180. doi: 10.1002/wics.200
  75. ^ Robotic mapping (localization, etc):
  76. ^ Machine perception:
  77. ^ Computer vision:
  78. ^ Speech recognition:
  79. ^ Object recognition:
  80. ^ "Kismet". MIT Artificial Intelligence Laboratory, Humanoid Robotics Group. 
  81. ^ Thro, Ellen (1993). Robotics. New York. 
  82. ^ Edelson, Edward (1991). The Nervous System. New York: Remmel Nunn. 
  83. ^ Tao, Jianhua; Tieniu Tan (2005). "Affective Computing: A Review". Affective Computing and Intelligent Interaction. LNCS 3784. Springer. pp. 981–995. doi:10.1007/11573548. 
  84. ^ James, William (1884). "What is Emotion". Mind 9: 188–205. doi:10.1093/mind/os-IX.34.188.  Cited by Tao and Tan.
  85. ^ "Affective Computing" MIT Technical Report #321 (Abstract), 1995
  86. ^ Kleine-Cosack, Christian (October 2006). "Recognition and Simulation of Emotions" (PDF). Archived from the original on 28 May 2008. Retrieved 13 May 2008. "The introduction of emotion to computer science was done by Pickard (sic) who created the field of affective computing." 
  87. ^ Diamond, David (December 2003). "The Love Machine; Building computers that care". Wired. Archived from the original on 18 May 2008. Retrieved 13 May 2008. "Rosalind Picard, a genial MIT professor, is the field's godmother; her 1997 book, Affective Computing, triggered an explosion of interest in the emotional side of computers and their users." 
  88. ^ Emotion and affective computing:
  89. ^ Gerald Edelman, Igor Aleksander and others have both argued that artificial consciousness is required for strong AI. (Aleksander 1995; Edelman 2007)
  90. ^ a b Artificial brain arguments: AI requires a simulation of the operation of the human brain A few of the people who make some form of the argument: The most extreme form of this argument (the brain replacement scenario) was put forward by Clark Glymour in the mid-1970s and was touched on by Zenon Pylyshyn and John Searle in 1980.
  91. ^ AI complete: Shapiro 1992, p. 9
  92. ^ Nils Nilsson writes: "Simply put, there is wide disagreement in the field about what AI is all about" (Nilsson 1983, p. 10).
  93. ^ a b Biological intelligence vs. intelligence in general:
    • Russell & Norvig 2003, pp. 2–3, who make the analogy with aeronautical engineering.
    • McCorduck 2004, pp. 100–101, who writes that there are "two major branches of artificial intelligence: one aimed at producing intelligent behavior regardless of how it was accomplioshed, and the other aimed at modeling intelligent processes found in nature, particularly human ones."
    • Kolata 1982, a paper in Science, which describes McCathy's indifference to biological models. Kolata quotes McCarthy as writing: "This is AI, so we don't care if it's psychologically real"[1]. McCarthy recently reiterated his position at the AI@50 conference where he said "Artificial intelligence is not, by definition, simulation of human intelligence" (Maker 2006).
  94. ^ a b Neats vs. scruffies:
  95. ^ a b Symbolic vs. sub-symbolic AI:
  96. ^ Haugeland 1985, p. 255.
  97. ^ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.8384&rep=rep1&type=pdf
  98. ^ Pei Wang (2008). Artificial general intelligence, 2008: proceedings of the First AGI Conference. IOS Press. p. 63. ISBN 978-1-58603-833-5. Retrieved 31 October 2011. 
  99. ^ Haugeland 1985, pp. 112–117
  100. ^ The most dramatic case of sub-symbolic AI being pushed into the background was the devastating critique of perceptrons by Marvin Minsky and Seymour Papert in 1969. See History of AI, AI winter, or Frank Rosenblatt.
  101. ^ Cognitive simulation, Newell and Simon, AI at CMU (then called Carnegie Tech):
  102. ^ Soar (history):
  103. ^ McCarthy and AI research at SAIL and SRI International:
  104. ^ AI research at Edinburgh and in France, birth of Prolog:
  105. ^ AI at MIT under Marvin Minsky in the 1960s :
  106. ^ Cyc:
  107. ^ Knowledge revolution:
  108. ^ Embodied approaches to AI:
  109. ^ Revival of connectionism:
  110. ^ Computational intelligence
  111. ^ Pat Langley, "The changing science of machine learning", Machine Learning, Volume 82, Number 3, 275–279, doi:10.1007/s10994-011-5242-y
  112. ^ Yarden Katz, "Noam Chomsky on Where Artificial Intelligence Went Wrong", The Atlantic, November 1, 2012
  113. ^ Peter Norvig, "On Chomsky and the Two Cultures of Statistical Learning"
  114. ^ Agent architectures, hybrid intelligent systems:
  115. ^ Hierarchical control system:
  116. ^ Subsumption architecture:
  117. ^ Search algorithms:
  118. ^ Forward chaining, backward chaining, Horn clauses, and logical deduction as search:
  119. ^ State space search and planning:
  120. ^ Uninformed searches (breadth first search, depth first search and general state space search):
  121. ^ Heuristic or informed searches (e.g., greedy best first and A*):
  122. ^ Optimization searches:
  123. ^ Artificial life and society based learning:
  124. ^ Genetic programming and genetic algorithms:
  125. ^ Logic:
  126. ^ Satplan:
  127. ^ Explanation based learning, relevance based learning, inductive logic programming, case based reasoning:
  128. ^ Propositional logic:
  129. ^ First-order logic and features such as equality:
  130. ^ Fuzzy logic:
  131. ^ Subjective logic:
  132. ^ Stochastic methods for uncertain reasoning:
  133. ^ Bayesian networks:
  134. ^ Bayesian inference algorithm:
  135. ^ Bayesian learning and the expectation-maximization algorithm:
  136. ^ Bayesian decision theory and Bayesian decision networks:
  137. ^ a b c Stochastic temporal models: Dynamic Bayesian networks: Hidden Markov model: Kalman filters:
  138. ^ decision theory and decision analysis:
  139. ^ Markov decision processes and dynamic decision networks:
  140. ^ Game theory and mechanism design:
  141. ^ Statistical learning methods and classifiers:
  142. ^ a b Neural networks and connectionism:
  143. ^ kernel methods such as the support vector machine, Kernel methods:
  144. ^ K-nearest neighbor algorithm:
  145. ^ Gaussian mixture model:
  146. ^ Naive Bayes classifier:
  147. ^ Decision tree:
  148. ^ Classifier performance:
  149. ^ Backpropagation:
  150. ^ Feedforward neural networks, perceptrons and radial basis networks:
  151. ^ Recurrent neural networks, Hopfield nets:
  152. ^ Competitive learning, Hebbian coincidence learning, Hopfield networks and attractor networks:
  153. ^ Hierarchical temporal memory:
  154. ^ Control theory:
  155. ^ Lisp:
  156. ^ Prolog:
  157. ^ a b The Turing test:
    Turing's original publication:
    Historical influence and philosophical implications:
  158. ^ Subject matter expert Turing test:
  159. ^ Rajani, Sandeep (2011). "Artificial Intelligence - Man or Machine". International Journal of Information Technology and Knowlede Management 4 (1): 173–176. Retrieved 24 September 2012. 
  160. ^ Game AI:
  161. ^ Mathematical definitions of intelligence:
  162. ^
  163. ^ "AI set to exceed human brain power" (web article). CNN. 26 July 2006. Archived from the original on 19 February 2008. Retrieved 26 February 2008. 
  164. ^ Brooks, R.A., "How to build complete creatures rather than isolated cognitive simulators," in K. VanLehn (ed.), Architectures for Intelligence, pp. 225–239, Lawrence Erlbaum Associates, Hillsdale, NJ, 1991.
  165. ^ Hacking Roomba » Search Results » atmel
  166. ^ Philosophy of AI. All of these positions in this section are mentioned in standard discussions of the subject, such as:
  167. ^ Dartmouth proposal:
  168. ^ The physical symbol systems hypothesis:
  169. ^ Dreyfus criticized the necessary condition of the physical symbol system hypothesis, which he called the "psychological assumption": "The mind can be viewed as a device operating on bits of information according to formal rules". (Dreyfus 1992, p. 156)
  170. ^ Dreyfus' critique of artificial intelligence:
  171. ^ This is a paraphrase of the relevant implication of Gödel's theorems.
  172. ^ The Mathematical Objection: Making the Mathematical Objection: Refuting Mathematical Objection: Background:
    • Gödel 1931, Church 1936, Kleene 1935, Turing 1937
  173. ^ This version is from Searle (1999), and is also quoted in Dennett 1991, p. 435. Searle's original formulation was "The appropriately programmed computer really is a mind, in the sense that computers given the right programs can be literally said to understand and have other cognitive states." (Searle 1980, p. 1). Strong AI is defined similarly by Russell & Norvig (2003, p. 947): "The assertion that machines could possibly act intelligently (or, perhaps better, act as if they were intelligent) is called the 'weak AI' hypothesis by philosophers, and the assertion that machines that do so are actually thinking (as opposed to simulating thinking) is called the 'strong AI' hypothesis."
  174. ^ Searle's Chinese room argument: Discussion:
  175. ^ Robot rights: Prematurity of: In fiction:
  176. ^ Independent documentary Plug & Pray, featuring Joseph Weizenbaum and Raymond Kurzweil
  177. ^ Ford, Martin R. (2009), The Lights in the Tunnel: Automation, Accelerating Technology and the Economy of the Future, Acculant Publishing, ISBN 978-1448659814. (e-book available free online.) 
  178. ^ "Machine Learning: A Job Killer?"
  179. ^ AI could decrease the demand for human labor:
  180. ^ In the early 1970s, Kenneth Colby presented a version of Weizenbaum's ELIZA known as DOCTOR which he promoted as a serious therapeutic tool. (Crevier 1993, pp. 132–144)
  181. ^ Joseph Weizenbaum's critique of AI: Weizenbaum (the AI researcher who developed the first chatterbot program, ELIZA) argued in 1976 that the misuse of artificial intelligence has the potential to devalue human life.
  182. ^ Technological singularity:
  183. ^ Transhumanism:
  184. ^ Rubin, Charles (Spring 2003). "Artificial Intelligence and Human Nature". The New Atlantis 1: 88–100. 
  185. ^ AI as evolution:

References[edit]

AI textbooks[edit]

History of AI[edit]

Other sources[edit]

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_for_Computing_Machinery b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_for_Computing_Machinery new file mode 100644 index 00000000..e5b4d02c --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_for_Computing_Machinery @@ -0,0 +1 @@ + Association for Computing Machinery - Wikipedia, the free encyclopedia

Association for Computing Machinery

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Association for Computing Machinery
"acm" in blue circle with gray rim, surrounded by blue diamond
Formation 1947
Type 501(c)(3) not-for-profit membership corporation
Headquarters New York City
Membership 100,000
President Vint Cerf
Website www.acm.org

The Association for Computing Machinery (ACM) is a U.S.-based international learned society for computing. It was founded in 1947 and is the world's largest and most prestigious[1] scientific and educational computing society. It is a not-for-profit professional membership group.[2] Its membership is more than 100,000 as of 2011. Its headquarters are in New York City.

The ACM and the IEEE Computer Society are the primary US umbrella organizations for academic and scholarly interests in computing. Unlike the IEEE, the ACM is solely dedicated to computing.

Contents

Activities[edit]

Street view of top half of skyscraper against the sky; its outside is dominated by vertical black and white lines
Two Penn Plaza site of the ACM headquarters in New York City

ACM is organized into over 170 local chapters and 35 Special Interest Groups (SIGs), through which it conducts most of its activities. Additionally, there are over 500 college and university chapters. The first student chapter was founded in 1961 at the University of Louisiana at Lafayette.

Many of the SIGs, like SIGGRAPH, SIGPLAN, SIGCSE and SIGCOMM, sponsor regular conferences which have become famous as the dominant venue for presenting innovations in certain fields. The groups also publish a large number of specialized journals, magazines, and newsletters.

ACM also sponsors other computer science related events such as the worldwide ACM International Collegiate Programming Contest (ICPC), and has sponsored some other events such as the chess match between Garry Kasparov and the IBM Deep Blue computer.

Services[edit]

ACM Press publishes a prestigious[citation needed] academic journal, Journal of the ACM, and general magazines for computer professionals, Communications of the ACM (also known as Communications or CACM) and Queue. Other publications of the ACM include:

Although Communications no longer publishes primary research, and is not considered a prestigious venue, many of the great debates and results in computing history have been published in its pages.

ACM has made almost all of its publications available to paid subscribers online at its Digital Library and also has a Guide to Computing Literature. Individual members additionally have access to Safari Books Online and Books24x7. The ACM also offers insurance, online courses, and other services to its members.

Digital Library[edit]

The ACM Digital Library, a part of the ACM Portal, contains a comprehensive archive of the organization's journals, magazines, and conference proceedings. Online services include a forum called Ubiquity and Tech News digest.

ACM requires the copyright of all submissions to be assigned to the organization as a condition of publishing the work.[4] Authors may post the documents on their own websites, but they are required to link back to the digital library's reference page for the paper. Though authors are not allowed to charge for access to copies of their work, downloading a copy from the ACM site requires a paid subscription.

Competition[edit]

ACM's primary historical competitor has been the IEEE Computer Society, which is the largest subgroup of the Institute of Electrical and Electronics Engineers. The IEEE focuses more on hardware and standardization issues than theoretical computer science, but there is considerable overlap with ACM's agenda. They occasionally cooperate on projects like developing computing curricula.[5] Some of the major awards in Computer science are given jointly by ACM and the IEEE–CS.[6]

There is also a mounting challenge to the ACM's publication practices coming from the open access movement. Some authors see a centralized peer–review process as less relevant and publish on their home pages or on unreviewed sites like arXiv. Other organizations have sprung up which do their peer review entirely free and online, such as Journal of Artificial Intelligence Research (JAIR), Journal of Machine Learning Research (JMLR) and the Journal of Research and Practice in Information Technology.

Membership grades[edit]

In addition to student and regular members, ACM has several advanced membership grades to recognize those with multiple years of membership and "demonstrated performance that sets them apart from their peers".[7]

Fellows[edit]

The ACM Fellows Program was established by Council of the Association for Computing Machinery in 1993 "to recognize and honor outstanding ACM members for their achievements in computer science and information technology and for their significant contributions to the mission of the ACM."

There are presently about 500 Fellows[8] out of about 60,000 professional members.

Other membership grades[edit]

In 2006 ACM began recognizing two additional membership grades. Senior Members have ten or more years of professional experience and 5 years of continuous ACM membership. Distinguished Engineers and Distinguished Scientists have at least 15 years of profession experience and 5 years of continuous ACM membership and who "have made a significant impact on the computing field".

Chapters[edit]

ACM has three kinds of chapters: Special Interest Groups,[9] Professional Chapters, and Student Chapters.[10]

Special Interest Groups[edit]

  • SIGACCESS: Accessible Computing
  • SIGACT: Algorithms and Computation Theory
  • SIGAda: Ada Programming Language
  • SIGAPP: Applied Computing
  • SIGARCH: Computer Architecture
  • SIGART: Artificial Intelligence
  • SIGBED: Embedded Systems
  • SIGCAS: Computers and Society
  • SIGCHI: Computer–Human Interaction
  • SIGCOMM: Data Communication
  • SIGCSE: Computer Science Education
  • SIGDA: Design Automation
  • SIGDOC: Design of Communication
  • SIGecom: Electronic Commerce
  • SIGEVO: Genetic and Evolutionary Computation
  • SIGGRAPH: Computer Graphics and Interactive Techniques
  • SIGHPC: High Performance Computing
  • SIGIR: Information Retrieval
  • SIGITE: Information Technology Education
  • SIGKDD: Knowledge Discovery and Data Mining
  • SIGMETRICS: Measurement and Evaluation
  • SIGMICRO: Microarchitecture
  • SIGMIS: Management Information Systems
  • SIGMM: Multimedia
  • SIGMOBILE: Mobility of Systems, Users, Data and Computing
  • SIGMOD: Management of Data
  • SIGOPS: Operating Systems
  • SIGPLAN: Programming Languages
  • SIGSAC: Security, Audit, and Control
  • SIGSAM: Symbolic and Algebraic Manipulation
  • SIGSIM: Simulation and Modeling
  • SIGSOFT: Software Engineering
  • SIGSPATIAL: Spatial Information
  • SIGUCCS: University and College Computing Services
  • SIGWEB: Hypertext, Hypermedia, and Web

Professional Chapters[edit]

As of 2011, ACM has professional & SIG Chapters in 56 countries.[11]

Student chapters[edit]

As of 2011, there exist ACM student chapters in 38 different countries.[12]

These chapters include:

Conferences[edit]

The ACM sponsors numerous conferences listed below. Most of the special interest groups also have an annual conference. ACM conferences are often very popular publishing venues and are therefore very competitive. For example, the 2007 SIGGRAPH conference attracted about 30000 visitors, and CIKM only accepted 15% of the long papers that were submitted in 2005.

The ACM is a co–presenter and founding partner of the Grace Hopper Celebration of Women in Computing (GHC) with the Anita Borg Institute for Women and Technology.[17]

There are some conferences hosted by ACM student branches; this includes Reflections Projections, which is hosted by UIUC ACM.[citation needed]

Awards[edit]

The ACM presents or co–presents a number of awards for outstanding technical and professional achievements and contributions in computer science and information technology.[18]

Leadership[edit]

The President of the ACM for 2012–2014[19] is Vint Cerf, an American computer scientist, who is recognized as one of "the fathers of the Internet". He is the successor of Alain Chesnais (2010–2012[20]), a French citizen living in Toronto where he runs his company named Visual Transitions and Wendy Hall of the University of Southampton.

ACM is led by a Council consisting of the President, Vice–President, Treasurer, Past President, SIG Governing Board Chair, Publications Board Chair, three representatives of the SIG Governing Board, and seven Members–At–Large. This institution is often referred to simply as "Council" in Communications of the ACM.

Infrastructure[edit]

ACM has five “Boards” that make up various committees and subgroups, to help Headquarters staff maintain quality services and products. These boards are as follows:

  1. Publications Board
  2. SIG Governing Board
  3. Education Board
  4. Membership Services Board
  5. Professions Board

ACM–W: Association for Computing Machinery Committee on Women[edit]

ACM–W, the ACM's committee on women in computing, is set up to support, inform, celebrate, and work with women in computing. Dr. Anita Borg was a great supporter of ACM–W. ACM–W provides various resources for women in computing as well as high school girls interested in the field. ACM–W also reaches out internationally to those women who are involved and interested in computing.

Athena Lectures[edit]

The ACM-W holds annual Athena Lectures, to honor outstanding women researchers who have made fundamental contributions to computer science, starting from 2006. Speakers are nominated by SIG officers.[21]

Publications[edit]

In 1997, ACM Press published Wizards and Their Wonders: Portraits in Computing (ISBN 0897919602), written by Christopher Morgan, with new photographs by Louis Fabian Bachrach. The book is a collection of historic and current portrait photographs of figures from the computer industry.

See also[edit]

References[edit]

  1. ^ "Indiana University Media Relations". indiana.edu. Retrieved 2012–10–02. 
  2. ^ "ACM 501(c)3 Status as a group". irs.gov. Retrieved 2012–10–01. 
  3. ^ Wakkary, R.; Stolterman, E. (2011). "WELCOME: Our first interactions". Interactions 18: 5. doi:10.1145/1897239.1897240.  edit
  4. ^ "ACM Copyright Policy". Acm.org. 
  5. ^ Joint Task Force of Association for Computing Machinery (ACM), Association for Information Systems (AIS) and IEEE Computer Society (IEEE–CS). "Computing Curricula 2005: The Overview Report". 
  6. ^ See, e.g., Ken Kennedy Award
  7. ^ "ACM Senior Members–An Overview". Acm.org. 
  8. ^ "List of ACM Fellows". Fellows.acm.org. Retrieved 2012–06–07. 
  9. ^ "ACM Special Interest Groups". Archived from the original on July 27, 2010 <!––DASHBot––>. Retrieved August 7, 2010. 
  10. ^ "ACM Chapters". Retrieved August 7, 2010. 
  11. ^ "Worldwide Professional Chapters". Association for Computing Machinery (ACM). Retrieved 2012-12-27. 
  12. ^ Student Chapters http://campus.acm.org/public/chapters/geo_listing/index.cfm?ct=Student&inus=0
  13. ^ "Conference on Information and Knowledge Management (CIKM)". Cikmconference.org. 
  14. ^ "GECCO – 2009". Sigevo.org. 
  15. ^ "Hypertext 2009". Ht2009.org. 
  16. ^ "Joint Conference on Digital Library (JCDL)–Home". JCDL. 
  17. ^ "Grace Hopper Celebration of Women in Computing, Largest Gathering of Women in Computing, Attracts Researchers, Industry". Retrieved June 27, 2011. 
  18. ^ "ACM Awards". Retrieved April 26, 2012. 
  19. ^ "ACM Elects Vint Cerf as President". Acm.org. May 25, 2012. 
  20. ^ "ACM Elects New Leaders Committed to Expanding International Initiatives". Acm.org. June 9, 2010. 
  21. ^ "Athena talks at ACM-W". Retrieved 10 January 2013. 

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_rule_learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_rule_learning new file mode 100644 index 00000000..338a10ac --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_rule_learning @@ -0,0 +1 @@ + Association rule learning - Wikipedia, the free encyclopedia

Association rule learning

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness.[1] Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule \{\mathrm{onions, potatoes}\} \Rightarrow \{\mathrm{burger}\} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production and bioinformatics. As opposed to sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

Contents

Definition[edit]

Example database with 4 items and 5 transactions
transaction ID milk bread butter beer
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0

Following the original definition by Agrawal et al.[2] the problem of association rule mining is defined as: Let I=\{i_1, i_2,\ldots,i_n\} be a set of n binary attributes called items. Let D = \{t_1, t_2, \ldots, t_m\} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X \Rightarrow Y where X, Y \subseteq I and X \cap Y = \emptyset. The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively.

To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I= \{\mathrm{milk, bread, butter, beer}\} and a small database containing the items (1 codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An example rule for the supermarket could be \{\mathrm{butter, bread}\} \Rightarrow \{\mathrm{milk}\} meaning that if butter and bread are bought, customers also buy milk.

Note: this example is extremely small. In practical applications, a rule needs a support of several hundred transactions before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Useful Concepts[edit]

To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence.

  • The support \mathrm{supp}(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset \{\mathrm{milk, bread, butter}\} has a support of 1/5=0.2 since it occurs in 20% of all transactions (1 out of 5 transactions).
  • The confidence of a rule is defined \mathrm{conf}(X\Rightarrow Y) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X). For example, the rule \{\mathrm{milk,  bread}\} \Rightarrow \{\mathrm{butter}\} has a confidence of 0.2/0.4=0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct (50% of the times a customer buys milk and bread, butter is bought as well). Be careful when reading the expression: here supp(X∪Y) means "support for occurrences of transactions where X and Y both appear", not "support for occurrences of transactions where either X or Y appears", the latter interpretation arising because set union is equivalent to logical disjunction. The argument of \mathrm{supp}() is a set of preconditions, and thus becomes more restrictive as it grows (instead of more inclusive).
  • Confidence can be interpreted as an estimate of the probability P(Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.[3]
  • The lift of a rule is defined as  \mathrm{lift}(X\Rightarrow Y) = \frac{ \mathrm{supp}(X \cup Y)}{ \mathrm{supp}(X) \times \mathrm{supp}(Y) } or the ratio of the observed support to that expected if X and Y were independent. The rule \{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\} has a lift of \frac{0.2}{0.4 \times 0.4} = 1.25 .
  • The conviction of a rule is defined as  \mathrm{conv}(X\Rightarrow Y) =\frac{ 1 - \mathrm{supp}(Y) }{ 1 - \mathrm{conf}(X\Rightarrow Y)}. The rule \{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\} has a conviction of \frac{1 - 0.4}{1 - .5} = 1.2 , and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this example, the conviction value of 1.2 shows that the rule \{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\} would be incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance.

Process[edit]

Frequent itemset lattice, where the color of the box indicates how many transactions contain the combination of items. Note that lower levels of the lattice can contain at most the minimum number of their parents' items; e.g. {ac} can have only at most min(a,c) items. This is called the downward-closure property.[2]

Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:

  1. First, minimum support is applied to find all frequent itemsets in a database.
  2. Second, these frequent itemsets and the minimum confidence constraint are used to form rules.

While the second step is straightforward, the first step needs more attention.

Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets (item combinations). The set of possible itemsets is the power set over I and has size 2^n-1 (excluding the empty set which is not a valid itemset). Although the size of the powerset grows exponentially in the number of items n in I, efficient search is possible using the downward-closure property of support[2][4] (also called anti-monotonicity[5]) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent. Exploiting this property, efficient algorithms (e.g., Apriori[6] and Eclat[7]) can find all frequent itemsets.

History[edit]

The concept of association rules was popularised particularly due to the 1993 article of Agrawal et al.,[2] which has acquired more than 6000 citations according to Google Scholar, as of March 2008, and is thus one of the most cited papers in the Data Mining field. However, it is possible that what is now called "association rules" is similar to what appears in the 1966 paper[8] on GUHA, a general data mining method developed by Petr Hájek et al.[9]

Alternative measures of interestingness[edit]

Next to confidence also other measures of interestingness for rules were proposed. Some popular measures are:

  • All-confidence[10]
  • Collective strength[11]
  • Lift (originally called interest)[14]

A definition of these measures can be found here. Several more measures are presented and compared by Tan et al.[15] Looking for techniques that can model what the user has known (and using this models as interestingness measures) is currently an active research trend under the name of "Subjective Interestingness"

Statistically sound associations[edit]

One limitation of the standard approach to discovering associations is that by searching massive numbers of possible associations to look for collections of items that appear to be associated, there is a large risk of finding many spurious associations. These are collections of items that co-occur with unexpected frequency in the data, but only do so by chance. For example, suppose we are considering a collection of 10,000 items and looking for rules containing two items in the left-hand-side and 1 item in the right-hand-side. There are approximately 1,000,000,000,000 such rules. If we apply a statistical test for independence with a significance level of 0.05 it means there is only a 5% chance of accepting a rule if there is no association. If we assume there are no associations, we should nonetheless expect to find 50,000,000,000 rules. Statistically sound association discovery[16][17] controls this risk, in most cases reducing the risk of finding any spurious associations to a user-specified significance level.

Algorithms[edit]

Many algorithms for generating association rules were presented over time.

Some well known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database.

Apriori algorithm[edit]

Apriori[6] is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support.

Eclat algorithm[edit]

Eclat[7] is a depth-first search algorithm using set intersection.

FP-growth algorithm[edit]

FP stands for frequent pattern.

In the first pass, the algorithm counts occurrence of items (attribute-value pairs) in the dataset, and stores them to 'header table'. In the second pass, it builds the FP-tree structure by inserting instances. Items in each instance have to be sorted by descending order of their frequency in the dataset, so that the tree can be processed quickly. Items in each instance that do not meet minimum coverage threshold are discarded. If many instances share most frequent items, FP-tree provides high compression close to tree root.

Recursive processing of this compressed version of main dataset grows large item sets directly, instead of generating candidate items and testing them against the entire database. Growth starts from the bottom of the header table (having longest branches), by finding all instances matching given condition. New tree is created, with counts projected from the original tree corresponding to the set of instances that are conditional on the attribute, with each node getting sum of its children counts. Recursive growth ends when no individual items conditional on the attribute meet minimum support threshold, and processing continues on the remaining header items of the original FP-tree.

Once the recursive process has completed, all large item sets with minimum coverage have been found, and association rule creation begins.

[18]

GUHA procedure ASSOC[edit]

GUHA is a general method for exploratory data analysis that has theoretical foundations in observational calculi.[19]

The ASSOC procedure[20] is a GUHA method which mines for generalized association rules using fast bitstrings operations. The association rules mined by this method are more general than those output by apriori, for example "items" can be connected both with conjunction and disjunctions and the relation between antecedent and consequent of the rule is not restricted to setting minimum support and confidence as in apriori: an arbitrary combination of supported interest measures can be used.

OPUS search[edit]

OPUS is an efficient algorithm for rule discovery that, in contrast to most alternatives, does not require either monotone or anti-monotone constraints such as minimum support.[21] Initially used to find rules for a fixed consequent[21][22] it has subsequently been extended to find rules with any item as a consequent.[23] OPUS search is the core technology in the popular Magnum Opus association discovery system.

Lore[edit]

A famous story about association rule mining is the "beer and diaper" story. A purported survey of behavior of supermarket shoppers discovered that customers (presumably young men) who buy diapers tend also to buy beer. This anecdote became popular as an example of how unexpected association rules might be found from everyday data. There are varying opinions as to how much of the story is true.[24] Daniel Powers says:[24]

In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market baskets from about 25 Osco Drug stores. Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves.

Other types of association mining[edit]

Contrast set learning is a form of associative learning. Contrast set learners use rules that differ meaningfully in their distribution across subsets.[25]

Weighted class learning is another form of associative learning in which weight may be assigned to classes to give focus to a particular issue of concern for the consumer of the data mining results.

High-order pattern discovery techniques facilitate the capture of high-order (polythetic) patterns or event associations that are intrinsic to complex real-world data. [26]

K-optimal pattern discovery provides an alternative to the standard approach to association rule learning that requires that each pattern appear frequently in the data.

Generalized Association Rules hierarchical taxonomy (concept hierarchy)

Quantitative Association Rules categorical and quantitative data [27]

Interval Data Association Rules e.g. partition the age into 5-year-increment ranged

Maximal Association Rules

Sequential pattern mining discovers subsequences that are common to more than minsup sequences in a sequence database, where minsup is set by the user. A sequence is an ordered list of transactions.[28]

Sequential Rules discovering relationships between items while considering the time ordering. It is generally applied on a sequence database. For example, a sequential rule found in database of sequences of customer transactions can be that customers who bought a computer and CD-Roms, later bought a webcam, with a given confidence and support.

See also[edit]

References[edit]

  1. ^ Piatetsky-Shapiro, Gregory (1991), Discovery, analysis, and presentation of strong rules, in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA.
  2. ^ a b c d e Agrawal, R.; Imieliński, T.; Swami, A. (1993). "Mining association rules between sets of items in large databases". Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93. p. 207. doi:10.1145/170035.170072. ISBN 0897915925.  edit
  3. ^ Hipp, J.; Güntzer, U.; Nakhaeizadeh, G. (2000). "Algorithms for association rule mining --- a general survey and comparison". ACM SIGKDD Explorations Newsletter 2: 58. doi:10.1145/360402.360421.  edit
  4. ^ Tan, Pang-Ning; Michael, Steinbach; Kumar, Vipin (2005). "Chapter 6. Association Analysis: Basic Concepts and Algorithms". Introduction to Data Mining. Addison-Wesley. ISBN 0-321-32136-7. 
  5. ^ Pei, Jian; Han, Jiawei; and Lakshmanan, Laks V. S.; Mining frequent itemsets with convertible constraints, in Proceedings of the 17th International Conference on Data Engineering, April 2–6, 2001, Heidelberg, Germany, 2001, pages 433-442
  6. ^ a b Agrawal, Rakesh; and Srikant, Ramakrishnan; Fast algorithms for mining association rules in large databases, in Bocca, Jorge B.; Jarke, Matthias; and Zaniolo, Carlo; editors, Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago, Chile, September 1994, pages 487-499
  7. ^ a b Zaki, M. J. (2000). "Scalable algorithms for association mining". IEEE Transactions on Knowledge and Data Engineering 12 (3): 372–390. doi:10.1109/69.846291.  edit
  8. ^ Hájek, Petr; Havel, Ivan; Chytil, Metoděj; The GUHA method of automatic hypotheses determination, Computing 1 (1966) 293-308
  9. ^ Hájek, Petr; Feglar, Tomas; Rauch, Jan; and Coufal, David; The GUHA method, data preprocessing and mining, Database Support for Data Mining Applications, Springer, 2004, ISBN 978-3-540-22479-2
  10. ^ Omiecinski, Edward R.; Alternative interest measures for mining associations in databases, IEEE Transactions on Knowledge and Data Engineering, 15(1):57-69, Jan/Feb 2003
  11. ^ Aggarwal, Charu C.; and Yu, Philip S.; A new framework for itemset generation, in PODS 98, Symposium on Principles of Database Systems, Seattle, WA, USA, 1998, pages 18-24
  12. ^ Brin, Sergey; Motwani, Rajeev; Ullman, Jeffrey D.; and Tsur, Shalom; Dynamic itemset counting and implication rules for market basket data, in SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1997), Tucson, Arizona, USA, May 1997, pp. 255-264
  13. ^ Piatetsky-Shapiro, Gregory; Discovery, analysis, and presentation of strong rules, Knowledge Discovery in Databases, 1991, pp. 229-248
  14. ^ Brin, Sergey; Motwani, Rajeev; Ullman, Jeffrey D.; and Tsur, Shalom; Dynamic itemset counting and implication rules for market basket data, in SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1997), Tucson, Arizona, USA, May 1997, pp. 265-276
  15. ^ Tan, Pang-Ning; Kumar, Vipin; and Srivastava, Jaideep; Selecting the right objective measure for association analysis, Information Systems, 29(4):293-313, 2004
  16. ^ Webb, Geoffrey I. (2007); Discovering Significant Patterns, Machine Learning 68(1), Netherlands: Springer, pp. 1-33 online access
  17. ^ Gionis, Aristides; Mannila, Heikki; Mielikäinen, Taneli; and Tsaparas, Panayiotis; Assessing Data Mining Results via Swap Randomization, ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 1, Issue 3 (December 2007), Article No. 14
  18. ^ Witten, Frank, Hall: Data mining practical machine learning tools and techniques, 3rd edition
  19. ^ Rauch, Jan; Logical calculi for knowledge discovery in databases, in Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, Springer, 1997, pp. 47-57
  20. ^ Hájek, Petr; and Havránek, Tomáš (1978). Mechanizing Hypothesis Formation: Mathematical Foundations for a General Theory. Springer-Verlag. ISBN 3-540-08738-9. 
  21. ^ a b Webb, Geoffrey I. (1995); OPUS: An Efficient Admissible Algorithm for Unordered Search, Journal of Artificial Intelligence Research 3, Menlo Park, CA: AAAI Press, pp. 431-465 online access
  22. ^ Bayardo, Roberto J., Jr.; Agrawal, Rakesh; Gunopulos, Dimitrios (2000). "Constraint-based rule mining in large, dense databases". Data Mining and Knowledge Discovery 4 (2): 217–240. doi:10.1023/A:1009895914772. 
  23. ^ Webb, Geoffrey I. (2000); Efficient Search for Association Rules, in Ramakrishnan, Raghu; and Stolfo, Sal; eds.; Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, New York, NY: The Association for Computing Machinery, pp. 99-107 online access
  24. ^ a b http://www.dssresources.com/newsletters/66.php
  25. ^ Menzies, Tim; and Hu, Ying; Data Mining for Very Busy People, IEEE Computer, October 2003, pp. 18-25
  26. ^ Wong, Andrew K.C.; Wang, Yang (1997). "High-order pattern discovery from discrete-valued data". IEEE Transactions on Knowledge and Data Engineering (TKDE): 877–893. 
  27. ^ Salleb-Aouissi, Ansaf; Vrain, Christel; and Nortet, Cyril (2007). "QuantMiner: A Genetic Algorithm for Mining Quantitative Association Rules". International Joint Conference on Artificial Intelligence (IJCAI): 1035–1040. 
  28. ^ Zaki, Mohammed J. (2001); SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning Journal, 42, pp. 31–60

External links[edit]

Bibliographies[edit]

Implementations[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_rule_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_rule_mining new file mode 100644 index 00000000..f509edb6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Association_rule_mining @@ -0,0 +1 @@ + Association rule learning - Wikipedia, the free encyclopedia

Association rule learning

From Wikipedia, the free encyclopedia
  (Redirected from Association rule mining)
Jump to: navigation, search

In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness.[1] Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule \{\mathrm{onions, potatoes}\} \Rightarrow \{\mathrm{burger}\} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production and bioinformatics. As opposed to sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

Contents

Definition[edit]

Example database with 4 items and 5 transactions
transaction ID milk bread butter beer
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0

Following the original definition by Agrawal et al.[2] the problem of association rule mining is defined as: Let I=\{i_1, i_2,\ldots,i_n\} be a set of n binary attributes called items. Let D = \{t_1, t_2, \ldots, t_m\} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X \Rightarrow Y where X, Y \subseteq I and X \cap Y = \emptyset. The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively.

To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I= \{\mathrm{milk, bread, butter, beer}\} and a small database containing the items (1 codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An example rule for the supermarket could be \{\mathrm{butter, bread}\} \Rightarrow \{\mathrm{milk}\} meaning that if butter and bread are bought, customers also buy milk.

Note: this example is extremely small. In practical applications, a rule needs a support of several hundred transactions before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Useful Concepts[edit]

To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence.

  • The support \mathrm{supp}(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset \{\mathrm{milk, bread, butter}\} has a support of 1/5=0.2 since it occurs in 20% of all transactions (1 out of 5 transactions).
  • The confidence of a rule is defined \mathrm{conf}(X\Rightarrow Y) = \mathrm{supp}(X \cup Y) / \mathrm{supp}(X). For example, the rule \{\mathrm{milk,  bread}\} \Rightarrow \{\mathrm{butter}\} has a confidence of 0.2/0.4=0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct (50% of the times a customer buys milk and bread, butter is bought as well). Be careful when reading the expression: here supp(X∪Y) means "support for occurrences of transactions where X and Y both appear", not "support for occurrences of transactions where either X or Y appears", the latter interpretation arising because set union is equivalent to logical disjunction. The argument of \mathrm{supp}() is a set of preconditions, and thus becomes more restrictive as it grows (instead of more inclusive).
  • Confidence can be interpreted as an estimate of the probability P(Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.[3]
  • The lift of a rule is defined as  \mathrm{lift}(X\Rightarrow Y) = \frac{ \mathrm{supp}(X \cup Y)}{ \mathrm{supp}(X) \times \mathrm{supp}(Y) } or the ratio of the observed support to that expected if X and Y were independent. The rule \{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\} has a lift of \frac{0.2}{0.4 \times 0.4} = 1.25 .
  • The conviction of a rule is defined as  \mathrm{conv}(X\Rightarrow Y) =\frac{ 1 - \mathrm{supp}(Y) }{ 1 - \mathrm{conf}(X\Rightarrow Y)}. The rule \{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\} has a conviction of \frac{1 - 0.4}{1 - .5} = 1.2 , and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this example, the conviction value of 1.2 shows that the rule \{\mathrm{milk, bread}\} \Rightarrow \{\mathrm{butter}\} would be incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance.

Process[edit]

Frequent itemset lattice, where the color of the box indicates how many transactions contain the combination of items. Note that lower levels of the lattice can contain at most the minimum number of their parents' items; e.g. {ac} can have only at most min(a,c) items. This is called the downward-closure property.[2]

Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:

  1. First, minimum support is applied to find all frequent itemsets in a database.
  2. Second, these frequent itemsets and the minimum confidence constraint are used to form rules.

While the second step is straightforward, the first step needs more attention.

Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets (item combinations). The set of possible itemsets is the power set over I and has size 2^n-1 (excluding the empty set which is not a valid itemset). Although the size of the powerset grows exponentially in the number of items n in I, efficient search is possible using the downward-closure property of support[2][4] (also called anti-monotonicity[5]) which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent. Exploiting this property, efficient algorithms (e.g., Apriori[6] and Eclat[7]) can find all frequent itemsets.

History[edit]

The concept of association rules was popularised particularly due to the 1993 article of Agrawal et al.,[2] which has acquired more than 6000 citations according to Google Scholar, as of March 2008, and is thus one of the most cited papers in the Data Mining field. However, it is possible that what is now called "association rules" is similar to what appears in the 1966 paper[8] on GUHA, a general data mining method developed by Petr Hájek et al.[9]

Alternative measures of interestingness[edit]

Next to confidence also other measures of interestingness for rules were proposed. Some popular measures are:

  • All-confidence[10]
  • Collective strength[11]
  • Lift (originally called interest)[14]

A definition of these measures can be found here. Several more measures are presented and compared by Tan et al.[15] Looking for techniques that can model what the user has known (and using this models as interestingness measures) is currently an active research trend under the name of "Subjective Interestingness"

Statistically sound associations[edit]

One limitation of the standard approach to discovering associations is that by searching massive numbers of possible associations to look for collections of items that appear to be associated, there is a large risk of finding many spurious associations. These are collections of items that co-occur with unexpected frequency in the data, but only do so by chance. For example, suppose we are considering a collection of 10,000 items and looking for rules containing two items in the left-hand-side and 1 item in the right-hand-side. There are approximately 1,000,000,000,000 such rules. If we apply a statistical test for independence with a significance level of 0.05 it means there is only a 5% chance of accepting a rule if there is no association. If we assume there are no associations, we should nonetheless expect to find 50,000,000,000 rules. Statistically sound association discovery[16][17] controls this risk, in most cases reducing the risk of finding any spurious associations to a user-specified significance level.

Algorithms[edit]

Many algorithms for generating association rules were presented over time.

Some well known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database.

Apriori algorithm[edit]

Apriori[6] is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support.

Eclat algorithm[edit]

Eclat[7] is a depth-first search algorithm using set intersection.

FP-growth algorithm[edit]

FP stands for frequent pattern.

In the first pass, the algorithm counts occurrence of items (attribute-value pairs) in the dataset, and stores them to 'header table'. In the second pass, it builds the FP-tree structure by inserting instances. Items in each instance have to be sorted by descending order of their frequency in the dataset, so that the tree can be processed quickly. Items in each instance that do not meet minimum coverage threshold are discarded. If many instances share most frequent items, FP-tree provides high compression close to tree root.

Recursive processing of this compressed version of main dataset grows large item sets directly, instead of generating candidate items and testing them against the entire database. Growth starts from the bottom of the header table (having longest branches), by finding all instances matching given condition. New tree is created, with counts projected from the original tree corresponding to the set of instances that are conditional on the attribute, with each node getting sum of its children counts. Recursive growth ends when no individual items conditional on the attribute meet minimum support threshold, and processing continues on the remaining header items of the original FP-tree.

Once the recursive process has completed, all large item sets with minimum coverage have been found, and association rule creation begins.

[18]

GUHA procedure ASSOC[edit]

GUHA is a general method for exploratory data analysis that has theoretical foundations in observational calculi.[19]

The ASSOC procedure[20] is a GUHA method which mines for generalized association rules using fast bitstrings operations. The association rules mined by this method are more general than those output by apriori, for example "items" can be connected both with conjunction and disjunctions and the relation between antecedent and consequent of the rule is not restricted to setting minimum support and confidence as in apriori: an arbitrary combination of supported interest measures can be used.

OPUS search[edit]

OPUS is an efficient algorithm for rule discovery that, in contrast to most alternatives, does not require either monotone or anti-monotone constraints such as minimum support.[21] Initially used to find rules for a fixed consequent[21][22] it has subsequently been extended to find rules with any item as a consequent.[23] OPUS search is the core technology in the popular Magnum Opus association discovery system.

Lore[edit]

A famous story about association rule mining is the "beer and diaper" story. A purported survey of behavior of supermarket shoppers discovered that customers (presumably young men) who buy diapers tend also to buy beer. This anecdote became popular as an example of how unexpected association rules might be found from everyday data. There are varying opinions as to how much of the story is true.[24] Daniel Powers says:[24]

In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market baskets from about 25 Osco Drug stores. Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves.

Other types of association mining[edit]

Contrast set learning is a form of associative learning. Contrast set learners use rules that differ meaningfully in their distribution across subsets.[25]

Weighted class learning is another form of associative learning in which weight may be assigned to classes to give focus to a particular issue of concern for the consumer of the data mining results.

High-order pattern discovery techniques facilitate the capture of high-order (polythetic) patterns or event associations that are intrinsic to complex real-world data. [26]

K-optimal pattern discovery provides an alternative to the standard approach to association rule learning that requires that each pattern appear frequently in the data.

Generalized Association Rules hierarchical taxonomy (concept hierarchy)

Quantitative Association Rules categorical and quantitative data [27]

Interval Data Association Rules e.g. partition the age into 5-year-increment ranged

Maximal Association Rules

Sequential pattern mining discovers subsequences that are common to more than minsup sequences in a sequence database, where minsup is set by the user. A sequence is an ordered list of transactions.[28]

Sequential Rules discovering relationships between items while considering the time ordering. It is generally applied on a sequence database. For example, a sequential rule found in database of sequences of customer transactions can be that customers who bought a computer and CD-Roms, later bought a webcam, with a given confidence and support.

See also[edit]

References[edit]

  1. ^ Piatetsky-Shapiro, Gregory (1991), Discovery, analysis, and presentation of strong rules, in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA.
  2. ^ a b c d e Agrawal, R.; Imieliński, T.; Swami, A. (1993). "Mining association rules between sets of items in large databases". Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93. p. 207. doi:10.1145/170035.170072. ISBN 0897915925.  edit
  3. ^ Hipp, J.; Güntzer, U.; Nakhaeizadeh, G. (2000). "Algorithms for association rule mining --- a general survey and comparison". ACM SIGKDD Explorations Newsletter 2: 58. doi:10.1145/360402.360421.  edit
  4. ^ Tan, Pang-Ning; Michael, Steinbach; Kumar, Vipin (2005). "Chapter 6. Association Analysis: Basic Concepts and Algorithms". Introduction to Data Mining. Addison-Wesley. ISBN 0-321-32136-7. 
  5. ^ Pei, Jian; Han, Jiawei; and Lakshmanan, Laks V. S.; Mining frequent itemsets with convertible constraints, in Proceedings of the 17th International Conference on Data Engineering, April 2–6, 2001, Heidelberg, Germany, 2001, pages 433-442
  6. ^ a b Agrawal, Rakesh; and Srikant, Ramakrishnan; Fast algorithms for mining association rules in large databases, in Bocca, Jorge B.; Jarke, Matthias; and Zaniolo, Carlo; editors, Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago, Chile, September 1994, pages 487-499
  7. ^ a b Zaki, M. J. (2000). "Scalable algorithms for association mining". IEEE Transactions on Knowledge and Data Engineering 12 (3): 372–390. doi:10.1109/69.846291.  edit
  8. ^ Hájek, Petr; Havel, Ivan; Chytil, Metoděj; The GUHA method of automatic hypotheses determination, Computing 1 (1966) 293-308
  9. ^ Hájek, Petr; Feglar, Tomas; Rauch, Jan; and Coufal, David; The GUHA method, data preprocessing and mining, Database Support for Data Mining Applications, Springer, 2004, ISBN 978-3-540-22479-2
  10. ^ Omiecinski, Edward R.; Alternative interest measures for mining associations in databases, IEEE Transactions on Knowledge and Data Engineering, 15(1):57-69, Jan/Feb 2003
  11. ^ Aggarwal, Charu C.; and Yu, Philip S.; A new framework for itemset generation, in PODS 98, Symposium on Principles of Database Systems, Seattle, WA, USA, 1998, pages 18-24
  12. ^ Brin, Sergey; Motwani, Rajeev; Ullman, Jeffrey D.; and Tsur, Shalom; Dynamic itemset counting and implication rules for market basket data, in SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1997), Tucson, Arizona, USA, May 1997, pp. 255-264
  13. ^ Piatetsky-Shapiro, Gregory; Discovery, analysis, and presentation of strong rules, Knowledge Discovery in Databases, 1991, pp. 229-248
  14. ^ Brin, Sergey; Motwani, Rajeev; Ullman, Jeffrey D.; and Tsur, Shalom; Dynamic itemset counting and implication rules for market basket data, in SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 1997), Tucson, Arizona, USA, May 1997, pp. 265-276
  15. ^ Tan, Pang-Ning; Kumar, Vipin; and Srivastava, Jaideep; Selecting the right objective measure for association analysis, Information Systems, 29(4):293-313, 2004
  16. ^ Webb, Geoffrey I. (2007); Discovering Significant Patterns, Machine Learning 68(1), Netherlands: Springer, pp. 1-33 online access
  17. ^ Gionis, Aristides; Mannila, Heikki; Mielikäinen, Taneli; and Tsaparas, Panayiotis; Assessing Data Mining Results via Swap Randomization, ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 1, Issue 3 (December 2007), Article No. 14
  18. ^ Witten, Frank, Hall: Data mining practical machine learning tools and techniques, 3rd edition
  19. ^ Rauch, Jan; Logical calculi for knowledge discovery in databases, in Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, Springer, 1997, pp. 47-57
  20. ^ Hájek, Petr; and Havránek, Tomáš (1978). Mechanizing Hypothesis Formation: Mathematical Foundations for a General Theory. Springer-Verlag. ISBN 3-540-08738-9. 
  21. ^ a b Webb, Geoffrey I. (1995); OPUS: An Efficient Admissible Algorithm for Unordered Search, Journal of Artificial Intelligence Research 3, Menlo Park, CA: AAAI Press, pp. 431-465 online access
  22. ^ Bayardo, Roberto J., Jr.; Agrawal, Rakesh; Gunopulos, Dimitrios (2000). "Constraint-based rule mining in large, dense databases". Data Mining and Knowledge Discovery 4 (2): 217–240. doi:10.1023/A:1009895914772. 
  23. ^ Webb, Geoffrey I. (2000); Efficient Search for Association Rules, in Ramakrishnan, Raghu; and Stolfo, Sal; eds.; Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, New York, NY: The Association for Computing Machinery, pp. 99-107 online access
  24. ^ a b http://www.dssresources.com/newsletters/66.php
  25. ^ Menzies, Tim; and Hu, Ying; Data Mining for Very Busy People, IEEE Computer, October 2003, pp. 18-25
  26. ^ Wong, Andrew K.C.; Wang, Yang (1997). "High-order pattern discovery from discrete-valued data". IEEE Transactions on Knowledge and Data Engineering (TKDE): 877–893. 
  27. ^ Salleb-Aouissi, Ansaf; Vrain, Christel; and Nortet, Cyril (2007). "QuantMiner: A Genetic Algorithm for Mining Quantitative Association Rules". International Joint Conference on Artificial Intelligence (IJCAI): 1035–1040. 
  28. ^ Zaki, Mohammed J. (2001); SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning Journal, 42, pp. 31–60

External links[edit]

Bibliographies[edit]

Implementations[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Automatic_distillation_of_structure b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Automatic_distillation_of_structure new file mode 100644 index 00000000..18b84808 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Automatic_distillation_of_structure @@ -0,0 +1 @@ + Automatic distillation of structure - Wikipedia, the free encyclopedia

Automatic distillation of structure

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Automatic Distillation of Structure (ADIOS) is an algorithm that can analyse source material such as text and come up with meaningful information about the generative structures that gave rise to the source. One application of the algorithm is grammar induction: ADIOS can read a source text and infer grammatical rules based on structures and patterns found in the text. Using these, the system can then generate new well-structured sentences.

ADIOS was developed by Zach Solan, David Horn, and Eytan Ruppin from Tel Aviv University, Israel, and Shimon Edelman from Cornell University, New York, USA.

References[edit]


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Automatic_summarization b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Automatic_summarization new file mode 100644 index 00000000..abe13f0d --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Automatic_summarization @@ -0,0 +1 @@ + Automatic summarization - Wikipedia, the free encyclopedia

Automatic summarization

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. An example of the use of summarization technology is search engines such as Google. Document summarization is another.

Generally, there are two approaches to automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate. Such a summary might contain words not explicitly present in the original. The state-of-the-art abstractive methods are still quite weak, so most research has focused on extractive methods.

Contents

Methods[edit]

Methods of automatic summarization include extraction-based, abstraction-based, maximum entropy-based, and aided summarization.

Extraction-based summarization[edit]

Two particular types of summarization often addressed in the literature are keyphrase extraction, where the goal is to select individual words or phrases to "tag" a document, and document summarization, where the goal is to select whole sentences to create a short paragraph summary.

Abstraction-based summarization[edit]

Extraction techniques merely copy the information deemed most important by the system to the summary (for example, key clauses, sentences or paragraphs), while abstraction involves paraphrasing sections of the source document. In general, abstraction can condense a text more strongly than extraction, but the programs that can do this are harder to develop as they require the use of natural language generation technology, which itself is a growing field.

While some work has been done in abstractive summarization (creating an abstract synopsis like that of a human), the majority of summarization systems are extractive (selecting a subset of sentences to place in a summary).

Maximum entropy-based summarization[edit]

Even though automating abstractive summarization is the goal of summarization research, most practical systems are based on some form of extractive summarization. Extracted sentences can form a valid summary in itself or form a basis for further condensation operations. Furthermore, evaluation of extracted summaries can be automated, since it is essentially a classification task. During the DUC 2001 and 2002 evaluation workshops, TNO[disambiguation needed] developed a sentence extraction system for multi-document summarization in the news domain. The system was based on a hybrid system using a naive Bayes classifier and statistical language models for modeling salience. Although the system exhibited good results, we wanted to explore the effectiveness of a maximum entropy (ME) classifier for the meeting summarization task, as ME is known to be robust against feature dependencies. Maximum entropy has also been applied successfully for summarization in the broadcast news domain.

Aided summarization[edit]

Machine learning techniques from closely related fields such as information retrieval or text mining have been successfully adapted to help automatic summarization.

Apart from Fully Automated Summarizers (FAS), there are systems that aid users with the task of summarization (MAHS = Machine Aided Human Summarization), for example by highlighting candidate passages to be included in the summary, and there are systems that depend on post-processing by a human (HAMS = Human Aided Machine Summarization).

Applications[edit]

There are different types of summaries depending what the summarization program focuses on to make the summary of the text, for example generic summaries or query relevant summaries (sometimes called query-based summaries).

Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs. Summarization of multimedia documents, e.g. pictures or movies, is also possible.

Some systems will generate a summary based on a single source document, while others can use multiple source documents (for example, a cluster of news stories on the same topic). These systems are known as multi-document summarization systems.

Keyphrase extraction[edit]

Task description and example[edit]

The task is the following. You are given a piece of text, such as a journal article, and you must produce a list of keywords or keyphrases that capture the primary topics discussed in the text. In the case of research articles, many authors provide manually assigned keywords, but most text lacks pre-existing keyphrases. For example, news articles rarely have keyphrases attached, but it would be useful to be able to automatically do so for a number of applications discussed below. Consider the example text from a recent news article:

"The Army Corps of Engineers, rushing to meet President Bush's promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm, according to documents obtained by The Associated Press".

An extractive keyphrase extractor might select "Army Corps of Engineers", "President Bush", "New Orleans", and "defective flood-control pumps" as keyphrases. These are pulled directly from the text. In contrast, an abstractive keyphrase system would somehow internalize the content and generate keyphrases that might be more descriptive and more like what a human would produce, such as "political negligence" or "inadequate protection from floods". Note that these terms do not appear in the text and require a deep understanding, which makes it difficult for a computer to produce such keyphrases. Keyphrases have many applications, such as to improve document browsing by providing a short summary. Also, keyphrases can improve information retrieval — if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a full-text search. Also, automatic keyphrase extraction can be useful in generating index entries for a large text corpus.

Keyphrase extraction as supervised learning[edit]

Beginning with the Turney paper, many researchers have approached keyphrase extraction as a supervised machine learning problem. Given a document, we construct an example for each unigram, bigram, and trigram found in the text (though other text units are also possible, as discussed below). We then compute various features describing each example (e.g., does the phrase begin with an upper-case letter?). We assume there are known keyphrases available for a set of training documents. Using the known keyphrases, we can assign positive or negative labels to the examples. Then we learn a classifier that can discriminate between positive and negative examples as a function of the features. Some classifiers make a binary classification for a test example, while others assign a probability of being a keyphrase. For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases. After training a learner, we can select keyphrases for test documents in the following manner. We apply the same example-generation strategy to the test documents, then run each example through the learner. We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model. If probabilities are given, a threshold is used to select the keyphrases. Keyphrase extractors are generally evaluated using precision and recall. Precision measures how many of the proposed keyphrases are actually correct. Recall measures how many of the true keyphrases your system proposed. The two measures can be combined in an F-score, which is the harmonic mean of the two (F = 2PR/(P + R) ). Matches between the proposed keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization.

Design choices[edit]

Designing a supervised keyphrase extraction system involves deciding on several choices (some of these apply to unsupervised, too):

What are the examples?[edit]

The first choice is exactly how to generate examples. Turney and others have used all possible unigrams, bigrams, and trigrams without intervening punctuation and after removing stopwords. Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags. Ideally, the mechanism for generating examples produces all the known labeled keyphrases as candidates, though this is often not the case. For example, if we use only unigrams, bigrams, and trigrams, then we will never be able to extract a known keyphrase containing four words. Thus, recall may suffer. However, generating too many examples can also lead to low precision.

What are the features?[edit]

We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non- keyphrases. Typically features involve various term frequencies (how many times a phrase appears in the current text or in a larger corpus), the length of the example, relative position of the first occurrence, various boolean syntactic features (e.g., contains all caps), etc. The Turney paper used about 12 such features. Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney’s seminal paper.

How many keyphrases to return?[edit]

In the end, the system will need to return a list of keyphrases for a test document, so we need to have a way to limit the number. Ensemble methods (i.e., using votes from several classifiers) have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases. This is the technique used by Turney with C4.5 decision trees. Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number.

What learning algorithm?[edit]

Once examples and features are created, we need a way to learn to predict keyphrases. Virtually any supervised learning algorithm could be used, such as decision trees, Naive Bayes, and rule induction. In the case of Turney's GenEx algorithm, a genetic algorithm is used to learn parameters for a domain-specific keyphrase extraction algorithm. The extractor follows a series of heuristics to identify keyphrases. The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases.

Unsupervised keyphrase extraction: TextRank[edit]

While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of training data. Many documents with known keyphrases are needed. Furthermore, training on a specific domain tends to customize the extraction process to that domain, so the resulting classifier is not necessarily portable, as some of Turney's results demonstrate. Unsupervised keyphrase extraction removes the need for training data. It approaches the problem from a different angle. Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[1] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages. Recall this is based on the notion of "prestige" or "recommendation" from social networks. In this way, TextRank does not rely on any previous training data at all, but rather can be run on any arbitrary piece of text, and it can produce output simply based on the text's intrinsic properties. Thus the algorithm is easily portable to new domains and languages.

TextRank is a general purpose graph-based ranking algorithm for NLP. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or lexical similarity between the text unit vertices. Unlike PageRank, the edges are typically undirected and can be weighted to reflect a degree of similarity. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).

Design choices[edit]
What should vertices be?[edit]

The vertices should correspond to what we want to rank. Potentially, we could do something similar to the supervised methods and create a vertex for each unigram, bigram, trigram, etc. However, to keep the graph small, the authors decide to rank individual unigrams in a first step, and then include a second step that merges highly ranked adjacent unigrams to form multi-word phrases. This has a nice side effect of allowing us to produce keyphrases of arbitrary length. For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together. Note that the unigrams placed in the graph can be filtered by part of speech. The authors found that adjectives and nouns were the best to include. Thus, some linguistic knowledge comes into play in this step.

How should we create edges?[edit]

Edges are created based on word co-occurrence in this application of TextRank. Two vertices are connected by an edge if the unigrams appear within a window of size N in the original text. N is typically around 2–10. Thus, "natural" and "language" might be linked in a text about NLP. "Natural" and "processing" would also be linked because they would both appear in the same string of N words. These edges build on the notion of "text cohesion" and the idea that words that appear near each other are likely related in a meaningful way and "recommend" each other to the reader.

How are the final keyphrases formed?[edit]

Since this method simply ranks the individual vertices, we need a way to threshold or produce a limited number of keyphrases. The technique chosen is to set a count T to be a user-specified fraction of the total number of vertices in the graph. Then the top T vertices/unigrams are selected based on their stationary probabilities. A post- processing step is then applied to merge adjacent instances of these T unigrams. As a result, potentially more or less than T final keyphrases will be produced, but the number should be roughly proportional to the length of the original text.

Why it works[edit]

It is not initially clear why applying PageRank to a co-occurrence graph would produce useful keyphrases. One way to think about it is the following. A word that appears multiple times throughout a text may have many different co-occurring neighbors. For example, in a text about machine learning, the unigram "learning" might co-occur with "machine", "supervised", "un-supervised", and "semi-supervised" in four different sentences. Thus, the "learning" vertex would be a central "hub" that connects to these other modifying words. Running PageRank/TextRank on the graph is likely to rank "learning" highly. Similarly, if the text contains the phrase "supervised classification", then there would be an edge between "supervised" and "classification". If "classification" appears several other places and thus has many neighbors, it is importance would contribute to the importance of "supervised". If it ends up with a high rank, it will be selected as one of the top T unigrams, along with "learning" and probably "classification". In the final post-processing step, we would then end up with keyphrases "supervised learning" and "supervised classification".

In short, the co-occurrence graph will contain densely connected regions for terms that appear often and in different contexts. A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters. This is similar to densely connected Web pages getting ranked highly by PageRank.

Document summarization[edit]

Like keyphrase extraction, document summarization hopes to identify the essence of a text. The only real difference is that now we are dealing with larger text units—whole sentences instead of words and phrases.

Before getting into the details of some summarization methods, we will mention how summarization systems are typically evaluated. The most common way is using the so-called ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measure. This is a recall-based measure that determines how well a system-generated summary covers the content present in one or more human-generated model summaries known as references. It is recall-based to encourage systems to include all the important topics in the text. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching, though ROUGE-1 (unigram matching) has been shown to correlate best with human assessments of system-generated summaries (i.e., the summaries with highest ROUGE-1 values correlate with the summaries humans deemed the best). ROUGE-1 is computed as division of count of unigrams in reference that appear in system and count of unigrams in reference summary.

If there are multiple references, the ROUGE-1 scores are averaged. Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner. High-order n-gram ROUGE measures try to judge fluency to some degree. Note that ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision- based, because translation systems favor accuracy.

A promising line in document summarization is adaptive document/text summarization.[2] The idea of adaptive summarization involves preliminary recognition of document/text genre and subsequent application of summarization algorithms optimized for this genre. First summarizes that perform adaptive summarization have been created.[3]

Overview of supervised learning approaches[edit]

Supervised text summarization is very much like supervised keyphrase extraction. Basically, if you have a collection of documents and human-generated summaries for them, you can learn features of sentences that make them good candidates for inclusion in the summary. Features might include the position in the document (i.e., the first few sentences are probably important), the number of words in the sentence, etc. The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary". This is not typically how people create summaries, so simply using journal abstracts or existing summaries is usually not sufficient. The sentences in these summaries do not necessarily match up with sentences in the original text, so it would be difficult to assign labels to examples for training. Note, however, that these natural summaries can still be used for evaluation purposes, since ROUGE-1 only cares about unigrams.

Unsupervised approaches: TextRank and LexRank[edit]

The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data. Some unsupervised summarization approaches are based on finding a "centroid" sentence, which is the mean word vector of all the sentences in the document. Then the sentences can be ranked with regard to their similarity to this centroid sentence.

A more principled way to estimate sentence importance is using random walks and eigenvector centrality. LexRank[4] is an algorithm essentially identical to TextRank, and both use this approach for document summarization. The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.

Design choices[edit]
What are the vertices?[edit]

In both LexRank and TextRank, a graph is constructed by creating a vertex for each sentence in the document.

What are the edges?[edit]

The edges between sentences are based on some form of semantic similarity or content overlap. While LexRank uses cosine similarity of TF-IDF vectors, TextRank uses a very similar measure based on the number of words two sentences have in common (normalized by the sentences' lengths). The LexRank paper explored using unweighted edges after applying a threshold to the cosine values, but also experimented with using edges with weights equal to the similarity score. TextRank uses continuous similarity scores as weights.

How are summaries formed?[edit]

In both algorithms, the sentences are ranked by applying PageRank to the resulting graph. A summary is formed by combining the top ranking sentences, using a threshold or length cutoff to limit the size of the summary.

TextRank and LexRank differences[edit]

It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system (MEAD) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights. In this case, some training documents might be needed, though the TextRank results show the additional features are not absolutely necessary.

Another important distinction is that TextRank was used for single document summarization, while LexRank has been applied to multi-document summarization. The task remains the same in both cases—only the number of sentences to choose from has grown. However, when summarizing multiple documents, there is a greater risk of selecting duplicate or highly redundant sentences to place in the same summary. Imagine you have a cluster of news articles on a particular event, and you want to produce one summary. Each article is likely to have many similar sentences, and you would only want to include distinct ideas in the summary. To address this issue, LexRank applies a heuristic post-processing step that builds up a summary by adding sentences in rank order, but discards any sentences that are too similar to ones already placed in the summary. The method used is called Cross-Sentence Information Subsumption (CSIS).

Why unsupervised summarization works[edit]

These methods work based on the idea that sentences "recommend" other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance. The importance of this sentence also stems from the importance of the sentences "recommending" it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences. This makes intuitive sense and allows the algorithms to be applied to any arbitrary new text. The methods are domain-independent and easily portable. One could imagine the features indicating important sentences in the news domain might vary considerably from the biomedical domain. However, the unsupervised "recommendation"-based approach applies to any domain.

Multi-document summarization[edit]

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. Resulting summary report allows individual users, such as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents. In such a way, multi-document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload.

Multi-document summarization creates information reports that are both concise and comprehensive. With different opinions being put together & outlined, every topic is described from multiple perspectives within a single document. While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should itself contain the required information, hence limiting the need for accessing original files to cases when refinement is required. Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.

Incorporating diversity: GRASSHOPPER algorithm[edit]

Multi-document extractive summarization faces a problem of potential redundancy. Ideally, we would like to extract sentences that are both "central" (i.e., contain the main ideas) and "diverse" (i.e., they differ from one another). LexRank deals with diversity as a heuristic final stage using CSIS, and other systems have used similar methods, such as Maximal Marginal Relevance (MMR), in trying to eliminate redundancy in information retrieval results.

There is a general purpose graph-based ranking algorithm like Page/Lex/TextRank that handles both "centrality" and "diversity" in a unified mathematical framework based on absorbing Markov chain random walks. (An absorbing random walk is like a standard random walk, except some states are now absorbing states that act as "black holes" that cause the walk to end abruptly at that state.) The algorithm is called GRASSHOPPER. In addition to explicitly promoting diversity during the ranking process, GRASSHOPPER incorporates a prior ranking (based on sentence position in the case of summarization).

Evaluation techniques[edit]

The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries.

Evaluation techniques fall into intrinsic and extrinsic,[5] inter-texual and intra-texual.[6]

Intrinsic and extrinsic evaluation[edit]

An intrinsic evaluation tests the summarization system in of itself while an extrinsic evaluation tests the summarization based on how it affects the completion of some other task. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.

Inter-textual and intra-textual[edit]

Intra-texual methods assess the output of a specific summarization system, and the inter-texual ones focus on contrastive analysis of outputs of several summarization systems.

Human judgement often has wide variance on what is considered a "good" summary, which means that making the evaluation process automatic is particularly difficult. Manual evaluation can be used, but this is both time and labor intensive as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning coherence and coverage[disambiguation needed].

One of the metrics used in NIST's annual Document Understanding Conferences, in which research groups submit their systems for both summarization and translation tasks, is the ROUGE metric (Recall-Oriented Understudy for Gisting Evaluation [3]). It essentially calculates n-gram overlaps between automatically generated summaries and previously-written human summaries. A high level of overlap should indicate a high level of shared concepts between the two summaries. Note that overlap metrics like this are unable to provide any feedback on a summary's coherence. Anaphor resolution remains another problem yet to be fully solved.

Current difficulties in evaluating summaries automatically[edit]

Evaluating summaries, either manually or automatically, is a hard task. The main difficulty in evaluation comes from the impossibility of building a fair gold-standard against which the results of the systems can be compared. Furthermore, it is also very hard to determine what a correct summary is, because there is always the possibility of a system to generate a good summary that is quite different from any human summary used as an approximation to the correct output.

Content selection is not a deterministic problem. People are subjective, and different authors would choose different sentences. And individuals may not be consistent. A particular person may chose different sentences at different times. Two distinct sentences expressed in different words can express the same meaning. This phenomenon is known as paraphrasing. We can find an approach to automatically evaluating summaries using paraphrases (ParaEval).

Most summarization systems perform an extractive approach, selecting and copying important sentences from the source documents. Although humans can also cut and paste relevant information of a text, most of the times they rephrase sentences when necessary, or they join different related information into one sentence.

Evaluating summaries qualitatively[edit]

The main drawback of the evaluation systems existing so far is that we need at least one reference summary, and for some methods more than one, to be able to compare automatic summaries with models. This is a hard and expensive task. Much effort has to be done in order to have corpus of texts and their corresponding summaries. Furthermore, for some methods, not only do we need to have human-made summaries available for comparison, but also manual annotation has to be performed in some of them (e.g. SCU in the Pyramid Method). In any case, what the evaluation methods need as an input, is a set of summaries to serve as gold standards and a set of automatic summaries. Moreover, they all perform a quantitative evaluation with regard to different similarity metrics. To overcome these problems, we think that the quantitative evaluation might not be the only way to evaluate summaries, and a qualitative automatic evaluation would be also important.

See also[edit]

References[edit]

  1. ^ Rada Mihalcea and Paul Tarau, 2004: TextRank: Bringing Order into Texts, Department of Computer Science University of North Texas [1]
  2. ^ Yatsko, V. et al Automatic genre recognition and adaptive text summarization. In: Automatic Documentation and Mathematical Linguistics, 2010, Volume 44, Number 3, pp.111-120.
  3. ^ UNIS (Universal Summarizer)
  4. ^ Güneş Erkan and Dragomir R. Radev: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization [2]
  5. ^ Mani, I. Summarization evaluation: an overview
  6. ^ Yatsko V. A., Vishnyakov T. N. A method for evaluating modern systems of automatic text summarization. In: Automatic Documentation and Mathematical Linguistics. - 2007. - V. 41. - No 3. - P. 93-103.

Further reading[edit]

  • Hercules, Dalianis (2003). Porting and evaluation of automatic summarization. 
  • Roxana, Angheluta (2002). The Use of Topic Segmentation for Automatic Summarization. 
  • Anne, Buist (2004). Automatic Summarization of Meeting Data: A Feasibility Study. 
  • Annie, Louis (2009). Performance Confidence Estimation for Automatic Summarization. 
  • Elena, Lloret and Manuel, Palomar (2009). Challenging Issues of Automatic Summarization: Relevance Detection and Quality-based Evaluation. 
  • Andrew, Goldberg (2007). Automatic Summarization. 
  • Endres-Niggemeyer, Brigitte (1998). Summarizing Information. ISBN 3-540-63735-4. 
  • Marcu, Daniel (2000). The Theory and Practice of Discourse Parsing and Summarization. ISBN 0-262-13372-5. 
  • Mani, Inderjeet (2001). Automatic Summarization. ISBN 1-58811-060-5. 
  • Huff, Jason (2010). AutoSummarize. , Conceptual artwork using automatic summarization software in Microsoft Word 2008.
  • Lehmam, Abderrafih (2010). Essential summarizer: innovative automatic text summarization software in twenty languages - ACM Digital Library. , Published in Proceeding RIAO'10 Adaptivity, Personalization and Fusion of Heterogeneous Information, CID Paris, France
  • Xiaojin, Zhu, Andrew Goldberg, Jurgen Van Gael, and David Andrzejewski (2007). Improving diversity in ranking using absorbing random walks. , The GRASSHOPPER algorithm

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Bayes_theorem b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Bayes_theorem new file mode 100644 index 00000000..1fc093d0 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Bayes_theorem @@ -0,0 +1 @@ + Bad title - Wikipedia, the free encyclopedia

Bad title

Jump to: navigation, search

Return to Main Page.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Biomedical_text_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Biomedical_text_mining new file mode 100644 index 00000000..f5036de0 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Biomedical_text_mining @@ -0,0 +1 @@ + Biomedical text mining - Wikipedia, the free encyclopedia

Biomedical text mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Biomedical text mining (also known as BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field on the edge of natural language processing, bioinformatics, medical informatics and computational linguistics.

There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.

Contents

Main applications[edit]

The main developments in this area have been related to the identification of biological entities (named entity recognition), such as protein and gene names in free text, the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature, automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology terms). Even the extraction of kinetic parameters from text or the subcellular location of proteins have been addressed by information extraction and text mining technology.

Examples[edit]

  • PIE - PIE (Protein Interaction information Extraction) is a configurable Web service to extract PPI-relevant articles from MEDLINE.
  • KLEIO - an advanced information retrieval system providing knowledge enriched searching for biomedicine.
  • FACTA+ - a MEDLINE search engine for finding associations between biomedical concepts. The FACTA+ Visualizer helps intuitive understanding of FACTA+ search results through graphical visualization of the results.[1]
  • U-Compare - U-Compare is an integrated text mining/natural language processing system based on the UIMA Framework, with an emphasis on components for biomedical text mining.[2]
  • TerMine - a term management system that identifies key terms in biomedical and other text types.
  • PLAN2L — Extraction of gene regulation relations, protein-protein interactions, mutations, ranked associations and cellular and developmental process associations for genes and proteins of the plant Arabidopsis from abstracts and full text articles.
  • MEDIE - an intelligent search engine to retrieve biomedical correlations from MEDLINE, based on indexing by Natural Language Processing and Text Mining techniques [3]
  • AcroMine - an acronym dictionary which can be used to find distinct expanded forms of acronyms from MEDLINE.[4]
  • AcroMine Disambiguator - Disambiguates abbreviations in biomedical text with their correct full forms.[5]
  • GENIA tagger - Analyses biomedical text and outputs base forms, part-of-speech tags, chunk tags, and named entity tags
  • NEMine - Recognises gene/protein names in text
  • Yeast MetaboliNER - Recognizes yeast metabolite names in text.
  • Smart Dictionary Lookup - machine learning-based gene/protein name lookup.
  • TPX - A concept-assisted search and navigation tool for biomedical literature analyses - runs on PubMed/PMC and can be configured, on request, to run on local literature repositories too.[6]
  • Chilibot — A tool for finding relationships between genes or gene products.
  • EBIMed - EBIMed is a web application that combines Information Retrieval and Extraction from Medline.[7]
  • FABLE — A gene-centric text-mining search engine for MEDLINE
  • GOAnnotator, an online tool that uses Semantic similarity for verification of electronic protein annotations using GO terms automatically extracted from literature.
  • GoPubMed — retrieves PubMed abstracts for your search query, then detects ontology terms from the Gene Ontology and Medical Subject Headings in the abstracts and allows the user to browse the search results by exploring the ontologies and displaying only papers mentioning selected terms, their synonyms or descendants.
  • Anne O'Tate Retrieves sets of PubMed records, using a standard PubMed interface, and analyzes them, arranging content of PubMed record fields (MeSH, author, journal, words from title and abtsracts, and others) in order of frequency.
  • Information Hyperlinked Over Proteins (iHOP):[8] "A network of concurring genes and proteins extends through the scientific literature touching on phenotypes, pathologies and gene function. iHOP provides this network as a natural way of accessing millions of PubMed abstracts. By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource, bringing all advantages of the internet to scientific literature research."
  • LitInspector — Gene and signal transduction pathway data mining in PubMed abstracts.
  • NextBio- Life sciences search engine with a text mining functionality that utilizes PubMed abstracts (ex: literature search) and clinical trials (example) to return concepts relevant to the query based on a number of heuristics including ontology relationships, journal impact, publication date, and authorship.
  • The Neuroscience Information Framework (NIF) — A neuroscience research hub with a search engine specifically tailored for neuroscience, direct access to over 180 databases, and curated resources. Built as part of the NIH Blueprint for Neuroscience Research.
  • PubAnatomy — An interactive visual search engine that provides new ways to explore relationships among Medline literature, text mining results, anatomical structures, gene expression and other background information.
  • PubGeneCo-occurrence networks display of gene and protein symbols as well as MeSH, GO, PubChem and interaction terms (such as "binds" or "induces") as these appear in MEDLINE records (that is, PubMed titles and abstracts).
  • Whatizit - Whatizit is great at identifying molecular biology terms and linking them to publicly available databases.[9]
  • XTractor — Discovering Newer Scientific Relations Across PubMed Abstracts. A tool to obtain manually annotated,expert curated relationships for Proteins, Diseases, Drugs and Biological Processes as they get published in PubMed.
  • Medical Abstract — Medical Abstract is an aggregator for medical abstract journal from PubMed Abstracts.
  • MuGeX — MuGeX is a tool for finding disease specific mutation-gene pairs.
  • MedCase — MedCase is an experimental tool of Faculties of Veterinary Medicine and Computer Science in Cluj-Napoca, designed as a homeostatic serving sistem with natural language support for medical applications.
  • BeCAS — BeCAS is a web application, API and widget for biomedical concept identification, able to annotate free text and PubMed abstracts.
  • @Note — A workbench for Biomedical Text Mining (Including Information Retrieval, Name Entity Recognition and Relation Extraction plugins)

Conferences at which BioNLP research is presented[edit]

BioNLP is presented at a variety of meetings:

See also[edit]

External links[edit]

References[edit]

  1. ^ Tsuruoka Y, Tsujii J and Ananiadou S (2008). "FACTA: a text search engine for finding associated biomedical concepts". Bioinformatics 24 (21): 2559–2560. doi:10.1093/bioinformatics/btn469. PMC 2572701. PMID 18772154. 
  2. ^ Kano Y, Baumgartner Jr WA, McCrohon L, Ananiadou S, Cohen KB, Hunter L and Tsujii J (2009). "U-Compare: share and compare text mining tools with UIMA". Bioinformatics 25 (15): 1997–1998. doi:10.1093/bioinformatics/btp289. PMC 2712335. PMID 19414535. 
  3. ^ Miyao Y, Ohta T, Masuda K, Tsuruoka Y, Yoshida K, Ninomiya T and Tsujii J (2006). "Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases". Proceedings of COLING-ACL 2006. pp. 1017–1024. 
  4. ^ Okazaki N and Ananiadou S (2006). "Building an abbreviation dictionary using a term recognition approach". Bioinformatics 22 (24): 3089–3095. doi:10.1093/bioinformatics/btl534. PMID 17050571. 
  5. ^ Okazaki N, Ananiadou S and Tsujii J (2010). "Building a high-quality sense inventory for improved abbreviation disambiguation". Bioinformatics 26 (9): 1246–1253. doi:10.1093/bioinformatics/btq129. PMC 2859134. PMID 20360059. 
  6. ^ Thomas Joseph, Vangala G Saipradeep, Ganesh Sekar Venkat Raghavan, Rajgopal Srinivasan, Aditya Rao, Sujatha Kotte & Naveen Sivadasan (2012). "TPX: Biomedical literature search made easy". Bioinformation 8 (12): 578–580. doi:10.6026/97320630008578. PMID 22829734. 
  7. ^ Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M and Stoehr P (2007). "EBIMed—text crunching to gather facts for proteins from Medline". Bioinformatics 23 (2): e237–e244. doi:10.1093/bioinformatics/btl302. PMID 17237098. 
  8. ^ Hoffmann R, Valencia A (September 2005). "Implementing the iHOP concept for navigation of biomedical literature". Bioinformatics 21 (Suppl 2): ii252–8. doi:10.1093/bioinformatics/bti1142. PMID 16204114. 
  9. ^ Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A (November 2008). "Text processing through Web services: calling Whatizit". Bioinformatics 24 (2): 296–298. doi:10.1093/bioinformatics/btm557. PMID 18006544. 

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Business_intelligence b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Business_intelligence new file mode 100644 index 00000000..82860570 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Business_intelligence @@ -0,0 +1 @@ + Business intelligence - Wikipedia, the free encyclopedia

Business intelligence

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Business intelligence (BI) is a set of theories, methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information for business purposes. BI can handle large amounts of information to help identify and develop new opportunities. Making use of new opportunities and implementing an effective strategy can provide a competitive market advantage and long-term stability.[1]

BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics.

Though the term business intelligence is sometimes a synonym for competitive intelligence (because they both support decision making), BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and disseminates information with a topical focus on company competitors. If understood broadly, business intelligence can include the subset of competitive intelligence.[2]

Contents

History[edit]

In a 1958 article, IBM researcher Hans Peter Luhn used the term business intelligence. He defined intelligence as: "the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal."[3]

Business intelligence as it is understood today is said to have evolved from the decision support systems that began in the 1960s and developed throughout the mid-1980s. DSS originated in the computer-aided models created to assist with decision making and planning. From DSS, data warehouses, Executive Information Systems, OLAP and business intelligence came into focus beginning in the late 80s.

In 1989, Howard Dresner (later a Gartner Group analyst) proposed "business intelligence" as an umbrella term to describe "concepts and methods to improve business decision making by using fact-based support systems."[4] It was not until the late 1990s that this usage was widespread.[5]

Business intelligence and data warehousing[edit]

Often BI applications use data gathered from a data warehouse or a data mart. A data warehouse is a copy of transactional data so that it facillitates in decision support. However, not all data warehouses are used for business intelligence, nor do all business intelligence applications require a data warehouse.

To distinguish between the concepts of business intelligence and data warehouses, Forrester Research often defines business intelligence in one of two ways:

Using a broad definition: "Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making."[6] When using this definition, business intelligence also includes technologies such as data integration, data quality, data warehousing, master data management, text and content analytics, and many others that the market sometimes lumps into the Information Management segment. Therefore, Forrester refers to data preparation and data usage as two separate, but closely linked segments of the business intelligence architectural stack.

Forrester defines the latter, narrower business intelligence market as, "...referring to just the top layers of the BI architectural stack such as reporting, analytics and dashboards."[7]

Business intelligence and business analytics[edit]

Thomas Davenport argues that business intelligence should be divided into querying, reporting, OLAP, an "alerts" tool, and business analytics. In this definition, business analytics is the subset of BI based on statistics, prediction, and optimization.[8]

Applications in an enterprise[edit]

Business intelligence can be applied to the following business purposes, in order to drive business value.[citation needed]

  1. Measurement – program that creates a hierarchy of performance metrics (see also Metrics Reference Model) and benchmarking that informs business leaders about progress towards business goals (business process management).
  2. Analytics – program that builds quantitative processes for a business to arrive at optimal decisions and to perform business knowledge discovery. Frequently involves: data mining, process mining, statistical analysis, predictive analytics, predictive modeling, business process modeling, complex event processing and prescriptive analytics.
  3. Reporting/enterprise reporting – program that builds infrastructure for strategic reporting to serve the strategic management of a business, not operational reporting. Frequently involves data visualization, executive information system and OLAP.
  4. Collaboration/collaboration platform – program that gets different areas (both inside and outside the business) to work together through data sharing and electronic data interchange.
  5. Knowledge management – program to make the company data driven through strategies and practices to identify, create, represent, distribute, and enable adoption of insights and experiences that are true business knowledge. Knowledge management leads to learning management and regulatory compliance.

In addition to above, business intelligence also can provide a pro-active approach, such as ALARM function to alert immediately to end-user. There are many types of alerts, for example if some business value exceeds the threshold value the color of that amount in the report will turn RED and the business analyst is alerted. Sometimes an alert mail will be sent to the user as well. This end to end process requires data governance, which should be handled by the expert.[citation needed]

Prioritization of business intelligence projects[edit]

It is often difficult to provide a positive business case for business intelligence initiatives and often the projects must be prioritized through strategic initiatives. Here are some hints to increase the benefits for a BI project.

  • As described by Kimball[9] you must determine the tangible benefits such as eliminated cost of producing legacy reports.
  • Enforce access to data for the entire organization.[10] In this way even a small benefit, such as a few minutes saved, makes a difference when multiplied by the number of employees in the entire organization.
  • As described by Ross, Weil & Roberson for Enterprise Architecture,[11] consider letting the BI project be driven by other business initiatives with excellent business cases. To support this approach, the organization must have enterprise architects who can identify suitable business projects.
  • Use a structured and quantitative methodology to create defensible prioritization in line with the actual needs of the organization, such as a weighted decision matrix.[12]

Success factors of implementation[edit]

Before implementing a BI solution, it is worth taking different factors into consideration before proceeding. According to Kimball et al., these are the three critical areas that you need to assess within your organization before getting ready to do a BI project:[13]

  1. The level of commitment and sponsorship of the project from senior management
  2. The level of business need for creating a BI implementation
  3. The amount and quality of business data available.

Business sponsorship[edit]

The commitment and sponsorship of senior management is according to Kimball et al., the most important criteria for assessment.[14] This is because having strong management backing helps overcome shortcomings elsewhere in the project. However, as Kimball et al. state: “even the most elegantly designed DW/BI system cannot overcome a lack of business [management] sponsorship”.[15]

It is important that personnel who participate in the project have a vision and an idea of the benefits and drawbacks of implementing a BI system. The best business sponsor should have organizational clout and should be well connected within the organization. It is ideal that the business sponsor is demanding but also able to be realistic and supportive if the implementation runs into delays or drawbacks. The management sponsor also needs to be able to assume accountability and to take responsibility for failures and setbacks on the project. Support from multiple members of the management ensures the project does not fail if one person leaves the steering group. However, having many managers work together on the project can also mean that there are several different interests that attempt to pull the project in different directions, such as if different departments want to put more emphasis on their usage. This issue can be countered by an early and specific analysis of the business areas that benefit the most from the implementation. All stakeholders in project should participate in this analysis in order for them to feel ownership of the project and to find common ground.

Another management problem that should be encountered before start of implementation is if the business sponsor is overly aggressive. If the management individual gets carried away by the possibilities of using BI and starts wanting the DW or BI implementation to include several different sets of data that were not included in the original planning phase. However, since extra implementations of extra data may add many months to the original plan, it's wise to make sure the person from management is aware of his actions.

Business needs[edit]

Because of the close relationship with senior management, another critical thing that must be assessed before the project begins is whether or not there is a business need and whether there is a clear business benefit by doing the implementation.[16] The needs and benefits of the implementation are sometimes driven by competition and the need to gain an advantage in the market. Another reason for a business-driven approach to implementation of BI is the acquisition of other organizations that enlarge the original organization it can sometimes be beneficial to implement DW or BI in order to create more oversight.

Companies that implement BI are often large, multinational organizations with diverse subsidiaries.[17] A well-designed BI solution provides a consolidated view of key business data not available anywhere else in the organization, giving management visibility and control over measures that otherwise would not exist.

Amount and quality of available data[edit]

Without good data, it does not matter how good the management sponsorship or business-driven motivation is. Without proper data, or with too little quality data, any BI implementation fails. Before implementation it is a good idea to do data profiling. This analysis identifies the “content, consistency and structure [..]”[16] of the data. This should be done as early as possible in the process and if the analysis shows that data is lacking, put the project on the shelf temporarily while the IT department figures out how to properly collect data.

When planning for business data and business intelligence requirements, it is always advisable to consider specific scenarios that apply to a particular organization, and then select the business intelligence features best suited for the scenario.

Often, scenarios revolve around distinct business processes, each built on one or more data sources. These sources are used by features that present that data as information to knowledge workers, who subsequently act on that information. The business needs of the organization for each business process adopted correspond to the essential steps of business intelligence. These essential steps of business intelligence includes but not limited to:

  1. Go through business data sources in order to collect needed data
  2. Convert business data to information and present appropriately
  3. Query and analyze data
  4. Act on those data collected

User aspect[edit]

Some considerations must be made in order to successfully integrate the usage of business intelligence systems in a company. Ultimately the BI system must be accepted and utilized by the users in order for it to add value to the organization.[18][19] If the usability of the system is poor, the users may become frustrated and spend a considerable amount of time figuring out how to use the system or may not be able to really use the system. If the system does not add value to the users´ mission, they simply don't use it.[19]

To increase user acceptance of a BI system, it can be advisable to consult business users at an early stage of the DW/BI lifecycle, for example at the requirements gathering phase.[18] This can provide an insight into the business process and what the users need from the BI system. There are several methods for gathering this information, such as questionnaires and interview sessions.

When gathering the requirements from the business users, the local IT department should also be consulted in order to determine to which degree it is possible to fulfill the business's needs based on the available data.[18]

Taking on a user-centered approach throughout the design and development stage may further increase the chance of rapid user adoption of the BI system.[19]

Besides focusing on the user experience offered by the BI applications, it may also possibly motivate the users to utilize the system by adding an element of competition. Kimball[18] suggests implementing a function on the Business Intelligence portal website where reports on system usage can be found. By doing so, managers can see how well their departments are doing and compare themselves to others and this may spur them to encourage their staff to utilize the BI system even more.

In a 2007 article, H. J. Watson gives an example of how the competitive element can act as an incentive.[20] Watson describes how a large call centre implemented performance dashboards for all call agents, with monthly incentive bonuses tied to performance metrics. Also, agents could compare their performance to other team members. The implementation of this type of performance measurement and competition significantly improved agent performance.

BI chances of success can be improved by involving senior management to help make BI a part of the organizational culture, and by providing the users with necessary tools, training, and support.[20] Training encourages more people to use the BI application.[18]

Providing user support is necessary to maintain the BI system and resolve user problems.[19] User support can be incorporated in many ways, for example by creating a website. The website should contain great content and tools for finding the necessary information. Furthermore, helpdesk support can be used. The help desk can be manned by power users or the DW/BI project team.[18]

BI Portals[edit]

A Business Intelligence portal (BI portal) is the primary access interface for Data Warehouse (DW) and Business Intelligence (BI) applications. The BI portal is the users first impression of the DW/BI system. It is typically a browser application, from which the user has access to all the individual services of the DW/BI system, reports and other analytical functionality. The BI portal must be implemented in such a way that it is easy for the users of the DW/BI application to call on the functionality of the application.[21]

The BI portal's main functionality is to provide a navigation system of the DW/BI application. This means that the portal has to be implemented in a way that the user has access to all the functions of the DW/BI application.

The most common way to design the portal is to custom fit it to the business processes of the organization for which the DW/BI application is designed, in that way the portal can best fit the needs and requirements of its users.[22]

The BI portal needs to be easy to use and understand, and if possible have a look and feel similar to other applications or web content of the organization the DW/BI application is designed for (consistency).

The following is a list of desirable features for web portals in general and BI portals in particular:

Usable
User should easily find what they need in the BI tool.
Content Rich
The portal is not just a report printing tool, it should contain more functionality such as advice, help, support information and documentation.
Clean
The portal should be designed so it is easily understandable and not over complex as to confuse the users
Current
The portal should be updated regularly.
Interactive
The portal should be implemented in a way that makes it easy for the user to use its functionality and encourage them to use the portal. Scalability and customization give the user the means to fit the portal to each user.
Value Oriented
It is important that the user has the feeling that the DW/BI application is a valuable resource that is worth working on.

Marketplace[edit]

There are a number of business intelligence vendors, often categorized into the remaining independent "pure-play" vendors and consolidated "megavendors" that have entered the market through a recent trend[when?] of acquisitions in the BI industry.[23]

Some companies adopting BI software decide to pick and choose from different product offerings (best-of-breed) rather than purchase one comprehensive integrated solution (full-service).[24]

Industry-specific[edit]

Specific considerations for business intelligence systems have to be taken in some sectors such as governmental banking regulations. The information collected by banking institutions and analyzed with BI software must be protected from some groups or individuals, while being fully available to other groups or individuals. Therefore BI solutions must be sensitive to those needs and be flexible enough to adapt to new regulations and changes to existing law.

Semi-structured or unstructured data[edit]

Businesses create a huge amount of valuable information in the form of e-mails, memos, notes from call-centers, news, user groups, chats, reports, web-pages, presentations, image-files, video-files, and marketing material and news. According to Merrill Lynch, more than 85% of all business information exists in these forms. These information types are called either semi-structured or unstructured data. However, organizations often only use these documents once.[25]

The management of semi-structured data is recognized as a major unsolved problem in the information technology industry.[26] According to projections from Gartner (2003), white collar workers spend anywhere from 30 to 40 percent of their time searching, finding and assessing unstructured data. BI uses both structured and unstructured data, but the former is easy to search, and the latter contains a large quantity of the information needed for analysis and decision making.[26][27] Because of the difficulty of properly searching, finding and assessing unstructured or semi-structured data, organizations may not draw upon these vast reservoirs of information, which could influence a particular decision, task or project. This can ultimately lead to poorly informed decision making.[25]

Therefore, when designing a business intelligence/DW-solution, the specific problems associated with semi-structured and unstructured data must be accommodated for as well as those for the structured data.[27]

Unstructured data vs. semi-structured data[edit]

Unstructured and semi-structured data have different meanings depending on their context. In the context of relational database systems, unstructured data cannot be stored in predictably ordered columns and rows. One type of unstructured data is typically stored in a BLOB (binary large object), a catch-all data type available in most relational database management systems. Unstructured data may also refer to irregularly or randomly repeated column patterns that vary from row to row within each file or document.

Many of these data types, however, like e-mails, word processing text files, PPTs, image-files, and video-files conform to a standard that offers the possibility of metadata. Metadata can include information such as author and time of creation, and this can be stored in a relational database. Therefore it may be more accurate to talk about this as semi-structured documents or data,[26] but no specific consensus seems to have been reached.

Unstructured data can also simply be the knowledge that business users have about future business trends. Business forecasting naturally aligns with the BI system because business users think of their business in aggregate terms. Capturing the business knowledge that may only exist in the minds of business users provides some of the most important data points for a complete BI solution.

Problems with semi-structured or unstructured data[edit]

There are several challenges to developing BI with semi-structured data. According to Inmon & Nesavich,[28] some of those are:

  1. Physically accessing unstructured textual data – unstructured data is stored in a huge variety of formats.
  2. Terminology – Among researchers and analysts, there is a need to develop a standardized terminology.
  3. Volume of data – As stated earlier, up to 85% of all data exists as semi-structured data. Couple that with the need for word-to-word and semantic analysis.
  4. Searchability of unstructured textual data – A simple search on some data, e.g. apple, results in links where there is a reference to that precise search term. (Inmon & Nesavich, 2008)[28] gives an example: “a search is made on the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies.”

The use of metadata[edit]

To solve problems with searchability and assessment of data, it is necessary to know something about the content. This can be done by adding context through the use of metadata.[25] Many systems already capture some metadata (e.g. filename, author, size, etc.), but more useful would be metadata about the actual content – e.g. summaries, topics, people or companies mentioned. Two technologies designed for generating metadata about content are automatic categorization and information extraction.

Future[edit]

A 2009 Gartner paper predicted[29] these developments in the business intelligence market:

  • Because of lack of information, processes, and tools, through 2012, more than 35 percent of the top 5,000 global companies regularly fail to make insightful decisions about significant changes in their business and markets.
  • By 2012, business units will control at least 40 percent of the total budget for business intelligence.
  • By 2012, one-third of analytic applications applied to business processes will be delivered through coarse-grained application mashups.

A 2009 Information Management special report predicted the top BI trends: "green computing, social networking, data visualization, mobile BI, predictive analytics, composite applications, cloud computing and multitouch."[30]

Other business intelligence trends include the following:

  • Third party SOA-BI products increasingly address ETL issues of volume and throughput.
  • Cloud computing and Software-as-a-Service (SaaS) are ubiquitous.
  • Companies embrace in-memory processing, 64-bit processing, and pre-packaged analytic BI applications.
  • Operational applications have callable BI components, with improvements in response time, scaling, and concurrency.
  • Near or real time BI analytics is a baseline expectation.
  • Open source BI software replaces vendor offerings.

Other lines of research include the combined study of business intelligence and uncertain data.[31][32] In this context, the data used is not assumed to be precise, accurate and complete. Instead, data is considered uncertain and therefore this uncertainty is propagated to the results produced by BI.

According to a study by the Aberdeen Group, there has been increasing interest in Software-as-a-Service (SaaS) business intelligence over the past years, with twice as many organizations using this deployment approach as one year ago – 15% in 2009 compared to 7% in 2008.[citation needed]

An article by InfoWorld’s Chris Kanaracus points out similar growth data from research firm IDC, which predicts the SaaS BI market will grow 22 percent each year through 2013 thanks to increased product sophistication, strained IT budgets, and other factors.[33]

See also[edit]

References[edit]

  1. ^ (Rud, Olivia (2009). Business Intelligence Success Factors: Tools for Aligning Your Business in the Global Economy. Hoboken, N.J: Wiley & Sons. ISBN 978-0-470-39240-9. )
  2. ^ Kobielus, James (30 April 2010). "What’s Not BI? Oh, Don’t Get Me Started....Oops Too Late...Here Goes....". "“Business” intelligence is a non-domain-specific catchall for all the types of analytic data that can be delivered to users in reports, dashboards, and the like. When you specify the subject domain for this intelligence, then you can refer to “competitive intelligence,” “market intelligence,” “social intelligence,” “financial intelligence,” “HR intelligence,” “supply chain intelligence,” and the like." 
  3. ^ H P Luhn (1958). "A Business Intelligence System". IBM Journal 2 (4): 314. doi:10.1147/rd.24.0314. 
  4. ^ D. J. Power (10 March 2007). "A Brief History of Decision Support Systems, version 4.0". DSSResources.COM. Retrieved 10 July 2008. 
  5. ^ Power, D. J. "A Brief History of Decision Support Systems". Retrieved 1 November 2010. 
  6. ^ Evelson, Boris (21 November 2008). "Topic Overview: Business Intelligence". 
  7. ^ Evelson, Boris (29 April 2010). "Want to know what Forrester's lead data analysts are thinking about BI and the data domain?". 
  8. ^ Henschen, Doug (4 January 2010). Analytics at Work: Q&A with Tom Davenport. (Interview). http://www.informationweek.com/news/software/bi/222200096.
  9. ^ Kimball et al., 2008: 29
  10. ^ "Are You Ready for the New Business Intelligence?". Dell.com. Retrieved 2012-06-19. 
  11. ^ Jeanne W. Ross, Peter Weil, David C. Robertson (2006) "Enterprise Architecture As Strategy", p. 117 ISBN 1-59139-839-8.
  12. ^ Krapohl, Donald. "A Structured Methodology for Group Decision Making". AugmentedIntel. Retrieved 22 April 2013. 
  13. ^ Kimball et al. 2008: p. 298
  14. ^ Kimball et al., 2008: 16
  15. ^ Kimball et al., 2008: 18
  16. ^ a b Kimball et al., 2008: 17
  17. ^ "How Companies Are Implementing Business Intelligence Competency Centers". Computer World. Retrieved April 2006. 
  18. ^ a b c d e f Kimball
  19. ^ a b c d Swain Scheps Business Intelligence for Dummies, 2008, ISBN 978-0-470-12723-0
  20. ^ a b Watson, Hugh J.; Wixom, Barbara H. (2007). "The Current State of Business Intelligence". Computer 40 (9): 96. doi:10.1109/MC.2007.331. 
  21. ^ The Data Warehouse Lifecycle Toolkit (2nd ed.). Ralph Kimball (2008).
  22. ^ Microsoft Data Warehouse Toolkit. Wiley Publishing. (2006)
  23. ^ Pendse, Nigel (7 March 2008). "Consolidations in the BI industry". The OLAP Report. 
  24. ^ Imhoff, Claudia (4 April 2006). "Three Trends in Business Intelligence Technology". 
  25. ^ a b c Rao, R. (2003). "From unstructured data to actionable intelligence". IT Professional 5 (6): 29. doi:10.1109/MITP.2003.1254966. 
  26. ^ a b c Blumberg, R. & S. Atre (2003). "The Problem with Unstructured Data". DM Review: 42–46. 
  27. ^ a b Negash, S (2004). "Business Intelligence". Communications of the Association of Information Systems 13: 177–195. 
  28. ^ a b Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice Hall 2008, pp. 1–13
  29. ^ Gartner Reveals Five Business Intelligence Predictions for 2009 and Beyond. gartner.com. 15 January 2009
  30. ^ Campbell, Don (23 June 2009). "10 Red Hot BI Trends". Information Management. 
  31. ^ Rodriguez, Carlos; Daniel, Florian; Casati, Fabio; Cappiello, Cinzia (2010). "Toward Uncertain Business Intelligence: The Case of Key Indicators". IEEE Internet Computing 14 (4): 32. doi:10.1109/MIC.2010.59. 
  32. ^ Rodriguez, C., Daniel, F., Casati, F. & Cappiello, C. (2009), Computing Uncertain Key Indicators from Uncertain Data, pp. 106–120  | conference = ICIQ'09 | year = 2009
  33. ^ SaaS BI growth will soar in 2010 | Cloud Computing. InfoWorld (2010-02-01). Retrieved on 17 January 2012.

Bibliography[edit]

  • Ralph Kimball et al. "The Data warehouse Lifecycle Toolkit" (2nd ed.) Wiley ISBN 0-470-47957-4
  • Peter Rausch, Alaa Sheta, Aladdin Ayesh : Business Intelligence and Performance Management: Theory, Systems, and Industrial Applications, Springer Verlag U.K., 2013, ISBN 978-1-4471-4865-4.

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Buzzword b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Buzzword new file mode 100644 index 00000000..2292f3eb --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Buzzword @@ -0,0 +1 @@ + Buzzword - Wikipedia, the free encyclopedia

Buzzword

From Wikipedia, the free encyclopedia
Jump to: navigation, search

A buzzword is a word or phrase used to impress, or an expression which is fashionable. Buzzwords often originate in jargon. Buzzwords are often neologisms.[1]

The term was first used in 1946 as student slang.[2]

Contents

Examples[edit]

The following terms are, or were, examples of buzzwords (see also list of buzzwords):

See also[edit]

Footnotes[edit]

  1. ^ Grammar.About.com - definition of buzzword
  2. ^ Online Etymology Dictionary. Douglas Harper, Historian.
  3. ^ The Register: The Long Tail's maths begin to crumble
  4. ^ Evolt: Buzzword Bingo
  5. ^ "The Buzzword Bingo Book: The Complete, Definitive Guide to the Underground Workplace Game of Doublespeak", author: Benjamin Yoskovitz, publisher: Villard, ISBN 978-0-375-75348-0
  6. ^ Cnet.com's Top 10 Buzzwords

Further reading[edit]

  • Negus, K. Pickering, M. 2004. Creativity, Communication and Cultural Value. Sage Publications Ltd
  • Collins, David. 2000. Management fads and buzzwords : critical-practical perspectives. London ; New York : Routledge
  • Godin, B. 2006. The Knowledge-Based Economy: Conceptual Framework or Buzzword?. The Journal of technology transfer 31 (1): 17-.

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/CIKM_Conference b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/CIKM_Conference new file mode 100644 index 00000000..99baf3bf --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/CIKM_Conference @@ -0,0 +1 @@ + Conference on Information and Knowledge Management - Wikipedia, the free encyclopedia

Conference on Information and Knowledge Management

From Wikipedia, the free encyclopedia
  (Redirected from CIKM Conference)
Jump to: navigation, search

The ACM Conference on Information and Knowledge Management (CIKM, pronounced /ˈsikəm/) is an annual computer science research conference dedicated to information and knowledge management. Since the first event in 1992, the conference has evolved into one of the major forums for research on database management, information retrieval, and knowledge management.[1][2] The conference is noted for its interdisciplinarity, as it brings together communities that otherwise often publish at separate venues. Recent editions have attracted well beyond 500 participants.[3] In addition to the main research program, the conference also features a number of workshops, tutorials, and industry presentations.[4]

For many years, the conference was held in the USA. Since 2005, venues in other countries have been selected as well. Locations include:[5]

See also[edit]

References[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Category_Data_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Category_Data_mining new file mode 100644 index 00000000..fde21417 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Category_Data_mining @@ -0,0 +1 @@ + Category:Data mining - Wikipedia, the free encyclopedia

Category:Data mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Data mining facilities are included in some of the Category:Data analysis software and Category:Statistical software products.

Subcategories

This category has the following 5 subcategories, out of 5 total.

A

C

D

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Cluster_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Cluster_analysis new file mode 100644 index 00000000..59efc1b8 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Cluster_analysis @@ -0,0 +1 @@ + Cluster analysis - Wikipedia, the free encyclopedia

Cluster analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search
The result of a cluster analysis shown as the coloring of the squares into three clusters.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς "grape") and typological analysis. The subtle differences are often in the usage of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification primarily their discriminative power is of interest. This often leads to misunderstandings between researchers coming from the fields of data mining and machine learning, since they use the same terms and often the same algorithms, but have different goals.

Contents

Clusters and clusterings[edit]

The notion of a "cluster" cannot be precisely defined,[1] which is one of the reasons why there are so many clustering algorithms. There of course is a common denominator: a group of data objects. However, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. Understanding these "cluster models" is key to understanding the differences between the various algorithms. Typical cluster models include:

  • Connectivity models: for example hierarchical clustering builds models based on distance connectivity.
  • Centroid models: for example the k-means algorithm represents each cluster by a single mean vector.
  • Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.
  • Density models: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
  • Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
  • Group models: some algorithms (unfortunately) do not provide a refined model for their results and just provide the grouping information.
  • Graph-based models: a clique, i.e., a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques.

A "clustering" is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished in:

  • hard clustering: each object belongs to a cluster or not
  • soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster)

There are also finer distinctions possible, for example:

  • strict partitioning clustering: here each object belongs to exactly one cluster
  • strict partitioning clustering with outliers: objects can also belong to no cluster, and are considered outliers.
  • overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.
  • hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
  • subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap.

Clustering algorithms[edit]

Clustering algorithms can be categorized based on their cluster model, as listed above. The following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms.

There is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder."[1] The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. It should be noted that an algorithm that is designed for one kind of models has no chance on a data set that contains a radically different kind of models.[1] For example, k-means cannot find non-convex clusters.[1]

Connectivity based clustering (hierarchical clustering)[edit]

Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. As such, these algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name "hierarchical clustering" comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don't mix.

Connectivity based clustering is a whole family of methods that differ by the way distances are computed. Apart from the usual choice of distance functions, the user also needs to decide on the linkage criterion (since a cluster consists of multiple objects, there are multiple candidates to compute the distance to) to use. Popular choices are known as single-linkage clustering (the minimum of object distances), complete linkage clustering (the maximum of object distances) or UPGMA ("Unweighted Pair Group Method with Arithmetic Mean", also known as average linkage clustering). Furthermore, hierarchical clustering can be agglomerative (starting with single elements and aggregating them into clusters) or divisive (starting with the complete data set and dividing it into partitions).

While these methods are fairly easy to understand, the results are not always easy to use, as they will not produce a unique partitioning of the data set, but a hierarchy the user still needs to choose appropriate clusters from. The methods are not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge (known as "chaining phenomenon", in particular with single-linkage clustering). In the general case, the complexity is \mathcal{O}(n^3), which makes them too slow for large data sets. For some special cases, optimal efficient methods (of complexity \mathcal{O}(n^2)) are known: SLINK[2] for single-linkage and CLINK[3] for complete-linkage clustering. In the data mining community these methods are recognized as a theoretical foundation of cluster analysis, but often considered obsolete. They did however provide inspiration for many later methods such as density based clustering.

Centroid-based clustering[edit]

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximative method is Lloyd's algorithm,[4] often actually referred to as "k-means algorithm". It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (Fuzzy c-means).

Most k-means-type algorithms require the number of clusters - k - to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders in between of clusters (which is not surprising, as the algorithm optimized cluster centers, not cluster borders).

K-means has a number of interesting theoretical properties. On one hand, it partitions the data space into a structure known as Voronoi diagram. On the other hand, it is conceptually close to nearest neighbor classification and as such popular in machine learning. Third, it can be seen as a variation of model based classification, and Lloyd's algorithm as a variation of the Expectation-maximization algorithm for this model discussed below.

Distribution-based clustering[edit]

The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. A nice property of this approach is that this closely resembles the way artificial data sets are generated: by sampling random objects from a distribution.

While the theoretical foundation of these methods is excellent, they suffer from one key problem known as overfitting, unless constraints are put on the model complexity. A more complex model will usually always be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult.

One prominent method is known as Gaussian mixture models (using the expectation-maximization algorithm). Here, the data set is usually modeled with a fixed (to avoid overfitting) number of Gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to fit better to the data set. This will converge to a local optimum, so multiple runs may produce different results. In order to obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most likely belong to, for soft clusterings this is not necessary.

Distribution-based clustering is a semantically strong method, as it not only provides you with clusters, but also produces complex models for the clusters that can also capture correlation and dependence of attributes. However, using these algorithms puts an extra burden on the user: to choose appropriate data models to optimize, and for many real data sets, there may be no mathematical model available the algorithm is able to optimize (e.g. assuming Gaussian distributions is a rather strong assumption on the data).

Density-based clustering[edit]

In density-based clustering,[5] clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas - that are required to separate clusters - are usually considered to be noise and border points.

The most popular[6] density based clustering method is DBSCAN.[7] In contrast to many newer methods, it features a well-defined cluster model called "density-reachability". Similar to linkage based clustering, it is based on connecting points within certain distance thresholds. However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary shape, in contrast to many other methods) plus all objects that are within these objects' range. Another interesting property of DBSCAN is that its complexity is fairly low - it requires a linear number of range queries on the database - and that it will discover essentially the same results (it is deterministic for core and noise points, but not for border points) in each run, therefore there is no need to run it multiple times. OPTICS[8] is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter \varepsilon, and produces a hierarchical result related to that of linkage clustering. DeLi-Clu,[9] Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS, eliminating the \varepsilon parameter entirely and offering performance improvements over OPTICS by using an R-tree index.

The key drawback of DBSCAN and OPTICS is that they expect some kind of density drop to detect cluster borders. Moreover they can not detect intrinsic cluster structures which are prevalent in the majority of real life data. A variation of DBSCAN, EnDBSCAN[10] efficiently detects such kinds of structures. On data sets with, for example, overlapping Gaussian distributions - a common use case in artificial data - the cluster borders produced by these algorithms will often look arbitrary, because the cluster density decreases continuously. On a data set consisting of mixtures of Gaussians, these algorithms are nearly always outperformed by methods such as EM clustering, that are able to precisely model this kind of data.

Newer developments[edit]

In recent years considerable effort has been put into improving algorithm performance of the existing algorithms.[11] Among them are CLARANS (Ng and Han, 1994),[12] and BIRCH (Zhang et al., 1996).[13] With the recent need to process larger and larger data sets (also known as big data), the willingness to trade semantic meaning of the generated clusters for performance has been increasing. This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering. Various other approaches to clustering have been tried such as seed based clustering.[14]

For high-dimensional data, many of the existing methods fail due to the curse of dimensionality, which renders particular distance functions problematic in high-dimensional spaces. This led to new clustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated ("correlated") subspace clusters that can be modeled by giving a correlation of their attributes. Examples for such clustering algorithms are CLIQUE[15] and SUBCLU.[16]

Ideas from density-based clustering methods (in particular the DBSCAN/OPTICS family of algorithms) have been adopted to subspace clustering (HiSC,[17] hierarchical subspace clustering and DiSH[18]) and correlation clustering (HiCO,[19] hierarchical corelation clustering, 4C[20] using "correlation connectivity" and ERiC[21] exploring hierarchical density-based correlation clusters).

Several different clustering systems based on mutual information have been proposed. One is Marina Meilă's variation of information metric;[22] another provides hierarchical clustering.[23] Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information.[24] Also message passing algorithms, a recent development in Computer Science and Statistical Physics, has led to the creation of new types of clustering algorithms.[25]

Evaluation of clustering results[edit]

Evaluation of clustering results sometimes is referred to as cluster validation.

There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. These measures are usually tied to the type of criterion being considered in assessing the quality of a clustering method.

Internal evaluation[edit]

When a clustering result is evaluated based on the data that was clustered itself, this is called internal evaluation. These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. One drawback of using internal criteria in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications.[26] Additionally, this evaluation is biased towards algorithms that use the same cluster model. For example k-Means clustering naturally optimizes object distances, and a distance-based internal criterion will likely overrate the resulting clustering.

Therefore, the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, but this shall not imply that one algorithm produces more valid results than another.[1] Validity as measured by such an index depends on the claim that this kind of structure exists in the data set. An algorithm designed for some kind of models has no chance if the data set contains a radically different set of models, or if the evaluation measures a radically different criterion.[1] For example, k-means clustering can only find convex clusters, and many evaluation indexes assume convex clusters. On a data set with non-convex clusters neither the use of k-means, nor of an evaluation criterion that assumes convexity, is sound.

The following methods can be used to assess the quality of clustering algorithms based on internal criterion:

One of the most common models in information retrieval (IR), the vector space model, represents a document set as a term-document matrix where each row corresponds to a term and each column corresponds to a document. Because of the use of matrices in IR, it is possible to apply linear algebra to this IR model. This paper describes an application of linear algebra to text clustering, namely, a metric for measuring cluster quality. The metric is based on the theory that cluster quality is proportional to the number of terms that are disjoint across the clusters. The metric compares the singular values of the term-document matrix to the singular values of the matrices for each of the clusters to determine the amount of overlap of the terms across clusters. Because the metric can be difficult to interpret, a standardization of the metric is defined, which specifies the number of standard deviations a clustering of a document set is from an average, random clustering of that document set. Empirical evidence shows that the standardized cluster metric correlates with clustered retrieval performance when comparing clustering algorithms or multiple parameters for the same clustering algorithm.

The Davies–Bouldin index can be calculated by the following formula:
 DB = \frac {1} {n} \sum_{i=1}^{n} \max_{i\neq j}\left(\frac{\sigma_i + \sigma_j} {d(c_i,c_j)}\right)
where n is the number of clusters, c_x is the centroid of cluster x, \sigma_x is the average distance of all elements in cluster x to centroid c_x, and d(c_i,c_j) is the distance between centroids c_i and c_j. Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criterion.
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:[27]
 D = \min_{1\leq i \leq n}\left\{\min_{1\leq j \leq n,i\neq j}\left\{\frac {d(i,j)}{\max_{1\leq k \leq n}{d^{'}(k)}}\right\}\right\}
where d(i,j) represents the distance between clusters i and j, and d^{'}(k) measures the intra-cluster distance of cluster k. The inter-cluster distance d(i,j) between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters. Similarly, the intra-cluster distance d^{'}(k) may be measured in a variety ways, such as the maximal distance between any pair of elements in cluster k. Since internal criterion seek clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index are more desirable.

External evaluation[edit]

In external evaluation, clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks. Such benchmarks consist of a set of pre-classified items, and these sets are often created by human (experts). Thus, the benchmark sets can be thought of as a gold standard for evaluation. These types of evaluation methods measure how close the clustering is to the predetermined benchmark classes. However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain anomalies.[28] Additionally, from a knowledge discovery point of view, the reproduction of known knowledge may not necessarily be the intended result.[28]

Some of the measures of quality of a cluster algorithm using external criterion include:

The Rand index computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula:
 RI = \frac {TP + TN} {TP + FP + FN + TN}
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. One issue with the Rand index is that false positives and false negatives are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern.
The F-measure can be used to balance the contribution of false negatives by weighting recall through a parameter \beta \geq 0. Let precision and recall be defined as follows:
 P = \frac {TP } {TP + FP }
 R = \frac {TP } {TP + FN}
where P is the precision rate and R is the recall rate. We can calculate the F-measure by using the following formula:[26]
 F_{\beta} = \frac {(\beta^2 + 1)\cdot P \cdot R } {\beta^2 \cdot P + R}
Notice that when \beta=0, F_{0}=P. In other words, recall has no impact on the F-measure when \beta=0, and increasing \beta allocates an increasing amount of weight to recall in the final F-measure.
  • Pair-counting F-Measure is the F-Measure applied to the set of object pairs, where objects are paired with each other when they are part of the same cluster. This measure is able to compare clusterings with different numbers of clusters.
  • Jaccard index
The Jaccard index is used to quantify the similarity between two datasets. The Jaccard index takes on a value between 0 and 1. An index of 1 means that the two dataset are identical, and an index of 0 indicates that the datasets have no common elements. The Jaccard index is defined by the following formula:
 J(A,B) = \frac {|A \cap B| } {|A \cup B|} = \frac{TP}{TP + FP + FN}
This is simply the number of unique elements common to both sets divided by the total number of unique elements in both sets.
The Fowlkes-Mallows index computes the similarity between the clusters returned by the clustering algorithm and the benchmark classifications. The higher the value of the Fowlkes-Mallows index the more similar the clusters and the benchmark classifications are. It can be computed using the following formula:
 FM = \sqrt{ \frac {TP}{TP+FP} \cdot \frac{TP}{TP+FN}  }
where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. The FM index is the geometric mean of the precision and recall P and R, while the F-measure is their harmonic mean.[31] Moreover, precision and recall are also known as Wallace's indices B^I and B^{II}.[32]
A confusion matrix can be used to quickly visualize the results of a classification (or clustering) algorithm. It shows how different a cluster is from the gold standard cluster.
  • The Mutual Information is an information theoretic measure of how much information is shared between a clustering and a ground-truth classification that can detect a non-linear similarity between two clusterings. Adjusted mutual information is the corrected-for-chance variant of this that has a reduced bias for varying cluster numbers.

Clustering Axioms[edit]

Given that there is a myriad of clustering algorithms and objectives, it is helpful to reason about clustering independently of any particular algorithm, objective function, or generative data model. This can be achieved by defining a clustering function as one that satisfies a set of properties. This is often termed as an Axiomatic System. Functions that satisfy the basic axioms are called clustering functions.[33]

Formal Preliminaries[edit]

A partitioning function acts on a set S of n \ge 2 points along with an integer k > 0, and pairwise distances among the points in S. The points in S are not assumed to belong to any specific set; the pairwise distances are the only data the partitioning function has about them. Since we wish to deal with point sets that do not necessarily belong to a specific set, we identify the points with the set S = \{1, 2, . . . , n\}. We can then define a distance function to be any function d : S \times S \rightarrow R such that for distinct i, j \in S , we have d(i, j) \ge 0, d(i, j) = 0 if and only if i = j, and d(i, j) = d(j, i) - in other words d must be symmetric and two points have distance zero if only if they are the same point.

A partitioning function is a function F that takes a distance function d on S \times S and an integer k \ge 1 and returns a k-partitioning of S. A k-partitioning of S is a collection of non-empty disjoint subsets of S whose union is S. The sets in F(d,k) will be called its "clusters". Two clustering functions are equivalent if and only if they output the same partitioning on all values of d and k - i.e. functionally equivalent.

Axioms[edit]

Now in an effort to distinguish clustering functions from partitioning functions, we lay down some properties that one may like a clustering function to satisfy. Here is the first one. If d is a distance function, then we define \alpha \cdot d to be the same function with all distances multiplied by \alpha.

Scale-Invariance.

For any distance function d, number of clusters k, and scalar \alpha > 0, we have F(d, k) = F(\alpha \cdot d, k)

This property simply requires the function to be immune to stretching or shrinking the data points linearly. It effectively disallows clustering functions to be sensitive to changes in units of measurement - which is desirable. We would like clustering functions to not have any predefined hard-coded distance values in their decision process.

The next property ensures that the clustering function is “rich" in types of partitioning it could output. For a fixed S and k, Let Range(F(\bullet , k)) be the set of all possible outputs while varying d.

k-Richess.

For any number of clusters k, Range(F(\bullet, k)) is equal to the set of all k-partitions of S

In other words, if we are given a set of points such that all we know about the points are pairwise distances, then for any partitioning \Gamma, there should exist a d such that F(d, k) = \Gamma. By varying distances amongst points, we should be able to obtain all possible k-partitionings.

The next property is more subtle. We call a partitioning function “consistent" if it satisfies the following: when we shrink distances between points in the same cluster and expand distances between between points in different clusters, we get the same result. Formally, we say that d' is a \Gamma-transformation of d if (a) for all i,j \in S belonging to the same cluster of \Gamma, we have d'(i,j) \le d(i,j); and (b) for all i,j \in S belonging to different clusters of \Gamma, we have d'(i,j) \ge d(i,j). In other words, d' is a transformation of d such that points inside the same cluster are brought closer together and points not inside the same cluster are moved further away from one another.

Consistency.

Fix k. Let d be a distance function, and d' be a F(d, k)-transformation of d. Then F(d, k) = F(d', k)

In other words, suppose that we run the partitioning function F on d to get back a particular partitioning \Gamma. Now, with respect to \Gamma, if we shrink in-cluster distances or expand between-cluster distances and run F again, we should still get back the same result - namely \Gamma.

The partitioning function F is forced to return a fixed number of clusters: k. If this were not the case, then the above three properties could never be satisfied by any function.[34] In many popular clustering algorithms such as k-means, Single-Linkage, and spectral clustering, the number of clusters to be returned is determined beforehand – by the human user or other methods – and passed into the clustering function as a parameter.

Applications[edit]

Biology, computational biology and bioinformatics
Plant and animal ecology
cluster analysis is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes
Transcriptomics
clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes). Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.
Sequence analysis
clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.
High-throughput genotyping platforms
clustering algorithms are used to automatically assign genotypes.
Human genetic clustering
The similarity of genetic data is used in clustering to infer population structures.
Medicine
Medical imaging
On PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image. In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time. This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.
IMRT segmentation
Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
Business and marketing
Market research
Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers, and for use in market segmentation, Product positioning, New product development and Selecting test markets.
Grouping of shopping items
Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't have the concept of a SKU)
World wide web
Social network analysis
In the study of social networks, clustering may be used to recognize communities within large groups of people.
Search result grouping
In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty.
Slippy map optimization
Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.
Computer science
Software evolution
Clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed. It is a form of restructuring and hence is a way of directly preventative maintenance.
Image segmentation
Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
Evolutionary algorithms
Clustering may be used to identify different niches within the population of an evolutionary algorithm so that reproductive opportunity can be distributed more evenly amongst the evolving species or subspecies.
Recommender systems
Recommender systems are designed to recommend new items based on a user's tastes. They sometimes use clustering algorithms to predict a user's preferences based on the preferences of other users in the user's cluster.
Markov chain Monte Carlo methods
Clustering is often utilized to locate and characterize extrema in the target distribution.
Social science
Crime analysis
Cluster analysis can be used to identify areas where there are greater incidences of particular types of crime. By identifying these distinct areas or "hot spots" where a similar crime has happened over a period of time, it is possible to manage law enforcement resources more effectively.
Educational data mining
Cluster analysis is for example used to identify groups of schools or students with similar properties.
Typologies
From poll data, projects such as those underaken by the Pew Research Center use cluster analysis to discern typologies of opinions, habits, and demographics that may be useful in politics and marketing.
Others
Field robotics
Clustering algorithms are used for robotic situational awareness to track objects and detect outliers in sensor data.[35]
Mathematical chemistry
To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.[36]
Climatology
To find weather regimes or preferred sea level pressure atmospheric patterns.[37]
Petroleum geology
Cluster analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.
Physical geography
The clustering of chemical properties in different sample locations.

See also[edit]

Related topics[edit]

Related methods[edit]

References[edit]

  1. ^ a b c d e f Estivill-Castro, V. (2002). "Why so many clustering algorithms". ACM SIGKDD Explorations Newsletter 4: 65. doi:10.1145/568574.568575.  edit
  2. ^ R. Sibson (1973). "SLINK: an optimally efficient algorithm for the single-link cluster method". The Computer Journal (British Computer Society) 16 (1): 30–34. doi:10.1093/comjnl/16.1.30. 
  3. ^ D. Defays (1977). "An efficient algorithm for a complete link method". The Computer Journal (British Computer Society) 20 (4): 364–366. doi:10.1093/comjnl/20.4.364. 
  4. ^ Lloyd, S. (1982). "Least squares quantization in PCM". IEEE Transactions on Information Theory 28 (2): 129–137. doi:10.1109/TIT.1982.1056489.  edit
  5. ^ Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek (2011). "Density-based Clustering". WIREs Data Mining and Knowledge Discovery 1 (3): 231–240. doi:10.1002/widm.30. 
  6. ^ Microsoft academic search: most cited data mining articles: DBSCAN is on rank 24, when accessed on: 4/18/2010
  7. ^ Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise". In Evangelos Simoudis, Jiawei Han, Usama M. Fayyad. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231. ISBN 1-57735-004-9. 
  8. ^ Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander (1999). "OPTICS: Ordering Points To Identify the Clustering Structure". ACM SIGMOD international conference on Management of data. ACM Press. pp. 49–60. 
  9. ^ Achtert, E.; Böhm, C.; Kröger, P. (2006). "DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking". LNCS: Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science 3918: 119–128. doi:10.1007/11731139_16. ISBN 978-3-540-33206-0.  edit
  10. ^ S Roy, D K Bhattacharyya (2005). "An Approach to find Embedded Clusters Using Density Based Techniques". LNCS Vol.3816. Springer Verlag. pp. 523–535. 
  11. ^ Z. Huang. "Extensions to the k-means algorithm for clustering large data sets with categorical values". Data Mining and Knowledge Discovery, 2:283–304, 1998.
  12. ^ R. Ng and J. Han. "Efficient and effective clustering method for spatial data mining". In: Proceedings of the 20th VLDB Conference, pages 144-155, Santiago, Chile, 1994.
  13. ^ Tian Zhang, Raghu Ramakrishnan, Miron Livny. "An Efficient Data Clustering Method for Very Large Databases." In: Proc. Int'l Conf. on Management of Data, ACM SIGMOD, pp. 103–114.
  14. ^ Can, F.; Ozkarahan, E. A. (1990). "Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases". ACM Transactions on Database Systems 15 (4): 483. doi:10.1145/99935.99938.  edit
  15. ^ Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P. (2005). "Automatic Subspace Clustering of High Dimensional Data". Data Mining and Knowledge Discovery 11: 5. doi:10.1007/s10618-005-1396-1.  edit
  16. ^ Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data. In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.
  17. ^ Achtert, E.; Böhm, C.; Kriegel, H. P.; Kröger, P.; Müller-Gorman, I.; Zimek, A. (2006). "Finding Hierarchies of Subspace Clusters". LNCS: Knowledge Discovery in Databases: PKDD 2006. Lecture Notes in Computer Science 4213: 446–453. doi:10.1007/11871637_42. ISBN 978-3-540-45374-1.  edit
  18. ^ Achtert, E.; Böhm, C.; Kriegel, H. P.; Kröger, P.; Müller-Gorman, I.; Zimek, A. (2007). "Detection and Visualization of Subspace Cluster Hierarchies". LNCS: Advances in Databases: Concepts, Systems and Applications. Lecture Notes in Computer Science 4443: 152–163. doi:10.1007/978-3-540-71703-4_15. ISBN 978-3-540-71702-7.  edit
  19. ^ Achtert, E.; Böhm, C.; Kröger, P.; Zimek, A. (2006). "Mining Hierarchies of Correlation Clusters". Proc. 18th International Conference on Scientific and Statistical Database Management (SSDBM): 119–128. doi:10.1109/SSDBM.2006.35. ISBN 0-7695-2590-3.  edit
  20. ^ Böhm, C.; Kailing, K.; Kröger, P.; Zimek, A. (2004). "Computing Clusters of Correlation Connected objects". Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04. p. 455. doi:10.1145/1007568.1007620. ISBN 1581138598.  edit
  21. ^ Achtert, E.; Bohm, C.; Kriegel, H. P.; Kröger, P.; Zimek, A. (2007). "On Exploring Complex Relationships of Correlation Clusters". 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007). p. 7. doi:10.1109/SSDBM.2007.21. ISBN 0-7695-2868-6.  edit
  22. ^ Meilă, Marina (2003). "Comparing Clusterings by the Variation of Information". Learning Theory and Kernel Machines. Lecture Notes in Computer Science 2777: 173–187. doi:10.1007/978-3-540-45167-9_14. ISBN 978-3-540-40720-1. 
  23. ^ Alexander Kraskov, Harald Stögbauer, Ralph G. Andrzejak, and Peter Grassberger, "Hierarchical Clustering Based on Mutual Information", (2003) ArXiv q-bio/0311039
  24. ^ Auffarth, B. (2010). Clustering by a Genetic Algorithm with Biased Mutation Operator. WCCI CEC. IEEE, July 18–23, 2010. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.170.869
  25. ^ B.J. Frey and D. Dueck (2007). "Clustering by Passing Messages Between Data Points". Science 315 (5814): 972–976. doi:10.1126/science.1136800. PMID 17218491  Papercore summary Frey2007
  26. ^ a b Christopher D. Manning, Prabhakar Raghavan & Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press. ISBN 978-0-521-86571-5. 
  27. ^ Dunn, J. (1974). "Well separated clusters and optimal fuzzy partitions". Journal of Cybernetics 4: 95–104. doi:10.1080/01969727408546059. 
  28. ^ a b Ines Färber, Stephan Günnemann, Hans-Peter Kriegel, Peer Kröger, Emmanuel Müller, Erich Schubert, Thomas Seidl, Arthur Zimek (2010). "On Using Class-Labels in Evaluation of Clusterings". In Xiaoli Z. Fern, Ian Davidson, Jennifer Dy. MultiClust: Discovering, Summarizing, and Using Multiple Clusterings. ACM SIGKDD. 
  29. ^ W. M. Rand (1971). "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association (American Statistical Association) 66 (336): 846–850. doi:10.2307/2284239. JSTOR 2284239. 
  30. ^ E. B. Fowlkes & C. L. Mallows (1983), "A Method for Comparing Two Hierarchical Clusterings", Journal of the American Statistical Association 78, 553–569.
  31. ^ L. Hubert et P. Arabie. Comparing partitions. J. of Classification, 2(1), 1985.
  32. ^ D. L. Wallace. Comment. Journal of the American Statistical Association, 78 :569– 579, 1983.
  33. ^ R. B. Zadeh, S Ben-David. "A Uniqueness Theorem for Clustering", in Proceedings of the Conference of Uncertainty in Artificial Intelligence, 2009.
  34. ^ J Kleinberg, "An Impossibility Theorem for Clustering", Proceedings of The Neural Information Processing Systems Conference 2002
  35. ^ Bewley A. et al. "Real-time volume estimation of a dragline payload". "IEEE International Conference on Robotics and Automation",2011: 1571-1576.
  36. ^ Basak S.C., Magnuson V.R., Niemi C.J., Regal R.R. "Determining Structural Similarity of Chemicals Using Graph Theoretic Indices". Discr. Appl. Math., 19, 1988: 17-44.
  37. ^ Huth R. et al. "Classifications of Atmospheric Circulation Patterns: Recent Advances and Applications". Ann. N.Y. Acad. Sci., 1146, 2008: 105-152

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Co_occurrence_networks b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Co_occurrence_networks new file mode 100644 index 00000000..445b59ac --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Co_occurrence_networks @@ -0,0 +1 @@ + Co-occurrence networks - Wikipedia, the free encyclopedia

Co-occurrence networks

From Wikipedia, the free encyclopedia
Jump to: navigation, search
A co-occurrence network created with KH Coder

Co-occurrence networks are generally used to provide a graphic visualization of potential relationships between people, organizations, concepts or other entities represented within written material. The generation and visualization of co-occurrence networks has become practical with the advent of electronically stored text amenable to text mining.

By way of definition, co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified unit of text. Networks are generated by connecting pairs of terms using a set of criteria defining co-occurrence. For example, terms A and B may be said to “co-occur” if they both appear in a particular article. Another article may contain terms B and C. Linking A to B and B to C creates a co-occurrence network of these three terms. Rules to define co-occurrence within a text corpus can be set according to desired criteria. For example, a more stringent criteria for co-occurrence may require a pair of terms to appear in the same sentence.

Contents

Methods and development[edit]

Co-occurrence networks can be created for any given list of terms (any dictionary) in relation to any collection of texts (any text corpus). Co-occurring pairs of terms can be called “neighbors” and these often group into “neighborhoods” based on their interconnections. Individual terms may have several neighbors. Neighborhoods may connect to one another through at least one individual term or may remain unconnected.

Individual terms are, within the context of text mining, symbolically represented as text strings. In the real world, the entity identified by a term normally has several symbolic representations. It is therefore useful to consider terms as being represented by one primary symbol and up to several synonymous alternative symbols. Occurrence of an individual term is established by searching for each known symbolic representations of the term. The process can be augmented through NLP (natural language processing) algorithms that interrogate segments of text for possible alternatives such as word order, spacing and hyphenation. NLP can also be used to identify sentence structure and categorize text strings according to grammar (for example, categorizing a string of text as a noun based on a preceding string of text known to be an article).

Graphic representation of co-occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the domain represented by the dictionary of terms applied to the text corpus. Meaningful visualization normally requires simplifications of the network. For example, networks may be drawn such that the number of neighbors connecting to each term is limited. The criteria for limiting neighbors might be based on the absolute number of co-occurrences or more subtle criteria such as “probability” of co-occurrence or the presence of an intervening descriptive term.

Quantitative aspects of the underlying structure of a co-occurrence network might also be informative, such as the overall number of connections between entities, clustering of entities representing sub-domains, detecting synonyms,[1] etc.

Applications and use[edit]

Some working applications of the co-occurrence approach are available to the public through the internet. PubGene is an example of an application that addresses the interests of biomedical community by presenting networks based on the co-occurrence of genetics related terms as these appear in MEDLINE records.[2][3] The website NameBase is an example of how human relationships can be inferred by examining networks constructed from the co-occurrence of personal names in newspapers and other texts (as in Ozgur et al.[4]).

Networks of information are also used to facilitate efforts to organize and focus publicly available information for law enforcement and intelligence purposes (so called "open source intelligence" or OSINT). Related techniques include co-citation networks as well as the analysis of hyperlink and content structure on the internet (such as in the analysis of web sites connected to terrorism[5]).

See also[edit]

  • Takada H, Saito K, Yamada T, Kimura M: “Analysis of Growing Co-occurrence Networks” SIG-KBS (Journal Code:X0831A) 2006, VOL.73rd;NO.;PAGE.117-122 Language;Japanese
  • Liu, Chua T-S; “Building semantic perceptron net for topic spotting.” Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, 2001; 378 - 385

References[edit]

  1. ^ Cohen AM, Hersh WR, Dubay C, Spackman, K: “Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts” BMC Bioinformatics 2005, 6:103
  2. ^ Jenssen TK, Laegreid A, Komorowski J, Hovig E: "A literature network of human genes for high-throughput analysis of gene expression. " Nature Genetics, 2001 May; 28(1):21-8. PMID 11326270
  3. ^ Grivell L: “Mining the bibliome: searching for a needle in a haystack? New computing tools are needed to effectively scan the growing amount of scientific literature for useful information.” EMBO reports 2001 Mar;3(3):200-3: doi:10.1093/embo-reports/kvf059 PMID 11882534
  4. ^ Ozgur A, Cetin B, Bingol H: “Co-occurrence Network of Reuters News” (15 Dec 2007) http://arxiv.org/abs/0712.2491
  5. ^ Zhou Y, Reid E, Qin J, Chen H, Lai G: "US Domestic Extremist Groups on the Web: Link and Content Analysis" http://doi.ieeecomputersociety.org/10.1109/MIS.2005.96

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Computational_complexity_theory b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Computational_complexity_theory new file mode 100644 index 00000000..3370fac6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Computational_complexity_theory @@ -0,0 +1 @@ + Computational complexity theory - Wikipedia, the free encyclopedia

Computational complexity theory

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Computational complexity theory is a branch of the theory of computation in theoretical computer science and mathematics that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps.

A problem is regarded as inherently difficult if its solution requires significant resources, whatever the algorithm used. The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage. Other complexity measures are also used, such as the amount of communication (used in communication complexity), the number of gates in a circuit (used in circuit complexity) and the number of processors (used in parallel computing). One of the roles of computational complexity theory is to determine the practical limits on what computers can and cannot do.

Closely related fields in theoretical computer science are analysis of algorithms and computability theory. A key distinction between analysis of algorithms and computational complexity theory is that the former is devoted to analyzing the amount of resources needed by a particular algorithm to solve a problem, whereas the latter asks a more general question about all possible algorithms that could be used to solve the same problem. More precisely, it tries to classify problems that can or cannot be solved with appropriately restricted resources. In turn, imposing restrictions on the available resources is what distinguishes computational complexity from computability theory: the latter theory asks what kind of problems can, in principle, be solved algorithmically.

Contents

Computational problems[edit]

A traveling salesperson tour through Germany’s 15 largest cities.

Problem instances[edit]

A computational problem can be viewed as an infinite collection of instances together with a solution for every instance. The input string for a computational problem is referred to as a problem instance, and should not be confused with the problem itself. In computational complexity theory, a problem refers to the abstract question to be solved. In contrast, an instance of this problem is a rather concrete utterance, which can serve as the input for a decision problem. For example, consider the problem of primality testing. The instance is a number (e.g. 15) and the solution is "yes" if the number is prime and "no" otherwise (in this case "no"). Stated another way, the instance is a particular input to the problem, and the solution is the output corresponding to the given input.

To further highlight the difference between a problem and an instance, consider the following instance of the decision version of the traveling salesman problem: Is there a route of at most 2000 kilometres passing through all of Germany's 15 largest cities? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem, such as asking for a round trip through all sites in Milan whose total length is at most 10 km. For this reason, complexity theory addresses computational problems and not particular problem instances.

Representing problem instances[edit]

When considering computational problems, a problem instance is a string over an alphabet. Usually, the alphabet is taken to be the binary alphabet (i.e., the set {0,1}), and thus the strings are bitstrings. As in a real-world computer, mathematical objects other than bitstrings must be suitably encoded. For example, integers can be represented in binary notation, and graphs can be encoded directly via their adjacency matrices, or by encoding their adjacency lists in binary.

Even though some proofs of complexity-theoretic theorems regularly assume some concrete choice of input encoding, one tries to keep the discussion abstract enough to be independent of the choice of encoding. This can be achieved by ensuring that different representations can be transformed into each other efficiently.

Decision problems as formal languages[edit]

A decision problem has only two possible outputs, yes or no (or alternately 1 or 0) on any input.

Decision problems are one of the central objects of study in computational complexity theory. A decision problem is a special type of computational problem whose answer is either yes or no, or alternately either 1 or 0. A decision problem can be viewed as a formal language, where the members of the language are instances whose output is yes, and the non-members are those instances whose output is no. The objective is to decide, with the aid of an algorithm, whether a given input string is a member of the formal language under consideration. If the algorithm deciding this problem returns the answer yes, the algorithm is said to accept the input string, otherwise it is said to reject the input.

An example of a decision problem is the following. The input is an arbitrary graph. The problem consists in deciding whether the given graph is connected, or not. The formal language associated with this decision problem is then the set of all connected graphs—of course, to obtain a precise definition of this language, one has to decide how graphs are encoded as binary strings.

Function problems[edit]

A function problem is a computational problem where a single output (of a total function) is expected for every input, but the output is more complex than that of a decision problem, that is, it isn't just yes or no. Notable examples include the traveling salesman problem and the integer factorization problem.

It is tempting to think that the notion of function problems is much richer than the notion of decision problems. However, this is not really the case, since function problems can be recast as decision problems. For example, the multiplication of two integers can be expressed as the set of triples (abc) such that the relation a × b = c holds. Deciding whether a given triple is member of this set corresponds to solving the problem of multiplying two numbers. Similarly, finding the minimum value of a mathematical function f(x) is equivalent to a search on k for the problem of determining whether a feasible point exists for f(x)≤ k.

Measuring the size of an instance[edit]

To measure the difficulty of solving a computational problem, one may wish to see how much time the best algorithm requires to solve the problem. However, the running time may, in general, depend on the instance. In particular, larger instances will require more time to solve. Thus the time required to solve a problem (or the space required, or any measure of complexity) is calculated as function of the size of the instance. This is usually taken to be the size of the input in bits. Complexity theory is interested in how algorithms scale with an increase in the input size. For instance, in the problem of finding whether a graph is connected, how much more time does it take to solve a problem for a graph with 2n vertices compared to the time taken for a graph with n vertices?

If the input size is n, the time taken can be expressed as a function of n. Since the time taken on different inputs of the same size can be different, the worst-case time complexity T(n) is defined to be the maximum time taken over all inputs of size n. If T(n) is a polynomial in n, then the algorithm is said to be a polynomial time algorithm. Cobham's thesis says that a problem can be solved with a feasible amount of resources if it admits a polynomial time algorithm.

Machine models and complexity measures[edit]

Turing machine[edit]

An artistic representation of a Turing machine

A Turing machine is a mathematical model of a general computing machine. It is a theoretical device that manipulates symbols contained on a strip of tape. Turing machines are not intended as a practical computing technology, but rather as a thought experiment representing a computing machine—anything from an advanced supercomputer to a mathematician with a pencil and paper. It is believed that if a problem can be solved by an algorithm, there exists a Turing machine that solves the problem. Indeed, this is the statement of the Church–Turing thesis. Furthermore, it is known that everything that can be computed on other models of computation known to us today, such as a RAM machine, Conway's Game of Life, cellular automata or any programming language can be computed on a Turing machine. Since Turing machines are easy to analyze mathematically, and are believed to be as powerful as any other model of computation, the Turing machine is the most commonly used model in complexity theory.

Many types of Turing machines are used to define complexity classes, such as deterministic Turing machines, probabilistic Turing machines, non-deterministic Turing machines, quantum Turing machines, symmetric Turing machines and alternating Turing machines. They are all equally powerful in principle, but when resources (such as time or space) are bounded, some of these may be more powerful than others.

A deterministic Turing machine is the most basic Turing machine, which uses a fixed set of rules to determine its future actions. A probabilistic Turing machine is a deterministic Turing machine with an extra supply of random bits. The ability to make probabilistic decisions often helps algorithms solve problems more efficiently. Algorithms that use random bits are called randomized algorithms. A non-deterministic Turing machine is a deterministic Turing machine with an added feature of non-determinism, which allows a Turing machine to have multiple possible future actions from a given state. One way to view non-determinism is that the Turing machine branches into many possible computational paths at each step, and if it solves the problem in any of these branches, it is said to have solved the problem. Clearly, this model is not meant to be a physically realizable model, it is just a theoretically interesting abstract machine that gives rise to particularly interesting complexity classes. For examples, see nondeterministic algorithm.

Other machine models[edit]

Many machine models different from the standard multi-tape Turing machines have been proposed in the literature, for example random access machines. Perhaps surprisingly, each of these models can be converted to another without providing any extra computational power. The time and memory consumption of these alternate models may vary.[1] What all these models have in common is that the machines operate deterministically.

However, some computational problems are easier to analyze in terms of more unusual resources. For example, a nondeterministic Turing machine is a computational model that is allowed to branch out to check many different possibilities at once. The nondeterministic Turing machine has very little to do with how we physically want to compute algorithms, but its branching exactly captures many of the mathematical models we want to analyze, so that nondeterministic time is a very important resource in analyzing computational problems.

Complexity measures[edit]

For a precise definition of what it means to solve a problem using a given amount of time and space, a computational model such as the deterministic Turing machine is used. The time required by a deterministic Turing machine M on input x is the total number of state transitions, or steps, the machine makes before it halts and outputs the answer ("yes" or "no"). A Turing machine M is said to operate within time f(n), if the time required by M on each input of length n is at most f(n). A decision problem A can be solved in time f(n) if there exists a Turing machine operating in time f(n) that solves the problem. Since complexity theory is interested in classifying problems based on their difficulty, one defines sets of problems based on some criteria. For instance, the set of problems solvable within time f(n) on a deterministic Turing machine is then denoted by DTIME(f(n)).

Analogous definitions can be made for space requirements. Although time and space are the most well-known complexity resources, any complexity measure can be viewed as a computational resource. Complexity measures are very generally defined by the Blum complexity axioms. Other complexity measures used in complexity theory include communication complexity, circuit complexity, and decision tree complexity.

The complexity of an algorithm is often expressed using big O notation.

Best, worst and average case complexity[edit]

Visualization of the quicksort algorithm that has average case performance \Theta(n\log n).

The best, worst and average case complexity refer to three different ways of measuring the time complexity (or any other complexity measure) of different inputs of the same size. Since some inputs of size n may be faster to solve than others, we define the following complexities:

  • Best-case complexity: This is the complexity of solving the problem for the best input of size n.
  • Worst-case complexity: This is the complexity of solving the problem for the worst input of size n.
  • Average-case complexity: This is the complexity of solving the problem on an average. This complexity is only defined with respect to a probability distribution over the inputs. For instance, if all inputs of the same size are assumed to be equally likely to appear, the average case complexity can be defined with respect to the uniform distribution over all inputs of size n.

For example, consider the deterministic sorting algorithm quicksort. This solves the problem of sorting a list of integers that is given as the input. The worst-case is when the input is sorted or sorted in reverse order, and the algorithm takes time O(n2) for this case. If we assume that all possible permutations of the input list are equally likely, the average time taken for sorting is O(n log n). The best case occurs when each pivoting divides the list in half, also needing O(n log n) time.

Upper and lower bounds on the complexity of problems[edit]

To classify the computation time (or similar resources, such as space consumption), one is interested in proving upper and lower bounds on the minimum amount of time required by the most efficient algorithm solving a given problem. The complexity of an algorithm is usually taken to be its worst-case complexity, unless specified otherwise. Analyzing a particular algorithm falls under the field of analysis of algorithms. To show an upper bound T(n) on the time complexity of a problem, one needs to show only that there is a particular algorithm with running time at most T(n). However, proving lower bounds is much more difficult, since lower bounds make a statement about all possible algorithms that solve a given problem. The phrase "all possible algorithms" includes not just the algorithms known today, but any algorithm that might be discovered in the future. To show a lower bound of T(n) for a problem requires showing that no algorithm can have time complexity lower than T(n).

Upper and lower bounds are usually stated using the big O notation, which hides constant factors and smaller terms. This makes the bounds independent of the specific details of the computational model used. For instance, if T(n) = 7n2 + 15n + 40, in big O notation one would write T(n) = O(n2).

Complexity classes[edit]

Defining complexity classes[edit]

A complexity class is a set of problems of related complexity. Simpler complexity classes are defined by the following factors:

Of course, some complexity classes have complex definitions that do not fit into this framework. Thus, a typical complexity class has a definition like the following:

The set of decision problems solvable by a deterministic Turing machine within time f(n). (This complexity class is known as DTIME(f(n)).)

But bounding the computation time above by some concrete function f(n) often yields complexity classes that depend on the chosen machine model. For instance, the language {xx | x is any binary string} can be solved in linear time on a multi-tape Turing machine, but necessarily requires quadratic time in the model of single-tape Turing machines. If we allow polynomial variations in running time, Cobham-Edmonds thesis states that "the time complexities in any two reasonable and general models of computation are polynomially related" (Goldreich 2008, Chapter 1.2). This forms the basis for the complexity class P, which is the set of decision problems solvable by a deterministic Turing machine within polynomial time. The corresponding set of function problems is FP.

Important complexity classes[edit]

A representation of the relation among complexity classes

Many important complexity classes can be defined by bounding the time or space used by the algorithm. Some important complexity classes of decision problems defined in this manner are the following:

Complexity class Model of computation Resource constraint
DTIME(f(n)) Deterministic Turing machine Time f(n)
P Deterministic Turing machine Time poly(n)
EXPTIME Deterministic Turing machine Time 2poly(n)
NTIME(f(n)) Non-deterministic Turing machine Time f(n)
NP Non-deterministic Turing machine Time poly(n)
NEXPTIME Non-deterministic Turing machine Time 2poly(n)
DSPACE(f(n)) Deterministic Turing machine Space f(n)
L Deterministic Turing machine Space O(log n)
PSPACE Deterministic Turing machine Space poly(n)
EXPSPACE Deterministic Turing machine Space 2poly(n)
NSPACE(f(n)) Non-deterministic Turing machine Space f(n)
NL Non-deterministic Turing machine Space O(log n)
NPSPACE Non-deterministic Turing machine Space poly(n)
NEXPSPACE Non-deterministic Turing machine Space 2poly(n)

It turns out that PSPACE = NPSPACE and EXPSPACE = NEXPSPACE by Savitch's theorem.

Other important complexity classes include BPP, ZPP and RP, which are defined using probabilistic Turing machines; AC and NC, which are defined using Boolean circuits and BQP and QMA, which are defined using quantum Turing machines. #P is an important complexity class of counting problems (not decision problems). Classes like IP and AM are defined using Interactive proof systems. ALL is the class of all decision problems.

Hierarchy theorems[edit]

For the complexity classes defined in this way, it is desirable to prove that relaxing the requirements on (say) computation time indeed defines a bigger set of problems. In particular, although DTIME(n) is contained in DTIME(n2), it would be interesting to know if the inclusion is strict. For time and space requirements, the answer to such questions is given by the time and space hierarchy theorems respectively. They are called hierarchy theorems because they induce a proper hierarchy on the classes defined by constraining the respective resources. Thus there are pairs of complexity classes such that one is properly included in the other. Having deduced such proper set inclusions, we can proceed to make quantitative statements about how much more additional time or space is needed in order to increase the number of problems that can be solved.

More precisely, the time hierarchy theorem states that

\operatorname{DTIME}\big(f(n) \big) \subsetneq \operatorname{DTIME} \big(f(n) \sdot \log^{2}(f(n)) \big).

The space hierarchy theorem states that

\operatorname{DSPACE}\big(f(n)\big) \subsetneq \operatorname{DSPACE} \big(f(n) \sdot \log(f(n)) \big).

The time and space hierarchy theorems form the basis for most separation results of complexity classes. For instance, the time hierarchy theorem tells us that P is strictly contained in EXPTIME, and the space hierarchy theorem tells us that L is strictly contained in PSPACE.

Reduction[edit]

Many complexity classes are defined using the concept of a reduction. A reduction is a transformation of one problem into another problem. It captures the informal notion of a problem being at least as difficult as another problem. For instance, if a problem X can be solved using an algorithm for Y, X is no more difficult than Y, and we say that X reduces to Y. There are many different types of reductions, based on the method of reduction, such as Cook reductions, Karp reductions and Levin reductions, and the bound on the complexity of reductions, such as polynomial-time reductions or log-space reductions.

The most commonly used reduction is a polynomial-time reduction. This means that the reduction process takes polynomial time. For example, the problem of squaring an integer can be reduced to the problem of multiplying two integers. This means an algorithm for multiplying two integers can be used to square an integer. Indeed, this can be done by giving the same input to both inputs of the multiplication algorithm. Thus we see that squaring is not more difficult than multiplication, since squaring can be reduced to multiplication.

This motivates the concept of a problem being hard for a complexity class. A problem X is hard for a class of problems C if every problem in C can be reduced to X. Thus no problem in C is harder than X, since an algorithm for X allows us to solve any problem in C. Of course, the notion of hard problems depends on the type of reduction being used. For complexity classes larger than P, polynomial-time reductions are commonly used. In particular, the set of problems that are hard for NP is the set of NP-hard problems.

If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to reduce a known NP-complete problem, Π2, to another problem, Π1, would indicate that there is no known polynomial-time solution for Π1. This is because a polynomial-time solution to Π1 would yield a polynomial-time solution to Π2. Similarly, because all NP problems can be reduced to the set, finding an NP-complete problem that can be solved in polynomial time would mean that P = NP.[2]

Important open problems[edit]

Diagram of complexity classes provided that P ≠ NP. The existence of problems in NP outside both P and NP-complete in this case was established by Ladner.[3]

P versus NP problem[edit]

The complexity class P is often seen as a mathematical abstraction modeling those computational tasks that admit an efficient algorithm. This hypothesis is called the Cobham–Edmonds thesis. The complexity class NP, on the other hand, contains many problems that people would like to solve efficiently, but for which no efficient algorithm is known, such as the Boolean satisfiability problem, the Hamiltonian path problem and the vertex cover problem. Since deterministic Turing machines are special nondeterministic Turing machines, it is easily observed that each problem in P is also member of the class NP.

The question of whether P equals NP is one of the most important open questions in theoretical computer science because of the wide implications of a solution.[2] If the answer is yes, many important problems can be shown to have more efficient solutions. These include various types of integer programming problems in operations research, many problems in logistics, protein structure prediction in biology,[4] and the ability to find formal proofs of pure mathematics theorems.[5] The P versus NP problem is one of the Millennium Prize Problems proposed by the Clay Mathematics Institute. There is a US$1,000,000 prize for resolving the problem.[6]

Problems in NP not known to be in P or NP-complete[edit]

It was shown by Ladner that if PNP then there exist problems in NP that are neither in P nor NP-complete.[3] Such problems are called NP-intermediate problems. The graph isomorphism problem, the discrete logarithm problem and the integer factorization problem are examples of problems believed to be NP-intermediate. They are some of the very few NP problems not known to be in P or to be NP-complete.

The graph isomorphism problem is the computational problem of determining whether two finite graphs are isomorphic. An important unsolved problem in complexity theory is whether the graph isomorphism problem is in P, NP-complete, or NP-intermediate. The answer is not known, but it is believed that the problem is at least not NP-complete.[7] If graph isomorphism is NP-complete, the polynomial time hierarchy collapses to its second level.[8] Since it is widely believed that the polynomial hierarchy does not collapse to any finite level, it is believed that graph isomorphism is not NP-complete. The best algorithm for this problem, due to Laszlo Babai and Eugene Luks has run time 2O(√(n log(n))) for graphs with n vertices.

The integer factorization problem is the computational problem of determining the prime factorization of a given integer. Phrased as a decision problem, it is the problem of deciding whether the input has a factor less than k. No efficient integer factorization algorithm is known, and this fact forms the basis of several modern cryptographic systems, such as the RSA algorithm. The integer factorization problem is in NP and in co-NP (and even in UP and co-UP[9]). If the problem is NP-complete, the polynomial time hierarchy will collapse to its first level (i.e., NP will equal co-NP). The best known algorithm for integer factorization is the general number field sieve, which takes time O(e(64/9)1/3(n.log 2)1/3(log (n.log 2))2/3) to factor an n-bit integer. However, the best known quantum algorithm for this problem, Shor's algorithm, does run in polynomial time. Unfortunately, this fact doesn't say much about where the problem lies with respect to non-quantum complexity classes.

Separations between other complexity classes[edit]

Many known complexity classes are suspected to be unequal, but this has not been proved. For instance PNPPPPSPACE, but it is possible that P = PSPACE. If P is not equal to NP, then P is not equal to PSPACE either. Since there are many known complexity classes between P and PSPACE, such as RP, BPP, PP, BQP, MA, PH, etc., it is possible that all these complexity classes collapse to one class. Proving that any of these classes are unequal would be a major breakthrough in complexity theory.

Along the same lines, co-NP is the class containing the complement problems (i.e. problems with the yes/no answers reversed) of NP problems. It is believed[10] that NP is not equal to co-NP; however, it has not yet been proven. It has been shown that if these two complexity classes are not equal then P is not equal to NP.

Similarly, it is not known if L (the set of all problems that can be solved in logarithmic space) is strictly contained in P or equal to P. Again, there are many complexity classes between the two, such as NL and NC, and it is not known if they are distinct or equal classes.

It is suspected that P and BPP are equal. However, it is currently open if BPP = NEXP.

Intractability[edit]

Problems that can be solved in theory (e.g., given infinite time), but which in practice take too long for their solutions to be useful, are known as intractable problems.[11] In complexity theory, problems that lack polynomial-time solutions are considered to be intractable for more than the smallest inputs. In fact, the Cobham–Edmonds thesis states that only those problems that can be solved in polynomial time can be feasibly computed on some computational device. Problems that are known to be intractable in this sense include those that are EXPTIME-hard. If NP is not the same as P, then the NP-complete problems are also intractable in this sense. To see why exponential-time algorithms might be unusable in practice, consider a program that makes 2n operations before halting. For small n, say 100, and assuming for the sake of example that the computer does 1012 operations each second, the program would run for about 4 × 1010 years, which is the same order of magnitude as the age of the universe. Even with a much faster computer, the program would only be useful for very small instances and in that sense the intractability of a problem is somewhat independent of technological progress. Nevertheless a polynomial time algorithm is not always practical. If its running time is, say, n15, it is unreasonable to consider it efficient and it is still useless except on small instances.

What intractability means in practice is open to debate. Saying that a problem is not in P does not imply that all large cases of the problem are hard or even that most of them are. For example the decision problem in Presburger arithmetic has been shown not to be in P, yet algorithms have been written that solve the problem in reasonable times in most cases. Similarly, algorithms can solve the NP-complete knapsack problem over a wide range of sizes in less than quadratic time and SAT solvers routinely handle large instances of the NP-complete Boolean satisfiability problem.

Continuous complexity theory[edit]

Continuous complexity theory can refer to complexity theory of problems that involve continuous functions that are approximated by discretizations, as studied in numerical analysis. One approach to complexity theory of numerical analysis[12] is information based complexity.

Continuous complexity theory can also refer to complexity theory of the use of analog computation, which uses continuous dynamical systems and differential equations.[13] Control theory can be considered a form of computation and differential equations are used in the modelling of continuous-time and hybrid discrete-continuous-time systems.[14]

History[edit]

The analysis of algorithms has been studied long before the invention of computers. Gabriel Lamé gave a running time analysis of the Euclidean algorithm in 1844.

Before the actual research explicitly devoted to the complexity of algorithmic problems started off, numerous foundations were laid out by various researchers. Most influential among these was the definition of Turing machines by Alan Turing in 1936, which turned out to be a very robust and flexible notion of computer.

Fortnow & Homer (2003) date the beginning of systematic studies in computational complexity to the seminal paper "On the Computational Complexity of Algorithms" by Juris Hartmanis and Richard Stearns (1965), which laid out the definitions of time and space complexity and proved the hierarchy theorems. Also, in 1965 Edmonds defined a "good" algorithm as one with running time bounded by a polynomial of the input size.[15]

According to Fortnow & Homer (2003), earlier papers studying problems solvable by Turing machines with specific bounded resources include John Myhill's definition of linear bounded automata (Myhill 1960), Raymond Smullyan's study of rudimentary sets (1961), as well as Hisao Yamada's paper[16] on real-time computations (1962). Somewhat earlier, Boris Trakhtenbrot (1956), a pioneer in the field from the USSR, studied another specific complexity measure.[17] As he remembers:

However, [my] initial interest [in automata theory] was increasingly set aside in favor of computational complexity, an exciting fusion of combinatorial methods, inherited from switching theory, with the conceptual arsenal of the theory of algorithms. These ideas had occurred to me earlier in 1955 when I coined the term "signalizing function", which is nowadays commonly known as "complexity measure".
—Boris Trakhtenbrot, From Logic to Theoretical Computer Science – An Update. In: Pillars of Computer Science, LNCS 4800, Springer 2008.

In 1967, Manuel Blum developed an axiomatic complexity theory based on his axioms and proved an important result, the so-called, speed-up theorem. The field really began to flourish in 1971 when the US researcher Stephen Cook and, working independently, Leonid Levin in the USSR, proved that there exist practically relevant problems that are NP-complete. In 1972, Richard Karp took this idea a leap forward with his landmark paper, "Reducibility Among Combinatorial Problems", in which he showed that 21 diverse combinatorial and graph theoretical problems, each infamous for its computational intractability, are NP-complete.[18]

Relationship between computability theory, complexity theory and formal language theory.


See also[edit]

Notes[edit]


References[edit]

  1. ^ See Arora & Barak 2009, Chapter 1: The computational model and why it doesn't matter
  2. ^ a b See Sipser 2006, Chapter 7: Time complexity
  3. ^ a b Ladner, Richard E. (1975), "On the structure of polynomial time reducibility" (PDF), Journal of the ACM (JACM) 22 (1): 151–171, doi:10.1145/321864.321877. 
  4. ^ Berger, Bonnie A.; Leighton, T (1998), "Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete", Journal of Computational Biology 5 (1): 27–40, doi:10.1089/cmb.1998.5.27, PMID 9541869. 
  5. ^ Cook, Stephen (April 2000), The P versus NP Problem, Clay Mathematics Institute, retrieved 2006-10-18. 
  6. ^ Jaffe, Arthur M. (2006), "The Millennium Grand Challenge in Mathematics", Notices of the AMS 53 (6), retrieved 2006-10-18. 
  7. ^ Arvind, Vikraman; Kurur, Piyush P. (2006), "Graph isomorphism is in SPP", Information and Computation 204 (5): 835–852, doi:10.1016/j.ic.2006.02.002. 
  8. ^ Uwe Schöning, "Graph isomorphism is in the low hierarchy", Proceedings of the 4th Annual Symposium on Theoretical Aspects of Computer Science, 1987, 114–124; also: Journal of Computer and System Sciences, vol. 37 (1988), 312–323
  9. ^ Lance Fortnow. Computational Complexity Blog: Complexity Class of the Week: Factoring. September 13, 2002. http://weblog.fortnow.com/2002/09/complexity-class-of-week-factoring.html
  10. ^ Boaz Barak's course on Computational Complexity Lecture 2
  11. ^ Hopcroft, J.E., Motwani, R. and Ullman, J.D. (2007) Introduction to Automata Theory, Languages, and Computation, Addison Wesley, Boston/San Francisco/New York (page 368)
  12. ^ Smale, Steve (1997). "Complexity Theory and Numerical Analysis". Acta Numerica (Cambridge Univ Press). CiteSeerX: 10.1.1.33.4678. 
  13. ^ A Survey on Continuous Time Computations, Olivier Bournez, Manuel Campagnolo, New Computational Paradigms. Changing Conceptions of What is Computable. (Cooper, S.B. and L{\"o}we, B. and Sorbi, A., Eds.). New York, Springer-Verlag, pages 383-423. 2008
  14. ^ Tomlin, Claire J.; Mitchell, Ian; Bayen, Alexandre M.; Oishi, Meeko (July 2003). "Computational Techniques for the Verification of Hybrid Systems". Proceedings of the IEEE 91 (7). CiteSeerX: 10.1.1.70.4296. 
  15. ^ Richard M. Karp, "Combinatorics, Complexity, and Randomness", 1985 Turing Award Lecture
  16. ^ Yamada, H. (1962). "Real-Time Computation and Recursive Functions Not Real-Time Computable". IEEE Transactions on Electronic Computers. EC-11 (6): 753–760. doi:10.1109/TEC.1962.5219459.  edit
  17. ^ Trakhtenbrot, B.A.: Signalizing functions and tabular operators. Uchionnye Zapiski Penzenskogo Pedinstituta (Transactions of the Penza Pedagogoical Institute) 4, 75–87 (1956) (in Russian)
  18. ^ Richard M. Karp (1972), "Reducibility Among Combinatorial Problems", in R. E. Miller and J. W. Thatcher (editors), Complexity of Computer Computations, New York: Plenum, pp. 85–103 

Textbooks[edit]

Surveys[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Computer_science b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Computer_science new file mode 100644 index 00000000..51caea9f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Computer_science @@ -0,0 +1 @@ + Computer science - Wikipedia, the free encyclopedia

Computer science

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Computer science or computing science (abbreviated CS or CompSci) is the scientific and practical approach to computation and its applications. A computer scientist specializes in the theory of computation and the design of computational systems.[1]

Its subfields can be divided into a variety of theoretical and practical disciplines. Some fields, such as computational complexity theory (which explores the fundamental properties of computational problems), are highly abstract, whilst fields such as computer graphics emphasize real-world visual applications. Still other fields focus on the challenges in implementing computation. For example, programming language theory considers various approaches to the description of computation, whilst the study of computer programming itself investigates various aspects of the use of programming language and complex systems. Human-computer interaction considers the challenges in making computers and computations useful, usable, and universally accessible to humans.

large capital lambda Plot of a quicksort algorithm
Utah teapot representing computer graphics Microsoft Tastenmaus mouse representing human-computer interaction
Computer science deals with the theoretical foundations of information and computation, together with practical techniques for the implementation and application of these foundations

Contents

History[edit]

Charles Babbage is credited with inventing the first mechanical computer.
Ada Lovelace is credited with writing the first algorithm intended for processing on a computer.

The earliest foundations of what would become computer science predate the invention of the modern digital computer. Machines for calculating fixed numerical tasks such as the abacus have existed since antiquity but they only supported the human mind, aiding in computations as complex as multiplication and division.

Blaise Pascal designed and constructed the first working mechanical calculator, Pascal's calculator, in 1642. Two hundred years later, Thomas de Colmar launched the mechanical calculator industry[2] when he released his simplified arithmometer, which was the first calculating machine strong enough and reliable enough to be used daily in an office environment. Charles Babbage started the design of the first automatic mechanical calculator, his difference engine, in 1822, which eventually gave him the idea of the first programmable mechanical calculator, his Analytical Engine.[3] He started developing this machine in 1834 and "in less than two years he had sketched out many of the salient features of the modern computer. A crucial step was the adoption of a punched card system derived from the Jacquard loom"[4] making it infinitely programmable.[5] In 1843, during the translation of a French article on the analytical engine, Ada Lovelace wrote, in one of the many notes she included, an algorithm to compute the Bernoulli numbers, which is considered to be the first computer program.[6] Around 1885, Herman Hollerith invented the tabulator which used punched cards to process statistical information; eventually his company became part of IBM. In 1937, one hundred years after Babbage's impossible dream, Howard Aiken convinced IBM, which was making all kinds of punched card equipment and was also in the calculator business[7] to develop his giant programmable calculator, the ASCC/Harvard Mark I, based on Babbage's analytical engine, which itself used cards and a central computing unit. When the machine was finished, some hailed it as "Babbage's dream come true".[8]

During the 1940s, as new and more powerful computing machines were developed, the term computer came to refer to the machines rather than their human predecessors.[9] As it became clear that computers could be used for more than just mathematical calculations, the field of computer science broadened to study computation in general. Computer science began to be established as a distinct academic discipline in the 1950s and early 1960s.[10][11] The world's first computer science degree program, the Cambridge Diploma in Computer Science, began at the University of Cambridge Computer Laboratory in 1953. The first computer science degree program in the United States was formed at Purdue University in 1962.[12] Since practical computers became available, many applications of computing have become distinct areas of study in their own right.

Although many initially believed it was impossible that computers themselves could actually be a scientific field of study, in the late fifties it gradually became accepted among the greater academic population.[13] It is the now well-known IBM brand that formed part of the computer science revolution during this time. IBM (short for International Business Machines) released the IBM 704[14] and later the IBM 709[15] computers, which were widely used during the exploration period of such devices. "Still, working with the IBM [computer] was frustrating...if you had misplaced as much as one letter in one instruction, the program would crash, and you would have to start the whole process over again".[13] During the late 1950s, the computer science discipline was very much in its developmental stages, and such issues were commonplace.

Time has seen significant improvements in the usability and effectiveness of computing technology. Modern society has seen a significant shift in the users of computer technology, from usage only by experts and professionals, to a near-ubiquitous user base. Initially, computers were quite costly, and some degree of human aid was needed for efficient use - in part from professional computer operators. As computer adoption became more widespread and affordable, less human assistance was needed for common usage.

Major achievements[edit]

The German military used the Enigma machine (shown here) during World War II for communication they thought to be secret. The large-scale decryption of Enigma traffic at Bletchley Park was an important factor that contributed to Allied victory in WWII.[16]

Despite its short history as a formal academic discipline, computer science has made a number of fundamental contributions to science and society - in fact, along with electronics, it is a founding science of the current epoch of human history called the Information Age and a driver of the Information Revolution, seen as the third major leap in human technological progress after the Industrial Revolution (1750-1850 CE) and the Agricultural Revolution (8000-5000 BCE).

These contributions include:

Philosophy[edit]

A number of computer scientists have argued for the distinction of three separate paradigms in computer science. Peter Wegner argued that those paradigms are science, technology, and mathematics.[22] Peter Denning's working group argued that they are theory, abstraction (modeling), and design.[23] Amnon H. Eden described them as the "rationalist paradigm" (which treats computer science as a branch of mathematics, which is prevalent in theoretical computer science, and mainly employs deductive reasoning), the "technocratic paradigm" (which might be found in engineering approaches, most prominently in software engineering), and the "scientific paradigm" (which approaches computer-related artifacts from the empirical perspective of natural sciences, identifiable in some branches of artificial intelligence).[24]

Name of the field[edit]

The term "computer science" appears in a 1959 article in Communications of the ACM,[25] in which Louis Fein argues for the creation of a Graduate School in Computer Sciences analogous to the creation of Harvard Business School in 1921, justifying the name by arguing that, like management science, it is applied and interdisciplinary in nature, yet at the same time, has all the characteristics of an academic discipline.[26] His efforts, and those of others such as numerical analyst George Forsythe, were rewarded: universities went on to create such programs, starting with Purdue in 1962.[27] Despite its name, a significant amount of computer science does not involve the study of computers themselves. Because of this, several alternative names have been proposed.[28] Certain departments of major universities prefer the term computing science, to emphasize precisely that difference. Danish scientist Peter Naur suggested the term datalogy,[29] to reflect the fact that the scientific discipline revolves around data and data treatment, while not necessarily involving computers. The first scientific institution to use the term was the Department of Datalogy at the University of Copenhagen, founded in 1969, with Peter Naur being the first professor in datalogy. The term is used mainly in the Scandinavian countries. Also, in the early days of computing, a number of terms for the practitioners of the field of computing were suggested in the Communications of the ACMturingineer, turologist, flow-charts-man, applied meta-mathematician, and applied epistemologist.[30] Three months later in the same journal, comptologist was suggested, followed next year by hypologist.[31] The term computics has also been suggested.[32] In Europe, terms derived from contracted translations of the expression "automatic information" (e.g. "informazione automatica" in Italian) or "information and mathematics" are often used, e.g. informatique (French), Informatik (German), informatica (Italy), informática (Spain, Portugal) or informatika (Slavic languages) are also used and have also been adopted in the UK (as in the School of Informatics of the University of Edinburgh).[33]

A folkloric quotation, often attributed to—but almost certainly not first formulated by—Edsger Dijkstra, states that "computer science is no more about computers than astronomy is about telescopes."[note 1] The design and deployment of computers and computer systems is generally considered the province of disciplines other than computer science. For example, the study of computer hardware is usually considered part of computer engineering, while the study of commercial computer systems and their deployment is often called information technology or information systems. However, there has been much cross-fertilization of ideas between the various computer-related disciplines. Computer science research also often intersects other disciplines, such as philosophy, cognitive science, linguistics, mathematics, physics, statistics, and logic.

Computer science is considered by some to have a much closer relationship with mathematics than many scientific disciplines, with some observers saying that computing is a mathematical science.[10] Early computer science was strongly influenced by the work of mathematicians such as Kurt Gödel and Alan Turing, and there continues to be a useful interchange of ideas between the two fields in areas such as mathematical logic, category theory, domain theory, and algebra.

The relationship between computer science and software engineering is a contentious issue, which is further muddied by disputes over what the term "software engineering" means, and how computer science is defined.[34] David Parnas, taking a cue from the relationship between other engineering and science disciplines, has claimed that the principal focus of computer science is studying the properties of computation in general, while the principal focus of software engineering is the design of specific computations to achieve practical goals, making the two separate but complementary disciplines.[35]

The academic, political, and funding aspects of computer science tend to depend on whether a department formed with a mathematical emphasis or with an engineering emphasis. Computer science departments with a mathematics emphasis and with a numerical orientation consider alignment with computational science. Both types of departments tend to make efforts to bridge the field educationally if not across all research.

Areas of computer science[edit]

As a discipline, computer science spans a range of topics from theoretical studies of algorithms and the limits of computation to the practical issues of implementing computing systems in hardware and software.[36][37] CSAB, formerly called Computing Sciences Accreditation Board – which is made up of representatives of the Association for Computing Machinery (ACM), and the IEEE Computer Society (IEEE-CS)[38] – identifies four areas that it considers crucial to the discipline of computer science: theory of computation, algorithms and data structures, programming methodology and languages, and computer elements and architecture. In addition to these four areas, CSAB also identifies fields such as software engineering, artificial intelligence, computer networking and communication, database systems, parallel computation, distributed computation, computer-human interaction, computer graphics, operating systems, and numerical and symbolic computation as being important areas of computer science.[36]

Theoretical computer science[edit]

The broader field of theoretical computer science encompasses both the classical theory of computation and a wide range of other topics that focus on the more abstract, logical, and mathematical aspects of computing.

Theory of computation[edit]

According to Peter J. Denning, the fundamental question underlying computer science is, "What can be (efficiently) automated?"[10] The study of the theory of computation is focused on answering fundamental questions about what can be computed and what amount of resources are required to perform those computations. In an effort to answer the first question, computability theory examines which computational problems are solvable on various theoretical models of computation. The second question is addressed by computational complexity theory, which studies the time and space costs associated with different approaches to solving a multitude of computational problems.

The famous "P=NP?" problem, one of the Millennium Prize Problems,[39] is an open problem in the theory of computation.

DFAexample.svg Wang tiles.png P = NP ? GNITIRW-TERCES Blochsphere.svg
Automata theory Computability theory Computational complexity theory Cryptography Quantum computing theory

Information and coding theory[edit]

Information theory is related to the quantification of information. This was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and communicating data. Coding theory is the study of the properties of codes (systems for converting information from one form to another) and their fitness for a specific application. Codes are used for data compression, cryptography, error detection and correction, and more recently also for network coding. Codes are studied for the purpose of designing efficient and reliable data transmission methods.

Algorithms and data structures[edit]

O(n^2) Sorting quicksort anim.gif Singly linked list.png SimplexRangeSearching.png
Analysis of algorithms Algorithms Data structures Computational geometry

Programming language theory[edit]

Programming language theory (PLT) is a branch of computer science that deals with the design, implementation, analysis, characterization, and classification of programming languages and their individual features. It falls within the discipline of computer science, both depending on and affecting mathematics, software engineering and linguistics. It is an active research area, with numerous dedicated academic journals.

\Gamma\vdash x: \text{Int} Ideal compiler.png Python add5 syntax.svg
Type theory Compiler design Programming languages

Formal methods[edit]

Formal methods are a particular kind of mathematically based technique for the specification, development and verification of software and hardware systems. The use of formal methods for software and hardware design is motivated by the expectation that, as in other engineering disciplines, performing appropriate mathematical analysis can contribute to the reliability and robustness of a design. They form an important theoretical underpinning for software engineering, especially where safety or security is involved. Formal methods are a useful adjunct to software testing since they help avoid errors and can also give a framework for testing. For industrial use, tool support is required. However, the high cost of using formal methods means that they are usually only used in the development of high-integrity and life-critical systems, where safety or security is of utmost importance. Formal methods are best described as the application of a fairly broad variety of theoretical computer science fundamentals, in particular logic calculi, formal languages, automata theory, and program semantics, but also type systems and algebraic data types to problems in software and hardware specification and verification.

Applied computer science[edit]

Artificial intelligence[edit]

This branch of computer science aims to or is required to synthesise goal-orientated processes such as problem-solving, decision-making, environmental adaptation, learning and communication which are found in humans and animals. From its origins in cybernetics and in the Dartmouth Conference (1956), artificial intelligence (AI) research has been necessarily cross-disciplinary, drawing on areas of expertise such as applied mathematics, symbolic logic, semiotics, electrical engineering, philosophy of mind, neurophysiology, and social intelligence. AI is associated in the popular mind with robotic development, but the main field of practical application has been as an embedded component in areas of software development which require computational understanding and modeling such as finance and economics, data mining and the physical sciences. The starting-point in the late 1940s was Alan Turing's question "Can computers think?", and the question remains effectively unanswered although the "Turing Test" is still used to assess computer output on the scale of human intelligence. But the automation of evaluative and predictive tasks has been increasingly successful as a substitute for human monitoring and intervention in domains of computer application involving complex real-world data.

Nicolas P. Rougier's rendering of the human brain.png Human eye, rendered from Eye.svg.png Corner.png KnnClassification.svg
Machine learning Computer vision Image processing Pattern recognition
User-FastFission-brain.gif Data.png Sky.png Earth.png
Cognitive science Data mining Evolutionary computation Information retrieval
Neuron.svg English.png HONDA ASIMO.jpg MeningiomaMRISegmentation.png
Knowledge representation Natural language processing Robotics Medical Image Computing

Computer architecture and engineering[edit]

Computer architecture, or digital computer organization, is the conceptual design and fundamental operational structure of a computer system. It focuses largely on the way by which the central processing unit performs internally and accesses addresses in memory. The field often involves disciplines of computer engineering and electrical engineering, selecting and interconnecting hardware components to create computers that meet functional, performance, and cost goals.

NOR ANSI.svg Fivestagespipeline.png SIMD.svg
Digital logic Microarchitecture Multiprocessing
Operating system placement.svg NETWORK-Library-LAN.png Emp Tables (Database).PNG Padlock.svg
Operating systems Computer networks Databases Information security
Roomba original.jpg Flowchart.png Ideal compiler.png Python add5 syntax.svg
Ubiquitous computing Systems architecture Compiler design Programming languages

Computer graphics and visualization[edit]

Computer graphics is the study of digital visual contents, and involves synthese and manipulations of image data. The study is connected to many other fields in computer science, including computer vision, image processing, and computational geometry, and is heavily applied in the fields of special effects and video games.

Computer security and cryptography[edit]

Computer security is a branch of computer technology, whose objective includes protection of information from unauthorized access, disruption, or modification while maintaining the accessibility and usability of the system for its intended users. Cryptography is the practice and study of hiding (encryption) and therefore deciphering (decryption) information. Modern cryptography is largely related to computer science, for many encryption and decryption algorithms are based on their computational complexity.

Computational science[edit]

Computational science (or scientific computing) is the field of study concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyze and solve scientific problems. In practical use, it is typically the application of computer simulation and other forms of computation to problems in various scientific disciplines.

Lorenz attractor yb.svg Quark wiki.jpg Naphthalene-3D-balls.png 1u04-argonaute.png
Numerical analysis Computational physics Computational chemistry Bioinformatics

Computer Networks[edit]

This branch of computer science aims to manage networks between computers worldwide.

Concurrent, parallel and distributed systems[edit]

Concurrency is a property of systems in which several computations are executing simultaneously, and potentially interacting with each other. A number of mathematical models have been developed for general concurrent computation including Petri nets, process calculi and the Parallel Random Access Machine model. A distributed system extends the idea of concurrency onto multiple computers connected through a network. Computers within the same distributed system have their own private memory, and information is often exchanged amongst themselves to achieve a common goal.

Databases and information retrieval[edit]

A database is intended to organize, store, and retrieve large amounts of data easily. Digital databases are managed using database management systems to store, create, maintain, and search data, through database models and query languages.

Health Informatics[edit]

Health Informatics in computer science deals with computational techniques for solving problems in health care.

Information science[edit]

Earth.png Neuron.png English.png Wacom graphics tablet and pen.png
Information retrieval Knowledge representation Natural language processing Human–computer interaction

Software engineering[edit]

Software engineering is the study of designing, implementing, and modifying software in order to ensure it is of high quality, affordable, maintainable, and fast to build. It is a systematic approach to software design, involving the application of engineering practices to software. Software engineering deals with the organizing and analyzing of software— it doesn't just deal with the creation or manufacture of new software, but its internal maintenance and arrangement. Both computer applications software engineers and computer systems software engineers are projected to be among the fastest growing occupations from 2008 and 2018.

Academia[edit]

Conferences[edit]

Conferences are strategic events of the Academic Research in computer science. During those conferences, researchers from the public and private sectors present their recent work and meet. Proceedings of these conferences are an important part of the computer science literature.

Journals[edit]

Education[edit]

Some universities teach computer science as a theoretical study of computation and algorithmic reasoning. These programs often feature the theory of computation, analysis of algorithms, formal methods, concurrency theory, databases, computer graphics, and systems analysis, among others. They typically also teach computer programming, but treat it as a vessel for the support of other fields of computer science rather than a central focus of high-level study. The ACM/IEEE-CS Joint Curriculum Task Force "Computing Curriculum 2005" (and 2008 update) [40] gives a guideline for university curriculum.

Other colleges and universities, as well as secondary schools and vocational programs that teach computer science, emphasize the practice of advanced programming rather than the theory of algorithms and computation in their computer science curricula. Such curricula tend to focus on those skills that are important to workers entering the software industry. The process aspects of computer programming are often referred to as software engineering.

While computer science professions increasingly drive the U.S. economy, computer science education is absent in most American K-12 curricula. A report entitled "Running on Empty: The Failure to Teach K-12 Computer Science in the Digital Age" was released in October 2010 by Association for Computing Machinery (ACM) and Computer Science Teachers Association (CSTA), and revealed that only 14 states have adopted significant education standards for high school computer science. The report also found that only nine states count high school computer science courses as a core academic subject in their graduation requirements. In tandem with "Running on Empty", a new non-partisan advocacy coalition - Computing in the Core (CinC) - was founded to influence federal and state policy, such as the Computer Science Education Act, which calls for grants to states to develop plans for improving computer science education and supporting computer science teachers.

Within the United States a gender gap in computer science education has been observed as well. Research conducted by the WGBH Educational Foundation and the Association for Computing Machinery (ACM) revealed that more than twice as many high school boys considered computer science to be a “very good” or “good” college major than high school girls.[41] In addition, the high school Advanced Placement (AP) exam for computer science has displayed a disparity in gender. Compared to other AP subjects it has the lowest number of female participants, with a composition of about 15 percent women.[42] This gender gap in computer science is further witnessed at the college level, where 31 percent of undergraduate computer science degrees are earned by women and only 8 percent of computer science faculty consists of women.[43] According to an article published by the Epistemic Games Group in August 2012, the number of women graduates in the computer science field has declined to 13 percent.

See also[edit]

Notes[edit]

  1. ^ See the entry "Computer science" on Wikiquote for the history of this quotation.

References[edit]

  1. ^ "WordNet Search - 3.1". Wordnetweb.princeton.edu. Retrieved 2012-05-14. 
  2. ^ In 1851
  3. ^ "Science Museum - Introduction to Babbage". Archived from the original on 2006-09-08. Retrieved 2006-09-24. 
  4. ^ Anthony Hyman, Charles Babbage, pioneer of the computer, 1982
  5. ^ "The introduction of punched cards into the new engine was important not only as a more convenient form of control than the drums, or because programs could now be of unlimited extent, and could be stored and repeated without the danger of introducing errors in setting the machine by hand; it was important also because it served to crystallize Babbage's feeling that he had invented something really new, something much more than a sophisticated calculating machine." Bruce Collier, 1970
  6. ^ "A Selection and Adaptation From Ada's Notes found in "Ada, The Enchantress of Numbers," by Betty Alexandra Toole Ed.D. Strawberry Press, Mill Valley, CA". Retrieved 2006-05-04. 
  7. ^ "In this sense Aiken needed IBM, whose technology included the use of punched cards, the accumulation of numerical data, and the transfer of numerical data from one register to another", Bernard Cohen, p.44 (2000)
  8. ^ Brian Randell, p.187, 1975
  9. ^ The Association for Computing Machinery (ACM) was founded in 1947.
  10. ^ a b c Denning, P.J. (2000). "Computer Science: The Discipline" (PDF). Encyclopedia of Computer Science. Archived from the original on 2006-05-25. 
  11. ^ "Some EDSAC statistics". Cl.cam.ac.uk. Retrieved 2011-11-19. 
  12. ^ Computer science pioneer Samuel D. Conte dies at 85 July 1, 2002
  13. ^ a b Levy, Steven (1984). Hackers: Heroes of the Computer Revolution. Doubleday. ISBN 0-385-19195-2. 
  14. ^ http://www.computerhistory.org/revolution/computer-graphics-music-and-art/15/222/633
  15. ^ http://archive.computerhistory.org/resources/text/IBM/IBM.709.1957.102646304.pdf
  16. ^ a b David Kahn, The Codebreakers, 1967, ISBN 0-684-83130-9.
  17. ^ a b http://www.cis.cornell.edu/Dean/Presentations/Slides/bgu.pdf
  18. ^ Constable, R.L. (March 2000). Computer Science: Achievements and Challenges circa 2000 (PDF). 
  19. ^ Abelson, H.; G.J. Sussman with J. Sussman (1996). Structure and Interpretation of Computer Programs (2nd ed.). MIT Press. ISBN 0-262-01153-0. "The computer revolution is a revolution in the way we think and in the way we express what we think. The essence of this change is the emergence of what might best be called procedural epistemology — the study of the structure of knowledge from an imperative point of view, as opposed to the more declarative point of view taken by classical mathematical subjects." 
  20. ^ Black box traders are on the march The Telegraph, August 26, 2006
  21. ^ "The Impact of High Frequency Trading on an Electronic Market". Papers.ssrn.com. doi:10.2139/ssrn.1686004. Retrieved 2012-05-14. 
  22. ^ Wegner, P. (October 13–15, 1976). "Research paradigms in computer science". Proceedings of the 2nd international Conference on Software Engineering. San Francisco, California, United States: IEEE Computer Society Press, Los Alamitos, CA. 
  23. ^ Denning, P. J.; Comer, D. E.; Gries, D.; Mulder, M. C.; Tucker, A.; Turner, A. J.; Young, P. R. (Jan 1989). "Computing as a discipline". Communications of the ACM 32: 9–23. doi:10.1145/63238.63239.  volume = 64 edit
  24. ^ Eden, A. H. (2007). "Three Paradigms of Computer Science". Minds and Machines 17 (2): 135–167. doi:10.1007/s11023-007-9060-8.  edit
  25. ^ Louis Fine (1959). "The Role of the University in Computers, Data Processing, and Related Fields". Communications of the ACM 2 (9): 7–14. doi:10.1145/368424.368427. 
  26. ^ id., p. 11
  27. ^ Donald Knuth (1972). "George Forsythe and the Development of Computer Science". Comms. ACM.
  28. ^ Matti Tedre (2006). The Development of Computer Science: A Sociocultural Perspective, p.260
  29. ^ Peter Naur (1966). "The science of datalogy". Communications of the ACM 9 (7): 485. doi:10.1145/365719.366510. 
  30. ^ Communications of the ACM 1(4):p.6
  31. ^ Communications of the ACM 2(1):p.4
  32. ^ IEEE Computer 28(12):p.136
  33. ^ P. Mounier-Kuhn, L’Informatique en France, de la seconde guerre mondiale au Plan Calcul. L’émergence d’une science, Paris, PUPS, 2010, ch. 3 & 4.
  34. ^ M. Tedre (2011) Computing as a Science: A Survey of Competing Viewpoints, Minds and Machines 21(3), 361-387
  35. ^ Parnas, D. L. (1998). Annals of Software Engineering 6: 19–37. doi:10.1023/A:1018949113292.  edit, p. 19: "Rather than treat software engineering as a subfield of computer science, I treat it as an element of the set, Civil Engineering, Mechanical Engineering, Chemical Engineering, Electrical Engineering, [...]"
  36. ^ a b Computing Sciences Accreditation Board (28 May 1997). "Computer Science as a Profession". Archived from the original on 2008-06-17. Retrieved 2010-05-23. 
  37. ^ Committee on the Fundamentals of Computer Science: Challenges and Opportunities, National Research Council (2004). Computer Science: Reflections on the Field, Reflections from the Field. National Academies Press. ISBN 978-0-309-09301-9. 
  38. ^ "Csab, Inc". Csab.org. 2011-08-03. Retrieved 2011-11-19. 
  39. ^ Clay Mathematics Institute P=NP
  40. ^ "ACM Curricula Recommendations". Retrieved 2012-11-18. 
  41. ^ http://www.acm.org/membership/NIC.pdf
  42. ^ Gilbert, Alorie. "Newsmaker: Computer science's gender gap". CNET News. 
  43. ^ Dovzan, Nicole. "Examining the Gender Gap in Technology". University of Michigan. 

"Computer Software Engineer." U.S. Bureau of Labor Statistics. U.S. Bureau of Labor Statistics, n.d. Web. 05 Feb. 2013.

Further reading[edit]

Overview
  • Tucker, Allen B. (2004). Computer Science Handbook (2nd ed.). Chapman and Hall/CRC. ISBN 1-58488-360-X. 
    • "Within more than 70 chapters, every one new or significantly revised, one can find any kind of information and references about computer science one can imagine. [...] all in all, there is absolute nothing about Computer Science that can not be found in the 2.5 kilogram-encyclopaedia with its 110 survey articles [...]." (Christoph Meinel, Zentralblatt MATH)
  • van Leeuwen, Jan (1994). Handbook of Theoretical Computer Science. The MIT Press. ISBN 0-262-72020-5. 
    • "[...] this set is the most unique and possibly the most useful to the [theoretical computer science] community, in support both of teaching and research [...]. The books can be used by anyone wanting simply to gain an understanding of one of these areas, or by someone desiring to be in research in a topic, or by instructors wishing to find timely information on a subject they are teaching outside their major areas of expertise." (Rocky Ross, SIGACT News)
  • Ralston, Anthony; Reilly, Edwin D.; Hemmendinger, David (2000). Encyclopedia of Computer Science (4th ed.). Grove's Dictionaries. ISBN 1-56159-248-X. 
    • "Since 1976, this has been the definitive reference work on computer, computing, and computer science. [...] Alphabetically arranged and classified into broad subject areas, the entries cover hardware, computer systems, information and data, software, the mathematics of computing, theory of computation, methodologies, applications, and computing milieu. The editors have done a commendable job of blending historical perspective and practical reference information. The encyclopedia remains essential for most public and academic library reference collections." (Joe Accardin, Northeastern Illinois Univ., Chicago)
  • Edwin D. Reilly (2003). Milestones in Computer Science and Information Technology. Greenwood Publishing Group. ISBN 978-1-57356-521-9. 
Selected papers
    • "Covering a period from 1966 to 1993, its interest lies not only in the content of each of these papers — still timely today — but also in their being put together so that ideas expressed at different times complement each other nicely." (N. Bernard, Zentralblatt MATH)
Articles
Curriculum and classification

External links[edit]

Bibliography and academic search engines
Professional organizations
Misc

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Concept_drift b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Concept_drift new file mode 100644 index 00000000..01d0e23f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Concept_drift @@ -0,0 +1 @@ + Concept drift - Wikipedia, the free encyclopedia

Concept drift

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In predictive analytics and machine learning, the concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.

The term concept refers to the quantity to be predicted. More generally, it can also refer to other phenomena of interest besides the target concept, such as an input, but, in the context of concept drift, the term commonly refers to the target variable.

Contents

Examples[edit]

In a fraud detection application the target concept may be a binary attribute FRAUDULENT with values "yes" or "no" that indicates whether a given transaction is fraudulent. Or, in a weather prediction application, there may be several target concepts such as TEMPERATURE, PRESSURE, and HUMIDITY.

The behavior of the customers in an online shop may change over time. For example, if weekly merchandise sales are to be predicted, and a predictive model has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on advertising, promotions being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time - this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example.

Possible remedies[edit]

To prevent deterioration in prediction accuracy because of concept drift, both active and passive solutions can be adopted. Active solutions rely on triggering mechanisms, e.g., change-detection tests (Basseville and Nikiforov 1993; Alippi and Roveri, 2007) to explicitly detect concept drift as a change in the statistics of the data-generating process. In stationary conditions, any fresh information made available can be integrated to improve the model. Differently, when concept drift is detected, the current model is no more up-to-date and must be substituted with a new one to maintain the prediction accuracy (Gama et al., 2004; Alippi et al., 2011). On the contrary, in passive solutions the model is continuously updated, e.g., by retraining the model on the most recently observed samples (Widmer and Kubat, 1996), or enforcing an ensemble of classifiers (Elwell and Polikar 2011).

Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, finite model. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.

Concept drift cannot be avoided for complex phenomenon that are not governed by fixed laws of nature. All processes that arise from human activity, such as socioeconomic processes, and biological processes are likely to experience concept drift.Therefore periodic retraining, also known as refreshing, of any model is necessary.

Software[edit]

  • RapidMiner (formerly YALE (Yet Another Learning Environment)): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: concept drift plugin))
  • EDDM (EDDM (Early Drift Detection Method)): free open-source implementation of drift detection methods in Weka (machine learning).
  • MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with Weka (machine learning).

Datasets[edit]

Real[edit]

  • Airline, approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E.Ikonomovska. Reference: Data Expo 2009 Competition [1]. Access
  • Chess.com (online games) and Luxembourg (social survey) datasets compiled by I.Zliobaite. Access
  • ECUE spam 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual. Access from S.J.Delany webpage
  • Elec2, electricity demand, 2 classes, 45312 instances. Reference: M.Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. Access from J.Gama webpage. Comment on applicability.
  • PAKDD'09 competition data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data. Access
  • Sensor stream and Power supply stream datasets are available from X. Zhu's Stream Data Mining Repository. Access
  • Text mining, a collection of text mining datasets with concept drift, maintained by I.Katakis. Access
  • Gas Sensor Array Drift Dataset, a collection of 13910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations.

[2]

Other[edit]

  • KDD'99 competition data contains simulated intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift. Access

Synthetic[edit]

  • Sine, Line, Plane, Circle and Boolean Data Sets, L.L.Minku, A.P.White, X.Yao, The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift, IEEE Transactions on Knowledge and Data Engineering, vol.22, no.5, pp. 730–742, 2010. Access from L.Minku webpage.
  • SEA concepts, N.W.Street, Y.Kim, A streaming ensemble algorithm (SEA) for large-scale classification, KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001. Access from J.Gama webpage.
  • STAGGER, J.C.Schlimmer, R.H.Granger, Incremental Learning from Noisy Data, Mach. Learn., vol.1, no.3, 1986.

Data generation frameworks[edit]

  • L.L.Minku, A.P.White, X.Yao, The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift, IEEE Transactions on Knowledge and Data Engineering, vol.22, no.5, pp. 730–742, 2010. Download from L.Minku webpage.
  • Lindstrom P, SJ Delany & B MacNamee (2008) Autopilot: Simulating Changing Concepts in Real Data In: Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science, D Bridge, K Brown, B O'Sullivan & H Sorensen (eds.) p272-263 PDF
  • Narasimhamurthy A., L.I. Kuncheva, A framework for generating data to simulate changing environments, Proc. IASTED, Artificial Intelligence and Applications, Innsbruck, Austria, 2007, 384-389 PDF Code

Projects[edit]

  • INFER: Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010 - 2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland)
  • HaCDAIS: Handling Concept Drift in Adaptive Information Systems (2008-2012), Eindhoven University of Technology (the Netherlands)
  • KDUS: Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal)
  • ADEPT: Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK)
  • ALADDIN: autonomous learning agents for decentralised data and information networks (2005-2010)

Meetings[edit]

  • 2011
    • LEE 2011 Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11
    • HaCDAIS 2011 The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems
    • ICAIS 2011 Track on Incremental Learning
    • IJCNN 2011 Special Session on Concept Drift and Learning Dynamic Environments
    • CIDUE 2011 Symposium on Computational Intelligence in Dynamic and Uncertain Environments
  • 2010
    • HaCDAIS 2010 International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions
    • ICMLA10 Special Session on Dynamic learning in non-stationary environments
    • SAC 2010 Data Streams Track at ACM Symposium on Applied Computing
    • SensorKDD 2010 International Workshop on Knowledge Discovery from Sensor Data
    • StreamKDD 2010 Novel Data Stream Pattern Mining Techniques
    • Concept Drift and Learning in Nonstationary Environments at IEEE World Congress on Computational Intelligence
    • MLMDS’2010 Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA’10

Mailing list[edit]

Announcements, discussions, job postings related to the topic of concept drift in data mining / machine learning. Posts are moderated.

To subscribe go to the group home page: http://groups.google.com/group/conceptdrift

Bibliographic references[edit]

Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here:

Reviews[edit]

  • Zliobaite, I., Learning under Concept Drift: an Overview. Technical Report. 2009, Faculty of Mathematics and Informatics, Vilnius University: Vilnius, Lithuania. PDF
  • Jiang, J., A Literature Survey on Domain Adaptation of Statistical Classifiers. 2008. PDF
  • Kuncheva L.I. Classifier ensembles for detecting concept change in streaming data: Overview and perspectives, Proc. 2nd Workshop SUEMA 2008 (ECAI 2008), Patras, Greece, 2008, 5-10, PDF
  • Gaber, M, M., Zaslavsky, A., and Krishnaswamy, S., Mining Data Streams: A Review, in ACM SIGMOD Record, Vol. 34, No. 1, June 2005, ISSN: 0163-5808
  • Kuncheva L.I., Classifier ensembles for changing environments, Proceedings 5th International Workshop on Multiple Classifier Systems, MCS2004, Cagliari, Italy, in F. Roli, J. Kittler and T. Windeatt (Eds.), Lecture Notes in Computer Science, Vol 3077, 2004, 1-15, PDF.
  • Tsymbal, A., The problem of concept drift: Definitions and related work. Technical Report. 2004, Department of Computer Science, Trinity College: Dublin, Ireland. PDF

See also[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Concept_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Concept_mining new file mode 100644 index 00000000..23fa5d83 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Concept_mining @@ -0,0 +1 @@ + Concept mining - Wikipedia, the free encyclopedia

Concept mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining.[1] Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is nontrivial, but it can provide powerful insights into the meaning, provenance and similarity of documents.

Contents

Methods[edit]

Traditionally, the conversion of words to concepts has been performed using a thesaurus,[2] and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton's WordNet.

The mappings of words to concepts[3] are often ambiguous. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available. Machine translation systems cannot easily infer context.

For the purposes of concept mining however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining.

There are many techniques for disambiguation that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on semantic similarity between the possible concepts and the context have appeared and gained interest in the scientific community.

Applications[edit]

Detecting and indexing similar documents in large corpora[edit]

One of the spin-offs of calculating document statistics in the concept domain, rather than the word domain, is that concepts form natural tree structures based on hypernymy and meronymy. These structures can be used to produce simple tree membership statistics, that can be used to locate any document in a Euclidean concept space. If the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created. This technique is currently in commercial use locating similar legal documents in a 2.5 million document corpus.

Clustering documents by topic[edit]

Standard numeric clustering techniques may be used in "concept space" as described above to locate and index documents by the inferred topic. These are numerically far more efficient than their text mining cousins, and tend to behave more intuitively, in that they map better to the similarity measures a human would generate.

References[edit]

  1. ^ Yuen-Hsien Tseng, Chun-Yen Chang, Shu-Nu Chang Rundgren, and Carl-Johan Rundgren, " Mining Concept Maps from News Stories for Measuring Civic Scientific Literacy in Media", Computers and Education, Vol. 55, No. 1, August 2010, pp. 165-177.
  2. ^ Yuen-Hsien Tseng, " Automatic Thesaurus Generation for Chinese Documents", Journal of the American Society for Information Science and Technology, Vol. 53, No. 13, Nov. 2002, pp. 1130-1138.
  3. ^ Yuen-Hsien Tseng, " Generic Title Labeling for Clustered Documents", Expert Systems With Applications, Vol. 37, No. 3, 15 March 2010, pp. 2247-2254 .

See also[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Conference_on_Information_and_Knowledge_Management b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Conference_on_Information_and_Knowledge_Management new file mode 100644 index 00000000..955aca00 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Conference_on_Information_and_Knowledge_Management @@ -0,0 +1 @@ + Conference on Information and Knowledge Management - Wikipedia, the free encyclopedia

Conference on Information and Knowledge Management

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The ACM Conference on Information and Knowledge Management (CIKM, pronounced /ˈsikəm/) is an annual computer science research conference dedicated to information and knowledge management. Since the first event in 1992, the conference has evolved into one of the major forums for research on database management, information retrieval, and knowledge management.[1][2] The conference is noted for its interdisciplinarity, as it brings together communities that otherwise often publish at separate venues. Recent editions have attracted well beyond 500 participants.[3] In addition to the main research program, the conference also features a number of workshops, tutorials, and industry presentations.[4]

For many years, the conference was held in the USA. Since 2005, venues in other countries have been selected as well. Locations include:[5]

See also[edit]

References[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Conference_on_Knowledge_Discovery_and_Data_Mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Conference_on_Knowledge_Discovery_and_Data_Mining new file mode 100644 index 00000000..8823b71c --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Conference_on_Knowledge_Discovery_and_Data_Mining @@ -0,0 +1 @@ + SIGKDD - Wikipedia, the free encyclopedia

SIGKDD

From Wikipedia, the free encyclopedia
Jump to: navigation, search

SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining. It became an official ACM SIG in 1998. The official web page of SIGKDD can be found on www.KDD.org. The current Chairman of SIGKDD (since 2009) is Usama M. Fayyad, Ph.D.

Contents

Conferences[edit]

SIGKDD has hosted an annual conference - ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) - since 1995. KDD Conferences grew from KDD (Knowledge Discovery and Data Mining) workshops at AAAI conferences, which were started by Gregory Piatetsky-Shapiro in 1989, 1991, and 1993, and Usama Fayyad in 1994. [1] Conference papers of each Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining are published through ACM[2]

KDD-2012 took place in Beijing, China [3] and KDD-2013 will take place in Chicago, United States, Aug 11-14, 2013.

KDD-Cup[edit]

SIGKDD sponsors the KDD Cup competition every year in conjunction with the annual conference. It is aimed at members of the industry and academia, particularly students, interested in KDD.

Awards[edit]

The group also annually recognizes members of the KDD community with its Innovation Award and Service Award. Additionally, KDD presents a Best Paper Award [4] to recognize the highest quality paper at each conference.

SIGKDD Explorations[edit]

SIGKDD has also published a biannual academic journal titled SIGKDD Explorations since June, 1999. Editors in Chief

Current Executive Committee[edit]

Chair

Treasurer

Directors

Former Chairpersons

  • Gregory Piatetsky-Shapiro[8] (2005-2008)
  • Won Kim (1998-2004)

Information Directors[edit]

References[edit]

External links[edit]


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Contrast_set_learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Contrast_set_learning new file mode 100644 index 00000000..f9e28380 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Contrast_set_learning @@ -0,0 +1 @@ + Contrast set learning - Wikipedia, the free encyclopedia

Contrast set learning

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Contrast set learning is a form of association rule learning that seeks to identify meaningful differences between separate groups by reverse-engineering the key predictors that identify for each particular group. For example, given a set of attributes for a pool of students (labeled by degree type), a contrast set learner would identify the contrasting features between students seeking bachelor's degrees and those working toward PhD degrees.

Contents

Overview[edit]

A common practice in data mining is to classify, to look at the attributes of an object or situation and make a guess at what category the observed item belongs to. As new evidence is examined (typically by feeding a training set to a learning algorithm), these guesses are refined and improved. Contrast set learning works in the opposite direction. While classifiers read a collection of data and collect information that is used to place new data into a series of discrete categories, contrast set learning takes the category that an item belongs to and attempts to reverse engineer the statistical evidence that identifies an item as a member of a class. That is, contrast set learners seek rules associating attribute values with changes to the class distribution.[1] They seek to identify the key predictors that contrast one classification from another.

For example, an aerospace engineer might record data on test launches of a new rocket. Measurements would be taken at regular intervals throughout the launch, noting factors such as the trajectory of the rocket, operating temperatures, external pressures, and so on. If the rocket launch fails after a number of successful tests, the engineer could use contrast set learning to distinguish between the successful and failed tests. A contrast set learner will produce a set of association rules that, when applied, will indicate the key predictors of each failed tests versus the successful ones (the temperature was too high, the wind pressure was too high, etc.).

Contrast set learning is a form of association rule learning. Association rule learners typically offer rules linking attributes commonly occurring together in a training set (for instance, people who are enrolled in four-year programs and take a full course load tend to also live near campus). Instead of finding rules that describe the current situation, contrast set learners seek rules that differ meaningfully in their distribution across groups (and thus, can be used as predictors for those groups).[2] For example, a contrast set learner could ask, “What are the key identifiers of a person with a bachelor's degree or a person with a PhD, and how do people with PhD's and bachelor’s degrees differ?”

Standard classifier algorithms, such as C4.5, have no concept of class importance (that is, they do not know if a class is "good" or "bad"). Such learners cannot bias or filter their predictions towards certain desired classes. As the goal of contrast set learning is to discover meaningful differences between groups, it is useful to be able to target the learned rules towards certain classifications. Several contrast set learners, such as MINWAL[3] or the family of TAR algorithms,[4][5][6] assign weights to each class in order to focus the learned theories toward outcomes that are of interest to a particular audience. Thus, contrast set learning can be though of as a form of weighted class learning.[7]

Example: Supermarket Purchases[edit]

The differences between standard classification, association rule learning, and contrast set learning can be illustrated with a simple supermarket metaphor. In the following small dataset, each row is a supermarket transaction and each "1" indicates that the item was purchased (a "0" indicates that the item was not purchased):

Hamburger Potatoes Foie Gras Onions Champagne Purpose of Purchases
1 1 0 1 0 Cookout
1 1 0 1 0 Cookout
0 0 1 0 1 Anniversary
1 1 0 1 0 Cookout
1 1 0 0 1 Frat Party

Given this data,

  • Association rule learning may discover that customers that buy onions and potatoes together are likely to also purchase hamburger meat.
  • Classification may discover that customers that bought onions, potatoes, and hamburger meats were purchasing items for a cookout.
  • Contrast set learning may discover that the major difference between customers shopping for a cookout and those shopping for an anniversary dinner are that customers acquiring items for a cookout purchase onions, potatoes, and hamburger meat (and do not purchase foie gras or champagne).

Treatment Learning[edit]

Treatment learning is a form of weighted contrast-set learning that takes a single desirable group and contrasts it against the remaining undesirable groups (the level of desirability is represented by weighted classes).[4] The resulting "treatment" suggests a set of rules that, when applied, will lead to the desired outcome.

Treatment learning differs from standard contrast set learning through the following constraints:

  • Rather than seeking the differences between all groups, treatment learning specifies a particular group to focus on, applies a weight to this desired grouping, and lumps the remaining groups into one "undesired" category.
  • Treatment learning has a stated focus on minimal theories. In practice, treatment are limited to a maximum of four contraints (i.e., rather than stating all of the reasons that a rocket differs from a skateboard, a treatment learner will state one to four major differences that predict for rockets at a high level of statistical significance).

This focus on simplicity is an important goal for treatment learners. Treatment learning seeks the smallest change that has the greatest impact on the class distribution.[7]

Conceptually, treatment learners explore all possible subsets of the range of values for all attributes. Such a search is often infeasible in practice, so treatment learning often focuses instead on quickly pruning and ignoring attribute ranges that, when applied, lead to a class distribution where the desired class is in the minority.[6]

Example: Boston Housing Data[edit]

The following example demonstrates the output of the treatment learner TAR3 on a dataset of housing data from the city of Boston (a nontrivial public dataset with over 500 examples). In this dataset, a number of factors are collected for each house, and each house is classified according to its quality (low, medium-low, medium-high, and high). The desired class is set to "high," and all other classes are lumped together as undesirable.

The output of the treatment learner is as follows:

 Baseline class distribution: low: 29% medlow: 29% medhigh: 21% high: 21%  
 Suggested Treatment: [PTRATIO=[12.6..16), RM=[6.7..9.78)]  
 New class distribution: low: 0% medlow: 0% medhigh: 3% high: 97%  

With no applied treatments (rules), the desired class represents only 21% of the class distribution. However, if we filter the data set for houses with 6.7 to 9.78 rooms and a neighborhood parent-teacher ratio of 12.6 to 16, then 97% of the remaining examples fall into the desired class (high quality houses).

Algorithms[edit]

There are a number of algorithms that perform contrast set learning. The following subsections describe two examples.

STUCCO[edit]

The STUCCO contrast set learner[1][2] treats the task of learning from contrast sets as a tree search problem where the root node of the tree is an empty contrast set. Children are added by specializing the set with additional items picked through a canonical ordering of attributes (to avoid visiting the same nodes twice). Children are formed by appending terms that follow all existing terms in a given ordering. The formed tree is searched in a breadth-first manner. Given the nodes at each level, the dataset is scanned and the support is counted for each group. Each node is then examined to determine if it is significant and large, if it should be pruned, and if new children should be generated. After all significant contrast sets are located, a post-processor selects a subset to show to the user - the low order, simpler results are shown first, followed by the higher order results which are "surprising and significantly different.[2]"

The support calculation comes from testing a null hypothesis that the contrast set support is equal across all groups (i.e., that contrast set support is independent of group membership). The support count for each group is a frequency value that can be analyzed in a contingency table where each row represents the truth value of the contrast set and each column variable indicates the group membership frequency. If there is a difference in proportions between the contrast set frequencies and those of the null hypothesis, the algorithm must then determine if the differences in proportions represent a relation between variables or if it can be attributed to random causes. This can be determined through a chi-square test comparing the observed frequency count to the expected count.

Nodes are pruned from the tree when all specializations of the node can never lead to a significant and large contrast set. The decision to prune is based on:

  • The minimum deviation size: The maximum difference between the support of any two groups bust be greater than a user-specified threshold.
  • Expected cell frequencies: The expected cell frequencies of a contingency table can only decrease as the contrast set is specialized. When these frequencies are too small, the validity of the chi-square test is violated.
  • \chi^2 bounds: An upper bound is kept on the distribution of a statistic calculated when the null hypothesis is true. Nodes are pruned when it is no longer possible to meet this cutoff.

TAR3[edit]

The TAR3[5][8] weighted contrast set learner is based on two fundamental concepts - the lift and support of a rule set.

The lift of a set of rules is the change that some decision makes to a set of examples after imposing that decision (i.e., how the class distribution shifts in response to the imposition of a rule). TAR3 seeks the smallest set of rules which induces the biggest changes in the sum of the weights attached to each class multiplied by the frequency at which each class occurs. The lift is calculated by dividing the score of the set in which the set of rules is imposed by the score of the baseline set (i.e., no rules are applied). Note that by reversing the lift scoring function, the TAR3 learner can also select for the remaining classes and reject the target class.

It is problematic to rely on the lift of a rule set alone. Incorrect or misleading data noise, if correlated with failing examples, may result in an overfitted rule set. Such an overfitted model may have a large lift score, but it does not accurately reflect the prevailing conditions within the dataset. To avoid overfitting, TAR3 utilizes a support threshold and rejects all rules that fall on the wrong side of this threshold. Given a target class, the support threshold is a user-supplied value (usually 0.2) which is compared to the ratio of the frequency of the target class when the rule set has been applied to the frequency of that class in the overall dataset. TAR3 rejects all sets of rules with support lower than this threshold.

By requiring both a high lift and a high support value, TAR3 not only returns ideal rule sets, but also favors smaller sets of rules. The fewer rules adopted, the more evidence that will exist supporting those rules.

The TAR3 algorithm only builds sets of rules from attribute value ranges with a high heuristic value. The algorithm determines which ranges to use by first determining the lift score of each attribute’s value ranges. These individual scores are then sorted and converted into a cumulative probability distribution. TAR3 randomly selects values from this distribution, meaning that low-scoring ranges are unlikely to be selected. To build a candidate rule set, several ranges are selected and combined. These candidate rule sets are then scored and sorted. If no improvement is seen after a user-defined number of rounds, the algorithm terminates and returns the top-scoring rule sets.

References[edit]

  1. ^ a b Stephen Bay and Michael Pazzani (2001). "Detecting group differences: Mining contrast sets". Data Mining and Knowledge Discovery 5 (3): 213–246. 
  2. ^ a b c Stephen Bay and Michael Pazzani (1999). "Detecting change in categorical data: mining contrast sets". KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. 
  3. ^ C.H. Cai, A.W.C. Fu, C.H. Cheng, and W.W. Kwong (1998). "Mining association rules with weighted items". Proceedings of International Database Engineering and Applications Symposium (IDEAS 98). 
  4. ^ a b Y. Hu (2003). Treatment learning: Implementation and application.  Unknown parameter |book= ignored (help)
  5. ^ a b K. Gundy-Burlet, J. Schumann, T. Barrett, and T. Menzies (2007). "Parametric analysis of ANTARES re-entry guidance algorithms using advanced test generation and data analysis". In 9th International Symposium on Artifical Intelligence, Robotics and Automation in Space. 
  6. ^ a b Gregory Gay, Tim Menzies, Misty Davies, and Karen Gundy-Burlet (2010). "Automatically Finding the Control Variables for Complex System Behavior". Automated Software Engineering 17 (4). 
  7. ^ a b T. Menzies and Y. Hu (2003). "Data Mining for Very Busy People". IEEE Computer 36 (11): 22–29. 
  8. ^ J. Schumann, K. Gundy-Burlet, C. Pasareanu, T. Menzies, and A. Barrett (2009). "Software V&V support by parametric analysis of large software simulation systems". Proceedings of the 2009 IEEE Aerospace Conference. 

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Cross_Industry_Standard_Process_for_Data_Mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Cross_Industry_Standard_Process_for_Data_Mining new file mode 100644 index 00000000..acef9941 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Cross_Industry_Standard_Process_for_Data_Mining @@ -0,0 +1 @@ + Cross Industry Standard Process for Data Mining - Wikipedia, the free encyclopedia

Cross Industry Standard Process for Data Mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

CRISP-DM stands for Cross Industry Standard Process for Data Mining.[1] It is a data mining process model that describes commonly used approaches that expert data miners use to tackle problems. Polls conducted in 2002, 2004, and 2007 show that it is the leading methodology used by data miners.[2][3][4] The only other data mining standard named in these polls was SEMMA. However, 3-4 times as many people reported using CRISP-DM. A review and critique of data mining process models in 2009 called the CRISP-DM the "de facto standard for developing data mining and knowledge discovery projects."[5] Other reviews of CRISP-DM and data mining process models include Kurgan and Musilek's 2006 review,[6] and Azevedo and Santos' 2008 comparison of CRISP-DM and SEMMA.[7]

Contents

Major phases[edit]

CRISP-DM breaks the process of data mining into six major phases.[8]

The sequence of the phases is not strict and moving back and forth between different phases is always required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions and subsequent data mining processes will benefit from the experiences of previous ones.

Process diagram showing the relationship between the different phases of CRISP-DM
  • Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.
  • Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
  • Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
  • Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
  • Evaluation
At this stage in the project you have built a model (or models) that appear to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
  • Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.

History[edit]

CRISP-DM was conceived in 1996. In 1997 it got underway as a European Union project under the ESPRIT funding initiative. The project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company.

This core consortium brought different experiences to the project: ISL, later acquired and merged into SPSS Inc. The computer giant NCR Corporation produced the Teradata data warehouse and its own data mining software. Daimler-Benz had a significant data mining team. OHRA was just starting to explore the potential use of data mining.

The first version of the methodology was presented at the 4th CRISP-DM SIG Workshop in Brussels in March 1999,[9] and published as a step-by-step data mining guide later that year.[10]

Between 2006 and 2008 a CRISP-DM 2.0 SIG was formed and there were discussions about updating the CRISP-DM process model.[5][11] The current status of these efforts is not known. However, the original crisp-dm.org website cited in the reviews,[6][7] and the CRISP-DM 2.0 SIG website[5][11] are both no longer active.

While many non-IBM data mining practitioners use CRISP-DM,[2][3][4][5] IBM is the primary corporation that currently embraces the CRISP-DM process model. It makes some of the old CRISP-DM documents available for download[10] and it has incorporated it into its SPSS Modeler product.

References[edit]

  1. ^ Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22.
  2. ^ a b Gregory Piatetsky-Shapiro (2002); KDnuggets Methodology Poll
  3. ^ a b Gregory Piatetsky-Shapiro (2004); KDnuggets Methodology Poll
  4. ^ a b Gregory Piatetsky-Shapiro (2007); KDnuggets Methodology Poll
  5. ^ a b c d Óscar Marbán, Gonzalo Mariscal and Javier Segovia (2009); A Data Mining & Knowledge Discovery Process Model. In Data Mining and Knowledge Discovery in Real Life Applications, Book edited by: Julio Ponce and Adem Karahoca, ISBN 978-3-902613-53-0, pp. 438-453, February 2009, I-Tech, Vienna, Austria.
  6. ^ a b Lukasz Kurgan and Petr Musilek (2006); A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. Volume 21 Issue 1, March 2006, pp 1 - 24, Cambridge University Press, New York, NY, USA doi: 10.1017/S0269888906000737.
  7. ^ a b Azevedo, A. and Santos, M. F. (2008);KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182-185.
  8. ^ Harper, Gavin; Stephen D. Pickett (August 2006). "Methods for mining HTS data". Drug Discovery Today 11 (15–16): 694–699. doi:10.1016/j.drudis.2006.06.006. PMID 16846796. 
  9. ^ Pete Chapman (1999); The CRISP-DM User Guide.
  10. ^ a b Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rüdiger Wirth (2000); CRISP-DM 1.0 Step-by-step data mining guide.
  11. ^ a b Colin Shearer (2006); First CRISP-DM 2.0 Workshop Held

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data new file mode 100644 index 00000000..525ee0be --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data @@ -0,0 +1 @@ + Data - Wikipedia, the free encyclopedia

Data

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Data (/ˈdtə/ DAY-tə, /ˈdætə/ DA-tə, or /ˈdɑːtə/ DAH-tə) are values of qualitative or quantitative variables, belonging to a set of items. Data in computing (or data processing) are represented in a structure, often tabular (represented by rows and columns), a tree (a set of nodes with parent-children relationship) or a graph structure (a set of interconnected nodes). Data are typically the results of measurements and can be visualised using graphs or images. Data as an abstract concept can be viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e., unprocessed data, refers to a collection of numbers, characters and is a relative term; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next. Field data refers to raw data collected in an uncontrolled in situ environment. Experimental data refers to data generated within the context of a scientific investigation by observation and recording.

The word data is the plural of datum, neuter past participle of the Latin dare, "to give", hence "something given". In discussions of problems in geometry, mathematics, engineering, and so on, the terms givens and data are used interchangeably. Such usage is the origin of data as a concept in computer science or data processing: data are numbers, words, images, etc., accepted as they stand.

Though data is also increasingly used in humanities (particularly in the growing digital humanities), it has been suggested that the highly interpretive nature of humanities might be at odds with the ethos of data as "given". Peter Checkland introduced the term capta (from the Latin capere, “to take”) to distinguish between an immense number of possible data and a sub-set of them, to which attention is oriented.[1] Johanna Drucker has argued that since the humanities affirm knowledge production as “situated, partial, and constitutive,” using data may introduce assumptions that are counterproductive, for example that phenomena are discreet or are observer-independent.[2] The term capta, which emphasizes the act of observation as constitutive, is offered as an alternative to data for visual representations in the humanities.

Contents

Usage in English[edit]

In English, the word datum is still used in the general sense of "an item given". In cartography, geography, nuclear magnetic resonance and technical drawing it is often used to refer to a single specific reference datum from which distances to all other data are measured. Any measurement or result is a datum, but data point is more usual,[3] albeit tautological or, more generously, pleonastic. In one sense, datum is a count noun with the plural datums (see usage in datum article) that can be used with cardinal numbers (e.g. "80 datums"); data (originally a Latin plural) is not used like a normal count noun with cardinal numbers, but it can be used as a plural with plural determiners such as these and many, in addition to its use as a singular abstract mass noun with a verb in the singular form.[4] Even when a very small quantity of data is referenced (one number, for example) the phrase piece of data is often used, as opposed to datum. The debate over appropriate usage is ongoing.[5][6][7]

The IEEE Computer Society allows usage of data as either a mass noun or plural based on author preference.[8] Some professional organizations and style guides[9][dead link] require that authors treat data as a plural noun. For example, the Air Force Flight Test Center specifically states that the word data is always plural, never singular.[10]

Data is most often used as a singular mass noun in educated everyday usage.[11][12] Some major newspapers such as The New York Times use it either in the singular or plural. In the New York Times the phrases "the survey data are still being analyzed" and "the first year for which data is available" have appeared within one day.[13] The Wall Street Journal explicitly allows this in its style guide.[14] The Associated Press style guide classifies data as a collective noun that takes the singular when treated as a unit but the plural when referring to individual items ("The data is sound.", but "The data have been carefully collected.").[15]

In scientific writing data is often treated as a plural, as in These data do not support the conclusions, but it is also used as a singular mass entity like information, for instance in computing and related disciplines.[16] British usage now widely accepts treating data as singular in standard English,[17] including everyday newspaper usage[18] at least in non-scientific use.[19] UK scientific publishing still prefers treating it as a plural.[20] Some UK university style guides recommend using data for both singular and plural use[21] and some recommend treating it only as a singular in connection with computers.[22]

Meaning of data, information and knowledge[edit]

The terms data, information and knowledge are frequently used for overlapping concepts. The main difference is in the level of abstraction being considered. Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three.[23] Data on its own carries no meaning. For data to become information, it must be interpreted and take on a meaning. For example, the height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge".

Information as a concept bears a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.

Beynon-Davies uses the concept of a sign to distinguish between data and information; data are symbols while information occurs when the symbols are used to refer to something.[24][25]

It is people and computers who collect data and impose patterns on it. These patterns are seen as information which can be used to enhance knowledge. These patterns can be interpreted as truth, and are authorized as aesthetic and ethical criteria. Events that leave behind perceivable physical or virtual remains can be traced back through data. Marks are no longer considered data once the link between the mark and observation is broken.[26]

Mechanical computing devices are classified according to the means by which they represent data. An analog computer represents a datum as a voltage, distance, position, or other physical quantity. A digital computer represents a datum as a sequence of symbols drawn from a fixed alphabet. The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1". More familiar representations, such as numbers or letters, are then constructed from the binary alphabet.

Some special forms of data are distinguished. A computer program is a collection of data, which can be interpreted as instructions. Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably Lisp and similar languages, programs are essentially indistinguishable from other data. It is also useful to distinguish metadata, that is, a description of other data. A similar yet earlier term for metadata is "ancillary data." The prototypical example of metadata is the library catalog, which is a description of the contents of books.

See also[edit]

References[edit]

This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.

  1. ^ P. Checkland and S. Holwell (1998). Information, Systems, and Information Systems: Making Sense of the Field. Chichester, West Sussex: John Wiley & Sons. pp. 86–89. ISBN 0-471-95820-4. 
  2. ^ Johanna Drucker (2011). "Humanities Approaches to Graphical Display". 
  3. ^ Matt Dye (2001). "Writing Reports". University of Bristol. 
  4. ^ "data, datum". Merriam–Webster's Dictionary of English Usage. Springfield, Massachusetts: Merriam-Webster. 2002. pp. 317–318. ISBN 978-0-87779-132-4. 
  5. ^ "Data is a singular noun". 
  6. ^ "Grammarist: Data". 
  7. ^ "Dictionary.com Data". 
  8. ^ "IEEE Computer Society Style Guide, DEF". IEEE Computer Society. 
  9. ^ "WHO Style Guide". Geneva: World Health Organization. 2004. p. 43. 
  10. ^ The Author's Guide to Writing Air Force Flight Test Center Technical Reports. Air Force Flight Test Center. 
  11. ^ New Oxford Dictionary of English, 1999
  12. ^ "...in educated everyday usage as represented by the Guardian newspaper, it is nowadays most often used as a singular." http://www.lexically.net/TimJohns/Kibbitzer/revis006.htm
  13. ^ "When Serving the Lord, Ministers Are Often Found to Neglect Themselves". New York Times. 2009. "Investment Tax Cuts Help Mostly the Rich". New York Times. 2009. 
  14. ^ "Is Data Is, or Is Data Ain’t, a Plural?". Wall Street Journal. 2012. 
  15. ^ The Associated Press (June 2002). "collective nouns". In Norm Goldstein. The Associated Press Stylebook and Briefing on Media Law. Cambridge, Massachusetts: Perseus. p. 52. ISBN 0-7382-0740-3. 
  16. ^ R.W. Burchfield, ed. (1996). "data". Fowler's Modern English Usage (3rd ed.). Oxford: Clarendon Press. pp. 197–198. ISBN 0-19-869126-2. 
  17. ^ New Oxford Dictionary of English. 1999. 
  18. ^ Tim Johns (1997). "Data: singular or plural?". "...in educated everyday usage as represented by The Guardian newspaper, it is nowadays most often used as a singular." 
  19. ^ "Data". Compact Oxford Dictionary. 
  20. ^ "Data: singular or plural?". Blair Wisconsin International University. 
  21. ^ "Singular or plural". University of Nottingham Style Book. University of Nottingham. [dead link]
  22. ^ "Computers and computer systems". OpenLearn. [dead link]
  23. ^ Akash Mitra (2011). "Classifying data for successful modeling". 
  24. ^ P. Beynon-Davies (2002). Information Systems: An introduction to informatics in organisations. Basingstoke, UK: Palgrave Macmillan. ISBN 0-333-96390-3. 
  25. ^ P. Beynon-Davies (2009). Business information systems. Basingstoke, UK: Palgrave. ISBN 978-0-230-20368-6. 
  26. ^ Sharon Daniel. The Database: An Aesthetics of Dignity. 

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_Mining_and_Knowledge_Discovery b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_Mining_and_Knowledge_Discovery new file mode 100644 index 00000000..cd0f4642 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_Mining_and_Knowledge_Discovery @@ -0,0 +1 @@ + Data Mining and Knowledge Discovery - Wikipedia, the free encyclopedia

Data Mining and Knowledge Discovery

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Data Mining and Knowledge Discovery  
Abbreviated title (ISO 4) Data Min. Knowl. Discov.
Discipline Computer science
Language English
Publication details
Publisher Springer Science+Business Media
Publication history 1997-present
Frequency Triannually
Impact factor
(2011)
1.545
Indexing
ISSN 1384-5810 (print)
1573-756X (web)
LCCN sn98038132
CODEN DMKDFD
OCLC number 38037443
Links

Data Mining and Knowledge Discovery is a triannual peer-reviewed scientific journal focusing on data mining. It is published by Springer Science+Business Media. As of 2012, the editor-in-chief is Geoffrey I. Webb.

External links [edit]


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_analysis new file mode 100644 index 00000000..a3e3dade --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_analysis @@ -0,0 +1 @@ + Data analysis - Wikipedia, the free encyclopedia

Data analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facts and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical or structural models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling.

Contents

The process of data analysis[edit]

Data analysis is a process, within which several phases can be distinguished:[1]

Data cleaning[edit]

Data cleaning is an important procedure during which the data are inspected, and erroneous data are—if necessary, preferable, and possible—corrected. Data cleaning can be done during the stage of data entry. If this is done, it is important that no subjective decisions are made. The guiding principle provided by Adèr (ref) is: during subsequent manipulations of the data, information should always be cumulatively retrievable. In other words, it should always be possible to undo any data set alterations. Therefore, it is important not to throw information away at any stage in the data cleaning phase. All information should be saved (i.e., when altering variables, both the original values and the new values should be kept, either in a duplicate data set or under a different variable name), and all alterations to the data set should be carefully and clearly documented, for instance in a syntax or a log.[2]

Initial data analysis[edit]

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that are aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:[3]

Quality of data[edit]

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analyses: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms, n: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.[4]

Quality of measurements[edit]

The quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature.
There are two ways to assess measurement quality:

  • Confirmatory factor analysis
  • Analysis of homogeneity (internal consistency), which gives an indication of the reliability of a measurement instrument. During this analysis, one inspects the variances of the items and the scales, the Cronbach's α of the scales, and the change in the Cronbach's alpha when an item would be deleted from a scale.[5]

Initial transformations[edit]

After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase.[6]
Possible transformations of variables are:[7]

  • Square root transformation (if the distribution differs moderately from normal)
  • Log-transformation (if the distribution differs substantially from normal)
  • Inverse transformation (if the distribution differs severely from normal)
  • Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no transformations help)

Did the implementation of the study fulfill the intentions of the research design?[edit]

One should check the success of the randomization procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups.
If the study did not need and/or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.
Other possible data distortions that should be checked are:

  • dropout (this should be identified during the initial data analysis phase)
  • Item nonresponse (whether this is random or not should be assessed during the initial data analysis phase)
  • Treatment quality (using manipulation checks).[8]

Characteristics of data sample[edit]

In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the structure of the sample (and specifically the size of the subgroups) when subgroup analyses will be performed during the main analysis phase.
The characteristics of the data sample can be assessed by looking at:

  • Basic statistics of important variables
  • Scatter plots
  • Correlations and associations
  • Cross-tabulations[9]

Final stage of the initial data analysis[edit]

During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective actions are taken.
Also, the original plan for the main data analyses can and should be specified in more detail and/or rewritten.
In order to do this, several decisions about the main data analyses can and should be made:

  • In the case of non-normals: should one transform variables; make variables categorical (ordinal/dichotomous); adapt the analysis method?
  • In the case of missing data: should one neglect or impute the missing data; which imputation technique should be used?
  • In the case of outliers: should one use robust analysis techniques?
  • In case items do not fit the scale: should one adapt the measurement instrument by omitting items, or rather ensure comparability with other (uses of the) measurement instrument(s)?
  • In the case of (too) small subgroups: should one drop the hypothesis about inter-group differences, or use small sample techniques, like exact tests or bootstrapping?
  • In case the randomization procedure seems to be defective: can and should one calculate propensity scores and include them as covariates in the main analyses?[10]

Analyses[edit]

Several analyses can be used during the initial data analysis phase:[11]

  • Univariate statistics(single variable)
  • Bivariate associations (correlations)
  • Graphical techniques (scatter plots)

It is important to take the measurement levels of the variables into account for the analyses, as special statistical techniques are available for each level:[12]

  • Nominal and ordinal variables
    • Frequency counts (numbers and percentages)
    • Associations
      • circumambulations (crosstabulations)
      • hierarchical loglinear analysis (restricted to a maximum of 8 variables)
      • loglinear analysis (to identify relevant/important variables and possible confounders)
    • Exact tests or bootstrapping (in case subgroups are small)
    • Computation of new variables
  • Continuous variables
    • Distribution
      • Statistics (M, SD, variance, skewness, kurtosis)
      • Stem-and-leaf displays
      • Box plots

Main data analysis[edit]

In the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report.[13]

Exploratory and confirmatory approaches[edit]

In the main analysis phase either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis clear hypotheses about the data are tested.

Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a bonferroni correction. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. When a model is found exploratory in a dataset, then following up that analysis with a comfirmatory analysis in the same dataset could simply mean that the results of the comfirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place. The comfirmatory analysis therefore will not be more informative than the original exploratory analysis.[14]

Stability of results[edit]

It is important to obtain some indication about how generalizable the results are.[15] While this is hard to check, one can look at the stability of the results. Are the results reliable and reproducible? There are two main ways of doing this:

  • Cross-validation: By splitting the data in multiple parts we can check if analyzes (like a fitted model) based on one part of the data generalize to another part of the data as well.
  • Sensitivity analysis: A procedure to study the behavior of a system or model when global parameters are (systematically) varied. One way to do this is with bootstrapping.

Statistical methods[edit]

Many statistical methods have been used for statistical analyses. A very brief list of four of the more popular methods is:

Free software for data analysis[edit]

  • ROOT - C++ data analysis framework developed at CERN
  • PAW - FORTRAN/C data analysis framework developed at CERN
  • SCaVis - Java (multi-platform) data analysis framework developed at ANL
  • KNIME - the Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
  • Data Applied - an online data mining and data visualization solution.
  • R - a programming language and software environment for statistical computing and graphics.
  • DevInfo - a database system endorsed by the United Nations Development Group for monitoring and analyzing human development.
  • Zeptoscope Basic[16] - Interactive Java-based plotter developed at Nanomix.
  • Lavastorm Analytics Engine Public Edition- Free desktop edition for organizations.
  • GeNIe - discovery of causal relationships from data, learning and inference with Bayesian networks, industrial quality software developed at the Decision Systems Laboratory, University of Pittsburgh.
  • ANTz - C realtime 3D data visualization, hierarchal object trees that combine multiple topologies with millions of nodes.

Commercial software for data analysis[edit]

  • Holsys One - Tool for the analysis of complex systems (sensors network, industrial plant) based on a reinterpretation of the IF-THEN clause in the sense of the theory of holons.
  • [1] - Infobright's high performance analytic database is designed for analyzing large volumes of machine-generated data

Education[edit]

In education, most educators have access to a data system for the purpose of analyzing student data.[17] These data systems present data to educators in an over-the-counter data format (embedding labels, supplemental documentation, and a help system and making key package/display and content decisions) to improve the accuracy of educators’ data analyses.[18]

Nuclear and particle physics[edit]

In nuclear and particle physics the data usually originate from the experimental apparatus via a data acquisition system. They are then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information. Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc. The resulting data n-tuples are then scrutinized by the physicists, using specialized software tools like ROOT or PAW, comparing the results of the experiment with theory.

The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation software like Geant4, in order to predict the response of the detector to a given theoretical event, thus producing simulated events which are then compared to experimental data.

See also[edit]

References[edit]

  1. ^ Adèr, 2008, p. 334-335.
  2. ^ Adèr, 2008, p. 336-337.
  3. ^ Adèr, 2008, p. 337.
  4. ^ Adèr, 2008, p. 338-341.
  5. ^ Adèr, 2008, p. 341-3342.
  6. ^ Adèr, 2008, p. 344.
  7. ^ Tabachnick & Fidell, 2007, p. 87-88.
  8. ^ Adèr, 2008, p. 344-345.
  9. ^ Adèr, 2008, p. 345.
  10. ^ Adèr, 2008, p. 345-346.
  11. ^ Adèr, 2008, p. 346-347.
  12. ^ Adèr, 2008, p. 349-353.
  13. ^ Adèr, 2008, p. 363.
  14. ^ Adèr, 2008, p. 361-362.
  15. ^ Adèr, 2008, p. 368-371.
  16. ^ Zeptoscope.synopsia.net
  17. ^ Aarons, D. (2009). | Report finds states on course to build pupil-data systems. Education Week, 29(13), 6.
  18. ^ Rankin, J. (2013, March 28). How data Systems & reports can either fight or propagate the data analysis error epidemic, and how educator leaders can help. Presentation conducted from Technology Information Center for Administrative Leadership (TICAL) School Leadership Summit.
  • Adèr, H.J. (2008). Chapter 14: Phases and initial steps in data analysis. In H.J. Adèr & G.J. Mellenbergh (Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A consultant's companion (pp. 333–356). Huizen, the Netherlands: Johannes van Kessel Publishing.
  • Adèr, H.J. (2008). Chapter 15: The main analysis phase. In H.J. Adèr & G.J. Mellenbergh (Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A consultant's companion (pp. 333–356). Huizen, the Netherlands: Johannes van Kessel Publishing.
  • Tabachnick, B.G. & Fidell, L.S. (2007). Chapter 4: Cleaning up your act. Screening data prior to analysis. In B.G. Tabachnick & L.S. Fidell (Eds.), Using Multivariate Statistics, Fifth Edition (pp. 60–116). Boston: Pearson Education, Inc. / Allyn and Bacon.

Further reading[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_classification_business_intelligence_ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_classification_business_intelligence_ new file mode 100644 index 00000000..4cdcfb65 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_classification_business_intelligence_ @@ -0,0 +1 @@ + Data classification (business intelligence) - Wikipedia, the free encyclopedia

Data classification (business intelligence)

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In business intelligence, data classification has close ties to data clustering, but where data clustering is descriptive, data classification is predictive.[1][2] In essence data classification consists of using variables with known values to predict the unknown or future values of other variables. It can be used in e.g. direct marketing, insurance fraud detection or medical diagnosis.[2]

The first step in doing a data classification is to cluster the data set used for category training, to create the wanted number of categories. An algorithm, called the classifier, is then used on the categories, creating a descriptive model for each. These models can then be used to categorize new items in the created classification system.[1]

According to Golfarelli and Rizzi, these are the measures of effectiveness of the classifier:[1]

  • Predictive accuracy: How well does it predict the categories for new observations?
  • Speed: What is the computational cost of using the classifier?
  • Robustness: How well do the models created perform if data quality is low?
  • Scalability: Does the classifier function efficiently with large amounts of data?
  • Interpretability: Are the results understandable to users?

Typical examples of input for data classification could be variables such as demographics, lifestyle information, or economical behaviour.

Challenges for data classification[edit]

There are several challenges in working with data classification. One in particular is that it is necessary for all using categories on e.g. customers or clients, to do the modeling in an iterative process. This is to make sure that change in the characteristics of customer groups does not go unnoticed, making the existing categories outdated and obsolete, without anyone noticing.

This could be of special importance to insurance or banking companies, where fraud detection is extremely relevant. New fraud patterns may come unnoticed, if the methods to surveil these changes and alert when categories are changing, disappearing or new ones emerge, are not developed and implemented.

References[edit]

  1. ^ a b c Golfarelli, M. & Rizzi, S. (2009). Data Warehouse Design : Modern Principles and Methodologies. McGraw-Hill Osburn. ISBN 0-07-161039-1
  2. ^ a b Kimball, R. et al. (2008). The Data Warehouse Lifecycle Toolkit. (2. Ed.). Wiley. ISBN 0-471-25547-5

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_collection b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_collection new file mode 100644 index 00000000..4fdae58d --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_collection @@ -0,0 +1 @@ + Data collection - Wikipedia, the free encyclopedia

Data collection

From Wikipedia, the free encyclopedia
Jump to: navigation, search


Data collection usually takes place early on in an improvement project, and is often formalised through a data collection plan[1] which often contains the following activity.

  1. Pre collection activity — agree on goals, target data, definitions, methods
  2. Collection — data collections
  3. Present Findings — usually involves some form of sorting[2] analysis and/or presentation.

Prior to any data collection, pre-collection activity is one of the most crucial steps in the process. It is often discovered too late that the value of their interview information is discounted as a consequence of poor sampling of both questions and informants and poor elicitation techniques.[3] After pre-collection activity is fully completed, data collection in the field, whether by interviewing or other methods, can be carried out in a structured, systematic and scientific way.

A formal data collection process is necessary as it ensures that data gathered are both defined and accurate and that subsequent decisions based on arguments embodied in the findings are valid.[4] The process provides both a baseline from which to measure and in certain cases a target on what to improve.

Other main types of collection include census, sample survey, and administrative by-product and each with their respective advantages and disadvantages. A census refers to data collection about everyone or everything in a group or statistical population and has advantages, such as accuracy and detail and disadvantages, such as cost and time. A sampling is a data collection method that includes only part of the total population and has advantages, such as cost and time and disadvantages, such as accuracy and detail. Administrative by-product data are collected as a byproduct of an organization's day-to-day operations and has advantages, such as accuracy, time simplicity and disadvantages, such as no flexibility and lack of control.[5]

See also[edit]

References[edit]

  1. ^ LeanYourCompany.com, Establishing a data collection plan
  2. ^ Sorting Data: collection and analysis By Anthony Peter Macmillan Coxon ISBN 0-8039-7237-7
  3. ^ Weller, S., Romney, A. (1988). Systematic Data Collection (Qualitative Research Methods Series 10). Thousand Oaks, California: SAGE Publications, ISBN 0-8039-3074-7
  4. ^ Data Collection and Analysis By Dr. Roger Sapsford, Victor Jupp ISBN 0-7619-5046-X
  5. ^ Weimer, J. (ed.) (1995). Research Techniques in Human Engineering. Englewood Cliffs, NJ: Prentice Hall ISBN 0-13-097072-7

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_dredging b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_dredging new file mode 100644 index 00000000..79a61ba9 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_dredging @@ -0,0 +1 @@ + Data dredging - Wikipedia, the free encyclopedia

Data dredging

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Data dredging (data fishing, data snooping, equation fitting) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. Data-snooping bias is a form of statistical bias that arises from this misuse of statistics. Any relationships found might appear valid within the test set but they would have no statistical significance in the wider population.

Data dredging and data-snooping bias can occur when researchers either do not form a hypothesis in advance or narrow the data used to reduce the probability of the sample refuting a specific hypothesis. Although data-snooping bias can occur in any field that uses data mining, it is of particular concern in finance and medical research, which both heavily use data mining.

The process of data mining involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching for combinations of variables that might show a correlation. Conventional tests of statistical significance are based on the probability that an observation arose by chance, and necessarily accept some risk of mistaken test results, called the significance. When large numbers of tests are performed, some produce false results, hence 5% of randomly chosen hypotheses turn out to be significant at the 5% level, 1% turn out to be significant at the 1% significance level, and so on, by chance alone. This and a comic example (http://imgs.xkcd.com/comics/significant.png) exemplify the multiple comparisons hazard in data dredging; there is no overall effect of jelly beans on acne. Also, subgroups are sometimes explored without alerting the reader to the number of questions at issue, which can lead to misinformed conclusions.[1]

When enough hypotheses are tested, it is virtually certain that some falsely appear statistically significant, since every data set with any degree of randomness contains some spurious correlations. Researchers using data mining techniques can be easily misled by these apparently significant results, even though they are mere artifacts of random variation.

Circumventing the traditional scientific approach by conducting an experiment without a hypothesis can lead to premature conclusions. Data mining can be used negatively to seek more information from a data set than it actually contains. Failure to adjust existing statistical models when applying them to new datasets can also result in the occurrences of new patterns between different attributes that would otherwise have not shown up. Overfitting, oversearching, overestimation, and attribute selection errors are all actions that can lead to data dredging.

Contents

Types of problem[edit]

Drawing conclusions from data[edit]

The conventional frequentist statistical hypothesis testing procedure is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data, followed by carrying out a statistical significance test to see whether the results could be due to the effects of chance. (The last step is called testing against the null hypothesis).

A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same population, it is impossible to determine if the patterns found are chance patterns. See testing hypotheses suggested by the data.

Here is a simple example. Throwing a coin five times, with a result of 2 heads and 3 tails, might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. It is important to realize that the statistical significance under the incorrect procedure is completely spurious – significance tests do not protect against data dredging.

Hypothesis suggested by non-representative data[edit]

In a list of 367 people, at least two have the same day and month of birth. Suppose Mary and John both celebrate birthdays on August 7.

Data snooping would, by design, try to find additional similarities between Mary and John, such as:

  • Are they the youngest and the oldest persons in the list?
  • Have they met in person once? Twice? Three times?
  • Do their fathers have the same first name, or mothers have the same maiden name?

By going through hundreds or thousands of potential similarities between John and Mary, each having a low probability of being true, we can almost certainly find some similarity between them. Perhaps John and Mary are the only two persons in the list who switched minors three times in college, a fact we found out by exhaustively comparing their lives' histories. Our hypothesis, biased by data-snooping, can then become "People born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself very strongly supports that correlation, since no one with a different birthday had switched minors three times in college.

However, when we turn to the larger sample of the general population and attempt to reproduce the results, we find that there is no statistical correlation between August 7 birthdays and changing college minors more than once. The "fact" exists only for a very small, specific sample, not for the public as a whole. See also Reproducible research.

Bias[edit]

Bias is a systematic error in the analysis. For example, doctors directed HIV patients at high cardiovascular risk to a particular HIV treatment, abacavir, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalised the abacavir, since its patients were more high-risk so more of them had heart attacks.[1] This problem can be very severe, for example, in the observational study.[1][2]

Missing factors, unmeasured confounders, and loss to follow-up can also lead to bias.[1] By selecting papers with a significant p-value, negative studies are selected against—which is the publication bias.

Multiple modelling[edit]

Another aspect of the conditioning of statistical tests by knowledge of the data can be seen while using the frequent in the data analysis linear regression. A crucial step in the process is to decide which covariates to include in a relationship explaining one or more other variables. There are both statistical (see Stepwise regression) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data, means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model—it may introduce bias and alter mean-square-error in estimation.[3][4]

Examples in meteorology and epidemiology[edit]

In meteorology, dataset A is often weather data up to the present, which ensures that, even subconsciously, subset B of the data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.

As another example, suppose that observers note that a particular town appears to have a cancer cluster, but lack a firm hypothesis of why this is so. However, they have access to a large amount of demographic data about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a p-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is more likely than not to get at least one null hypothesis with a p-value less than 0.01.

Remedies[edit]

Looking for patterns in data is legitimate. Applying a statistical test of significance (hypothesis testing) to the same data the pattern was learned from is wrong. One way to construct hypotheses while avoiding data dredging is to conduct randomized out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid.

Another remedy for data dredging is to record the number of all significance tests conducted during the experiment and simply multiply the final significance level by this number (the Bonferroni correction); however, this is a very conservative metric. Methods particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions are the Scheffé's method and, if the researcher has in mind only pairwise comparisons, the Tukey method. The use of a false discovery rate is a more sophisticated approach that has become a popular method for control of multiple hypothesis tests.

When neither approach is practical, one can make a clear distinction between data analyses that are confirmatory and analyses that are exploratory. Statistical inference is appropriate only for the former.[4]

Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.

See also[edit]

References[edit]

  1. ^ a b c d Young, S. S., Karr, A. (2011). "Deming, data and observational studies". Significance 8 (3). 
  2. ^ Smith, G. D., Shah, E. (2002). "Data dredging, bias, or confounding". BMJ 325. PMC 1124898. 
  3. ^ Selvin, H.C.; Stuart, A. (1966). "Data-Dredging Procedures in Survey Analysis". The American Statistician 20 (3): 20–23. 
  4. ^ a b Berk, R.; Brown, L.; Zhao, L. (2009). "Statistical Inference After Model Selection". J Quant Criminol. doi:10.1007/s10940-009-9077-7. 

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_management b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_management new file mode 100644 index 00000000..85621da5 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_management @@ -0,0 +1 @@ + Data management - Wikipedia, the free encyclopedia

Data management

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Data management comprises all the disciplines related to managing data as a valuable resource.

Contents

Overview[edit]

The official definition provided by DAMA International, the professional organization for those in the data management profession, is: "Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise." {{DAMA International}} This definition is fairly broad and encompasses a number of professions which may not have direct technical contact with lower-level aspects of data management, such as relational database management.

Alternatively, the definition provided in the DAMA Data Management Body of Knowledge (DAMA-DMBOK) is: "Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets."[1]

The concept of "Data Management" arose in the 1980s as technology moved from sequential processing (first cards, then tape) to random access processing. Since it was now technically possible to store a single fact in a single place and access that using random access disk, those suggesting that "Data Management" was more important than "Process Management" used arguments such as "a customer's home address is stored in 75 (or some other large number) places in our computer systems." During this period, random access processing was not competitively fast, so those suggesting "Process Management" was more important than "Data Management" used batch processing time as their primary argument. As applications moved more and more into real-time, interactive applications, it became obvious to most practitioners that both management processes were important. If the data was not well defined, the data would be mis-used in applications. If the process wasn't well defined, it was impossible to meet user needs.

Corporate Data Quality Management[edit]

Corporate Data Quality Management (CDQM) is, according to the European Foundation for Quality Management and the Competence Center Corporate Data Quality (CC CDQ, University of St. Gallen), the whole set of activities intended to improve corporate data quality (both reactive and preventive). Main premise of CDQM is the business relevance of high-quality corporate data. CDQM comprises with following activity areas:[2]

  • Strategy for Corporate Data Quality: As CDQM is affected by various business drivers and requires involvement of multiple divisions in an organization; it must be considered a company-wide endeavor.
  • Corporate Data Quality Controlling: Effective CDQM requires compliance with standards, policies, and procedures. Compliance is monitored according to previously defined metrics and performance indicators and reported to stakeholders.
  • Corporate Data Quality Organization: CDQM requires clear roles and responsibilities for the use of corporate data. The CDQM organization defines tasks and privileges for decision making for CDQM.
  • Corporate Data Quality Processes and Methods: In order to handle corporate data properly and in a standardized way across the entire organization and to ensure corporate data quality, standard procedures and guidelines must be embedded in company’s daily processes.
  • Data Architecture for Corporate Data Quality: The data architecture consists of the data object model - which comprises the unambiguous definition and the conceptual model of corporate data - and the data storage and distribution architecture.
  • Applications for Corporate Data Quality: Software applications support the activities of Corporate Data Quality Management. Their use must be planned, monitored, managed and continuously improved.

Topics in Data Management[edit]

Topics in Data Management, grouped by the DAMA DMBOK Framework,[3] include:

  1. Data governance
  2. Data Architecture, Analysis and Design
  3. Database Management
  4. Data Security Management
  5. Data Quality Management
  1. Reference and Master Data Management
  2. Data Warehousing and Business Intelligence Management
  3. Document, Record and Content Management
  4. Meta Data Management
  5. Contact Data Management

Body of Knowledge[edit]

The DAMA Guide to the Data Management Body of Knowledge" (DAMA-DMBOK Guide), under the guidance of a new DAMA-DMBOK Editorial Board. This publication is available from April 5, 2009.

Usage[edit]

In modern management usage, one can easily discern a trend away from the term 'data' in composite expressions to the term information or even knowledge when talking in non-technical context. Thus there exists not only data management, but also information management and knowledge management. This is a misleading trend as it obscures that traditional data is managed or somehow processed on second looks. The distinction between data and derived values can be seen in the information ladder. While data can exist as such, 'information' and 'knowledge' are always in the "eye" (or rather the brain) of the beholder and can only be measured in relative units.

See also[edit]

Notes[edit]

  1. ^ http://www.dama.org/files/public/DI_DAMA_DMBOK_Guide_Presentation_2007.pdf "DAMA-DMBOK Guide (Data Management Body of Knowledge) Introduction & Project Status"
  2. ^ EFQM ; IWI-HSG: EFQM Framework for Corporate Data Quality Management. Brussels : EFQM Press, 2011. - Forthcoming.
  3. ^ http://www.dama.org/i4a/pages/index.cfm?pageid=3364 "DAMA-DMBOK Functional Framework"

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_mining new file mode 100644 index 00000000..931c939f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_mining @@ -0,0 +1 @@ + Data mining - Wikipedia, the free encyclopedia

Data mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD),[1] an interdisciplinary subfield of computer science,[2][3][4] is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.[2] The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.[2] Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.[2]

The term is a buzzword,[5] and is frequently misused to mean any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) but is also generalized to any kind of computer decision support system, including artificial intelligence, machine learning, and business intelligence. In the proper use of the word, the key term is discovery[citation needed], commonly defined as "detecting something new". Even the popular book "Data mining: Practical machine learning tools and techniques with Java"[6] (which covers mostly machine learning material) was originally to be named just "Practical machine learning", and the term "data mining" was only added for marketing reasons.[7] Often the more general terms "(large scale) data analysis", or "analytics" – or when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection) and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting are part of the data mining step, but do belong to the overall KDD process as additional steps.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

Data mining uses information from past data to analyze the outcome of a particular problem or situation that may arise. Data mining works to analyze data stored in data warehouses that are used to store that data that is being analyzed. That particular data may come from all parts of business, from the production to the management. Managers also use data mining to decide upon marketing strategies for their product. They can use data to compare and contrast among competitors. Data mining interprets its data into real time analysis that can be used to increase sales, promote new product, or delete product that is not value-added to the company.

Contents

Etymology[edit]

In the 1960s, statisticians used terms like "Data Fishing" or "Data Dredging" to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "Data Mining" appeared around 1990 in the database community. At the beginning of the century, there was a phrase "database mining"™, trademarked by HNC, a San Diego-based company (now merged into FICO), to pitch their Data Mining Workstation;[8] researchers consequently turned to "data mining". Other terms used include Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, etc. Gregory Piatetsky-Shapiro coined the term "Knowledge Discovery in Databases" for the first workshop on the same topic (1989) and this term became more popular in AI and Machine Learning Community. However, the term data mining became more popular in the business and press communities.[9] Currently, Data Mining and Knowledge Discovery are used interchangeably.

Background[edit]

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns[10] in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever larger data sets.

Research and evolution[edit]

The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD). Since 1989 this ACM SIG has hosted an annual international conference and published its proceedings,[11] and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".[12]

Computer science conferences on data mining include:

Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases

Process[edit]

The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:

(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation/Evaluation.[1]

It exists, however, in many variations on this theme, such as the Cross Industry Standard Process for Data Mining (CRISP-DM) which defines six phases:

(1) Business Understanding
(2) Data Understanding
(3) Data Preparation
(4) Modeling
(5) Evaluation
(6) Deployment

or a simplified process such as (1) pre-processing, (2) data mining, and (3) results validation.

Polls conducted in 2002, 2004, and 2007 show that the CRISP-DM methodology is the leading methodology used by data miners.[13][14][15] The only other data mining standard named in these polls was SEMMA. However, 3-4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,[16][17] and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.[18]

Pre-processing[edit]

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

Data mining[edit]

Data mining involves six common classes of tasks:[1]

  • Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.
  • Association rule learning (Dependency modeling) – Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
  • Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
  • Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
  • Regression – Attempts to find a function which models the data with the least error.
  • Summarization – providing a more compact representation of the data set, including visualization and report generation.
  • Sequential pattern mining – Sequential pattern mining finds sets of data items that occur together frequently in some sequences. Sequential pattern mining, which extracts frequent subsequences from a sequence database, has attracted a great deal of interest during the recent data mining research because it is the basis of many applications, such as: web user analysis, stock trend prediction, DNA sequence analysis, finding language or linguistic patterns from natural language texts, and using the history of symptoms to predict certain kind of disease.

Results validation[edit]

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. A number of statistical methods may be used to evaluate the algorithm, such as ROC curves.

If the learned patterns do not meet the desired standards, then it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

Standards[edit]

There have been some efforts to define standards for the data mining process, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006, but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.

For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.[19]

Notable uses[edit]

Games[edit]

Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully acquire the high level of abstraction required to be applied successfully. Instead, extensive experimentation with the tablebases – combined with an intensive study of tablebase-answers to well designed problems, and with knowledge of prior art (i.e. pre-tablebase knowledge) – is used to yield insightful patterns. Berlekamp (in dots-and-boxes, etc.) and John Nunn (in chess endgames) are notable examples of researchers doing this work, though they were not – and are not – involved in tablebase generation.

Business[edit]

Data mining is the analysis of historical business activities, stored as static data in data warehouse databases, to reveal hidden patterns and trends. Data mining software uses advanced pattern recognition algorithms to sift through large amounts of data to assist in discovering previously unknown strategic business information. Examples of what businesses use data mining for include performing market analysis to identify new product bundles, finding the root cause of manufacturing problems, to prevent customer attrition and acquire new customers, cross-sell to existing customers, and profile customers with more accuracy.[20] In today’s world raw data is being collected by companies at an exploding rate. For example, Walmart processes over 20 million point-of-sale transactions every day. This information is stored in a centralized database, but would be useless without some type of data mining software to analyse it. If Walmart analyzed their point-of-sale data with data mining techniques they would be able to determine sales trends, develop marketing campaigns, and more accurately predict customer loyalty.[21] Every time we use our credit card, a store loyalty card, or fill out a warranty card data is being collected about our purchasing behavior. Many people find the amount of information stored about us from companies, such as Google, Facebook, and Amazon, disturbing and are concerned about privacy. Although there is the potential for our personal data to be used in harmful, or unwanted, ways it is also being used to make our lives better. For example, Ford and Audi hope to one day collect information about customer driving patterns so they can recommend safer routes and warn drivers about dangerous road conditions.[22]

Data mining in customer relationship management applications can contribute significantly to the bottom line.[citation needed] Rather than randomly contacting a prospect or customer through a call center or sending mail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer. More sophisticated methods may be used to optimize resources across campaigns so that one may predict to which channel and to which offer an individual is most likely to respond (across all potential offers). Additionally, sophisticated applications could be used to automate mailing. Once the results from data mining (potential prospect/customer and channel/offer) are determined, this "sophisticated application" can either automatically send an e-mail or a regular mail. Finally, in cases where many people will take an action without an offer, "uplift modeling" can be used to determine which people have the greatest increase in response if given an offer. Uplift modeling thereby enables marketers to focus mailings and offers on persuadable people, and not to send offers to people who will buy the product without an offer. Data clustering can also be used to automatically discover the segments or groups within a customer data set.

Businesses employing data mining may see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than using one model to predict how many customers will churn, a business could build a separate model for each region and customer type. Then, instead of sending an offer to all people that are likely to churn, it may only want to send offers to loyal customers. Finally, the business may want to determine which customers are going to be profitable over a certain window in time, and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move on to automated data mining.

Data mining can also be helpful to human resources (HR) departments in identifying the characteristics of their most successful employees. Information obtained – such as universities attended by highly successful employees – can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.[23]

Another example of data mining, often called the market basket analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data mining system could identify those customers who favor silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical, or inexact rules may also be present within a database.

Market basket analysis has also been used to identify the purchase patterns of the Alpha Consumer. Alpha Consumers are people that play a key role in connecting with the concept behind a product, then adopting that product, and finally validating it for the rest of society. Analyzing the data collected on this type of user has allowed companies to predict future buying trends and forecast supply demands.[citation needed]

Data mining is a highly effective tool in the catalog marketing industry.[citation needed] Catalogers have a rich database of history of their customer transactions for millions of customers dating back a number of years. Data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns.

Data mining for business applications is a component that needs to be integrated into a complex modeling and decision making process. Reactive business intelligence (RBI) advocates a "holistic" approach that integrates data mining, modeling, and interactive visualization into an end-to-end discovery and continuous innovation process powered by human and automated learning.[24]

In the area of decision making, the RBI approach has been used to mine knowledge that is progressively acquired from the decision maker, and then self-tune the decision method accordingly.[25]

An example of data mining related to an integrated-circuit (IC) production line is described in the paper "Mining IC Test Data to Optimize VLSI Testing."[26] In this paper, the application of data mining and decision analysis to the problem of die-level functional testing is described. Experiments mentioned demonstrate the ability to apply a system of mining historical die-test data to create a probabilistic model of patterns of die failure. These patterns are then utilized to decide, in real time, which die to test next and when to stop testing. This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products.

Science and engineering[edit]

In recent years, data mining has been used widely in the areas of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering.

In the study of human genetics, sequence mining helps address the important goal of understanding the mapping relationship between the inter-individual variations in human DNA sequence and the variability in disease susceptibility. In simple terms, it aims to find out how the changes in an individual's DNA sequence affects the risks of developing common diseases such as cancer, which is of great importance to improving methods of diagnosing, preventing, and treating these diseases. The data mining method that is used to perform this task is known as multifactor dimensionality reduction.[27]

In the area of electrical power engineering, data mining methods have been widely used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on, for example, the status of the insulation (or other important safety-related parameters). Data clustering techniques – such as the self-organizing map (SOM), have been applied to vibration monitoring and analysis of transformer on-load tap-changers (OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for exactly the same tap position. SOM has been applied to detect abnormal conditions and to hypothesize about the nature of the abnormalities.[28]

Data mining methods have also been applied to dissolved gas analysis (DGA) in power transformers. DGA, as a diagnostics for power transformers, has been available for many years. Methods such as SOM has been applied to analyze generated data and to determine trends which are not obvious to the standard DGA ratio methods (such as Duval Triangle).[28]

Another example of data mining in science and engineering is found in educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning,[29] and to understand factors influencing university student retention.[30] A similar example of social application of data mining is its use in expertise finding systems, whereby descriptors of human expertise are extracted, normalized, and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate institutional memory.

Other examples of application of data mining methods are biomedical data facilitated by domain ontologies,[31] mining clinical trial data,[32] and traffic analysis using SOM.[33]

In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998, used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the WHO global database of 4.6 million suspected adverse drug reaction incidents.[34] Recently, similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses.[35]

Data mining has been applied software artifacts within the realm of software engineering: Mining Software Repositories.

Human rights[edit]

Data mining of government records – particularly records of the justice system (i.e. courts, prisons) – enables the discovery of systemic human rights violations in connection to generation and publication of invalid or fraudulent legal records by various government agencies.[36][37]

Medical data mining[edit]

In 2011, the case of Sorrell v. IMS Health, Inc., decided by the Supreme Court of the United States, ruled that pharmacies may share information with outside companies. This practice was authorized under the 1st Amendment of the Constitution, protecting the "freedom of speech."[38]

Spatial data mining[edit]

Spatial data mining is the application of data mining methods to spatial data. The end objective of spatial data mining is to find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions, and approaches to visualization and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasizes the importance of developing data-driven inductive approaches to geographical analysis and modeling.

Data mining offers great potential benefits for GIS-based applied decision-making. Recently, the task of integrating these two technologies has become of critical importance, especially as various public and private sector organizations possessing huge databases with thematic and geographically referenced data begin to realize the huge potential of the information contained therein. Among those organizations are:

  • offices requiring analysis or dissemination of geo-referenced statistical data
  • public health services searching for explanations of disease clustering
  • environmental agencies assessing the impact of changing land-use patterns on climate change
  • geo-marketing companies doing customer segmentation based on spatial location.

Challenges in Spatial mining: Geospatial data repositories tend to be very large. Moreover, existing GIS datasets are often splintered into feature and attribute components that are conventionally archived in hybrid data management systems. Algorithmic requirements differ substantially for relational (attribute) data management and for topological (feature) data management.[39] Related to this is the range and diversity of geographic data formats, which present unique challenges. The digital geographic data revolution is creating new types of data formats beyond the traditional "vector" and "raster" formats. Geographic data repositories increasingly include ill-structured data, such as imagery and geo-referenced multi-media.[40]

There are several critical research challenges in geographic knowledge discovery and data mining. Miller and Han[41] offer the following list of emerging research topics in the field:

  • Developing and supporting geographic data warehouses (GDW's): Spatial properties are often reduced to simple aspatial attributes in mainstream data warehouses. Creating an integrated GDW requires solving issues of spatial and temporal data interoperability – including differences in semantics, referencing systems, geometry, accuracy, and position.
  • Better spatio-temporal representations in geographic knowledge discovery: Current geographic knowledge discovery (GKD) methods generally use very simple representations of geographic objects and spatial relationships. Geographic data mining methods should recognize more complex geographic objects (i.e. lines and polygons) and relationships (i.e. non-Euclidean distances, direction, connectivity, and interaction through attributed geographic space such as terrain). Furthermore, the time dimension needs to be more fully integrated into these geographic representations and relationships.
  • Geographic knowledge discovery using diverse data types: GKD methods should be developed that can handle diverse data types beyond the traditional raster and vector models, including imagery and geo-referenced multimedia, as well as dynamic data types (video streams, animation).

Sensor data mining[edit]

Wireless sensor networks can be used for facilitating the collection of data for spatial data mining for a variety of applications such as air pollution monitoring.[42] A characteristic of such networks is that nearby sensor nodes monitoring an environmental feature typically register similar values. This kind of data redundancy due to the spatial correlation between sensor observations inspires the techniques for in-network data aggregation and mining. By measuring the spatial correlation between data sampled by different sensors, a wide class of specialized algorithms can be developed to develop more efficient spatial data mining algorithms.[43]

Visual data mining[edit]

In the process of turning from analogical into digital, large data sets have been generated, collected, and stored discovering statistical patterns, trends and information which is hidden in data, in order to build predictive patterns. Studies suggest visual data mining is faster and much more intuitive than is traditional data mining.[44][45][46] See also Computer vision.

Music data mining[edit]

Data mining techniques, and in particular co-occurrence analysis, has been used to discover relevant similarities among music corpora (radio lists, CD databases) for the purpose of classifying music into genres in a more objective manner.[47]

Surveillance[edit]

Data mining has been used to stop terrorist programs under the U.S. government, including the Total Information Awareness (TIA) program, Secure Flight (formerly known as Computer-Assisted Passenger Prescreening System (CAPPS II)), Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE),[48] and the Multi-state Anti-Terrorism Information Exchange (MATRIX).[49] These programs have been discontinued due to controversy over whether they violate the 4th Amendment to the United States Constitution, although many programs that were formed under them continue to be funded by different organizations or under different names.[50]

In the context of combating terrorism, two particularly plausible methods of data mining are "pattern mining" and "subject-based data mining".

Pattern mining[edit]

"Pattern mining" is a data mining method that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behavior in terms of the purchased products. For example, an association rule "beer ⇒ potato chips (80%)" states that four out of five customers that bought beer also bought potato chips.

In the context of pattern mining as a tool to identify terrorist activity, the National Research Council provides the following definition: "Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity — these patterns might be regarded as small signals in a large ocean of noise."[51][52][53] Pattern Mining includes new areas such a Music Information Retrieval (MIR) where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search methods.

Subject-based data mining[edit]

"Subject-based data mining" is a data mining method involving the search for associations between individuals in data. In the context of combating terrorism, the National Research Council provides the following definition: "Subject-based data mining uses an initiating individual or other datum that is considered, based on other information, to be of high interest, and the goal is to determine what other persons or financial transactions or movements, etc., are related to that initiating datum."[52]

Knowledge grid[edit]

Knowledge discovery "On the Grid" generally refers to conducting knowledge discovery in an open environment using grid computing concepts, allowing users to integrate data from various online data sources, as well make use of remote resources, for executing their data mining tasks. The earliest example was the Discovery Net,[54][55] developed at Imperial College London, which won the "Most Innovative Data-Intensive Application Award" at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed knowledge discovery application for a bioinformatics application. Other examples include work conducted by researchers at the University of Calabria, who developed a Knowledge Grid architecture for distributed knowledge discovery, based on grid computing.[56][57]

Reliability / Validity[edit]

Data mining can be misused, and can also unintentionally produce results which appear significant but which do not actually predict future behavior and cannot be reproduced on a new sample of data. See Data dredging.

Privacy concerns and ethics[edit]

Some people believe that data mining itself is ethically neutral.[58] While the term "data mining" has no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise). To be precise, data mining is a statistical method that is applied to a set of information (i.e. a data set). Associating these data sets with people is an extreme narrowing of the types of data that are available. Examples could range from a set of crash test data for passenger vehicles, to the performance of a group of stocks. These types of data sets make up a great proportion of the information available to be acted on by data mining methods, and rarely have ethical concerns associated with them. However, the ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics.[59] In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.[60][61]

Data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent).[62] This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.

It is recommended that an individual is made aware of the following before data are collected:[62]

  • the purpose of the data collection and any (known) data mining projects
  • how the data will be used
  • who will be able to mine the data and use the data and their derivatives
  • the status of security surrounding access to the data
  • how collected data can be updated.

In America, privacy concerns have been addressed to some extent by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week', "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is undermined by the complexity of consent forms that are required of patients and participants, which approach a level of incomprehensibility to average individuals."[63] This underscores the necessity for data anonymity in data aggregation and mining practices.

Data may also be modified so as to become anonymous, so that individuals may not readily be identified.[62] However, even "de-identified"/"anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.[64]

Software[edit]

Free open-source data mining software and applications[edit]

  • Carrot2: Text and search results clustering framework.
  • Chemicalize.org: A chemical structure miner and web search engine.
  • ELKI: A university research project with advanced cluster analysis and outlier detection methods written in the Java language.
  • GATE: a natural language processing and language engineering tool.
  • SCaViS: Java cross-platform data analysis framework developed at Argonne National Laboratory.
  • KNIME: The Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
  • ML-Flex: A software package that enables users to integrate with third-party machine-learning packages written in any programming language, execute classification analyses in parallel across multiple computing nodes, and produce HTML reports of classification results.
  • NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python language.
  • SenticNet API: A semantic and affective resource for opinion mining and sentiment analysis.
  • Orange: A component-based data mining and machine learning software suite written in the Python language.
  • R: A programming language and software environment for statistical computing, data mining, and graphics. It is part of the GNU Project.
  • RapidMiner: An environment for machine learning and data mining experiments.
  • UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM.
  • Weka: A suite of machine learning software applications written in the Java programming language.

Commercial data-mining software and applications[edit]

Marketplace surveys[edit]

Several researchers and organizations have conducted reviews of data mining tools and surveys of data miners. These identify some of the strengths and weaknesses of the software packages. They also provide an overview of the behaviors, preferences and views of data miners. Some of these reports include:

See also[edit]

Methods
Application domains
Application examples
Related topics

Data mining is about analyzing data; for information about extracting information out of data, see:

References[edit]

  1. ^ a b c Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge Discovery in Databases". Retrieved 17 December 2008. 
  2. ^ a b c d "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2011-10-28. 
  3. ^ Clifton, Christopher (2010). "Encyclopædia Britannica: Definition of Data Mining". Retrieved 2010-12-09. 
  4. ^ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "The Elements of Statistical Learning: Data Mining, Inference, and Prediction". Retrieved 2012-08-07. 
  5. ^ See e.g. OKAIRP 2005 Fall Conference, Arizona State University, About.com: Datamining
  6. ^ Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12-374856-0. 
  7. ^ Bouckaert, Remco R.; Frank, Eibe; Hall, Mark A.; Holmes, Geoffrey; Pfahringer, Bernhard; Reutemann, Peter; Witten, Ian H. (2010). "WEKA Experiences with a Java open-source project". Journal of Machine Learning Research 11: 2533–2541. "the original title, "Practical machine learning", was changed ... The term "data mining" was [added] primarily for marketing reasons." 
  8. ^ Mena, Jesús (2011). Machine Learning Forensics for Law Enforcement, Security, and Intelligence. Boca Raton, FL: CRC Press (Taylor & Francis Group). ISBN 978-1-4398-6069-4. 
  9. ^ Piatetsky-Shapiro, Gregory; Parker, Gary (2011). "Lesson: Data Mining, and Knowledge Discovery: An Introduction". Introduction to Data Mining. KD Nuggets. Retrieved 30 August 2012. 
  10. ^ Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 0-471-22852-4. OCLC 50055336. 
  11. ^ Proceedings, International Conferences on Knowledge Discovery and Data Mining, ACM, New York.
  12. ^ SIGKDD Explorations, ACM, New York.
  13. ^ Gregory Piatetsky-Shapiro (2002) KDnuggets Methodology Poll
  14. ^ Gregory Piatetsky-Shapiro (2004) KDnuggets Methodology Poll
  15. ^ Gregory Piatetsky-Shapiro (2007) KDnuggets Methodology Poll
  16. ^ Óscar Marbán, Gonzalo Mariscal and Javier Segovia (2009); A Data Mining & Knowledge Discovery Process Model. In Data Mining and Knowledge Discovery in Real Life Applications, Book edited by: Julio Ponce and Adem Karahoca, ISBN 978-3-902613-53-0, pp. 438–453, February 2009, I-Tech, Vienna, Austria.
  17. ^ Lukasz Kurgan and Petr Musilek (2006); A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. Volume 21 Issue 1, March 2006, pp 1–24, Cambridge University Press, New York, NY, USA doi: 10.1017/S0269888906000737.
  18. ^ Azevedo, A. and Santos, M. F. KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.
  19. ^ Günnemann, Stephan; Kremer, Hardy; Seidl, Thomas (2011). "An extension of the PMML standard to subspace clustering models". Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11. p. 48. doi:10.1145/2023598.2023605. ISBN 9781450308373.  edit
  20. ^ O'Brien, J. A., & Marakas, G. M. (2011). Management Information Systems. New York, NY: McGraw-Hill/Irwin.
  21. ^ Alexander, D. (n.d.). Data Mining. Retrieved from The University of Texas at Austin: College of Liberal Arts: http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/
  22. ^ Goss, S. (2013, April 10). Data-mining and our personal privacy. Retrieved from The Telegraph: http://www.macon.com/2013/04/10/2429775/data-mining-and-our-personal-privacy.html
  23. ^ Monk, Ellen; Wagner, Bret (2006). Concepts in Enterprise Resource Planning, Second Edition. Boston, MA: Thomson Course Technology. ISBN 0-619-21663-8. OCLC 224465825. 
  24. ^ Battiti, Roberto; and Brunato, Mauro; Reactive Business Intelligence. From Data to Models to Insight, Reactive Search Srl, Italy, February 2011. ISBN 978-88-905795-0-9.
  25. ^ Battiti, Roberto; Passerini, Andrea (2010). "Brain-Computer Evolutionary Multi-Objective Optimization (BC-EMO): a genetic algorithm adapting to the decision maker". IEEE Transactions on Evolutionary Computation 14 (15): 671–687. doi:10.1109/TEVC.2010.2058118. 
  26. ^ Fountain, Tony; Dietterich, Thomas; and Sudyka, Bill (2000); Mining IC Test Data to Optimize VLSI Testing, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM Press, pp. 18–25
  27. ^ Zhu, Xingquan; Davidson, Ian (2007). Knowledge Discovery and Data Mining: Challenges and Realities. New York, NY: Hershey. p. 18. ISBN 978-1-59904-252-7. 
  28. ^ a b McGrail, Anthony J.; Gulski, Edward; Allan, David; Birtwhistle, David; Blackburn, Trevor R.; Groot, Edwin R. S. "Data Mining Techniques to Assess the Condition of High Voltage Electrical Plant". CIGRÉ WG 15.11 of Study Committee 15. 
  29. ^ Baker, Ryan S. J. d. "Is Gaming the System State-or-Trait? Educational Data Mining Through the Multi-Contextual Application of a Validated Behavioral Model". Workshop on Data Mining for User Modeling 2007. 
  30. ^ Superby Aguirre, Juan Francisco; Vandamme, Jean-Philippe; Meskens, Nadine. "Determination of factors influencing the achievement of the first-year university students using data mining methods". Workshop on Educational Data Mining 2006. 
  31. ^ Zhu, Xingquan; Davidson, Ian (2007). Knowledge Discovery and Data Mining: Challenges and Realities. New York, NY: Hershey. pp. 163–189. ISBN 978-1-59904-252-7. 
  32. ^ Zhu, Xingquan; Davidson, Ian (2007). Knowledge Discovery and Data Mining: Challenges and Realities. New York, NY: Hershey. pp. 31–48. ISBN 978-1-59904-252-7. 
  33. ^ Chen, Yudong; Zhang, Yi; Hu, Jianming; Li, Xiang (2006). "Traffic Data Analysis Using Kernel PCA and Self-Organizing Map". IEEE Intelligent Vehicles Symposium. 
  34. ^ Bate, Andrew; Lindquist, Marie; Edwards, I. Ralph; Olsson, Sten; Orre, Roland; Lansner, Anders; and de Freitas, Rogelio Melhado; A Bayesian neural network method for adverse drug reaction signal generation, European Journal of Clinical Pharmacology 1998 Jun; 54(4):315–21
  35. ^ Norén, G. Niklas; Bate, Andrew; Hopstadius, Johan; Star, Kristina; and Edwards, I. Ralph (2008); Temporal Pattern Discovery for Trends and Transient Effects: Its Application to Patient Records. Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining (SIGKDD 2008), Las Vegas, NV, pp. 963–971.
  36. ^ Zernik, Joseph; Data Mining as a Civic Duty – Online Public Prisoners' Registration Systems, International Journal on Social Media: Monitoring, Measurement, Mining, 1: 84–96 (2010)
  37. ^ Zernik, Joseph; Data Mining of Online Judicial Records of the Networked US Federal Courts, International Journal on Social Media: Monitoring, Measurement, Mining, 1:69–83 (2010)
  38. ^ David G. Savage (2011-06-24). "Pharmaceutical industry: Supreme Court sides with pharmaceutical industry in two decisions". Los Angeles Times. Retrieved 2012-11-07. 
  39. ^ Healey, Richard G. (1991); Database Management Systems, in Maguire, David J.; Goodchild, Michael F.; and Rhind, David W., (eds.), Geographic Information Systems: Principles and Applications, London, GB: Longman
  40. ^ Camara, Antonio S.; and Raper, Jonathan (eds.) (1999); Spatial Multimedia and Virtual Reality, London, GB: Taylor and Francis
  41. ^ Miller, Harvey J.; and Han, Jiawei (eds.) (2001); Geographic Data Mining and Knowledge Discovery, London, GB: Taylor & Francis
  42. ^ Ma, Y.; Richards, M.; Ghanem, M.; Guo, Y.; Hassard, J. (2008). "Air Pollution Monitoring and Mining Based on Sensor Grid in London". Sensors 8 (6): 3601. doi:10.3390/s8063601.  edit
  43. ^ Ma, Y.; Guo, Y.; Tian, X.; Ghanem, M. (2011). "Distributed Clustering-Based Aggregation Algorithm for Spatial Correlated Sensor Networks". IEEE Sensors Journal 11 (3): 641. doi:10.1109/JSEN.2010.2056916.  edit
  44. ^ Zhao, Kaidi; and Liu, Bing; Tirpark, Thomas M.; and Weimin, Xiao; A Visual Data Mining Framework for Convenient Identification of Useful Knowledge
  45. ^ Keim, Daniel A.; Information Visualization and Visual Data Mining
  46. ^ Burch, Michael; Diehl, Stephan; Weißgerber, Peter; Visual Data Mining in Software Archives
  47. ^ Pachet, François; Westermann, Gert; and Laigre, Damien; Musical Data Mining for Electronic Music Distribution, Proceedings of the 1st WedelMusic Conference,Firenze, Italy, 2001, pp. 101–106.
  48. ^ Government Accountability Office, Data Mining: Early Attention to Privacy in Developing a Key DHS Program Could Reduce Risks, GAO-07-293 (February 2007), Washington, DC
  49. ^ Secure Flight Program report, MSNBC
  50. ^ "Total/Terrorism Information Awareness (TIA): Is It Truly Dead?". Electronic Frontier Foundation (official website). 2003. Retrieved 2009-03-15. 
  51. ^ Agrawal, Rakesh; Mannila, Heikki; Srikant, Ramakrishnan; Toivonen, Hannu; and Verkamo, A. Inkeri; Fast discovery of association rules, in Advances in knowledge discovery and data mining, MIT Press, 1996, pp. 307–328
  52. ^ a b National Research Council, Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Program Assessment, Washington, DC: National Academies Press, 2008
  53. ^ Haag, Stephen; Cummings, Maeve; Phillips, Amy (2006). Management Information Systems for the information age. Toronto: McGraw-Hill Ryerson. p. 28. ISBN 0-07-095569-7. OCLC 63194770. 
  54. ^ Ghanem, Moustafa; Guo, Yike; Rowe, Anthony; Wendel, Patrick (2002). "Grid-based knowledge discovery services for high throughput informatics". Proceedings 11th IEEE International Symposium on High Performance Distributed Computing. p. 416. doi:10.1109/HPDC.2002.1029946. ISBN 0-7695-1686-6.  edit
  55. ^ Ghanem, Moustafa; Curcin, Vasa; Wendel, Patrick; Guo, Yike (2009). "Building and Using Analytical Workflows in Discovery Net". Data Mining Techniques in Grid Computing Environments. p. 119. doi:10.1002/9780470699904.ch8. ISBN 9780470699904.  edit
  56. ^ Cannataro, Mario; Talia, Domenico (January 2003). "The Knowledge Grid: An Architecture for Distributed Knowledge Discovery". Communications of the ACM 46 (1): 89–93. doi:10.1145/602421.602425. Retrieved 17 October 2011. 
  57. ^ Talia, Domenico; Trunfio, Paolo (July 2010). "How distributed data mining tasks can thrive as knowledge services". Communications of the ACM 53 (7): 132–137. doi:10.1145/1785414.1785451. Retrieved 17 October 2011. 
  58. ^ Seltzer, William. The Promise and Pitfalls of Data Mining: Ethical Issues. 
  59. ^ Pitts, Chip (15 March 2007). "The End of Illegal Domestic Spying? Don't Count on It". Washington Spectator. 
  60. ^ Taipale, Kim A. (15 December 2003). "Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data". Columbia Science and Technology Law Review 5 (2). OCLC 45263753. SSRN 546782. 
  61. ^ Resig, John; and Teredesai, Ankur (2004). "A Framework for Mining Instant Messaging Services". Proceedings of the 2004 SIAM DM Conference. 
  62. ^ a b c Think Before You Dig: Privacy Implications of Data Mining & Aggregation, NASCIO Research Brief, September 2004
  63. ^ Biotech Business Week Editors (June 30, 2008); BIOMEDICINE; HIPAA Privacy Rule Impedes Biomedical Research, Biotech Business Week, retrieved 17 November 2009 from LexisNexis Academic
  64. ^ AOL search data identified individuals, SecurityFocus, August 2006
  65. ^ Mikut, Ralf; Reischl, Markus (September/October 2011). "Data Mining Tools". Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1 (5): 431–445. doi:10.1002/widm.24. Retrieved October 21, 2011. 
  66. ^ Karl Rexer, Heather Allen, & Paul Gearan (2011); Understanding Data Miners, Analytics Magazine, May/June 2011 (INFORMS: Institute for Operations Research and the Management Sciences).
  67. ^ Kobielus, James; The Forrester Wave: Predictive Analytics and Data Mining Solutions, Q1 2010, Forrester Research, 1 July 2008
  68. ^ Herschel, Gareth; Magic Quadrant for Customer Data-Mining Applications, Gartner Inc., 1 July 2008
  69. ^ Nisbet, Robert A. (2006); Data Mining Tools: Which One is Best for CRM? Part 1, Information Management Special Reports, January 2006
  70. ^ Haughton, Dominique; Deichmann, Joel; Eshghi, Abdolreza; Sayek, Selin; Teebagy, Nicholas; and Topi, Heikki (2003); A Review of Software Packages for Data Mining, The American Statistician, Vol. 57, No. 4, pp. 290–309

Further reading[edit]

  • Cabena, Peter; Hadjnian, Pablo; Stadler, Rolf; Verhees, Jaap; and Zanasi, Alessandro (1997); Discovering Data Mining: From Concept to Implementation, Prentice Hall, ISBN 0-13-743980-6
  • Feldman, Ronen; and Sanger, James; The Text Mining Handbook, Cambridge University Press, ISBN 978-0-521-83657-9
  • Guo, Yike; and Grossman, Robert (editors) (1999); High Performance Data Mining: Scaling Algorithms, Applications and Systems, Kluwer Academic Publishers
  • Hastie, Trevor, Tibshirani, Robert and Friedman, Jerome (2001); The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, ISBN 0-387-95284-5
  • Liu, Bing (2007); Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, Springer, ISBN 3-540-37881-2
  • Murphy, Chris (16 May 2011). "Is Data Mining Free Speech?". InformationWeek (UMB): 12. 
  • Nisbet, Robert; Elder, John; Miner, Gary (2009); Handbook of Statistical Analysis & Data Mining Applications, Academic Press/Elsevier, ISBN 978-0-12-374765-5
  • Poncelet, Pascal; Masseglia, Florent; and Teisseire, Maguelonne (editors) (October 2007); "Data Mining Patterns: New Methods and Applications", Information Science Reference, ISBN 978-1-59904-162-9
  • Tan, Pang-Ning; Steinbach, Michael; and Kumar, Vipin (2005); Introduction to Data Mining, ISBN 0-321-32136-7
  • Theodoridis, Sergios; and Koutroumbas, Konstantinos (2009); Pattern Recognition, 4th Edition, Academic Press, ISBN 978-1-59749-272-0
  • Weiss, Sholom M.; and Indurkhya, Nitin (1998); Predictive Data Mining, Morgan Kaufmann
  • Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12-374856-0.  (See also Free Weka software)
  • Ye, Nong (2003); The Handbook of Data Mining, Mahwah, NJ: Lawrence Erlbaum

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_pre_processing b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_pre_processing new file mode 100644 index 00000000..01692cf3 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_pre_processing @@ -0,0 +1 @@ + Data pre-processing - Wikipedia, the free encyclopedia

Data pre-processing

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Data pre-processing is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.[1]

If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. Kotsiantis et al. (2006) present a well-known algorithm for each step of data pre-processing.[2]

References[edit]

  1. ^ Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Los Altos, California.
  2. ^ S. Kotsiantis, D. Kanellopoulos, P. Pintelas, "Data Preprocessing for Supervised Leaning", International Journal of Computer Science, 2006, Vol 1 N. 2, pp 111–117.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_set b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_set new file mode 100644 index 00000000..2031606a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_set @@ -0,0 +1 @@ + Data set - Wikipedia, the free encyclopedia

Data set

From Wikipedia, the free encyclopedia
Jump to: navigation, search

A dataset (or data set) is a collection of data.

Most commonly a dataset corresponds to the contents of a single database table, or a single statistical data matrix, where each column of the table represents a particular variable, and each row corresponds to a given member of the dataset in question. The dataset lists values for each of the variables, such as height and weight of an object, for each member of the dataset. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows.

The term dataset may also be used more loosely, to refer to the data in a collection of closely related tables, corresponding to a particular experiment or event.

Contents

History[edit]

Historically, the term originated in the mainframe field, where it had a well-defined meaning, very close to contemporary computer file[citation needed].

Properties[edit]

Several characteristics define a dataset's structure and properties. These include the number and types of the attributes or variables, and various statistical measures applicable to them, such as standard deviation and kurtosis.[1]

In the simplest case, there is only one variable, and the dataset consists of a single column of values, often represented as a list. In spite of the name, such a univariate dataset is not a set in the usual mathematical sense, since a given value may occur multiple times. Usually the order does not matter, and then the collection of values may be considered a multiset rather than an (ordered) list[original research?].

The values may be numbers, such as real numbers or integers, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a level of measurement. For each variable, the values are normally all of the same kind. However, there may also be missing values, which must be indicated in some way.

In statistics, datasets usually come from actual observations obtained by sampling a statistical population, and each row corresponds to the observations on one element of that population. Datasets may further be generated by algorithms for the purpose of testing certain kinds of software. Some modern statistical analysis software such as SPSS still present their data in the classical dataset fashion

Classic datasets[edit]

Several classic datasets have been used extensively in the statistical literature:

  • Anscombe's quartet Small dataset illustrating the importance of graphing the data to avoid statistical fallacies

See also[edit]

[3]

Notes[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_stream_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_stream_mining new file mode 100644 index 00000000..207d63ea --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_stream_mining @@ -0,0 +1 @@ + Data stream mining - Wikipedia, the free encyclopedia

Data stream mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.

In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. In many applications, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift.

Contents

Software for data stream mining [edit]

  • RapidMiner: free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: concept drift plugin))
  • MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with Weka (machine learning).

Events [edit]

Researchers working on data stream mining [edit]

Master References [edit]

Bibliographic References [edit]

  • Minku and Yao. "DDD: A New Ensemble Approach For Dealing With Concept Drift.", IEEE Transactions on Knowledge and Data Engineering, 24:(4), p. 619-633, 2012.
  • Hahsler, Michael and Dunham, Margaret H. Temporal structure learning for clustering massive data streams in real-time. In SIAM Conference on Data Mining (SDM11), pages 664-675. SIAM, April 2011.
  • Minku, White and Yao. "The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift.", IEEE Transactions on Knowledge and Data Engineering, 22:(5), p. 730-742, 2010.
  • Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams. ECML/PKDD (2) 2009: 79-94 (extended version will appear in TKDE journal).
  • Scholz, Martin and Klinkenberg, Ralf: Boosting Classifiers for Drifting Concepts. In Intelligent Data Analysis (IDA), Special Issue on Knowledge Discovery from Data Streams, Vol. 11, No. 1, pages 3–28, March 2007.
  • Nasraoui O. , Cerwinske J., Rojas C., and Gonzalez F., "Collaborative Filtering in Dynamic Usage Environments", in Proc. of CIKM 2006 – Conference on Information and Knowledge Management, Arlington VA , Nov. 2006
  • Nasraoui O. , Rojas C., and Cardona C., “ A Framework for Mining Evolving Trends in Web Data Streams using Dynamic Learning and Retrospective Validation ”, Journal of Computer Networks- Special Issue on Web Dynamics, 50(10), 1425-1652, July 2006
  • Scholz, Martin and Klinkenberg, Ralf: An Ensemble Classifier for Drifting Concepts. In Gama, J. and Aguilar-Ruiz, J. S. (editors), Proceedings of the Second International Workshop on Knowledge Discovery in Data Streams, pages 53–64, Porto, Portugal, 2005.
  • Klinkenberg, Ralf: Learning Drifting Concepts: Example Selection vs. Example Weighting. In Intelligent Data Analysis (IDA), Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, Vol. 8, No. 3, pages 281—300, 2004.
  • Klinkenberg, Ralf: Using Labeled and Unlabeled Data to Learn Drifting Concepts. In Kubat, Miroslav and Morik, Katharina (editors), Workshop notes of the IJCAI-01 Workshop on \em Learning from Temporal and Spatial Data, pages 16–24, IJCAI, Menlo Park, CA, USA, AAAI Press, 2001.
  • Maloof M. and Michalski R. Selecting examples for partial memory learning. Machine Learning, 41(11), 2000, pp. 27–52.
  • Koychev I. Gradual Forgetting for Adaptation to Concept Drift. In Proceedings of ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning. Berlin, Germany, 2000, pp. 101–106
  • Klinkenberg, Ralf and Joachims, Thorsten: Detecting Concept Drift with Support Vector Machines. In Langley, Pat (editor), Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 487—494, San Francisco, CA, USA, Morgan Kaufmann, 2000.
  • Koychev I. and Schwab I., Adaptation to Drifting User’s Interests, Proc. of ECML 2000 Workshop: Machine Learning in New Information Age, Barcelona, Spain, 2000, pp. 39–45
  • Schwab I., Pohl W. and Koychev I. Learning to Recommend from Positive Evidence, Proceedings of Intelligent User Interfaces 2000, ACM Press, 241 - 247.
  • Klinkenberg, Ralf and Renz, Ingrid: Adaptive Information Filtering: Learning in the Presence of Concept Drifts. In Sahami, Mehran and Craven, Mark and Joachims, Thorsten and McCallum, Andrew (editors), Workshop Notes of the ICML/AAAI-98 Workshop \em Learning for Text Categorization, pages 33–40, Menlo Park, CA, USA, AAAI Press, 1998.
  • Grabtree I. Soltysiak S. Identifying and Tracking Changing Interests. International Journal of Digital Libraries, Springer Verlag, vol. 2, 38-53.
  • Widmer G. Tracking Context Changes through Meta-Learning, Machine Learning 27, 1997, pp. 256–286.
  • Maloof, M.A. and Michalski, R.S. Learning Evolving Concepts Using Partial Memory Approach. Working Notes of the 1995 AAAI Fall Symposium on Active Learning, Boston, MA, pp. 70–73, 1995
  • Mitchell T., Caruana R., Freitag D., McDermott, J. and Zabowski D. Experience with a Learning Personal Assistant. Communications of the ACM 37(7), 1994, pp. 81–91.
  • Widmer G. and Kubat M. Learning in the presence of concept drift and hidden contexts. Machine Learning 23, 1996, pp. 69–101.
  • Schlimmer J., and Granger R. Incremental Learning from Noisy Data, Machine Learning, 1(3), 1986, 317-357.

Books [edit]

  • João Gama and Mohamed Medhat Gaber (Eds.), Learning from Data Streams: Processing Techniques in Sensor Networks, Springer, 2007.
  • Auroop R. Ganguly, João Gama, Olufemi A. Omitaomu, Mohamed M. Gaber, and Ranga R. Vatsavai (Eds), Knowledge Discovery from Sensor Data, CRC Press, 2008.
  • João Gama, Knowledge Discovery from Data Streams, Chapman and Hall/CRC, 2010.

See also [edit]

External references [edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_visualization b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_visualization new file mode 100644 index 00000000..c6bdd514 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_visualization @@ -0,0 +1 @@ + Data visualization - Wikipedia, the free encyclopedia

Data visualization

From Wikipedia, the free encyclopedia
Jump to: navigation, search
A data visualization of Wikipedia as part of the World Wide Web, demonstrating hyperlinks

Data visualization is the study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information".[1]

Contents

Overview[edit]

A data visualization from social media

According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex data set by communicating its key-aspects in a more intuitive way. Yet designers often fail to achieve a balance between form and function, creating gorgeous data visualizations which fail to serve their main purpose — to communicate information".[2]

Indeed, Fernanda Viegas and Martin M. Wattenberg have suggested that an ideal visualization should not only communicate clearly, but stimulate viewer engagement and attention.[3]

Data visualization is closely related to information graphics, information visualization, scientific visualization, and statistical graphics. In the new millennium, data visualization has become an active area of research, teaching and development. According to Post et al. (2002), it has united scientific and information visualization.[4] Brian Willison has demonstrated that data visualization has also been linked to enhancing agile software development and customer engagement.[5]

KPI Library has developed the “Periodic Table of Visualization Methods,” an interactive chart displaying various data visualization methods. It includes six types of data visualization methods: data, information, concept, strategy, metaphor and compound.[6]

Data visualization scope[edit]

There are different approaches on the scope of data visualization. One common focus is on information presentation, such as Friedman (2008) presented it. In this way Friendly (2008) presumes two main parts of data visualization: statistical graphics, and thematic cartography.[1] In this line the "Data Visualization: Modern Approaches" (2007) article gives an overview of seven subjects of data visualization:[7]

All these subjects are closely related to graphic design and information representation.

On the other hand, from a computer science perspective, Frits H. Post (2002) categorized the field into a number of sub-fields:[4]

For different types of visualizations and their connection to infographics, see infographics.

Related fields[edit]

Data acquisition[edit]

Data acquisition is the sampling of the real world to generate data that can be manipulated by a computer. Sometimes abbreviated DAQ or DAS, data acquisition typically involves acquisition of signals and waveforms and processing the signals to obtain desired information. The components of data acquisition systems include appropriate sensors that convert any measurement parameter to an electrical signal, which is acquired by data acquisition hardware.

Data analysis[edit]

Data analysis is the process of studying and summarizing data with the intent to extract useful information and develop conclusions. Data analysis is closely related to data mining, but data mining tends to focus on larger data sets with less emphasis on making inference, and often uses data that was originally collected for a different purpose. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis, and inferential statistics (or confirmatory data analysis), where the EDA focuses on discovering new features in the data, and CDA on confirming or falsifying existing hypotheses.

Types of data analysis are:

Data governance[edit]

Data governance encompasses the people, processes and technology required to create a consistent, enterprise view of an organisation's data in order to:

  • Increase consistency & confidence in decision making
  • Decrease the risk of regulatory fines
  • Improve data security
  • Maximize the income generation potential of data
  • Designate accountability for information quality

Data management[edit]

Data management comprises all the academic disciplines related to managing data as a valuable resource. The official definition provided by DAMA is that "Data Resource Management is the development and execution of architectures, policies, practices, and procedures that properly manage the full data lifecycle needs of an enterprise." This definition is fairly broad and encompasses a number of professions that may not have direct technical contact with lower-level aspects of data management, such as relational database management.

Data mining[edit]

Data mining is the process of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but is increasingly being used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods.

It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data"[8] and "the science of extracting useful information from large data sets or databases."[9] In relation to enterprise resource planning, according to Monk (2006), data mining is "the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making".[10]

Data transforms[edit]

Data transforms is the process of Automation and Transformation, of both real-time and offline data from one format to another. There are standards and protocols that provide the specifications and rules, and it usually occurs in the process pipeline of aggregation or consolidation or interoperability. The primary use cases are in integration systems organizations, and compliance personnels.

Data visualization software[edit]

Software Type Targeted Users License
ANTz Realtime 3D Data Visualization Analysts, Scientists, Programmers, VR Public Domain
Amira GUI/Code Data Visualisation Scientists Proprietary
Avizo GUI/Code Data Visualisation Engineers and Scientists Proprietary
Cave5D Virtual Reality Data Visualization Scientists Open Source
curios.IT Interactive 3D Data Visualization Business Managers Proprietary
Data Desk GUI Data Visualisation Statistician Proprietary
DAVIX Operating System with data tools Security Consultant Various
Dundas Data Visualization, Inc. GUI Data Visualisation Business Managers Proprietary
ELKI Data mining visualizations Scientists and Teachers Open Source
Eye-Sys GUI/Code Data Visualisation Engineers and Scientists Proprietary
Ferret Data Visualization and Analysis Gridded Datasets Visualisation Oceanographers and meteorologists Open Source
FusionCharts Component Programmers Proprietary
Geoscape Geographic Data Visualisation Business Users Proprietary
TreeMap GUI Data Visualisation Business Managers Proprietary
Trendalyzer Data Visualisation Teachers Proprietary
Tulip GUI Data Visualization Researchers and Engineers Open Source
Gephi GUI Data Visualisation Statistician Open Source
GGobi GUI Data Visualisation Statistician Open Source
Grapheur GUI Data Visualisation Business Users, Project Managers, Coaches Proprietary
ggplot2 Data visualization package for R Programmers Open Source
Mondrian GUI Data Visualisation Statistician Open Source
IBM OpenDX GUI/Code Data Visualisation Engineers and Scientists Open Source
IDL (programming language) Code Data Visualisation Programmer Many
IDL (programming language) Programming Language Programmer Open Source
InetSoft GUI Data Visualization Business Users, Developers, Academics Proprietary
Infogr.am Online Infographic tool Journalists, Bloggers, Education, Business Users Proprietary
Instantatlas GIS Data Visualisation Analysts, researchers, statisticians and GIS professionals Proprietary
MeVisLab GUI/Code Data Visualisation Engineers and Scientists Proprietary
MindView Mind Map Graphic Visualisation Business Users and Project Managers Proprietary
Kumu Web-Based Relationship Visualization Social Impact, Business, Government & Policy Proprietary
Panopticon Software Enterprise application, SDK, Rapid Development Kit (RDK) Capital Markets, Telecommunications, Energy, Government Proprietary
Panorama Software GUI Data Visualisation Business Users Proprietary
PanXpan GUI Data Visualisation Business Users Proprietary
ParaView GUI/Code Data Visualisation Engineers and Scientists BSD
Processing (programming language) Programming Language Programmers GPL
ProfilePlot GUI Data Visualisation Engineers and Scientists Proprietary
protovis Library / Toolkit Programmers BSD
qunb GUI Data Visualisation Non-Expert Business Users Proprietary
SAS Institute GUI Data Visualisation Business Users, Analysts Proprietary
ScienceGL Components / Solutions OEM, Scientists, Engineers, Analysts Proprietary
Smile (software) GUI/Code Data Visualisation Engineers and Scientists Proprietary
Spotfire GUI Data Visualisation Business Users Proprietary
StatSoft Company of GUI/Code Data Visualisation Software Engineers and Scientists Proprietary
Tableau Software GUI Data Visualisation Business Users Proprietary
PowerPanels GUI Data Visualisation Business Users Proprietary
The Hive Group: Honeycomb GUI Data Visualisation Energy, Financial Services, Manufacturers, Government, Military Proprietary
The Hive Group: HiveOnDemand GUI Data Visualisation Business Users, Academic Users Proprietary
TinkerPlots GUI Data Visualisation Students Proprietary
Tom Sawyer Software Data Visualization and Social Network Analysis Applications Capital Markets, Telecommunications, Energy, Government; Business Users, Engineers, and Scientists Proprietary
Trade Space Visualizer GUI/Code Data Visualisation Engineers and Scientists Proprietary
Visifire Library Programmers Was Open Source, now Proprietary
Vis5D GUI Data Visualization Scientists Open Source
VisAD Java/Jython Library Programmers Open Source
VisIt GUI/Code Data Visualisation Engineers and Scientists BSD
VTK C++ Library Programmers Open Source
Weave Web-based data visualization Many Open Source[11]
Yoix Programming Language Programmers Open Source
Visual.ly Company Creative Tools: Data curation and visualization Proprietary
Holsys One: Show the algorithms inside a data GUI Data Visualisation Engineers and Scientists Proprietary

Data presentation architecture[edit]

Data presentation architecture (DPA) is a skill-set that seeks to identify, locate, manipulate, format and present data in such a way as to optimally communicate meaning and proffer knowledge.

Historically, the term data presentation architecture is attributed to Kelly Lautt:[12] "Data Presentation Architecture (DPA) is a rarely applied skill set critical for the success and value of Business Intelligence. Data presentation architecture weds the science of numbers, data and statistics in discovering valuable information from data and making it usable, relevant and actionable with the arts of data visualization, communications, organizational psychology and change management in order to provide business intelligence solutions with the data scope, delivery timing, format and visualizations that will most effectively support and drive operational, tactical and strategic behaviour toward understood business (or organizational) goals. DPA is neither an IT nor a business skill set but exists as a separate field of expertise. Often confused with data visualization, data presentation architecture is a much broader skill set that includes determining what data on what schedule and in what exact format is to be presented, not just the best way to present data that has already been chosen (which is data visualization). Data visualization skills are one element of DPA."

Objectives[edit]

DPA has two main objectives:

  • To use data to provide knowledge in the most effective manner possible (provide relevant, timely and complete data to each audience member in a clear and understandable manner that conveys important meaning, is actionable and can affect understanding, behavior and decisions)
  • To use data to provide knowledge in the most efficient manner possible (minimize noise, complexity, and unnecessary data or detail given each audience's needs and roles)

Scope[edit]

With the above objectives in mind, the actual work of data presentation architecture consists of:

  • Defining important meaning (relevant knowledge) that is needed by each audience member in each context
  • Finding the right data (subject area, historical reach, breadth, level of detail, etc.)
  • Determining the required periodicity of data updates (the currency of the data)
  • Determining the right timing for data presentation (when and how often the user needs to see the data)
  • Utilizing appropriate analysis, grouping, visualization, and other presentation formats
  • Creating effective delivery mechanisms for each audience member depending on their role, tasks, locations and access to technology

Related fields[edit]

DPA work has some commonalities with several other fields, including:

  • Business analysis in determining business goals, collecting requirements, mapping processes.
  • Solution architecture in determining the optimal detailed solution, including the scope of data to include, given the business goals
  • Business process improvement in that its goal is to improve and streamline actions and decisions in furtherance of business goals
  • Statistical analysis or data analysis in that it creates information and knowledge out of data
  • Data visualization in that it uses well-established theories of visualization to add or highlight meaning or importance in data presentation.
  • Information architecture, but information architecture's focus is on unstructured data and therefore excludes both analysis (in the statistical/data sense) and direct transformation of the actual content (data, for DPA) into new entities and combinations.
  • Graphic or user design: As the term DPA is used, it falls just short of design in that it does not consider such detail as colour palates, styling, branding and other aesthetic concerns, unless these design elements are specifically required or beneficial for communication of meaning, impact, severity or other information of business value. For example:
    • choosing to provide a specific colour in graphical elements that represent data of specific meaning or concern is part of the DPA skill-set
    • choosing locations for various data presentation elements on a presentation page (such as in a company portal, in a report or on a web page) in order to convey hierarchy, priority, importance or a rational progression for the user is part of the DPA skill-set.

See also[edit]

References[edit]

  1. ^ a b Michael Friendly (2008). "Milestones in the history of thematic cartography, statistical graphics, and data visualization".
  2. ^ Vitaly Friedman (2008) "Data Visualization and Infographics" in: Graphics, Monday Inspiration, January 14th, 2008.
  3. ^ Fernanda Viegas and Martin Wattenberg, "How To Make Data Look Sexy", CNN.com, April 19, 2011. http://articles.cnn.com/2011-04-19/opinion/sexy.data_1_visualization-21st-century-engagement?_s=PM:OPINION
  4. ^ a b Frits H. Post, Gregory M. Nielson and Georges-Pierre Bonneau (2002). Data Visualization: The State of the Art. Research paper TU delft, 2002..
  5. ^ Brian Willison, "Visualization Driven Rapid Prototyping", Parsons Institute for Information Mapping, 2008
  6. ^ Lengler, Ralph; Lengler, Ralph. "Periodic Table of Visualization Methods". www.visual-literacy.org. Retrieved 15 March 2013. 
  7. ^ "Data Visualization: Modern Approaches". in: Graphics, August 2nd, 2007
  8. ^ W. Frawley and G. Piatetsky-Shapiro and C. Matheus (Fall 1992). "Knowledge Discovery in Databases: An Overview". AI Magazine: pp. 213–228. ISSN 0738-4602. 
  9. ^ D. Hand, H. Mannila, P. Smyth (2001). Principles of Data Mining. MIT Press, Cambridge, MA. ISBN 0-262-08290-X. 
  10. ^ Ellen Monk, Bret Wagner (2006). Concepts in Enterprise Resource Planning, Second Edition. Thomson Course Technology, Boston, MA. ISBN 0-619-21663-8. 
  11. ^ http://oicweave.org/
  12. ^ The first formal, recorded, public usages of the term data presentation architecture were at the three formal Microsoft Office 2007 Launch events in Dec, Jan and Feb of 2007-08 in Edmonton, Calgary and Vancouver (Canada) in a presentation by Kelly Lautt describing a business intelligence system designed to improve service quality in a pulp and paper company. The term was further used and recorded in public usage on December 16, 2009 in a Microsoft Canada presentation on the value of merging Business Intelligence with corporate collaboration processes.

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_warehouse b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_warehouse new file mode 100644 index 00000000..cdc5583f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Data_warehouse @@ -0,0 +1 @@ + Data warehouse - Wikipedia, the free encyclopedia

Data warehouse

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Data Warehouse Overview

In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis. It is a central repository of data which is created by integrating data from one or more disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.

The data stored in the warehouse are uploaded from the operational systems (such as marketing, sales etc., shown in the figure to the right). The data may pass through an operational data store for additional operations before they are used in the DW for reporting.

The typical ETL-based data warehouse uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.[1]

A data warehouse constructed from an integrated data source systems does not require ETL, staging databases, or operational data store databases. The integrated data source systems may be considered to be a part of a distributed operational data store layer. Data federation methods or data virtualization methods may be used to access the distributed integrated source data systems to consolidate and aggregate data directly into the data warehouse database tables. Unlike the ETL-based data warehouse, the integrated source data systems and the data warehouse are all integrated since there is no transformation of dimensional or reference data. This integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the transactional data of the integrated source data systems.

Data warehouses can be subdivided into data marts. Data marts store subsets of data from a warehouse.

This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, cataloged and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.

Contents

Benefits of a data warehouse[edit]

A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to:

  • Congregate data from multiple sources into a single database so a single query engine can be used to present data.
  • Mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long running, analysis queries in transaction processing databases.
  • Maintain data history, even if the source transaction systems do not.
  • Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.
  • Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.
  • Present the organization's information consistently.
  • Provide a single common data model for all data of interest regardless of the data's source.
  • Restructure the data so that it makes sense to the business users.
  • Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems.
  • Add value to operational business applications, notably customer relationship management (CRM) systems.

Generic data warehouse environment[edit]

The environment for data warehouses and marts includes the following:

  • Source systems that provide data to the warehouse or mart;
  • Data integration technology and processes that are needed to prepare the data for use;
  • Different architectures for storing data in an organization's data warehouse or data marts;
  • Different tools and applications for the variety of users;
  • Metadata, data quality, and governance processes must be in place to ensure that the warehouse or mart meets its purposes.

In regards to source systems listed above, Rainer states, “A common source for the data in data warehouses is the company’s operational databases, which can be relational databases” (130).

Regarding data integration, Rainer states, “It is necessary to extract data from source systems, transform them, and load them into a data mart or warehouse” (131).

Rainer discusses storing data in an organization’s data warehouse or data marts. “There are a variety of possible architectures to store decision-support data” (131).

Metadata are data about data. “IT personnel need information about data sources; database, table, and column names; refresh schedules; and data usage measures (133).

Today, the most successful companies are those that can respond quickly and flexibly to market changes and opportunities. A key to this response is the effective and efficient use of data and information by analysts and managers (Rainer, 127). A “data warehouse” is a repository of historical data that are organized by subject to support decision makers in the organization (128). Once data are stored in a data mart or warehouse, they can be accessed.

Rainer, R. Kelly (2012-05-01). Introduction to Information Systems: Enabling and Transforming Business, 4th Edition (Page 129). Wiley. Kindle Edition.v

History[edit]

The concept of data warehousing dates back to the late 1980s[2] when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Though each environment served different users, they often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users.

Key developments in early years of data warehousing were:

  • 1960s — General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[3]
  • 1970s — ACNielsen and IRI provide dimensional data marts for retail sales.[3]
  • 1970s — Bill Inmon begins to define and discuss the term: Data Warehouse
  • 1975 — Sperry Univac Introduce MAPPER (MAintain, Prepare, and Produce Executive Reports) is a database management and reporting system that includes the world's first 4GL. It was the first platform specifically designed for building Information Centers (a forerunner of contemporary Enterprise Data Warehousing platforms)
  • 1983 — Teradata introduces a database management system specifically designed for decision support.
  • 1983 — Sperry Corporation Martyn Richard Jones defines the Sperry Information Center approach, which while not being a true DW in the Inmon sense, did contain many of the characteristics of DW structures and process as defined previously by Inmon, and later by Devlin. First used at the TSB England & Wales
  • 1984 — Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases Data Interpretation System (DIS). DIS was a hardware/software package and GUI for business users to create a database management and analytic system.
  • 1988 — Barry Devlin and Paul Murphy publish the article An architecture for a business and information system in IBM Systems Journal where they introduce the term "business data warehouse".
  • 1990 — Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system specifically for data warehousing.
  • 1991 — Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a data warehouse.
  • 1992 — Bill Inmon publishes the book Building the Data Warehouse.[4]
  • 1995 — The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
  • 1996 — Ralph Kimball publishes the book The Data Warehouse Toolkit.[5]
  • 2000 — Daniel Linstedt releases the Data Vault, enabling real time auditable Data Warehouses warehouse.

Information storage[edit]

Facts[edit]

A fact is a value or measurement, which represents a fact about the managed entity or system.

Facts as reported by the reporting entity are said to be at raw level.

E.g. if a BTS received 1,000 requests for traffic channel allocation, it allocates for 820 and rejects the remaining then it would report 3 facts or measurements to a management system:

  • tch_req_total = 1000
  • tch_req_success = 820
  • tch_req_fail = 180

Facts at raw level are further aggregated to higher levels in various dimensions to extract more service or business-relevant information out of it. These are called aggregates or summaries or aggregated facts.

E.g. if there are 3 BTSs in a city, then facts above can be aggregated from BTS to city level in network dimension. E.g.

  • tch\_req\_success\_city = tch\_req\_success\_bts1 + tch\_req\_success\_bts2 + tch\_req\_success\_bts3
  • avg\_tch\_req\_success\_city = (tch\_req\_success\_bts1 + tch\_req\_success\_bts2 + tch\_req\_success\_bts3) / 3

Dimensional vs. normalized approach for storage of data[edit]

There are two leading approaches to storing data in a data warehouse — the dimensional approach and the normalized approach.

The dimensional approach, whose supporters are referred to as “Kimballites”, believe in Ralph Kimball’s approach in which it is stated that the data warehouse should be modeled using a Dimensional Model/star schema. The normalized approach, also called the 3NF model, whose supporters are referred to as “Inmonites”, believe in Bill Inmon's approach in which it is stated that the data warehouse should be modeled using an E-R model/normalized model.

In a dimensional approach, transaction data are partitioned into "facts", which are generally numeric transaction data, and "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.

A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. Dimensional structures are easy to understand for business users, because the structure is divided into measurements/facts and context/dimensions. Facts are related to the organization’s business processes and operational system whereas the dimensions surrounding them contain context about the measurement (Kimball, Ralph 2008).

The main disadvantages of the dimensional approach are:

  1. In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and
  2. It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.

In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into separate physical tables when the database is implemented (Kimball, Ralph 2008). The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to:

  1. join data from different sources into meaningful information and then
  2. access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

It should be noted that both normalized and dimensional models can be represented in entity-relationship diagrams as both contain joined relational tables. The difference between the two models is the degree of normalization.

These approaches are not mutually exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree (Kimball, Ralph 2008).

In Information-Driven Business (Wiley 2010),[6] Robert Hillard proposes an approach to comparing the two approaches based on the information needs of the business problem. The technique shows that normalized models hold far more information than their dimensional equivalents (even when the same fields are used in both models) but this extra information comes at the cost of usability. The technique measures information quantity in terms of Information Entropy and usability in terms of the Small Worlds data transformation measure.[7]

Top-down versus bottom-up design methodologies[edit]

Bottom-up design[edit]

Ralph Kimball, a well-known author on data warehousing,[8] is a proponent of an approach to data warehouse design which he describes as bottom-up.[9]

In the bottom-up approach, data marts are first created to provide reporting and analytical capabilities for specific business processes. It is important to note that in Kimball methodology, the bottom-up process is the result of an initial business-oriented top-down analysis of the relevant business processes to be modelled.

Data marts contain, primarily, dimensions and facts. Facts can contain either atomic data and, if necessary, summarized data. The single data mart often models a specific business area such as "Sales" or "Production." These data marts can eventually be integrated to create a comprehensive data warehouse. The integration of data marts is managed through the implementation of what Kimball calls "a data warehouse bus architecture".[10] The data warehouse bus architecture is primarily an implementation of "the bus", a collection of conformed dimensions and conformed facts, which are dimensions that are shared (in a specific way) between facts in two or more data marts.

The integration of the data marts in the data warehouse is centered on the conformed dimensions (residing in "the bus") that define the possible integration "points" between data marts. The actual integration of two or more data marts is then done by a process known as "Drill across". A drill-across works by grouping (summarizing) the data along the keys of the (shared) conformed dimensions of each fact participating in the "drill across" followed by a join on the keys of these grouped (summarized) facts.

Maintaining tight management over the data warehouse bus architecture is fundamental to maintaining the integrity of the data warehouse. The most important management task is making sure dimensions among data marts are consistent. In Kimball's words, this means that the dimensions "conform".

Some consider it an advantage of the Kimball method, that the data warehouse ends up being "segmented" into a number of logically self-contained (up to and including The Bus) and consistent data marts, rather than a big and often complex centralized model. Business value can be returned as quickly as the first data marts can be created, and the method gives itself well to an exploratory and iterative approach to building data warehouses. For example, the data warehousing effort might start in the "Sales" department, by building a Sales-data mart. Upon completion of the Sales-data mart, the business might then decide to expand the warehousing activities into the, say, "Production department" resulting in a Production data mart. The requirement for the Sales data mart and the Production data mart to be integrable, is that they share the same "Bus", that will be, that the data warehousing team has made the effort to identify and implement the conformed dimensions in the bus, and that the individual data marts links that information from the bus. Note that this does not require 100% awareness from the onset of the data warehousing effort, no master plan is required upfront. The Sales-data mart is good as it is (assuming that the bus is complete) and the Production-data mart can be constructed virtually independent of the Sales-data mart (but not independent of the Bus).

If integration via the bus is achieved, the data warehouse, through its two data marts, will not only be able to deliver the specific information that the individual data marts are designed to do, in this example either "Sales" or "Production" information, but can deliver integrated Sales-Production information, which, often, is of critical business value.

Top-down design[edit]

Bill Inmon, one of the first authors on the subject of data warehousing, has defined a data warehouse as a centralized repository for the entire enterprise.[10] Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. In the Inmon vision, the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities.

Inmon states that the data warehouse is:

Subject-oriented
The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.
Non-volatile
Data in the data warehouse are never over-written or deleted — once committed, the data are static, read-only, and retained for future reporting.
Integrated
The data warehouse contains data from most or all of an organization's operational systems and these data are made consistent.
Time-variant
For An operational system, the stored data contains the current value. The data warehouse, however, contains the history of data values.

The top-down design methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the top-down methodology is that it represents a very large project with a very broad scope. The up-front cost for implementing a data warehouse using the top-down methodology is significant, and the duration of time from the start of project to the point that end users experience initial benefits can be substantial. In addition, the top-down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases.[10]

Hybrid design[edit]

Data warehouse (DW) solutions often resemble the hub and spokes architecture. Legacy systems feeding the DW/BI solution often include customer relationship management (CRM) and enterprise resource planning solutions (ERP), generating large amounts of data. To consolidate these various data models, and facilitate the extract transform load (ETL) process, DW solutions often make use of an operational data store (ODS). The information from the ODS is then parsed into the actual DW. To reduce data redundancy, larger systems will often store the data in a normalized way. Data marts for specific reports can then be built on top of the DW solution.

It is important to note that the DW database in a hybrid solution is kept on third normal form to eliminate data redundancy. A normal relational database however, is not efficient for business intelligence reports where dimensional modelling is prevalent. Small data marts can shop for data from the consolidated warehouse and use the filtered, specific data for the fact tables and dimensions required. The DW effectively provides a single source of information from which the data marts can read, creating a highly flexible solution from a BI point of view. The hybrid architecture allows a DW to be replaced with a master data management solution where operational, not static information could reside.

The Data Vault Modeling components follow hub and spokes architecture. This modeling style is a hybrid design, consisting of the best practices from both 3rd normal form and star schema. The Data Vault model is not a true 3rd normal form, and breaks some of the rules that 3NF dictates be followed. It is however, a top-down architecture with a bottom up design. The Data Vault model is geared to be strictly a data warehouse. It is not geared to be end-user accessible, which when built, still requires the use of a data mart or star schema based release area for business purposes.

Data warehouses versus operational systems[edit]

Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity-relationship model. Operational system designers generally follow the Codd rules of database normalization in order to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems.

Evolution in organization use[edit]

These terms refer to the level of sophistication of a data warehouse:

Offline operational data warehouse
Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data
Offline data warehouse
Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data are stored in a data structure designed to facilitate reporting.
On time data warehouse
Online Integrated Data Warehousing represent the real time Data warehouses stage data in the warehouse is updated for every transaction performed on the source data
Integrated data warehouse
These data warehouses assemble data from different areas of business, so users can look up the information they need across other systems.[11]

Sample applications[edit]

Some of the applications of data warehousing include:

  • Agriculture[12]
  • Biological data analysis
  • Call record analysis
  • Churn Prediction for Telecom subscribers, Credit Card users etc.
  • Decision support
  • Financial forecasting
  • Insurance fraud analysis
  • Logistics and Inventory management
  • Trend analysis

See also[edit]

References[edit]

  1. ^ Patil, Preeti S.; Srikantha Rao; Suryakant B. Patil (2011). "Optimization of Data Warehousing System: Simplification in Reporting and Analysis". International Journal of Computer Applications (Foundation of Computer Science) 9 (6): 33–37.  More than one of |work= and |journal= specified (help)
  2. ^ "The Story So Far". 2002-04-15. Retrieved 2008-09-21. 
  3. ^ a b Kimball 2002, pg. 16
  4. ^ Inmon, Bill (1992). Building the Data Warehouse. Wiley. ISBN 0-471-56960-7. 
  5. ^ Kimball, Ralph (1996). The Data Warehouse Toolkit. Wiley. ISBN 0-471-15337-0. 
  6. ^ Hillard, Robert (2010). Information-Driven Business. Wiley. ISBN 978-0-470-62577-4. 
  7. ^ "Information Theory & Business Intelligence Strategy - Small Worlds Data Transformation Measure - MIKE2.0, the open source methodology for Information Development". Mike2.openmethodology.org. Retrieved 2013-06-14. 
  8. ^ Kimball 2002, pg. 310
  9. ^ "The Bottom-Up Misnomer". 2003-09-17. Retrieved 2012-02-14. 
  10. ^ a b c Ericsson 2004, pp. 28–29
  11. ^ "Data Warehouse". 
  12. ^ Abdullah, Ahsan (2009). "Analysis of mealybug incidence on the cotton crop using ADSS-OLAP (Online Analytical Processing) tool, Volume 69, Issue 1". Computers and Electronics in Agriculture 69: 59–72. doi:10.1016/j.compag.2009.07.003. 

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Database_system b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Database_system new file mode 100644 index 00000000..d18f8fb0 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Database_system @@ -0,0 +1 @@ + Database - Wikipedia, the free encyclopedia

Database

From Wikipedia, the free encyclopedia
  (Redirected from Database system)
Jump to: navigation, search

A database is an organized collection of data. The data is typically organized to model relevant aspects of reality (for example, the availability of rooms in hotels), in a way that supports processes requiring this information (for example, finding a hotel with vacancies).

Database management systems (DBMSs) are specially designed applications that interact with the user, other applications, and the database itself to capture and analyze data. A general-purpose database management system (DBMS) is a software system designed to allow the definition, creation, querying, update, and administration of databases. Well-known DBMSs include MySQL, PostgreSQL, SQLite, Microsoft SQL Server, Microsoft Access, Oracle, SAP, dBASE, FoxPro, IBM DB2 and FilemakerPro. A database is not generally portable across different DBMS, but different DBMSs can inter-operate by using standards such as SQL and ODBC or JDBC to allow a single application to work with more than one database.

Contents

Terminology and overview[edit]

Formally, the term "database" refers to the data itself and supporting data structures. Databases are created to operate large quantities of information by inputting, storing, retrieving, and managing that information. Databases are set up, so that one set of software programs provides all users with access to all the data. Databases use a table format, that is made up of rows and columns. Each piece of information is entered into a row, which then creates a record. Once the records are created in the database, they can be organized and operated in a variety of ways that are limited mainly by the software being used. Databases are somewhat similar to spreadsheets, but databases are more demanding than spreadsheets because of their ability to manipulate the data that is stored. It is possible to do a number of functions with a database that would be more difficult to do with a spreadsheet. The word data is normally defined as facts from which information can be derived. A database may contain millions of such facts. From these facts the database management system (DBMS) can develop information.

A "database management system" (DBMS) is a suite of computer software providing the interface between users and a database or databases. Because they are so closely related, the term "database" when used casually often refers to both a DBMS and the data it manipulates.

Outside the world of professional information technology, the term database is sometimes used casually to refer to any collection of data (perhaps a spreadsheet, maybe even a card index). This article is concerned only with databases where the size and usage requirements necessitate use of a database management system.[1]

The interactions catered for by most existing DBMS fall into four main groups:

  • Data definition. Defining new data structures for a database, removing data structures from the database, modifying the structure of existing data.
  • Update. Inserting, modifying, and deleting data.
  • Retrieval. Obtaining information either for end-user queries and reports or for processing by applications.
  • Administration. Registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information if the system fails.

A DBMS is responsible for maintaining the integrity and security of stored data, and for recovering information if the system fails.

Both a database and its DBMS conform to the principles of a particular database model.[2] "Database system" refers collectively to the database model, database management system, and database.[3]

Physically, database servers are dedicated computers that hold the actual databases and run only the DBMS and related software. Database servers are usually multiprocessor computers, with generous memory and RAID disk arrays used for stable storage. Hardware database accelerators, connected to one or more servers via a high-speed channel, are also used in large volume transaction processing environments. DBMSs are found at the heart of most database applications. DBMSs may be built around a custom multitasking kernel with built-in networking support, but modern DBMSs typically rely on a standard operating system to provide these functions.[citation needed] Since DBMSs comprise a significant economical market, computer and storage vendors often take into account DBMS requirements in their own development plans.[citation needed]

Databases and DBMSs can be categorized according to the database model(s) that they support (such as relational or XML), the type(s) of computer they run on (from a server cluster to a mobile phone), the query language(s) used to access the database (such as SQL or XQuery), and their internal engineering, which affects performance, scalability, resilience, and security.

Applications and roles[edit]

Most organizations in developed countries today depend on databases for their business operations. Increasingly, databases are not only used to support the internal operations of the organization, but also to underpin its online interactions with customers and suppliers (see Enterprise software). Databases are not used only to hold administrative information, but are often embedded within applications to hold more specialized data: for example engineering data or economic models. Examples of database applications include computerized library systems, flight reservation systems, and computerized parts inventory systems.

Client-server or transactional DBMSs are often complex to maintain high performance, availability and security when many users are querying and updating the database at the same time. Personal, desktop-based database systems tend to be less complex. For example, FileMaker and Microsoft Access come with built-in graphical user interfaces.

General-purpose and special-purpose DBMSs[edit]

A DBMS has evolved into a complex software system and its development typically requires thousands of person-years of development effort.[4] Some general-purpose DBMSs such as Adabas, Oracle and DB2 have been undergoing upgrades since the 1970s. General-purpose DBMSs aim to meet the needs of as many applications as possible, which adds to the complexity. However, the fact that their development cost can be spread over a large number of users means that they are often the most cost-effective approach. However, a general-purpose DBMS is not always the optimal solution: in some cases a general-purpose DBMS may introduce unnecessary overhead. Therefore, there are many examples of systems that use special-purpose databases. A common example is an email system: email systems are designed to optimize the handling of email messages, and do not need significant portions of a general-purpose DBMS functionality.

Many databases have application software that accesses the database on behalf of end-users, without exposing the DBMS interface directly. Application programmers may use a wire protocol directly, or more likely through an application programming interface. Database designers and database administrators interact with the DBMS through dedicated interfaces to build and maintain the applications' databases, and thus need some more knowledge and understanding about how DBMSs operate and the DBMSs' external interfaces and tuning parameters.

General-purpose databases are usually developed by one organization or community of programmers, while a different group builds the applications that use it. In many companies, specialized database administrators maintain databases, run reports, and may work on code that runs on the databases themselves (rather than in the client application).

History[edit]

With the progress in technology in the areas of processors, computer memory, computer storage and computer networks, the sizes, capabilities, and performance of databases and their respective DBMSs have grown in orders of magnitudes.

The development of database technology can be divided into three eras based on data model or structure: navigational,[5] SQL/relational, and post-relational. The two main early navigational data models were the hierarchical model, epitomized by IBM's IMS system, and the Codasyl model (Network model), implemented in a number of products such as IDMS.

The relational model, first proposed in 1970 by Edgar F. Codd, departed from this tradition by insisting that applications should search for data by content, rather than by following links. The relational model is made up of ledger-style tables, each used for a different type of entity. It was not until the mid-1980s that computing hardware became powerful enough to allow relational systems (DBMSs plus applications) to be widely deployed. By the early 1990s, however, relational systems were dominant for all large-scale data processing applications, and they remain dominant today (2012) except in niche areas. The dominant database language is the standard SQL for the relational model, which has influenced database languages for other data models.[citation needed]

Object databases were invented in the 1980s to overcome the inconvenience of object-relational impedance mismatch, which led to the coining of the term "post-relational" but also development of hybrid object-relational databases.

The next generation of post-relational databases in the 2000s became known as NoSQL databases, introducing fast key-value stores and document-oriented databases. A competing "next generation" known as NewSQL databases attempted new implementations that retained the relational/SQL model while aiming to match the high performance of NoSQL compared to commercially available relational DBMSs.

1960s navigational DBMS[edit]

Basic structure of navigational CODASYL database model.

The introduction of the term database coincided with the availability of direct-access storage (disks and drums) from the mid-1960s onwards. The term represented a contrast with the tape-based systems of the past, allowing shared interactive use rather than daily batch processing. The Oxford English dictionary cites[citation needed] a 1962 technical report as the first to use the term "data-base."

As computers grew in speed and capability, a number of general-purpose database systems emerged; by the mid-1960s there were a number of such systems in commercial use. Interest in a standard began to grow, and Charles Bachman, author of one such product, the Integrated Data Store (IDS), founded the "Database Task Group" within CODASYL, the group responsible for the creation and standardization of COBOL. In 1971 they delivered their standard, which generally became known as the "Codasyl approach", and soon a number of commercial products based on this approach were made available.

The Codasyl approach was based on the "manual" navigation of a linked data set which was formed into a large network. Records could be found either by use of a primary key (known as a CALC key, typically implemented by hashing), by navigating relationships (called sets) from one record to another, or by scanning all the records in sequential order. Later systems added B-Trees to provide alternate access paths. Many Codasyl databases also added a query language that was very straightforward. However, in the final tally, CODASYL was very complex and required significant training and effort to produce useful applications.

IBM also had their own DBMS system in 1968, known as IMS. IMS was a development of software written for the Apollo program on the System/360. IMS was generally similar in concept to Codasyl, but used a strict hierarchy for its model of data navigation instead of Codasyl's network model. Both concepts later became known as navigational databases due to the way data was accessed, and Bachman's 1973 Turing Award presentation was The Programmer as Navigator. IMS is classified as a hierarchical database. IDMS and Cincom Systems' TOTAL database are classified as network databases.

1970s relational DBMS[edit]

Edgar Codd worked at IBM in San Jose, California, in one of their offshoot offices that was primarily involved in the development of hard disk systems. He was unhappy with the navigational model of the Codasyl approach, notably the lack of a "search" facility. In 1970, he wrote a number of papers that outlined a new approach to database construction that eventually culminated in the groundbreaking A Relational Model of Data for Large Shared Data Banks.[6]

In this paper, he described a new system for storing and working with large databases. Instead of records being stored in some sort of linked list of free-form records as in Codasyl, Codd's idea was to use a "table" of fixed-length records, with each table used for a different type of entity. A linked-list system would be very inefficient when storing "sparse" databases where some of the data for any one record could be left empty. The relational model solved this by splitting the data into a series of normalized tables (or relations), with optional elements being moved out of the main table to where they would take up room only if needed. Data may be freely inserted, deleted and edited in these tables, with the DBMS doing whatever maintenance needed to present a table view to the application/user.

In the relational model, related records are linked together with a "key"

The relational model also allowed the content of the database to evolve without constant rewriting of links and pointers. The relational part comes from entities referencing other entities in what is known as one-to-many relationship, like a traditional hierarchical model, and many-to-many relationship, like a navigational (network) model. Thus, a relational model can express both hierarchical and navigational models, as well as its native tabular model, allowing for pure or combined modeling in terms of these three models, as the application requires.

For instance, a common use of a database system is to track information about users, their name, login information, various addresses and phone numbers. In the navigational approach all of these data would be placed in a single record, and unused items would simply not be placed in the database. In the relational approach, the data would be normalized into a user table, an address table and a phone number table (for instance). Records would be created in these optional tables only if the address or phone numbers were actually provided.

Linking the information back together is the key to this system. In the relational model, some bit of information was used as a "key", uniquely defining a particular record. When information was being collected about a user, information stored in the optional tables would be found by searching for this key. For instance, if the login name of a user is unique, addresses and phone numbers for that user would be recorded with the login name as its key. This simple "re-linking" of related data back into a single collection is something that traditional computer languages are not designed for.

Just as the navigational approach would require programs to loop in order to collect records, the relational approach would require loops to collect information about any one record. Codd's solution to the necessary looping was a set-oriented language, a suggestion that would later spawn the ubiquitous SQL. Using a branch of mathematics known as tuple calculus, he demonstrated that such a system could support all the operations of normal databases (inserting, updating etc.) as well as providing a simple system for finding and returning sets of data in a single operation.

Codd's paper was picked up by two people at Berkeley, Eugene Wong and Michael Stonebraker. They started a project known as INGRES using funding that had already been allocated for a geographical database project and student programmers to produce code. Beginning in 1973, INGRES delivered its first test products which were generally ready for widespread use in 1979. INGRES was similar to System R in a number of ways, including the use of a "language" for data access, known as QUEL. Over time, INGRES moved to the emerging SQL standard.

IBM itself did one test implementation of the relational model, PRTV, and a production one, Business System 12, both now discontinued. Honeywell wrote MRDS for Multics, and now there are two new implementations: Alphora Dataphor and Rel. Most other DBMS implementations usually called relational are actually SQL DBMSs.

In 1970, the University of Michigan began development of the MICRO Information Management System[7] based on D.L. Childs' Set-Theoretic Data model.[8][9][10] Micro was used to manage very large data sets by the US Department of Labor, the U.S. Environmental Protection Agency, and researchers from the University of Alberta, the University of Michigan, and Wayne State University. It ran on IBM mainframe computers using the Michigan Terminal System.[11] The system remained in production until 1998.

Database machines and appliances[edit]

In the 1970s and 1980s attempts were made to build database systems with integrated hardware and software. The underlying philosophy was that such integration would provide higher performance at lower cost. Examples were IBM System/38, the early offering of Teradata, and the Britton Lee, Inc. database machine.

Another approach to hardware support for database management was ICL's CAFS accelerator, a hardware disk controller with programmable search capabilities. In the long term, these efforts were generally unsuccessful because specialized database machines could not keep pace with the rapid development and progress of general-purpose computers. Thus most database systems nowadays are software systems running on general-purpose hardware, using general-purpose computer data storage. However this idea is still pursued for certain applications by some companies like Netezza and Oracle (Exadata).

Late-1970s SQL DBMS[edit]

IBM started working on a prototype system loosely based on Codd's concepts as System R in the early 1970s. The first version was ready in 1974/5, and work then started on multi-table systems in which the data could be split so that all of the data for a record (some of which is optional) did not have to be stored in a single large "chunk". Subsequent multi-user versions were tested by customers in 1978 and 1979, by which time a standardized query languageSQL[citation needed] – had been added. Codd's ideas were establishing themselves as both workable and superior to Codasyl, pushing IBM to develop a true production version of System R, known as SQL/DS, and, later, Database 2 (DB2).

Larry Ellison's Oracle started from a different chain, based on IBM's papers on System R, and beat IBM to market when the first version was released in 1978.[citation needed]

Stonebraker went on to apply the lessons from INGRES to develop a new database, Postgres, which is now known as PostgreSQL. PostgreSQL is often used for global mission critical applications (the .org and .info domain name registries use it as their primary data store, as do many large companies and financial institutions).

In Sweden, Codd's paper was also read and Mimer SQL was developed from the mid-70s at Uppsala University. In 1984, this project was consolidated into an independent enterprise. In the early 1980s, Mimer introduced transaction handling for high robustness in applications, an idea that was subsequently implemented on most other DBMS.

Another data model, the entity-relationship model, emerged in 1976 and gained popularity for database design as it emphasized a more familiar description than the earlier relational model. Later on, entity-relationship constructs were retrofitted as a data modeling construct for the relational model, and the difference between the two have become irrelevant.[citation needed]

1980s desktop databases[edit]

The 1980s ushered in the age of desktop computing. The new computers empowered their users with spreadsheets like Lotus 1,2,3 and database software like dBASE. The dBASE product was lightweight and easy for any computer user to understand out of the box. C. Wayne Ratliff the creator of dBASE stated: “dBASE was different from programs like BASIC, C, FORTRAN, and COBOL in that a lot of the dirty work had already been done. The data manipulation is done by dBASE instead of by the user, so the user can concentrate on what he is doing, rather than having to mess with the dirty details of opening, reading, and closing files, and managing space allocation.“ [12] dBASE was one of the top selling software titles in the 1980s and early 1990’s.

1980s object-oriented databases[edit]

The 1980s, along with a rise in object oriented programming, saw a growth in how data in various databases were handled. Programmers and designers began to treat the data in their databases as objects. That is to say that if a person's data were in a database, that person's attributes, such as their address, phone number, and age, were now considered to belong to that person instead of being extraneous data. This allows for relations between data to be relations to objects and their attributes and not to individual fields.[13] The term "object-relational impedance mismatch" described the inconvenience of translating between programmed objects and database tables. Object databases and object-relational databases attempt to solve this problem by providing an object-oriented language (sometimes as extensions to SQL) that programmers can use as alternative to purely relational SQL. On the programming side, libraries known as object-relational mappings (ORMs) attempt to solve the same problem.

2000s NoSQL and NewSQL databases[edit]

The next generation of post-relational databases in the 2000s became known as NoSQL databases, including fast key-value stores and document-oriented databases. XML databases are a type of structured document-oriented database that allows querying based on XML document attributes.

NoSQL databases are often very fast, do not require fixed table schemas, avoid join operations by storing denormalized data, and are designed to scale horizontally.

In recent years there was a high demand for massively distributed databases with high partition tolerance but according to the CAP theorem it is impossible for a distributed system to simultaneously provide consistency, availability and partition tolerance guarantees. A distributed system can satisfy any two of these guarantees at the same time, but not all three. For that reason many NoSQL databases are using what is called eventual consistency to provide both availability and partition tolerance guarantees with a maximum level of data consistency.

The most popular NoSQL systems include: MongoDB, memcached, Redis, CouchDB, Hazelcast, Apache Cassandra and HBase,[14] note that all are open-source software products.

A number of new relational databases continuing use of SQL but aiming for performance comparable to NoSQL are known as NewSQL.

Database research[edit]

Database technology has been an active research topic since the 1960s, both in academia and in the research and development groups of companies (for example IBM Research). Research activity includes theory and development of prototypes. Notable research topics have included models, the atomic transaction concept and related concurrency control techniques, query languages and query optimization methods, RAID, and more.

The database research area has several dedicated academic journals (for example, ACM Transactions on Database Systems-TODS, Data and Knowledge Engineering-DKE) and annual conferences (e.g., ACM SIGMOD, ACM PODS, VLDB, IEEE ICDE).

Database type examples[edit]

One way to classify databases involves the type of their contents, for example: bibliographic, document-text, statistical, or multimedia objects. Another way is by their application area, for example: accounting, music compositions, movies, banking, manufacturing, or insurance. A third way is by some technical aspect, such as the database structure or interface type. This section lists a few of the adjectives used to characterize different kinds of databases.

  • An active database includes an event-driven architecture which can respond to conditions both inside and outside the database. Possible uses include security monitoring, alerting, statistics gathering and authorization. Many databases provide active database features in the form of database triggers.
  • A cloud database relies on cloud technology. Both the database and most of its DBMS reside remotely, "in the cloud," while its applications are both developed by programmers and later maintained and utilized by (application's) end-users through a web browser and Open APIs.
  • Data warehouses archive data from operational databases and often from external sources such as market research firms. The warehouse becomes the central source of data for use by managers and other end-users who may not have access to operational data. For example, sales data might be aggregated to weekly totals and converted from internal product codes to use UPCs so that they can be compared with ACNielsen data. Some basic and essential components of data warehousing include retrieving, analyzing, and mining data, transforming, loading and managing data so as to make them available for further use.
  • A document-oriented database is designed for storing, retrieving, and managing document-oriented, or semi structured data, information. Document-oriented databases are one of the main categories of NoSQL databases.
  • An embedded database system is a DBMS which is tightly integrated with an application software that requires access to stored data in such a way that the DBMS is hidden from the application’s end-users and requires little or no ongoing maintenance.[15]
  • End-user databases consist of data developed by individual end-users. Examples of these are collections of documents, spreadsheets, presentations, multimedia, and other files. Several products exist to support such databases. Some of them are much simpler than full fledged DBMSs, with more elementary DBMS functionality.
  • A federated database system comprises several distinct databases, each with its own DBMS. It is handled as a single database by a federated database management system (FDBMS), which transparently integrates multiple autonomous DBMSs, possibly of different types (in which case it would also be a heterogeneous database system), and provides them with an integrated conceptual view.
  • Sometimes the term multi-database is used as a synonym to federated database, though it may refer to a less integrated (e.g., without an FDBMS and a managed integrated schema) group of databases that cooperate in a single application. In this case typically middleware is used for distribution, which typically includes an atomic commit protocol (ACP), e.g., the two-phase commit protocol, to allow distributed (global) transactions across the participating databases.
  • In a hypertext or hypermedia database, any word or a piece of text representing an object, e.g., another piece of text, an article, a picture, or a film, can be hyperlinked to that object. Hypertext databases are particularly useful for organizing large amounts of disparate information. For example, they are useful for organizing online encyclopedias, where users can conveniently jump around the text. The World Wide Web is thus a large distributed hypertext database.
  • An in-memory database is a database that primarily resides in main memory, but is typically backed-up by non-volatile computer data storage. Main memory databases are faster than disk databases, and so are often used where response time is critical, such as in telecommunications network equipment.[16]
  • Operational databases store detailed data about the operations of an organization. They typically process relatively high volumes of updates using transactions. Examples include customer databases that record contact, credit, and demographic information about a business' customers, personnel databases that hold information such as salary, benefits, skills data about employees, enterprise resource planning systems that record details about product components, parts inventory, and financial databases that keep track of the organization's money, accounting and financial dealings.
The major parallel DBMS architectures which are induced by the underlying hardware architecture are:
  • Shared memory architecture, where multiple processors share the main memory space, as well as other data storage.
  • Shared disk architecture, where each processing unit (typically consisting of multiple processors) has its own main memory, but all units share the other storage.
  • Shared nothing architecture, where each processing unit has its own main memory and other storage.
  • Real-time databases process transactions fast enough for the result to come back and be acted on right away.
  • A spatial database can store the data with multidimensional features. The queries on such data include location based queries, like "Where is the closest hotel in my area?".
  • A temporal database has built-in time aspects, for example a temporal data model and a temporal version of SQL. More specifically the temporal aspects usually include valid-time and transaction-time.
  • An unstructured data database is intended to store in a manageable and protected way diverse objects that do not fit naturally and conveniently in common databases. It may include email messages, documents, journals, multimedia objects, etc. The name may be misleading since some objects can be highly structured. However, the entire possible object collection does not fit into a predefined structured framework. Most established DBMSs now support unstructured data in various ways, and new dedicated DBMSs are emerging.

Database design and modeling[edit]

The first task of a database designer is to produce a conceptual data model that reflects the structure of the information to be held in the database. A common approach to this is to develop an entity-relationship model, often with the aid of drawing tools. Another popular approach is the Unified Modeling Language. A successful data model will accurately reflect the possible state of the external world being modeled: for example, if people can have more than one phone number, it will allow this information to be captured. Designing a good conceptual data model requires a good understanding of the application domain; it typically involves asking deep questions about the things of interest to an organisation, like "can a customer also be a supplier?", or "if a product is sold with two different forms of packaging, are those the same product or different products?", or "if a plane flies from New York to Dubai via Frankfurt, is that one flight or two (or maybe even three)?". The answers to these questions establish definitions of the terminology used for entities (customers, products, flights, flight segments) and their relationships and attributes.

Producing the conceptual data model sometimes involves input from business processes, or the analysis of workflow in the organization. This can help to establish what information is needed in the database, and what can be left out. For example, it can help when deciding whether the database needs to hold historic data as well as current data.

Having produced a conceptual data model that users are happy with, the next stage is to translate this into a schema that implements the relevant data structures within the database. This process is often called logical database design, and the output is a logical data model expressed in the form of a schema. Whereas the conceptual data model is (in theory at least) independent of the choice of database technology, the logical data model will be expressed in terms of a particular database model supported by the chosen DBMS. (The terms data model and database model are often used interchangeably, but in this article we use data model for the design of a specific database, and database model for the modelling notation used to express that design.)

The most popular database model for general-purpose databases is the relational model, or more precisely, the relational model as represented by the SQL language. The process of creating a logical database design using this model uses a methodical approach known as normalization. The goal of normalization is to ensure that each elementary "fact" is only recorded in one place, so that insertions, updates, and deletions automatically maintain consistency.

The final stage of database design is to make the decisions that affect performance, scalability, recovery, security, and the like. This is often called physical database design. A key goal during this stage is data independence, meaning that the decisions made for performance optimization purposes should be invisible to end-users and applications. Physical design is driven mainly by performance requirements, and requires a good knowledge of the expected workload and access patterns, and a deep understanding of the features offered by the chosen DBMS.

Another aspect of physical database design is security. It involves both defining access control to database objects as well as defining security levels and methods for the data itself.

Database models[edit]

Collage of five types of database models.

A database model is a type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored, organized, and manipulated. The most popular example of a database model is the relational model (or the SQL approximation of relational), which uses a table-based format.

Common logical data models for databases include:

An object-relational database combines the two related structures.

Physical data models include:

Other models include:

External, conceptual, and internal views[edit]

Traditional view of data[19]

A database management system provides three views of the database data:

  • The external level defines how each group of end-users sees the organization of data in the database. A single database can have any number of views at the external level.
  • The conceptual level unifies the various external views into a coherent global view.[20] It provides the synthesis of all the external views. It is out of the scope of the various database end-users, and is rather of interest to database application developers and database administrators.
  • The internal level (or physical level) is the internal organization of data inside a DBMS (see Implementation section below). It is concerned with cost, performance, scalability and other operational matters. It deals with storage layout of the data, using storage structures such as indexes to enhance performance. Occasionally it stores data of individual views (materialized views), computed from generic data, if performance justification exists for such redundancy. It balances all the external views' performance requirements, possibly conflicting, in an attempt to optimize overall performance across all activities.

While there is typically only one conceptual (or logical) and physical (or internal) view of the data, there can be any number of different external views. This allows users to see database information in a more business-related way rather than from a technical, processing viewpoint. For example, a financial department of a company needs the payment details of all employees as part of the company's expenses, but does not need details about employees that are the interest of the human resources department. Thus different departments need different views of the company's database.

The three-level database architecture relates to the concept of data independence which was one of the major initial driving forces of the relational model. The idea is that changes made at a certain level do not affect the view at a higher level. For example, changes in the internal level do not affect application programs written using conceptual level interfaces, which reduces the impact of making physical changes to improve performance.

The conceptual view provides a level of indirection between internal and external. On one hand it provides a common view of the database, independent of different external view structures, and on the other hand it abstracts away details of how the data is stored or managed (internal level). In principle every level, and even every external view, can be presented by a different data model. In practice usually a given DBMS uses the same data model for both the external and the conceptual levels (e.g., relational model). The internal level, which is hidden inside the DBMS and depends on its implementation (see Implementation section below), requires a different level of detail and uses its own types of data structure types.

Separating the external, conceptual and internal levels was a major feature of the relational database model implementations that dominate 21st century databases.[20]

Database languages[edit]

Database languages are special-purpose languages, which do one or more of the following:

Database languages are specific to a particular data model. Notable examples include:

A database language may also incorporate features like:

  • DBMS-specific Configuration and storage engine management
  • Computations to modify query results, like counting, summing, averaging, sorting, grouping, and cross-referencing
  • Constraint enforcement (e.g. in an automotive database, only allowing one engine type per car)
  • Application programming interface version of the query language, for programmer convenience

Performance, security, and availability[edit]

Because of the critical importance of database technology to the smooth running of an enterprise, database systems include complex mechanisms to deliver the required performance, security, and availability, and allow database administrators to control the use of these features.

Database storage[edit]

Database storage is the container of the physical materialization of a database. It comprises the internal (physical) level in the database architecture. It also contains all the information needed (e.g., metadata, "data about the data", and internal data structures) to reconstruct the conceptual level and external level from the internal level when needed. Putting data into permanent storage is generally the responsibility of the database engine a.k.a. "storage engine". Though typically accessed by a DBMS through the underlying operating system (and often utilizing the operating systems' file systems as intermediates for storage layout), storage properties and configuration setting are extremely important for the efficient operation of the DBMS, and thus are closely maintained by database administrators. A DBMS, while in operation, always has its database residing in several types of storage (e.g., memory and external storage). The database data and the additional needed information, possibly in very large amounts, are coded into bits. Data typically reside in the storage in structures that look completely different from the way the data look in the conceptual and external levels, but in ways that attempt to optimize (the best possible) these levels' reconstruction when needed by users and programs, as well as for computing additional types of needed information from the data (e.g., when querying the database).

Some DBMS support specifying which character encoding was used to store data, so multiple encodings can be used in the same database.

Various low-level database storage structures are used by the storage engine to serialize the data model so it can be written to the medium of choice. Techniques such as indexing may be used to improve performance. Conventional storage is row-oriented, but there are also column-oriented and correlation databases.

Database materialized views[edit]

Often storage redundancy is employed to increase performance. A common example is storing materialized views, which consist of frequently needed external views or query results. Storing such views saves the expensive computing of them each time they are needed. The downsides of materialized views are the overhead incurred when updating them to keep them synchronized with their original updated database data, and the cost of storage redundancy.

Database and database object replication[edit]

Occasionally a database employs storage redundancy by database objects replication (with one or more copies) to increase data availability (both to improve performance of simultaneous multiple end-user accesses to a same database object, and to provide resiliency in a case of partial failure of a distributed database). Updates of a replicated object need to be synchronized across the object copies. In many cases the entire database is replicated.

Database security[edit]

Database security deals with all various aspects of protecting the database content, its owners, and its users. It ranges from protection from intentional unauthorized database uses to unintentional database accesses by unauthorized entities (e.g., a person or a computer program).

Database access control deals with controlling who (a person or a certain computer program) is allowed to access what information in the database. The information may comprise specific database objects (e.g., record types, specific records, data structures), certain computations over certain objects (e.g., query types, or specific queries), or utilizing specific access paths to the former (e.g., using specific indexes or other data structures to access information). Database access controls are set by special authorized (by the database owner) personnel that uses dedicated protected security DBMS interfaces.

This may be managed directly on an individual basis, or by the assignment of individuals and privileges to groups, or (in the most elaborate models) through the assignment of individuals and groups to roles which are then granted entitlements. Data security prevents unauthorized users from viewing or updating the database. Using passwords, users are allowed access to the entire database or subsets of it called "subschemas". For example, an employee database can contain all the data about an individual employee, but one group of users may be authorized to view only payroll data, while others are allowed access to only work history and medical data. If the DBMS provides a way to interactively enter and update the database, as well as interrogate it, this capability allows for managing personal databases.

Data security in general deals with protecting specific chunks of data, both physically (i.e., from corruption, or destruction, or removal; e.g., see physical security), or the interpretation of them, or parts of them to meaningful information (e.g., by looking at the strings of bits that they comprise, concluding specific valid credit-card numbers; e.g., see data encryption).

Change and access logging records who accessed which attributes, what was changed, and when it was changed. Logging services allow for a forensic database audit later by keeping a record of access occurrences and changes. Sometimes application-level code is used to record changes rather than leaving this to the database. Monitoring can be set up to attempt to detect security breaches.

Transactions and concurrency[edit]

Database transactions can be used to introduce some level of fault tolerance and data integrity after recovery from a crash. A database transaction is a unit of work, typically encapsulating a number of operations over a database (e.g., reading a database object, writing, acquiring lock, etc.), an abstraction supported in database and also other systems. Each transaction has well defined boundaries in terms of which program/code executions are included in that transaction (determined by the transaction's programmer via special transaction commands).

The acronym ACID describes some ideal properties of a database transaction: Atomicity, Consistency, Isolation, and Durability.

Migration[edit]

See also section Database migration in article Data migration

A database built with one DBMS is not portable to another DBMS (i.e., the other DBMS cannot run it). However, in some situations it is desirable to move, migrate a database from one DBMS to another. The reasons are primarily economical (different DBMSs may have different total costs of ownership or TCOs), functional, and operational (different DBMSs may have different capabilities). The migration involves the database's transformation from one DBMS type to another. The transformation should maintain (if possible) the database related application (i.e., all related application programs) intact. Thus, the database's conceptual and external architectural levels should be maintained in the transformation. It may be desired that also some aspects of the architecture internal level are maintained. A complex or large database migration may be a complicated and costly (one-time) project by itself, which should be factored into the decision to migrate. This in spite of the fact that tools may exist to help migration between specific DBMS. Typically a DBMS vendor provides tools to help importing databases from other popular DBMSs.

Database building, maintaining, and tuning[edit]

After designing a database for an application arrives the stage of building the database. Typically an appropriate general-purpose DBMS can be selected to be utilized for this purpose. A DBMS provides the needed user interfaces to be utilized by database administrators to define the needed application's data structures within the DBMS's respective data model. Other user interfaces are used to select needed DBMS parameters (like security related, storage allocation parameters, etc.).

When the database is ready (all its data structures and other needed components are defined) it is typically populated with initial application's data (database initialization, which is typically a distinct project; in many cases using specialized DBMS interfaces that support bulk insertion) before making it operational. In some cases the database becomes operational while empty from application's data, and data are accumulated along its operation.

After completing building the database and making it operational arrives the database maintenance stage: Various database parameters may need changes and tuning for better performance, application's data structures may be changed or added, new related application programs may be written to add to the application's functionality, etc. Contribution by Malebye Joyce as adapted from informations systems for businesses from chapter 5 - storing ad organizing data. Databases are often confused with spread sheet such as Microsoft excel which is different from Microsoft access. Both can be used to store information,however a database serves a better function at this. Below is a comparison of spreadsheets and databases. Spread sheets strengths -1. Very simple data storage 2. Relatively easy to use 3. Require less planning Weaknesses- 1. Data integrity problems, include inaccurate,inconsistent and out of date version and out of date data. 2. Formulas could be incorrect Databases strengths 1. Methods for keeping data up to date and consistent 2. Data is of higher quality than data stored in spreadsheets 3. Good for storing and organizing information. Weakness 1. Require more planning and designing

Backup and restore[edit]

Sometimes it is desired to bring a database back to a previous state (for many reasons, e.g., cases when the database is found corrupted due to a software error, or if it has been updated with erroneous data). To achieve this a backup operation is done occasionally or continuously, where each desired database state (i.e., the values of its data and their embedding in database's data structures) is kept within dedicated backup files (many techniques exist to do this effectively). When this state is needed, i.e., when it is decided by a database administrator to bring the database back to this state (e.g., by specifying this state by a desired point in time when the database was in this state), these files are utilized to restore that state.

Other[edit]

Other DBMS features might include:

  • Database logs
  • Graphics component for producing graphs and charts, especially in a data warehouse system
  • Query optimizer - Performs query optimization on every query to choose for it the most efficient query plan (a partial order (tree) of operations) to be executed to compute the query result. May be specific to a particular storage engine.
  • Tools or hooks for database design, application programming, application program maintenance, database performance analysis and monitoring, database configuration monitoring, DBMS hardware configuration (a DBMS and related database may span computers, networks, and storage units) and related database mapping (especially for a distributed DBMS), storage allocation and database layout monitoring, storage migration, etc.

See also[edit]

References[edit]

  1. ^ Jeffrey Ullman 1997: First course in database systems, Prentice-Hall Inc., Simon & Schuster, Page 1, ISBN 0-13-861337-0.
  2. ^ Tsitchizris, D. C. and F. H. Lochovsky (1982). Data Models. Englewood-Cliffs, Prentice-Hall.
  3. ^ Beynon-Davies P. (2004). Database Systems 3rd Edition. Palgrave, Basingstoke, UK. ISBN 1-4039-1601-2
  4. ^ Raul F. Chong, Michael Dang, Dwaine R. Snow, Xiaomei Wang (3 July 2008). "Introduction to DB2". Retrieved 17 March 2013. . This article quotes a development time of 5 years involving 750 people for DB2 release 9 alone
  5. ^ C. W. Bachmann (November 1973), "The Programmer as Navigator", CACM  (Turing Award Lecture 1973)
  6. ^ Codd, E.F. (1970)."A Relational Model of Data for Large Shared Data Banks". In: Communications of the ACM 13 (6): 377–387.
  7. ^ William Hershey and Carol Easthope, "A set theoretic data structure and retrieval language", Spring Joint Computer Conference, May 1972 in ACM SIGIR Forum, Volume 7, Issue 4 (December 1972), pp. 45-55, DOI=10.1145/1095495.1095500
  8. ^ Ken North, "Sets, Data Models and Data Independence", Dr. Dobb's, 10 March 2010
  9. ^ Description of a set-theoretic data structure, D. L. Childs, 1968, Technical Report 3 of the CONCOMP (Research in Conversational Use of Computers) Project, University of Michigan, Ann Arbor, Michigan, USA
  10. ^ Feasibility of a Set-Theoretic Data Structure : A General Structure Based on a Reconstituted Definition of Relation, D. L. Childs, 1968, Technical Report 6 of the CONCOMP (Research in Conversational Use of Computers) Project, University of Michigan, Ann Arbor, Michigan, USA
  11. ^ MICRO Information Management System (Version 5.0) Reference Manual, M.A. Kahn, D.L. Rumelhart, and B.L. Bronson, October 1977, Institute of Labor and Industrial Relations (ILIR), University of Michigan and Wayne State University
  12. ^ http://www.foxprohistory.org/interview_wayne_ratliff.htm
  13. ^ Development of an object-oriented DBMS; Portland, Oregon, United States; Pages: 472 – 482; 1986; ISBN 0-89791-204-7
  14. ^ "DB-Engines Ranking". January 2013. Retrieved 22 January 2013. 
  15. ^ Graves, Steve. "COTS Databases For Embedded Systems", Embedded Computing Design magazine, January, 2007. Retrieved on August 13, 2008.
  16. ^ "TeleCommunication Systems Signs up as a Reseller of TimesTen; Mobile Operators and Carriers Gain Real-Time Platform for Location-Based Services". Business Wire. 2002-06-24. 
  17. ^ Argumentation in Artificial Intelligence by Iyad Rahwan, Guillermo R. Simari
  18. ^ "OWL DL Semantics". Retrieved 10 December 2010. 
  19. ^ itl.nist.gov (1993) Integration Definition for Information Modeling (IDEFIX). 21 December 1993.
  20. ^ a b Date 1990, pp. 31–32
  21. ^ Chapple, Mike. "SQL Fundamentals". Databases. About.com. Retrieved 2009-01-28. 
  22. ^ "Structured Query Language (SQL)". International Business Machines. October 27, 2006. Retrieved 2007-06-10. 
  23. ^ Wagner, Michael (2010), "1. Auflage", SQL/XML:2006 - Evaluierung der Standardkonformität ausgewählter Datenbanksysteme, Diplomica Verlag, ISBN 3-8366-9609-6 

Further reading[edit]

  • Ling Liu and Tamer M. Özsu (Eds.) (2009). "Encyclopedia of Database Systems, 4100 p. 60 illus. ISBN 978-0-387-49616-0.
  • Beynon-Davies, P. (2004). Database Systems. 3rd Edition. Palgrave, Houndmills, Basingstoke.
  • Connolly, Thomas and Carolyn Begg. Database Systems. New York: Harlow, 2002.
  • Date, C. J. (2003). An Introduction to Database Systems, Fifth Edition. Addison Wesley. ISBN 0-201-51381-1. 
  • Gray, J. and Reuter, A. Transaction Processing: Concepts and Techniques, 1st edition, Morgan Kaufmann Publishers, 1992.
  • Kroenke, David M. and David J. Auer. Database Concepts. 3rd ed. New York: Prentice, 2007.
  • Raghu Ramakrishnan and Johannes Gehrke, Database Management Systems
  • Abraham Silberschatz, Henry F. Korth, S. Sudarshan, Database System Concepts
  • Lightstone, S.; Teorey, T.; Nadeau, T. (2007). Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more. Morgan Kaufmann Press. ISBN 0-12-369389-6. 
  • Teorey, T.; Lightstone, S. and Nadeau, T. Database Modeling & Design: Logical Design, 4th edition, Morgan Kaufmann Press, 2005. ISBN 0-12-685352-5

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Decision_support_system b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Decision_support_system new file mode 100644 index 00000000..1527bd92 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Decision_support_system @@ -0,0 +1 @@ + Decision support system - Wikipedia, the free encyclopedia

Decision support system

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Example of a Decision Support System for John Day Reservoir.

A decision support system (DSS) is a computer-based information system that supports business or organizational decision-making activities. DSSs serve the management, operations, and planning levels of an organization and help to make decisions, which may be rapidly changing and not easily specified in advance. Decision support systems can be either fully computerized, human or a combination of both.

DSSs include knowledge-based systems. A properly designed DSS is an interactive software-based system intended to help decision makers compile useful information from a combination of raw data, documents, and personal knowledge, or business models to identify and solve problems and make decisions.

Typical information that a decision support application might gather and present includes:

  • inventories of information assets (including legacy and relational data sources, cubes, data warehouses, and data marts),
  • comparative sales figures between one period and the next,
  • projected revenue figures based on product sales assumptions.

Contents

History[edit]

According to Keen (),[1] the concept of decision support has evolved from two main areas of research: The theoretical studies of organizational decision making done at the Carnegie Institute of Technology during the late 1950s and early 1960s, and the technical work on Technology in the 1960s. It is considered that the concept of DSS became an area of research of its own in the middle of the 1970s, before gaining in intensity during the 1980s. In the middle and late 1980s, executive information systems (EIS), group decision support systems (GDSS), and organizational decision support systems (ODSS) evolved from the single user and model-oriented DSS.

According to Sol (1987)[2] the definition and scope of DSS has been migrating over the years. In the 1970s DSS was described as "a computer based system to aid decision making". In the late 1970s the DSS movement started focusing on "interactive computer-based systems which help decision-makers utilize data bases and models to solve ill-structured problems". In the 1980s DSS should provide systems "using suitable and available technology to improve effectiveness of managerial and professional activities", and end 1980s DSS faced a new challenge towards the design of intelligent workstations.[2]

In 1987, Texas Instruments completed development of the Gate Assignment Display System (GADS) for United Airlines. This decision support system is credited with significantly reducing travel delays by aiding the management of ground operations at various airports, beginning with O'Hare International Airport in Chicago and Stapleton Airport in Denver Colorado.[3][4]

Beginning in about 1990, data warehousing and on-line analytical processing (OLAP) began broadening the realm of DSS. As the turn of the millennium approached, new Web-based analytical applications were introduced.

The advent of better and better reporting technologies has seen DSS start to emerge as a critical component of management design. Examples of this can be seen in the intense amount of discussion of DSS in the education environment.

DSS also have a weak connection to the user interface paradigm of hypertext. Both the University of Vermont PROMIS system (for medical decision making) and the Carnegie Mellon ZOG/KMS system (for military and business decision making) were decision support systems which also were major breakthroughs in user interface research. Furthermore, although hypertext researchers have generally been concerned with information overload, certain researchers, notably Douglas Engelbart, have been focused on decision makers in particular. Decision generate by chronical format. so it is important.

Taxonomies[edit]

As with the definition, there is no universally accepted taxonomy of DSS either. Different authors propose different classifications. Using the relationship with the user as the criterion, Haettenschwiler[5] differentiates passive, active, and cooperative DSS. A passive DSS is a system that aids the process of decision making, but that cannot bring out explicit decision suggestions or solutions. An active DSS can bring out such decision suggestions or solutions. A cooperative DSS allows the decision maker (or its advisor) to modify, complete, or refine the decision suggestions provided by the system, before sending them back to the system for validation. The system again improves, completes, and refines the suggestions of the decision maker and sends them back to him for validation. The whole process then starts again, until a consolidated solution is generated.

Another taxonomy for DSS has been created by Daniel Power. Using the mode of assistance as the criterion, Power differentiates communication-driven DSS, data-driven DSS, document-driven DSS, knowledge-driven DSS, and model-driven DSS.[6]

  • A communication-driven DSS supports more than one person working on a shared task; examples include integrated tools like Microsoft's NetMeeting or Groove[7]
  • A data-driven DSS or data-oriented DSS emphasizes access to and manipulation of a time series of internal company data and, sometimes, external data.
  • A document-driven DSS manages, retrieves, and manipulates unstructured information in a variety of electronic formats.
  • A knowledge-driven DSS provides specialized problem-solving expertise stored as facts, rules, procedures, or in similar structures.[6]
  • A model-driven DSS emphasizes access to and manipulation of a statistical, financial, optimization, or simulation model. Model-driven DSS use data and parameters provided by users to assist decision makers in analyzing a situation; they are not necessarily data-intensive. Dicodess is an example of an open source model-driven DSS generator.[8]

Using scope as the criterion, Power[9] differentiates enterprise-wide DSS and desktop DSS. An enterprise-wide DSS is linked to large data warehouses and serves many managers in the company. A desktop, single-user DSS is a small system that runs on an individual manager's PC.

Components[edit]

Design of a Drought Mitigation Decision Support System.

Three fundamental components of a DSS architecture are:[5][6][10][11][12]

  1. the database (or knowledge base),
  2. the model (i.e., the decision context and user criteria), and
  3. the user interface.

The users themselves are also important components of the architecture.[5][12]

Development Frameworks[edit]

DSS systems are not entirely different from other systems and require a structured approach. Such a framework includes people, technology, and the development approach.[10]

DSS technology levels (of hardware and software) may include:

  1. The actual application that will be used by the user. This is the part of the application that allows the decision maker to make decisions in a particular problem area. The user can act upon that particular problem.
  2. Generator contains Hardware/software environment that allows people to easily develop specific DSS applications. This level makes use of case tools or systems such as Crystal, Analytica and iThink.
  3. Tools include lower level hardware/software. DSS generators including special languages, function libraries and linking modules

An iterative developmental approach allows for the DSS to be changed and redesigned at various intervals. Once the system is designed, it will need to be tested and revised where necessary for the desired outcome.

Classification[edit]

There are several ways to classify DSS applications. Not every DSS fits neatly into one of the categories, but may be a mix of two or more architectures.

Holsapple and Whinston[13] classify DSS into the following six frameworks: Text-oriented DSS, Database-oriented DSS, Spreadsheet-oriented DSS, Solver-oriented DSS, Rule-oriented DSS, and Compound DSS.

A compound DSS is the most popular classification for a DSS. It is a hybrid system that includes two or more of the five basic structures described by Holsapple and Whinston.[13]

The support given by DSS can be separated into three distinct, interrelated categories:[14] Personal Support, Group Support, and Organizational Support.

DSS components may be classified as:

  1. Inputs: Factors, numbers, and characteristics to analyze
  2. User Knowledge and Expertise: Inputs requiring manual analysis by the user
  3. Outputs: Transformed data from which DSS "decisions" are generated
  4. Decisions: Results generated by the DSS based on user criteria

DSSs which perform selected cognitive decision-making functions and are based on artificial intelligence or intelligent agents technologies are called Intelligent Decision Support Systems (IDSS).[citation needed]

The nascent field of Decision engineering treats the decision itself as an engineered object, and applies engineering principles such as Design and Quality assurance to an explicit representation of the elements that make up a decision.

Applications[edit]

As mentioned above, there are theoretical possibilities of building such systems in any knowledge domain.

One is the clinical decision support system for medical diagnosis. Other examples include a bank loan officer verifying the credit of a loan applicant or an engineering firm that has bids on several projects and wants to know if they can be competitive with their costs.

DSS is extensively used in business and management. Executive dashboard and other business performance software allow faster decision making, identification of negative trends, and better allocation of business resources. Due to DSS all the information from any organization is represented in the form of charts, graphs i.e. in a summarized way, which helps the management to take strategic decision.

A growing area of DSS application, concepts, principles, and techniques is in agricultural production, marketing for sustainable development. For example, the DSSAT4 package,[15][16] developed through financial support of USAID during the 80's and 90's, has allowed rapid assessment of several agricultural production systems around the world to facilitate decision-making at the farm and policy levels. There are, however, many constraints to the successful adoption on DSS in agriculture.[17]

DSS are also prevalent in forest management where the long planning time frame demands specific requirements. All aspects of Forest management, from log transportation, harvest scheduling to sustainability and ecosystem protection have been addressed by modern DSSs.

A specific example concerns the Canadian National Railway system, which tests its equipment on a regular basis using a decision support system. A problem faced by any railroad is worn-out or defective rails, which can result in hundreds of derailments per year. Under a DSS, CN managed to decrease the incidence of derailments at the same time other companies were experiencing an increase.

Benefits[edit]

  1. Improves personal efficiency
  2. Speed up the process of decision making
  3. Increases organizational control
  4. Encourages exploration and discovery on the part of the decision maker
  5. Speeds up problem solving in an organization
  6. Facilitates interpersonal communication
  7. Promotes learning or training
  8. Generates new evidence in support of a decision
  9. Creates a competitive advantage over competition
  10. Reveals new approaches to thinking about the problem space
  11. Helps automate managerial processes
  12. Create Innovative ideas to speed up the performance

DSS Characteristics and capabilities[edit]

  1. Solve semi-structured & Unstructured problems
  2. Support To Managers At All Levels
  3. support Individual and groups
  4. Inter dependence and Sequence Decision.
  5. Support Intelligence, Designee,Choice.
  6. Adaptable & Flexible
  7. Interactive and ease of use
  8. Interactive and efficiency
  9. Human control the process
  10. Ease of development by end user
  11. Modeling and Analysis
  12. Data Access
  13. Stand alone Integration & Web Based
  14. Support Varieties Of Decision Process
  15. Support Varieties Of Decision Trees
  16. Quick Response

See also[edit]

References[edit]

  1. ^ Keen, P. G. W. (1978). Decision support systems: an organizational perspective. Reading, Mass., Addison-Wesley Pub. Co. ISBN 0-201-03667-3
  2. ^ a b Henk G. Sol et al. (1987). Expert systems and artificial intelligence in decision support systems: proceedings of the Second Mini Euroconference, Lunteren, The Netherlands, 17–20 November 1985. Springer, 1987. ISBN 90-277-2437-7. p.1-2.
  3. ^ Efraim Turban, Jay E. Aronson, Ting-Peng Liang (2008). Decision Support Systems and Intelligent Systems. p. 574. 
  4. ^ "Gate Delays at Airports Are Minimised for United by Texas Instruments' Explorer". Computer Business Review. 1987-11-26. 
  5. ^ a b c Haettenschwiler, P. (1999). Neues anwenderfreundliches Konzept der Entscheidungsunterstützung. Gutes Entscheiden in Wirtschaft, Politik und Gesellschaft. Zurich, vdf Hochschulverlag AG: 189-208.
  6. ^ a b c Power, D. J. (2002). Decision support systems: concepts and resources for managers. Westport, Conn., Quorum Books.
  7. ^ Stanhope, P. (2002). Get in the Groove: building tools and peer-to-peer solutions with the Groove platform. New York, Hungry Minds
  8. ^ Gachet, A. (2004). Building Model-Driven Decision Support Systems with Dicodess. Zurich, VDF.
  9. ^ Power, D. J. (1996). What is a DSS? The On-Line Executive Journal for Data-Intensive Decision Support 1(3).
  10. ^ a b Sprague, R. H. and E. D. Carlson (1982). Building effective decision support systems. Englewood Cliffs, N.J., Prentice-Hall. ISBN 0-13-086215-0
  11. ^ Haag, Cummings, McCubbrey, Pinsonneault, Donovan (2000). Management Information Systems: For The Information Age. McGraw-Hill Ryerson Limited: 136-140. ISBN 0-07-281947-2
  12. ^ a b Marakas, G. M. (1999). Decision support systems in the twenty-first century. Upper Saddle River, N.J., Prentice Hall.
  13. ^ a b Holsapple, C.W., and A. B. Whinston. (1996). Decision Support Systems: A Knowledge-Based Approach. St. Paul: West Publishing. ISBN 0-324-03578-0
  14. ^ Hackathorn, R. D., and P. G. W. Keen. (1981, September). "Organizational Strategies for Personal Computing in Decision Support Systems." MIS Quarterly, Vol. 5, No. 3.
  15. ^ DSSAT4 (pdf)
  16. ^ The Decision Support System for Agrotechnology Transfer
  17. ^ Stephens, W. and Middleton, T. (2002). Why has the uptake of Decision Support Systems been so poor? In: Crop-soil simulation models in developing countries. 129-148 (Eds R.B. Matthews and William Stephens). Wallingford:CABI.

Further reading[edit]

  • Delic, K.A., Douillet,L. and Dayal, U. (2001) "Towards an architecture for real-time decision support systems:challenges and solutions.
  • Diasio, S., Agell, N. (2009) "The evolution of expertise in decision support technologies: A challenge for organizations," cscwd, pp. 692–697, 13th International Conference on Computer Supported Cooperative Work in Design, 2009. http://www.computer.org/portal/web/csdl/doi/10.1109/CSCWD.2009.4968139
  • Gadomski, A.M. et al.(2001) "An Approach to the Intelligent Decision Advisor (IDA) for Emergency Managers", Int. J. Risk Assessment and Management, Vol. 2, Nos. 3/4.
  • Gomes da Silva, Carlos; Clímaco, João; Figueira, José. European Journal of Operational Research.
  • Ender, Gabriela; E-Book (2005–2011) about the OpenSpace-Online Real-Time Methodology: Knowledge-sharing, problem solving, results-oriented group dialogs about topics that matter with extensive conference documentation in real-time. Download http://www.openspace-online.com/OpenSpace-Online_eBook_en.pdf
  • Jiménez, Antonio; Ríos-Insua, Sixto; Mateos, Alfonso. Computers & Operations Research.
  • Jintrawet, Attachai (1995). A Decision Support System for Rapid Assessment of Lowland Rice-based Cropping Alternatives in Thailand. Agricultural Systems 47: 245-258.
  • Matsatsinis, N.F. and Y. Siskos (2002), Intelligent support systems for marketing decisions, Kluwer Academic Publishers.
  • Power, D. J. (2000). Web-based and model-driven decision support systems: concepts and issues. in proceedings of the Americas Conference on Information Systems, Long Beach, California.
  • Reich, Yoram; Kapeliuk, Adi. Decision Support Systems., Nov2005, Vol. 41 Issue 1, p1-19, 19p.
  • Sauter, V. L. (1997). Decision support systems: an applied managerial approach. New York, John Wiley.
  • Silver, M. (1991). Systems that support decision makers: description and analysis. Chichester ; New York, Wiley.
  • Sprague, R. H. and H. J. Watson (1993). Decision support systems: putting theory into practice. Englewood Clifts, N.J., Prentice Hall.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Decision_tree_learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Decision_tree_learning new file mode 100644 index 00000000..fb5bb25a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Decision_tree_learning @@ -0,0 +1 @@ + Decision tree learning - Wikipedia, the free encyclopedia

Decision tree learning

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels.

In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. This page deals with decision trees in data mining.

Contents

General[edit]

A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf.

Decision tree learning is a method commonly used in data mining.[1] The goal is to create a model that predicts the value of a target variable based on several input variables. An example is shown on the right. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf.

A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions. This process of top-down induction of decision trees (TDIDT) [2] is an example of a greedy algorithm, and it is by far the most common strategy for learning decision trees from data, but it is not the only strategy. In fact, some approaches have been developed recently allowing tree induction to be performed in a bottom-up fashion.[3]

In data mining, decision trees can be described also as the combination of mathematical and computational techniques to aid the description, categorisation and generalisation of a given set of data.

Data comes in records of the form:

(\textbf{x},Y) = (x_1, x_2, x_3, ..., x_k, Y)

The dependent variable, Y, is the target variable that we are trying to understand, classify or generalize. The vector x is composed of the input variables, x1, x2, x3 etc., that are used for that task.

Types[edit]

Decision trees used in data mining are of two main types:

  • Classification tree analysis is when the predicted outcome is the class to which the data belongs.
  • Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital).

The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al.[4] Trees used for regression and trees used for classification have some similarities - but also some differences, such as the procedure used to determine where to split.[4]

Some techniques, often called ensemble methods, construct more than one decision tree:

  • Bagging decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction.[5]
  • A Random Forest classifier uses a number of decision trees, in order to improve the classification rate.
  • Boosted Trees can be used for regression-type and classification-type problems.[6][7]
  • Rotation forest - in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features.[8]

Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.

There are many specific decision-tree algorithms. Notable ones include:

  • ID3 (Iterative Dichotomiser 3)
  • C4.5 (successor of ID3)
  • CART (Classification And Regression Tree)
  • CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees.[9]
  • MARS: extends decision trees to better handle numerical data.

ID3 and CART were invented independently at around same time (b/w 1970-80), yet follow a similar approach for learning decision tree from training tuples.

Formulae[edit]

Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items.[10] Different algorithms use different metrics for measuring "best". These generally measure the homogeneity of the target variable within the subsets. Some examples are given below. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split.

Gini impurity[edit]

Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.

To compute Gini impurity for a set of items, suppose i takes on values in {1, 2, ..., m}, and let fi be the fraction of items labeled with value i in the set.

I_{G}(f) = \sum_{i=1}^{m} f_i (1-f_i) = \sum_{i=1}^{m} (f_i - {f_i}^2) = \sum_{i=1}^m f_i - \sum_{i=1}^{m} {f_i}^2 = 1 - \sum^{m}_{i=1} {f_i}^{2}

Information gain[edit]

Used by the ID3, C4.5 and C5.0 tree-generation algorithms. Information gain is based on the concept of entropy from information theory.

I_{E}(f) = - \sum^{m}_{i=1} f_i \log^{}_2 f_i

Decision tree advantages[edit]

Amongst other data mining methods, decision trees have various advantages:

  • Simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
  • Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed.
  • Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. (For example, relation rules can be used only with nominal variables while neural networks can be used only with numerical variables.)
  • Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by boolean logic. (An example of a black box model is an artificial neural network since the explanation for the results is difficult to understand.)
  • Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
  • Robust. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
  • Performs well with large datasets. Large amounts of data can be analysed using standard computing resources in reasonable time.

Limitations[edit]

  • The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts.[11][12] Consequently, practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally-optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally-optimal decision tree.
  • Decision-tree learners can create over-complex trees that do not generalise well from the training data. (This is known as overfitting.[13]) Mechanisms such as pruning are necessary to avoid this problem.
  • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. In such cases, the decision tree becomes prohibitively large. Approaches to solve the problem involve either changing the representation of the problem domain (known as propositionalisation)[14] or using learning algorithms based on more expressive representations (such as statistical relational learning or inductive logic programming).
  • For data including categorical variables with different numbers of levels, information gain in decision trees is biased in favor of those attributes with more levels.[15]

Extensions[edit]

Decision graphs[edit]

In a decision tree, all paths from the root node to the leaf node proceed by way of conjunction, or AND. In a decision graph, it is possible to use disjunctions (ORs) to join two more paths together using Minimum message length (MML).[16] Decision graphs have been further extended to allow for previously unstated new attributes to be learnt dynamically and used at different places within the graph.[17] The more general coding scheme results in better predictive accuracy and log-loss probabilistic scoring.[citation needed] In general, decision graphs infer models with fewer leaves than decision trees.

Search through Evolutionary Algorithms[edit]

Evolutionary algorithms have been used to avoid local optimal decisions and search the decision tree space with little a priori bias.[18][19]

See also[edit]

Implementations[edit]

References[edit]

  1. ^ Rokach, Lior; Maimon, O. (2008). Data mining with decision trees: theory and applications. World Scientific Pub Co Inc. ISBN 978-9812771711. 
  2. ^ Quinlan, J. R., (1986). Induction of Decision Trees. Machine Learning 1: 81-106, Kluwer Academic Publishers
  3. ^ Barros R. C., Cerri R., Jaskowiak P. A., Carvalho, A. C. P. L. F., A bottom-up oblique decision tree induction algorithm. Proceedings of the 11th International Conference on Intelligent Systems Design and Applications (ISDA 2011).
  4. ^ a b Breiman, Leo; Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8. 
  5. ^ Breiman, L. (1996). Bagging Predictors. "Machine Learning, 24": pp. 123-140.
  6. ^ Friedman, J. H. (1999). Stochastic gradient boosting. Stanford University.
  7. ^ Hastie, T., Tibshirani, R., Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer Verlag.
  8. ^ Rodriguez, J.J. and Kuncheva, L.I. and Alonso, C.J. (2006), Rotation forest: A new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619-1630.
  9. ^ Kass, G. V. (1980). "An exploratory technique for investigating large quantities of categorical data". Applied Statistics 29 (2): 119–127. doi:10.2307/2986296. JSTOR 2986296. 
  10. ^ Rokach, L.; Maimon, O. (2005). "Top-down induction of decision trees classifiers-a survey". IEEE Transactions on Systems, Man, and Cybernetics, Part C 35 (4): 476–487. doi:10.1109/TSMCC.2004.843247. 
  11. ^ Hyafil, Laurent; Rivest, RL (1976). "Constructing Optimal Binary Decision Trees is NP-complete". Information Processing Letters 5 (1): 15–17. doi:10.1016/0020-0190(76)90095-8. 
  12. ^ Murthy S. (1998). Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery
  13. ^ Principles of Data Mining. 2007. doi:10.1007/978-1-84628-766-4. ISBN 978-1-84628-765-7.  edit
  14. ^ Horváth, Tamás; Yamamoto, Akihiro, eds. (2003). Inductive Logic Programming. Lecture Notes in Computer Science 2835. doi:10.1007/b13700. ISBN 978-3-540-20144-1.  edit
  15. ^ Deng,H.; Runger, G.; Tuv, E. (2011). "Bias of importance measures for multi-valued attributes and solutions". Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN). pp. 293–300. 
  16. ^ http://citeseer.ist.psu.edu/oliver93decision.html
  17. ^ Tan & Dowe (2003)
  18. ^ Papagelis A., Kalles D.(2001). Breeding Decision Trees Using Evolutionary Techniques, Proceedings of the Eighteenth International Conference on Machine Learning, p.393-400, June 28-July 01, 2001
  19. ^ Barros, Rodrigo C., Basgalupp, M. P., Carvalho, A. C. P. L. F., Freitas, Alex A. (2011). A Survey of Evolutionary Algorithms for Decision-Tree Induction. IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews, vol. 42, n. 3, p. 291-312, May 2012.

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Discovery_observation_ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Discovery_observation_ new file mode 100644 index 00000000..c8cea95b --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Discovery_observation_ @@ -0,0 +1 @@ + Discovery (observation) - Wikipedia, the free encyclopedia

Discovery (observation)

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Discovery is the act of detecting something new, or something "old" that had been unknown. With reference to science and academic disciplines, discovery is the observation of new phenomena, new actions, or new events and providing new reasoning to explain the knowledge gathered through such observations with previously acquired knowledge from abstract thought and everyday experiences. Visual discoveries are often called sightings.[citation needed]

Contents

Description[edit]

New discoveries are acquired through various senses and are usually assimilated, merging with pre-existing knowledge and actions. Questioning is a major form of human thought and interpersonal communication, and plays a key role in discovery.[citation needed] Discoveries are often made due to questions. Some discoveries lead to the invention of objects, processes, or techniques. A discovery may sometimes be based on earlier discoveries, collaborations or ideas, and the process of discovery requires at least the awareness that an existing concept or method can be modified or transformed.[citation needed] However, some discoveries also represent a radical breakthrough in knowledge.

Within science[edit]

Within scientific disciplines, discovery is the observation of new phenomena, actions, or events which helps explain knowledge gathered through previously acquired scientific evidence. In science, exploration is one of three purposes of research,[citation needed] the other two being description and explanation. Discovery is made by providing observational evidence and attempts to develop an initial, rough understanding of some phenomenon.

Discovery within the field of particle physics has an accepted definition for what constitutes a discovery: a five-sigma level of certainty.[1] Such a level defines statistically how unlikely it is that an experimental result is due to chance. The combination of a five-sigma level of certainty, and independent confirmation by other experiments, turns findings into accepted discoveries.[1]

Exploration[edit]

Discovery can also be used to describe the first incursions of peoples from one culture into the geographical and cultural environment of others. Western culture has used the term "discovery" in their histories to subtly emphasize the importance of "exploration" in the history of the world, such as in the "Age of Exploration". Since the European exploration of the world, the "discovery" of every continent, island, and geographical feature, for the European traveler, led to the notion that the native people were "discovered" (though many were there centuries or even millennia before). In that way, the term has Eurocentric and ethnocentric meaning often overlooked by westerners.[citation needed]

See also[edit]

References[edit]

General references
Specific references
  1. ^ a b Rincon, Paul (12 December 2011). "Higgs boson: Excitement builds over 'glimpses' at LHC". BBC News. Retrieved 2011-12-12. 

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Document_classification b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Document_classification new file mode 100644 index 00000000..11ebc600 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Document_classification @@ -0,0 +1 @@ + Document classification - Wikipedia, the free encyclopedia

Document classification

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.

The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.

Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: The content based approach and the request based approach.

Contents

"Content based" versus "request based" classification[edit]

Content based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a rule in much library classification that at least 20% of the content of a book should be about the class to which the book is assigned.[1] In automatic classification it could be the number of times given words appears in a document.

Request oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier ask himself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230[2]).

Request oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents different compared to a historical library. It is probably better, however, to understand request oriented classification as policy based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request oriented classification be regarded as a user-based approach.

Classification versus indexing[edit]

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21[3]). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986,[4] 2004;[5] Broughton, 2008;[6] Riesthuis & Bliedung, 1991[7]). Therefore is the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents).

Automatic document classification[edit]

Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification, where parts of the documents are labeled by the external mechanism.

Techniques[edit]

Automatic document classification techniques include:

Applications[edit]

Classification techniques have been applied to

See also[edit]

References[edit]

  1. ^ Library of Congress (2008). The subject headings manual. Washington, DC.: Library of Congress, Policy and Standards Division. (Sheet H 180: "Assign headings only for topics that comprise at least 20% of the work.")
  2. ^ Soergel, Dagobert (1985). Organizing information: Principles of data base and retrieval systems. Orlando, FL: Academic Press.
  3. ^ Lancaster, F. W. (2003). Indexing and abstracting in theory and practice. Library Association, London.
  4. ^ Aitchison, J. (1986). “A classification as a source for thesaurus: The Bibliographic Classification of H. E. Bliss as a source of thesaurus terms and structure.” Journal of Documentation, Vol. 42 No. 3, pp. 160-181.
  5. ^ Aitchison, J. (2004). “Thesauri from BC2: Problems and possibilities revealed in an experimental thesaurus derived from the Bliss Music schedule.” Bliss Classification Bulletin, Vol. 46, pp. 20-26.
  6. ^ Broughton, V. (2008). “A faceted classification as the basis of a faceted terminology: Conversion of a classified structure to thesaurus format in the Bliss Bibliographic Classification (2nd Ed.).” Axiomathes, Vol. 18 No.2, pp. 193-210.
  7. ^ Riesthuis, G. J. A., & Bliedung, St. (1991). “Thesaurification of the UDC.” Tools for knowledge organization and the human interface, Vol. 2, pp. 109-117. Index Verlag, Frankfurt.
  8. ^ Stephan Busemann, Sven Schmeier and Roman G. Arens (2000). Message classification in the call center. In Sergei Nirenburg, Douglas Appelt, Fabio Ciravegna and Robert Dale, eds., Proc. 6th Applied Natural Language Processing Conf. (ANLP'00), pp. 158-165, ACL.
  9. ^ Santini, Marina; Rosso, Mark (2008), Testing a Genre-Enabled Application: A Preliminary Assessment, BCS IRSG Symposium: Future Directions in Information Access, London, UK, pp. 54–63 

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/ECML_PKDD b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/ECML_PKDD new file mode 100644 index 00000000..2884bc73 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/ECML_PKDD @@ -0,0 +1 @@ + ECML PKDD - Wikipedia, the free encyclopedia

ECML PKDD

From Wikipedia, the free encyclopedia
Jump to: navigation, search

ECML PKDD, the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, is one of the leading[1][2] academic conferences on machine learning and knowledge discovery, held in Europe every year.

Contents

History [edit]

ECML PKDD is a merger of two European conferences, European Conference on Machine Learning (ECML) and European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). ECML and PKDD have been co-located since 2001;[3] however, both ECML and PKDD retained their own identity until 2007. For example, the 2007 conference was known as “the 18th European Conference on Machine Learning (ECML) and the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)”, or in brief, “ECML/PKDD 2007”, and the both ECML and PKDD had their own conference proceedings. In 2008 the conferences were merged into one conference, and the division into traditional ECML topics and traditional PKDD topics was removed.[4]

The history of ECML dates back to 1986, when the European Working Session on Learning was first held. In 1993 the name of the conference was changed to European Conference on Machine Learning.

PKDD was first organised in 1997. Originally PKDD stood for the European Symposium on Principles of Data Mining and Knowledge Discovery from Databases.[5] The name European Conference on Principles and Practice of Knowledge Discovery in Databases was used since 1999.[6]

List of past conferences [edit]

Conference Year City Country Date
ECML PKDD 2012 Bristol United Kingdom September 24-28
ECML PKDD 2011 Athens Greece September 5-9
ECML PKDD 2010 Barcelona Spain September 20–24
ECML PKDD 2009 Bled Slovenia September 7–11
ECML PKDD 2008 Antwerp Belgium September 15–19
18th ECML/11th PKDD 2007 Warsaw Poland September 17–21
17th ECML/10th PKDD 2006 Berlin Germany September 18–22
16th ECML/9th PKDD 2005 Porto Portugal October 3–7
15th ECML/8th PKDD 2004 Pisa Italy September 20–24
14th ECML/7th PKDD 2003 Cavtat/Dubrovnik Croatia September 22–26
13th ECML/6th PKDD 2002 Helsinki Finland August 19–23
12th ECML/5th PKDD 2001 Freiburg Germany September 3–7
Conference Year City Country Date
11th ECML 2000 Barcelona Spain May 30–June 2
10th ECML 1998 Chemnitz Germany April 21–24
9th ECML 1997 Prague Czech Republic April 23–26
8th ECML 1995 Heraclion Crete, Greece April 25–27
7th ECML 1994 Catania Italy April 6–8
6th ECML 1993 Vienna Austria April 5–7
5th EWSL 1991 Porto Portugal March 6–8
4th EWSL 1989 Montpellier France December 4–6
3rd EWSL 1988 Glasgow Scotland, UK October 3–5
2nd EWSL 1987 Bled Yugoslavia May 13–15
1st EWSL 1986 Orsay France February 3–4
Conference Year City Country Date
4th PKDD 2000 Lyon France September 13–16
3rd PKDD 1999 Prague Czech Republic September 15–18
2nd PKDD 1998 Nantes France September 23–26
1st PKDD 1997 Trondheim Norway June 24–27

References [edit]

  1. ^ "Machine Learning and Pattern Recognition". Libra. Retrieved 2009-07-04.  ECML is number 4 on the list.
  2. ^ "2007 Australian Ranking of ICT Conferences".  Both ECML and PKDD are ranked on “tier A”.
  3. ^ "Past conferences". ECML PKDD. Retrieved 2009-07-04. 
  4. ^ Daelemans, Walter; Goethals, Bart; Morik, Katharina (2008). "Preface". Proceedings of ECML PKDD 2008. Lecture Notes in Artificial Intelligence 5211. Springer. pp. V–VI. doi:10.1007/978-3-540-87479-9. ISBN 978-3-540-87478-2. .
  5. ^ Komorowski, Jan; Zytkow, Jan (1997). "Preface". Proceedings of PKDD 1997. Lecture Notes in Artificial Intelligence 1263. Springer. pp. V–VI. doi:10.1007/3-540-63223-9. ISBN 978-3-540-63223-8. .
  6. ^ Zytkow, Jan; Rauch, Jan (1999). "Preface". Proceedings of PKDD 1999. Lecture Notes in Artificial Intelligence 1704. Springer. pp. V–VII. doi:10.1007/b72280. ISBN 978-3-540-66490-1. .

External links [edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Elastic_map b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Elastic_map new file mode 100644 index 00000000..252b5cb1 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Elastic_map @@ -0,0 +1 @@ + Elastic map - Wikipedia, the free encyclopedia

Elastic map

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Linear PCA versus nonlinear Principal Manifolds[1] for visualization of breast cancer microarray data: a) Configuration of nodes and 2D Principal Surface in the 3D PCA linear manifold. The dataset is curved and can not be mapped adequately on a 2D principal plane; b) The distribution in the internal 2D non-linear principal surface coordinates (ELMap2D) together with an estimation of the density of points; c) The same as b), but for the linear 2D PCA manifold (PCA2D). The “basal” breast cancer subtype is visualized more adequately with ELMap2D and some features of the distribution become better resolved in comparison to PCA2D. Principal manifolds are produced by the elastic maps algorithm. Data are available for public competition.[2] Software is available for free non-commercial use.[3][4]

Elastic maps provide a tool for nonlinear dimensionality reduction. By their construction, they are system of elastic springs embedded in the data space.[1] This system approximates a low-dimensional manifold. The elastic coefficients of this system allow the switch from completely unstructured k-means clustering (zero elasticity) to the estimators located closely to linear PCA manifolds (for high bending and low stretching modules). With some intermediate values of the elasticity coefficients, this system effectively approximates non-linear principal manifolds. This approach is based on a mechanical analogy between principal manifolds, that are passing through "the middle" of data distribution, and elastic membranes and plates. The method was developed by A.N. Gorban, A.Y. Zinovyev and A.A. Pitenko in 1996–1998.

Contents

Energy of elastic map[edit]

Let data set be a set of vectors S in a finite-dimensional Euclidean space. The elastic map is represented by a set of nodes W_j in the same space. Each datapoint s \in S has a host node, namely the closest node W_j (if there are several closest nodes then one takes the node with the smallest number). The data set S is divided on classes K_j=\{s \ | \ W_j \mbox{ is a host of } s\}.

The approximation energy D is the distortion

D=\frac{1}{2}\sum_{j=1}^k \sum_{s \in K_j}\|s-W_j\|^2,

this is the energy of the springs with unit elasticity which connect each data point with its host node. It is possible to apply weighting factors to the terms of this sum, for example to reflect the standard deviation of the probability density function of any subset of data points \{s_i\}.

On the set of nodes an additional structure is defined. Some pairs of nodes, (W_i,W_j), are connected by elastic edges. Call this set of pairs E. Some triplets of nodes, (W_i,W_j,W_k), form bending ribs. Call this set of triplets G.

The stretching energy is U_{E}=\frac{1}{2}\lambda \sum_{(W_i,W_j) \in E} \|W_i -W_j\|^2 ,
The bending energy is U_G=\frac{1}{2}\mu \sum_{(W_i,W_j,W_l) \in G} \|W_i -2W_j+W_l\|^2 ,

where \lambda and \mu are the stretching and bending moduli respectively. The stretching energy is sometimes referred to as the "membrane" term, while the bending energy is referred to as the "thin plate" term.[5]

For example, on the 2D rectangular grid the elastic edges are just vertical and horizontal edges (pairs of closest vertices) and the bending ribs are the vertical or horizontal triplets of consecutive (closest) vertices.

The total energy of the elastic map is thus U=D+U_E+U_G.

The position of the nodes \{W_j\} is determined by the mechanical equilibrium of the elastic map, i.e. its location is such that it minimizes the total energy U.

Expectation-maximization algorithm[edit]

For a given splitting of the dataset S in classes K_j minimization of the quadratic functional U is a linear problem with the sparse matrix of coefficients. Therefore, similarly to PCA or k-means, a splitting method is used:

  • For given \{W_j\} find \{K_j\};
  • For given \{K_j\} minimize U and find \{W_j\};
  • If no change, terminate.

This expectation-maximization algorithm guarantees a local minimum of U. For improving the approximation various additional methods are proposed. For example, the softening strategy is used. This strategy starts with a rigid grids (small length, small bending and large elasticity modules \lambda and \mu coefficients) and finishes with soft grids (small \lambda and \mu ). The training goes in several epochs, each epoch with its own grid rigidness. Another adaptive strategy is growing net: one starts from small amount of nodes and gradually adds new nodes. Each epoch goes with its own number of nodes.

Applications[edit]

Application of principal curves build by the elastic maps method: Nonlinear quality of life index.[6] Points represent data of the UN 171 countries in 4-dimensional space formed by the values of 4 indicators: gross product per capita, life expectancy, infant mortality, tuberculosis incidence. Different forms and colors correspond to various geographical locations and years. Red bold line represents the principal curve, approximating the dataset.

Most important applications are in bioinformatics[7] ,[8] for exploratory data analysis and visualisation of multidimensional data, for data visualisation in economics, social and political sciences,[9] as an auxiliary tool for data mapping in geographic informational systems and for visualisation of data of various nature.

Recently, the method is adapted as a support tool in the decision process underlying the selection, optimization, and management of financial portfolios.[10]

References[edit]

  1. ^ a b A. N. Gorban, A. Y. Zinovyev, Principal Graphs and Manifolds, In: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods and Techniques, Olivas E.S. et al Eds. Information Science Reference, IGI Global: Hershey, PA, USA, 2009. 28–59.
  2. ^ Wang, Y., Klijn, J.G., Zhang, Y., Sieuwerts, A.M., Look, M.P., Yang, F., Talantov, D., Timmermans, M., Meijer-van Gelder, M.E., Yu, J. et al.: Gene expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365, 671–679 (2005); Data online
  3. ^ A. Zinovyev, ViDaExpert - Multidimensional Data Visualization Tool (free for non-commercial use). Institut Curie, Paris.
  4. ^ A. Zinovyev, ViDaExpert overview, IHES (Institut des Hautes Études Scientifiques), Bures-Sur-Yvette, Île-de-France.
  5. ^ Michael Kass, Andrew Witkin, Demetri Terzopoulos, Snakes: Active contour models, Int.J. Computer Vision, 1988 vol 1-4 pp.321-331
  6. ^ A. N. Gorban, A. Zinovyev, Principal manifolds and graphs in practice: from molecular biology to dynamical systems, International Journal of Neural Systems, Vol. 20, No. 3 (2010) 219–232.
  7. ^ A.N. Gorban, B. Kegl, D. Wunsch, A. Zinovyev (Eds.), Principal Manifolds for Data Visualisation and Dimension Reduction, LNCSE 58, Springer: Berlin – Heidelberg – New York, 2007. ISBN 978-3-540-73749-0
  8. ^ M. Chacón, M. Lévano, H. Allende, H. Nowak, Detection of Gene Expressions in Microarrays by Applying Iteratively Elastic Neural Net, In: B. Beliczynski et al. (Eds.), Lecture Notes in Computer Sciences, Vol. 4432, Springer: Berlin – Heidelberg 2007, 355–363.
  9. ^ A. Zinovyev, Data visualization in political and social sciences, In: SAGE "International Encyclopedia of Political Science", Badie, B., Berg-Schlosser, D., Morlino, L. A. (Eds.), 2011.
  10. ^ M. Resta, Portfolio optimization through elastic maps: Some evidence from the Italian stock exchange, Knowledge-Based Intelligent Information and Engineering Systems, B. Apolloni, R.J. Howlett and L. Jain (eds.), Lecture Notes in Computer Science, Vol. 4693, Springer: Berlin – Heidelberg, 2010, 635-641.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases new file mode 100644 index 00000000..c6dc87e0 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases @@ -0,0 +1 @@ + ECML PKDD - Wikipedia, the free encyclopedia

ECML PKDD

From Wikipedia, the free encyclopedia
Jump to: navigation, search

ECML PKDD, the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, is one of the leading[1][2] academic conferences on machine learning and knowledge discovery, held in Europe every year.

Contents

History [edit]

ECML PKDD is a merger of two European conferences, European Conference on Machine Learning (ECML) and European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). ECML and PKDD have been co-located since 2001;[3] however, both ECML and PKDD retained their own identity until 2007. For example, the 2007 conference was known as “the 18th European Conference on Machine Learning (ECML) and the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)”, or in brief, “ECML/PKDD 2007”, and the both ECML and PKDD had their own conference proceedings. In 2008 the conferences were merged into one conference, and the division into traditional ECML topics and traditional PKDD topics was removed.[4]

The history of ECML dates back to 1986, when the European Working Session on Learning was first held. In 1993 the name of the conference was changed to European Conference on Machine Learning.

PKDD was first organised in 1997. Originally PKDD stood for the European Symposium on Principles of Data Mining and Knowledge Discovery from Databases.[5] The name European Conference on Principles and Practice of Knowledge Discovery in Databases was used since 1999.[6]

List of past conferences [edit]

Conference Year City Country Date
ECML PKDD 2012 Bristol United Kingdom September 24-28
ECML PKDD 2011 Athens Greece September 5-9
ECML PKDD 2010 Barcelona Spain September 20–24
ECML PKDD 2009 Bled Slovenia September 7–11
ECML PKDD 2008 Antwerp Belgium September 15–19
18th ECML/11th PKDD 2007 Warsaw Poland September 17–21
17th ECML/10th PKDD 2006 Berlin Germany September 18–22
16th ECML/9th PKDD 2005 Porto Portugal October 3–7
15th ECML/8th PKDD 2004 Pisa Italy September 20–24
14th ECML/7th PKDD 2003 Cavtat/Dubrovnik Croatia September 22–26
13th ECML/6th PKDD 2002 Helsinki Finland August 19–23
12th ECML/5th PKDD 2001 Freiburg Germany September 3–7
Conference Year City Country Date
11th ECML 2000 Barcelona Spain May 30–June 2
10th ECML 1998 Chemnitz Germany April 21–24
9th ECML 1997 Prague Czech Republic April 23–26
8th ECML 1995 Heraclion Crete, Greece April 25–27
7th ECML 1994 Catania Italy April 6–8
6th ECML 1993 Vienna Austria April 5–7
5th EWSL 1991 Porto Portugal March 6–8
4th EWSL 1989 Montpellier France December 4–6
3rd EWSL 1988 Glasgow Scotland, UK October 3–5
2nd EWSL 1987 Bled Yugoslavia May 13–15
1st EWSL 1986 Orsay France February 3–4
Conference Year City Country Date
4th PKDD 2000 Lyon France September 13–16
3rd PKDD 1999 Prague Czech Republic September 15–18
2nd PKDD 1998 Nantes France September 23–26
1st PKDD 1997 Trondheim Norway June 24–27

References [edit]

  1. ^ "Machine Learning and Pattern Recognition". Libra. Retrieved 2009-07-04.  ECML is number 4 on the list.
  2. ^ "2007 Australian Ranking of ICT Conferences".  Both ECML and PKDD are ranked on “tier A”.
  3. ^ "Past conferences". ECML PKDD. Retrieved 2009-07-04. 
  4. ^ Daelemans, Walter; Goethals, Bart; Morik, Katharina (2008). "Preface". Proceedings of ECML PKDD 2008. Lecture Notes in Artificial Intelligence 5211. Springer. pp. V–VI. doi:10.1007/978-3-540-87479-9. ISBN 978-3-540-87478-2. .
  5. ^ Komorowski, Jan; Zytkow, Jan (1997). "Preface". Proceedings of PKDD 1997. Lecture Notes in Artificial Intelligence 1263. Springer. pp. V–VI. doi:10.1007/3-540-63223-9. ISBN 978-3-540-63223-8. .
  6. ^ Zytkow, Jan; Rauch, Jan (1999). "Preface". Proceedings of PKDD 1999. Lecture Notes in Artificial Intelligence 1704. Springer. pp. V–VII. doi:10.1007/b72280. ISBN 978-3-540-66490-1. .

External links [edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Evolutionary_data_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Evolutionary_data_mining new file mode 100644 index 00000000..0251343a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Evolutionary_data_mining @@ -0,0 +1 @@ + Evolutionary data mining - Wikipedia, the free encyclopedia

Evolutionary data mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Evolutionary data mining, or genetic data mining is an umbrella term for any data mining using evolutionary algorithms. While it can be used for mining data from DNA sequences,[1] it is not limited to biological contexts and can be used in any classification-based prediction scenario, which helps "predict the value ... of a user-specified goal attribute based on the values of other attributes."[2] For instance, a banking institution might want to predict whether a customer's credit would be "good" or "bad" based on their age, income and current savings.[2] Evolutionary algorithms for data mining work by creating a series of random rules to be checked against a training dataset.[3] The rules which most closely fit the data are selected and are mutated.[3] The process is iterated many times and eventually, a rule will arise that approaches 100% similarity with the training data.[2] This rule is then checked against a test dataset, which was previously invisible to the genetic algorithm.[2]

Contents

Process [edit]

Data preparation [edit]

Before databases can be mined for data using evolutionary algorithms, it first has to be cleaned,[2] which means incomplete, noisy or inconsistent data should be repaired. It is imperative that this be done before the mining takes place, as it will help the algorithms produce more accurate results.[3]

If data comes from more than one database, they can be integrated, or combined, at this point.[3] When dealing with large datasets, it might be beneficial to also reduce the amount of data being handled.[3] One common method of data reduction works by getting a normalized sample of data from the database, resulting in much faster, yet statistically equivalent results.[3]

At this point, the data is split into two equal but mutually exclusive elements, a test and a training dataset.[2] The training dataset will be used to let rules evolve which match it closely.[2] The test dataset will then either confirm or deny these rules.[2]

Data mining [edit]

Evolutionary algorithms work by trying to emulate natural evolution.[3] First, a random series of "rules" are set on the training dataset, which try to generalize the data into formulas.[3] The rules are checked, and the ones that fit the data best are kept, the rules that do not fit the data are discarded.[3] The rules that were kept are then mutated, and multiplied to create new rules.[3]

This process iterates as necessary in order to produce a rule that matches the dataset as closely as possible.[3] When this rule is obtained, it is then checked against the test dataset.[2] If the rule still matches the data, then the rule is valid and is kept.[2] If it does not match the data, then it is discarded and the process begins by selecting random rules again.[2]

See also [edit]

References [edit]

  1. ^ Wai-Ho Au, Keith C. C. Chan, and Xin Yao. "A Novel Evolutionary Data Mining Algorithm With Applications to Churn Prediction", IEEE, retrieved on 2008-12-4.
  2. ^ a b c d e f g h i j k Freitas, Alex A. "A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery", Pontifícia Universidade Católica do Paraná, Retrieved on 2008-12-4.
  3. ^ a b c d e f g h i j k Jiawei Han, Micheline Kamber Data Mining: Concepts and Techniques (2006), Morgan Kaufmann, ISBN 1-55860-901-6

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/FICO b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/FICO new file mode 100644 index 00000000..7f18384e --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/FICO @@ -0,0 +1 @@ + FICO - Wikipedia, the free encyclopedia

FICO

From Wikipedia, the free encyclopedia
Jump to: navigation, search
FICO
Type Public company
Traded as NYSEFICO
Founded 1956
Headquarters San Jose, California
Key people William J Lansing CEO
Products FICO Score, TRIAD, Blaze Advisor, Falcon Fraud, Debt Manager, Model Builder
Website www.fico.com

FICO is a public company that provides analytics and decision making services—including credit scoring[1]—intended to help financial services companies make complex, high-volume decisions.[1]

Contents

History[edit]

A pioneer credit score company, FICO was founded in 1956 as Fair, Isaac and Company by engineer Bill Fair and mathematician Earl Isaac.[2] FICO was first headquartered in San Rafael, CA, United States.[3]

Selling its first credit scoring system two years after the company's creation,[2] sales of similar systems soon followed. In 1987 FICO went public.[2]. The introduction of the first general-purpose FICO score was in 1989 when BEACON debuted at Equifax.[2]

Originally called Fair, Isaac and Company, it was renamed Fair Isaac Corporation in 2003.[2] The company rebranded again in 2009, changing its name and ticker symbol to FICO.[4] [5]

FICO also sells a product called Falcon Fraud Manager for banks and corporations, which is a neural network-based application designed to fight fraud by proactively detecting unusual transaction patterns.[6]

Location[edit]

FICO has its headquarters in San Jose, California, United States[7] and has offices in Asia Pacific, Australia, Brazil, Canada, China, India, Korea, Malaysia, Russia, Singapore, Spain, Turkey, United Kingdom, and the USA. Its main offices are located in San Jose, CA, San Rafael, CA, and San Diego, CA.[8]

FICO score[edit]

A measure of credit risk, FICO scores are available through all of the major consumer reporting agencies in the United States and Canada: Equifax;[9] Experian;[9] TransUnion;[9] PRBC.[10]

Clients[edit]

FICO provides its products and services to very large businesses and corporations across a number of fields, notably banking.

See also[edit]

References[edit]

  1. ^ a b About Us FICO Official Site
  2. ^ a b c d e History FICO Official Site
  3. ^ FICO score, Retrieved December 01, 2011.
  4. ^ Fair Isaac is Now FICO™ FICO Official Site
  5. ^ FICO Unveils New Ticker Symbol on New York Stock Exchange FICO Official Site
  6. ^ Kathryn Balint, "Fraud fighters," San Diego Union-Tribune, 18 February 2005.
  7. ^ Worldwide Locations FICO Official Site
  8. ^ FICO.com
  9. ^ a b c Credit Reporting Agencies FICO Official Site
  10. ^ Fair Isaac and PRBC Team Up to Enhance Credit Risk Tools Used by Mortgage Industry FICO Official Site

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/FSA_Red_Algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/FSA_Red_Algorithm new file mode 100644 index 00000000..84a10057 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/FSA_Red_Algorithm @@ -0,0 +1 @@ + FSA-Red Algorithm - Wikipedia, the free encyclopedia

FSA-Red Algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search

FSA-Red Algorithm[1] is an algorithm for data reduction which is suitable to build strong association rule using data mining method such as Apriori algorithm.

Contents

Setting[edit]

FSA-Red Algorithm, was introduced by Feri Sulianta in International Conference of Information and Communication Technology (ICOICT), Indonesia, Bandung, Wednesday, March 20, 2013[2] when he delivered presentation with the theme topics Mining food industrys multidimensional data to produce association rules using apriori algorithm as a basis of business stratgey.[3] The Algorithm is used for data reduction or preprocesssing to minimize the attribute to be analyzed. The goal is to make strong association rules using data mining technics related to the data which is reduced . The data preprocessing in FSA-Red performed a few of reduction techniques such as attribute selection, row selection and feature selection. Row selection has done by deleting all signed record which related to the attribute which need to be analyzed. Feature selection will remove all the unwanted attribute, ended with attribute selection to eliminate the non value attributes which is no need to be included.. The Idea base on the justification no matter the reduction has done the reduction procedure have to consider the presence of the other information in all dataset, so that the reduction should be done systematically consider the linkages between attributes. After the reduction proccess there would be only the in instances in the small scale with integrity by mean no information lost among the attribute in every selective instance.

Flowchart of FSA-Red Algorithm bind with Association Rule Method using Apriori Algorithm.

Benefit[edit]

The flexibility according to the FSA-Red Algorithm is the way attribute is chosen, there is no limitation to exclude the attribute, by mean any kind of attribute can be chose as a basis of reduction process even though there would be the attribute which is not the best compare to the others. This is the benefit from the reduction procedure which might result rich association patterns of the data.

See also[edit]

References[edit]

  1. ^ " FSA-Red Algorithm"ferisulianta.com (retrieved 24 April 2013)
  2. ^ " ICOICT 2013 " icoict.org (retrieved 24 April 2013)
  3. ^ " ICOICT 2013 Agenda and Presenter " icoict.org (retrieved 24 April 2013)

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Feature_vector b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Feature_vector new file mode 100644 index 00000000..737a2213 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Feature_vector @@ -0,0 +1 @@ + Feature vector - Wikipedia, the free encyclopedia

Feature vector

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, when representing texts perhaps to term occurrence frequencies. Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression. Feature vectors are often combined with weights using a dot product in order to construct a linear predictor function that is used to determine a score for making a prediction.

The vector space associated with these vectors is often called the feature space. In order to reduce the dimensionality of the feature space, a number of dimensionality reduction techniques can be employed.

Higher-level features can be obtained from already available features and added to the feature vector, for example for the study of diseases the feature 'Age' is useful and is defined as Age = 'Year of death' - 'Year of birth'. This process is referred to as feature construction.[1][2] Feature construction is the application of a set of constructive operators to a set of existing features resulting in construction of new features. Examples of such constructive operators include checking for the equality conditions {=, ≠}, the arithmetic operators {+,−,×, /}, the array operators {max(S), min(S), average(S)} as well as other more sophisticated operators, for example count(S,C)[3] that counts the number of features in the feature vector S satisfying some condition C or, for example, distances to other recognition classes generalized by some accepting device. Feature construction has long been considered a powerful tool for increasing both accuracy and understanding of structure, particularly in high-dimensional problems.[4] Applications include studies of disease and emotion recognition from speech.[5]

References[edit]

  1. ^ Liu, H., Motoda H. (1998) Feature Selection for Knowledge Discovery and Data Mining., Kluwer Academic Publishers. Norwell, MA, USA. 1998.
  2. ^ Piramuthu, S., Sikora R. T. Iterative feature construction for improving inductive learning algorithms. In Journal of Expert Systems with Applications. Vol. 36 , Iss. 2 (March 2009), pp. 3401-3406, 2009
  3. ^ Bloedorn, E., Michalski, R. Data-driven constructive induction: a methodology and its applications. IEEE Intelligent Systems, Special issue on Feature Transformation and Subset Selection, pp. 30-37, March/April, 1998
  4. ^ Breiman, L. Friedman, T., Olshen, R., Stone, C. (1984) Classification and regression trees, Wadsworth
  5. ^ Sidorova, J., Badia T. Syntactic learning for ESEDA.1, tool for enhanced speech emotion detection and analysis. Internet Technology and Secured Transactions Conference 2009 (ICITST-2009), London, November 9–12. IEEE

See also[edit]


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Formal_concept_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Formal_concept_analysis new file mode 100644 index 00000000..fcbf05db --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Formal_concept_analysis @@ -0,0 +1 @@ + Formal concept analysis - Wikipedia, the free encyclopedia

Formal concept analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In information science, formal concept analysis is a principled way of deriving a concept hierarchy or formal ontology from a collection of objects and their properties. Each concept in the hierarchy represents the set of objects sharing the same values for a certain set of properties; and each sub-concept in the hierarchy contains a subset of the objects in the concepts above it. The term was introduced by Rudolf Wille in 1984, and builds on applied lattice and order theory that was developed by Birkhoff and others in the 1930s.

Formal concept analysis finds practical application in fields including data mining, text mining, machine learning, knowledge management, semantic web, software development, and biology.

Contents

Overview and history[edit]

The original motivation of formal concept analysis was the concrete representation of complete lattices and their properties by means of formal contexts, data tables that represent binary relations between objects and attributes. In this theory, a formal concept is defined to be a pair consisting of a set of objects (the "extent") and a set of attributes (the "intent") such that the extent consists of all objects that share the given attributes, and the intent consists of all attributes shared by the given objects. In this way, formal concept analysis formalizes the notions of extension and intension.

Pairs of formal concepts may be partially ordered by the subset relation between their sets of objects, or equivalently by the superset relation between their sets of attributes. This ordering results in a graded system of sub- and superconcepts, a concept hierarchy, which can be displayed as a line diagram. The family of these concepts obeys the mathematical axioms defining a lattice, and is called more formally a concept lattice. In French this is called a treillis de Galois (Galois lattice) because of the relation between the sets of concepts and attributes is a Galois connection.

The theory in its present form goes back to the Darmstadt research group led by Rudolf Wille, Bernhard Ganter and Peter Burmeister, where formal concept analysis originated in the early 1980s. The mathematical basis, however, was already created by Garrett Birkhoff in the 1930s as part of the general lattice theory. Before the work of the Darmstadt group, there were already approaches in various French groups. Philosophical foundations of formal concept analysis refer in particular to Charles S. Peirce and the educationalist Hartmut von Hentig.

Motivation and philosophical background[edit]

In his article Restructuring Lattice Theory (1982) initiating formal concept analysis as a mathematical discipline, Rudolf Wille starts from a discontent with the current lattice theory and pure mathematics in general: The production of theoretical results - often achieved by "elaborate mental gymnastics" - were impressive, but the connections between neighbouring domains, even parts of a theory were getting weaker.

Restructuring lattice theory is an attempt to reinvigorate connections with our general culture by interpreting the theory as concretely as possible, and in this way to promote better communication between lattice theorists and potential users of lattice theory.[1]

This aim traces back to Hartmut von Hentig, who in 1972 plaided for restructuring sciences in view of better teaching and in order to make sciences mutually available and more generally (i.e. also without specialized knowledge) criticable.[2] Hence, by its origins formal concept analysis aims at interdisciplinarity and democratic control of research.[3]

It corrects the starting point of lattice theory during the development of formal logic in 19th century. Then - and later in model theory - a concept as unary predicate had been reduced to its extent. Now again, the philosophy of concepts should become less abstract by considering the intent. Hence, formal concept analysis is oriented towards the categories extension and intension of linguistics and classical conceptual logic.[4]

FCA aims at the clarity of concepts according to Charles S. Peirce's pragmatic maxim by unfolding observable, elementary properties of the subsumed objects.[3] In his late philosophy, Peirce assumed that logical thinking aims at perceiving reality, by the triade concept, judgement and conclusion. Mathematics is an abstraction of logic, develops patterns of possible realities and therefore may support rational communication. On this background, Wille defines:

The aim and meaning of Formal Concept Analysis as mathematical theory of concepts and concept hierarchies is to support the rational communication of humans by mathematically developing appropriate conceptual structures which can be logically activated.[5]

Example[edit]

A concept lattice for objects consisting of the integers from 1 to 10, and attributes composite (c), square (s), even (e), odd (o) and prime (p).

Consider O = {1,2,3,4,5,6,7,8,9,10}, and A = {composite, even, odd, prime, square}. The smallest concept including the number 3 is the one with objects {3,5,7}, and attributes {odd, prime}, for 3 has both of those attributes and {3,5,7} is the set of objects having that set of attributes. The largest concept involving the attribute of being square is the one with objects {1,4,9} and attributes {square}, for 1, 4 and 9 are all the square numbers and all three of them have that set of attributes. It can readily be seen that both of these example concepts satisfy the formal definitions below

The full set of concepts for these objects and attributes is shown in the illustration. It includes a concept for each of the original attributes: the composite numbers, square numbers, even numbers, odd numbers, and prime numbers. Additionally it includes concepts for the even composite numbers, composite square numbers (that is, all square numbers except 1), even composite squares, odd squares, odd composite squares, even primes, and odd primes.

Contexts and concepts[edit]

A (formal) context consists of a set of objects O, a set of unary attributes A, and an indication of which objects have which attributes. Formally it can be regarded as a bipartite graph I ⊆ O × A.

composite even odd prime square
1
2
3
4
5
6
7
8
9
10

A (formal) concept for a context is defined to be a pair (Oi, Ai) such that

  1. OiO
  2. AiA
  3. every object in Oi has every attribute in Ai
  4. for every object in O that is not in Oi, there is an attribute in Ai that the object does not have
  5. for every attribute in A that is not in Ai, there is an object in Oi that does not have that attribute

Oi is called the extent of the concept, Ai the intent.

A context may be described as a table, with the objects corresponding to the rows of the table, the attributes corresponding to the columns of the table, and a Boolean value (in the example represented graphically as a checkmark) in cell (x, y) whenever object x has value y.

A concept, in this representation, forms a maximal subarray (not necessarily contiguous) such that all cells within the subarray are checked. For instance, the concept highlighted with a different background color in the example table is the one describing the odd prime numbers, and forms a 3 × 2 subarray in which all cells are checked.[6]

Concept lattice of a context[edit]

The concepts (Oi, Ai) defined above can be partially ordered by inclusion: if (Oi, Ai) and (Oj, Aj) are concepts, we define a partial order ≤ by saying that (Oi, Ai) ≤ (Oj, Aj) whenever OiOj. Equivalently, (Oi, Ai) ≤ (Oj, Aj) whenever AjAi.

Every pair of concepts in this partial order has a unique greatest lower bound (meet). The greatest lower bound of (Oi, Ai) and (Oj, Aj) is the concept with objects OiOj; it has as its attributes the union of Ai, Aj, and any additional attributes held by all objects in OiOj. Symmetrically, every pair of concepts in this partial order has a unique least upper bound (join). The least upper bound of (Oi, Ai) and (Oj, Aj) is the concept with attributes AiAj; it has as its objects the union of Oi, Oj, and any additional objects that have all attributes in AiAj.

These meet and join operations satisfy the axioms defining a lattice. In fact, by considering infinite meets and joins, analogously to the binary meets and joins defined above, one sees that this is a complete lattice. It may be viewed as the Dedekind–MacNeille completion of a partially ordered set of height two in which the elements of the partial order are the objects and attributes of A and in which two elements x and y satisfy x ≤ y exactly when x is an object that has attribute y.

Any finite lattice may be generated as the concept lattice for some context. For, let L be a finite lattice, and form a context in which the objects and the attributes both correspond to elements of L. In this context, let object x have attribute y exactly when x and y are ordered as xy in the lattice. Then, the concept lattice of this context is isomorphic to L itself.[7] This construction may be interpreted as forming the Dedekind–MacNeille completion of L, which is known to produce an isomorphic lattice from any finite lattice.

Concept algebra of a context[edit]

Modelling negation in a formal context is somewhat problematic because the complement (O\Oi, A\Ai) of a concept (Oi, Ai) is in general not a concept. However, since the concept lattice is complete one can consider the join (Oi, Ai)Δ of all concepts (Oj, Aj) that satisfy Oj ⊆ G\Oi; or dually the meet (Oi, Ai)𝛁 of all concepts satisfying Aj ⊆ G\Ai. These two operations are known as weak negation and weak opposition, respectively.

This can be expressed in terms of the derivative functions. The derivative of a set Oi ⊆ O of objects is the set Oi' ⊆ A of all attributes that hold for all objects in Oi. The derivative of a set Ai ⊆ A of attributes is the set Ai' ⊆ O of all objects that have all attributes in Ai. A pair (Oi, Ai) is a concept if and only if Oi' = Ai and Ai' = Oi. Using this function, weak negation can be written as

(Oi, Ai)Δ = ((G\A)'', (G\A)'),

and weak opposition can be written as

(Oi, Ai)𝛁 = ((M\B)', (M\B)'').

The concept lattice equipped with the two additional operations Δ and 𝛁 is known as the concept algebra of a context. Concept algebras are a generalization of power sets.

Weak negation on a concept lattice L is a weak complementation, i.e. an order-reversing map Δ: L → L which satisfies the axioms xΔΔ ≤ x and (xy) ⋁ (xyΔ) = x. Weak composition is a dual weak complementation. A (bounded) lattice such as a concept algebra, which is equipped with a weak complementation and a dual weak complementation, is called a weakly dicomplemented lattice. Weakly dicomplemented lattices generalize distributive orthocomplemented lattices, i.e. Boolean algebras.[8][9]

Recovering the context from the line diagram[edit]

The line diagram of the concept lattice encodes enough information to recover the original context from which it was formed. Each object of the context corresponds to a lattice element, the element with the minimal object set that contains that object, and with an attribute set consisting of all attributes of the object. Symmetrically, each attribute of the context corresponds to a lattice element, the one with the minimal attribute set containing that attribute, and with an object set consisting of all objects with that attribute. We may label the nodes of the line diagram with the objects and attributes they correspond to; with this labeling, object x has attribute y if and only if there exists a monotonic path from x to y in the diagram.[10]

Efficient construction[edit]

Kuznetsov & Obiedkov (2001) survey the many algorithms that have been developed for constructing concept lattices. These algorithms vary in many details, but are in general based on the idea that each edge of the line diagram of the concept lattice connects some concept C to the concept formed by the join of C with a single object. Thus, one can build up the concept lattice one concept at a time, by finding the neighbors in the line diagram of known concepts, starting from the concept with an empty set of objects. The amount of time spent to traverse the entire concept lattice in this way is polynomial in the number of input objects and attributes per generated concept.

Tools[edit]

Many FCA software applications are available today. The main purpose of these tools varies from formal context creation to formal concept mining and generating the concepts lattice of a given formal context and the corresponding association rules. Most of these tools are academic and still under active development. One can find a non exhaustive list of FCA tools in the FCA software website. Most of these tools are open-source applications like ConExp, ToscanaJ, Lattice Miner,[11] Coron, FcaBedrock, etc.

See also[edit]

Notes[edit]

  1. ^ Rudolf Wille: Restructuring lattice theory: An approach based on hierarchies of concepts. Reprint in: ICFCA '09: Proceedings of the 7th International Conference on Formal Concept Analysis, Berlin, Heidelberg, 2009, p. 314.
  2. ^ Hartmut von Hentig: Magier oder Magister? Über die Einheit der Wissenschaft im Verständigungsprozeß. Klett 1972 / Suhrkamp 1974. Cited after Karl Erich Wolff: Ordnung, Wille und Begriff, Ernst Schröder Zentrum für Begriffliche Wissensverarbeitung, Darmstadt 2003.
  3. ^ a b Johannes Wollbold: Attribute Exploration of Gene Regulatory Processes. PhD thesis, University of Jena 2011, p. 9
  4. ^ Bernhard Ganter, Bernhard and Rudolf Wille: Formal Concept Analysis: Mathematical Foundations. Springer, Berlin, ISBN 3-540-62771-5, p. 1
  5. ^ Rudolf Wille: Formal Concept Analysis as Mathematical Theory of Concepts and Concept Hierarchies. In: B. Ganter et al.: Formal Concept Analysis. Foundations and Applications, Springer, 2005, p. 1f.
  6. ^ Wolff, section 2.
  7. ^ Stumme, Theorem 1.
  8. ^ Wille, Rudolf (2000), "Boolean Concept Logic", in Ganter, B.; Mineau, G. W., ICCS 2000 Conceptual Structures: Logical, Linguistic and Computational Issues, LNAI 1867, Springer, pp. 317–331, ISBN 978-3-540-67859-5 .
  9. ^ Kwuida, Léonard (2004), Dicomplemented Lattices. A contextual generalization of Boolean algebras, Shaker Verlag, ISBN 978-3-8322-3350-1 
  10. ^ Wolff, section 3.
  11. ^ Boumedjout Lahcen and Leonard Kwuida. Lattice Miner: A Tool for Concept Lattice Construction and Exploration. In Suplementary Proceeding of International Conference on Formal concept analysis (ICFCA'10), 2010

References[edit]

  • Ganter, Bernhard; Stumme, Gerd; Wille, Rudolf, eds. (2005), Formal Concept Analysis: Foundations and Applications, Lecture Notes in Artificial Intelligence, no. 3626, Springer-Verlag, ISBN 3-540-27891-5 
  • Ganter, Bernhard; Wille, Rudolf (1998), Formal Concept Analysis: Mathematical Foundations, Springer-Verlag, Berlin, ISBN 3-540-62771-5 . Translated by C. Franzke.
  • Carpineto, Claudio; Romano, Giovanni (2004), Concept Data Analysis: Theory and Applications, Wiley, ISBN 978-0-470-85055-8 .
  • Kuznetsov, Sergei O.; Obiedkov, Sergei A. (2001), "Algorithms for the Construction of Concept Lattices and Their Diagram Graphs", Principles of Data Mining and Knowledge Discovery, Lecture Notes in Computer Science 2168, Springer-Verlag, pp. 289–300, doi:10.1007/3-540-44794-6_24, ISBN 978-3-540-42534-2 .

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/GSP_Algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/GSP_Algorithm new file mode 100644 index 00000000..ebe60440 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/GSP_Algorithm @@ -0,0 +1 @@ + GSP Algorithm - Wikipedia, the free encyclopedia

GSP Algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search

GSP Algorithm (Generalized Sequential Pattern algorithm) is an algorithm used for sequence mining. The algorithms for solving sequence mining problems are mostly based on the a priori (level-wise) algorithm. One way to use the level-wise paradigm is to first discover all the frequent items in a level-wise fashion. It simply means counting the occurrences of all singleton elements in the database. Then, the transactions are filtered by removing the non-frequent items. At the end of this step, each transaction consists of only the frequent elements it originally contained. This modified database becomes an input to the GSP algorithm. This process requires one pass over the whole database.

GSP Algorithm makes multiple database passes. In the first pass, all single items (1-sequences) are counted. From the frequent items, a set of candidate 2-sequences are formed, and another pass is made to identify their frequency. The frequent 2-sequences are used to generate the candidate 3-sequences, and this process is repeated until no more frequent sequences are found. There are two main steps in the algorithm.

  • Candidate Generation. Given the set of frequent (k-1)-frequent sequences F(k-1), the candidates for the next pass are generated by joining F(k-1) with itself. A pruning phase eliminates any sequence, at least one of whose subsequences is not frequent.
  • Support Counting. Normally, a hash tree–based search is employed for efficient support counting. Finally non-maximal frequent sequences are removed.

Algorithm[edit]

F1 = the set of frequent 1-sequence k=2, do while F(k-1)!= Null;

Generate candidate sets Ck (set of candidate k-sequences);
For all input sequences s in the database D
do
Increment count of all a in Ck if s supports a
Fk = {a Є Ck such that its frequency exceeds the threshold}
k= k+1;
Result = Set of all frequent sequences is the union of all Fks
End do

End do

The above algorithm looks like the Apriori algorithm. One main difference is however the generation of candidate sets. Let us assume that:

A → B and A → C

are two frequent 2-sequences. The items involved in these sequences are (A, B) and (A,C) respectively. The candidate generation in a usual Apriori style would give (A, B, C) as a 3-itemset, but in the present context we get the following 3-sequences as a result of joining the above 2- sequences

A → B → C, A → C → B and A → BC

The candidate–generation phase takes this into account. The GSP algorithm discovers frequent sequences, allowing for time constraints such as maximum gap and minimum gap among the sequence elements. Moreover, it supports the notion of a sliding window, i.e., of a time interval within which items are observed as belonging to the same event, even if they originate from different events.

See Also[edit]

Sequence mining

References[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Gene_expression_programming b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Gene_expression_programming new file mode 100644 index 00000000..91865fd2 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Gene_expression_programming @@ -0,0 +1 @@ + Gene expression programming - Wikipedia, the free encyclopedia

Gene expression programming

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Gene expression programming (GEP) is an evolutionary algorithm that creates computer programs or models. These computer programs are complex tree structures that learn and adapt by changing their sizes, shapes, and composition, much like a living organism. And like living organisms, the computer programs of GEP are also encoded in simple linear chromosomes of fixed length. Thus, GEP is a genotype-phenotype system, benefiting from a simple genome to keep and transmit the genetic information and a complex phenotype to explore the environment and adapt to it.

Contents

Background[edit]

Evolutionary algorithms use populations of individuals, select individuals according to fitness, and introduce genetic variation using one or more genetic operators. Their use in artificial computational systems dates back to the 1950s where they were used to solve optimization problems (e.g. Box 1957[1] and Friedman 1959[2]). But it was with the introduction of evolution strategies by Rechenberg in 1965[3] that evolutionary algorithms gained popularity. A good overview text on evolutionary algorithms is the book “An Introduction to Genetic Algorithms” by Mitchell (1996).[4]

Gene expression programming[5] belongs to the family of evolutionary algorithms and is closely related to genetic algorithms and genetic programming. From genetic algorithms it inherited the linear chromosomes of fixed length; and from genetic programming it inherited the expressive parse trees of varied sizes and shapes.

In gene expression programming the linear chromosomes work as the genotype and the parse trees as the phenotype, creating a genotype/phenotype system. This genotype/phenotype system is multigenic, thus encoding multiple parse trees in each chromosome. This means that the computer programs created by GEP are composed of multiple parse trees. Because these parse trees are the result of gene expression, in GEP they are called expression trees.

Encoding: the genotype[edit]

The genome of gene expression programming consists of a linear, symbolic string or chromosome of fixed length composed of one or more genes of equal size. These genes, despite their fixed length, code for expression trees of different sizes and shapes. An example of a chromosome with two genes, each of size 9, is the string (position zero indicates the start of each gene):

012345678012345678
L+a-baccd**cLabacd

where “L” represents the natural logarithm function and “a”, “b”, “c”, and “d” represent the variables and constants used in a problem.

Expression trees: the phenotype[edit]

As shown above, the genes of gene expression programming have all the same size. However, these fixed length strings code for expression trees of different sizes. This means that the size of the coding regions varies from gene to gene, allowing for adaptation and evolution to occur smoothly.

For example, the mathematical expression:

\sqrt{(a-b)(c+d)} \,

can also be represented as an expression tree:

GEP expression tree, k-expression Q*-+abcd.png

where “Q” represents the square root function.

This kind of expression tree consists of the phenotypic expression of GEP genes, whereas the genes are linear strings encoding these complex structures. For this particular example, the linear string corresponds to:

01234567
Q*-+abcd

which is the straightforward reading of the expression tree from top to bottom and from left to right. These linear strings are called k-expressions (from Karva notation).

Going from k-expressions to expression trees is also very simple. For example, the following k-expression:

01234567890
Q*b**+baQba

is composed of two different terminals (the variables “a” and “b”), two different functions of two arguments (“*” and “+”), and a function of one argument (“Q”). Its expression gives:

GEP expression tree, k-expression Q*b**+baQba.png

K-expressions and genes[edit]

The k-expressions of gene expression programming correspond to the region of genes that gets expressed. This means that there might be sequences in the genes that are not expressed, which is indeed true for most genes. The reason for these noncoding regions is to provide a buffer of terminals so that all k-expressions encoded in GEP genes correspond always to valid programs or expressions.

The genes of gene expression programming are therefore composed of two different domains – a head and a tail – each with different properties and functions. The head is used mainly to encode the functions and variables chosen to solve the problem at hand, whereas the tail, while also used to encode the variables, provides essentially a reservoir of terminals to ensure that all programs are error-free.

For GEP genes the length of the tail is given by the formula:

t = h(n_\max-1)+1

where h is the head’s length and nmax is maximum arity. For example, for a gene created using the set of functions F = {Q, +, −, *, /} and the set of terminals T = {a, b}, nmax = 2. And if we choose a head length of 15, then t = 15 (2 − 1) + 1 = 16, which gives a gene length g of 15 + 16 = 31. The randomly generated string below is an example of one such gene:

0123456789012345678901234567890
*b+a-aQab+//+b+babbabbbababbaaa

It encodes the expression tree:

GEP expression tree, k-expression *b+a-aQa.png

which, in this case, only uses 8 of the 31 elements that constitute the gene.

It’s not hard to see that, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, with the simplest composed of only one node (when the first element of a gene is a terminal) and the largest composed of as many nodes as there are elements in the gene (when all the elements in the head are functions with maximum arity).

It’s also not hard to see that it is trivial to implement all kinds of genetic modification (mutation, inversion, insertion, recombination, and so on) with the guarantee that all resulting offspring encode correct, error-free programs.

Multigenic chromosomes[edit]

The chromosomes of gene expression programming are usually composed of more than one gene of equal length. Each gene codes for a sub-expression tree (sub-ET) or sub-program. Then the sub-ETs can interact with one another in different ways, forming a more complex program. The figure shows an example of a program composed of three sub-ETs.

Expression of GEP genes as sub-ETs. a) A three-genic chromosome with the tails shown in bold. b) The sub-ETs encoded by each gene.

In the final program the sub-ETs could be linked by addition or some other function, as there are no restrictions to the kind of linking function one might choose. Some examples of more complex linkers include taking the average, the median, the midrange, thresholding their sum to make a binomial classification, applying the sigmoid function to compute a probability, and so on. These linking functions are usually chosen a priori for each problem, but they can also be evolved elegantly and efficiently by the cellular system[6][7] of gene expression programming.

Cells and code reuse[edit]

In gene expression programming, homeotic genes control the interactions of the different sub-ETs or modules of the main program. The expression of such genes results in different main programs or cells, that is, they determine which genes are expressed in each cell and how the sub-ETs of each cell interact with one another. In other words, homeotic genes determine which sub-ETs are called upon and how often in which main program or cell and what kind of connections they establish with one another.

Homeotic genes and the cellular system[edit]

Homeotic genes have exactly the same kind of structural organization as normal genes and they are built using an identical process. They also contain a head domain and a tail domain, with the difference that the heads contain now linking functions and a special kind of terminals – genic terminals – that represent the normal genes. The expression of the normal genes results as usual in different sub-ETs, which in the cellular system are called ADFs (automatically defined functions). As for the tails, they contain only genic terminals, that is, derived features generated on the fly by the algorithm.

For example, the chromosome in the figure has three normal genes and one homeotic gene and encodes a main program that invokes three different functions a total of four times, linking them in a particular way.

Expression of a unicellular system with three ADFs. a) The chromosome composed of three conventional genes and one homeotic gene (shown in bold). b) The ADFs encoded by each conventional gene. c) The main program or cell.

From this example it is clear that the cellular system not only allows the unconstrained evolution of linking functions but also code reuse. And it shouldn't be hard to implement recursion in this system.

Multiple main programs and multicellular systems[edit]

Multicellular systems are composed of more than one homeotic gene. Each homeotic gene in this system puts together a different combination of sub-expression trees or ADFs, creating multiple cells or main programs.

For example, the program shown in the figure was created using a cellular system with two cells and three normal genes.

Expression of a multicellular system with three ADFs and two main programs. a) The chromosome composed of three conventional genes and two homeotic genes (shown in bold). b) The ADFs encoded by each conventional gene. c) Two different main programs expressed in two different cells.

The applications of these multicellular systems are multiple and varied and, like the multigenic systems, they can be used both in problems with just one output and in problems with multiple outputs.

Other levels of complexity[edit]

The head/tail domain of GEP genes (both normal and homeotic) is the basic building block of all GEP algorithms. However, gene expression programming also explores other chromosomal organizations that are more complex than the head/tail structure. Essentially these complex structures consist of functional units or genes with a basic head/tail domain plus one or more extra domains. These extra domains usually encode random numerical constants that the algorithm relentlessly fine-tunes in order to find a good solution. For instance, these numerical constants may be the weights or factors in a function approximation problem (see the GEP-RNC algorithm below); they may be the weights and thresholds of a neural network (see the GEP-NN algorithm below); the numerical constants needed for the design of decision trees (see the GEP-DT algorithm below); the weights needed for polynomial induction; or the random numerical constants used to discover the parameter values in a parameter optimization task.

The basic gene expression algorithm[edit]

The fundamental steps of the basic gene expression algorithm are listed below in pseudocode:

1. Select function set;
2. Select terminal set;
3. Load dataset for fitness evaluation;
4. Create chromosomes of initial population randomly;
5. For each program in population:
a) Express chromosome;
b) Execute program;
c) Evaluate fitness;
6. Verify stop condition;
7. Select programs;
8. Replicate selected programs to form the next population;
9. Modify chromosomes using genetic operators;
10. Go to step 5.

The first four steps prepare all the ingredients that are needed for the iterative loop of the algorithm (steps 5 through 10). Of these preparative steps, the crucial one is the creation of the initial population, which is created randomly using the elements of the function and terminal sets.

Populations of programs[edit]

Like all evolutionary algorithms, gene expression programming works with populations of individuals, which in this case are computer programs. Therefore some kind of initial population must be created to get things started. Subsequent populations are descendants, via selection and genetic modification, of the initial population.

In the genotype/phenotype system of gene expression programming, it is only necessary to create the simple linear chromosomes of the individuals without worrying about the structural soundness of the programs they code for, as their expression always results in syntactically correct programs.

Fitness functions and the selection environment[edit]

Fitness functions and selection environments (called training datasets in machine learning) are the two facets of fitness and are therefore intricately connected. Indeed, the fitness of a program depends not only on the cost function used to measure its performance but also on the training data chosen to evaluate fitness

The selection environment or training data[edit]

The selection environment consists of the set of training records, which are also called fitness cases. These fitness cases could be a set of observations or measurements concerning some problem, and they form what is called the training dataset.

The quality of the training data is essential for the evolution of good solutions. A good training set should be representative of the problem at hand and also well-balanced, otherwise the algorithm might get stuck at some local optimum. In addition, it is also important to avoid using unnecessarily large datasets for training as this will slow things down unnecessarily. A good rule of thumb is to choose enough records for training to enable a good generalization in the validation data and leave the remaining records for validation and testing.

Fitness functions[edit]

Broadly speaking, there are essentially three different kinds of problems based on the kind of prediction being made:

1. Problems involving numeric (continuous) predictions;
2. Problems involving categorical or nominal predictions, both binomial and multinomial;
3. Problems involving binary or Boolean predictions.

The first type of problem goes by the name of regression; the second is known as classification, with logistic regression as a special case where, besides the crisp classifications like “Yes” or “No”, a probability is also attached to each outcome; and the last one is related to Boolean algebra and logic synthesis.

Fitness functions for regression[edit]

In regression, the response or dependent variable is numeric (usually continuous) and therefore the output of a regression model is also continuous. So it’s quite straightforward to evaluate the fitness of the evolving models by comparing the output of the model to the value of the response in the training data.

There are several basic fitness functions for evaluating model performance, with the most common being based on the error or residual between the model output and the actual value. Such functions include the mean squared error, root mean squared error, mean absolute error, relative squared error, root relative squared error, relative absolute error, and others.

All these standard measures offer a fine granularity or smoothness to the solution space and therefore work very well for most applications. But some problems might require a coarser evolution, such as determining if a prediction is within a certain interval, for instance less than 10% of the actual value. However, even if one is only interested in counting the hits (that is, a prediction that is within the chosen interval), making populations of models evolve based on just the number of hits each program scores is usually not very efficient due to the coarse granularity of the fitness landscape. Thus the solution usually involves combining these coarse measures with some kind of smooth function such as the standard error measures listed above.

Fitness functions based on the correlation coefficient and R-square are also very smooth. For regression problems, these functions work best by combining them with other measures because, by themselves, they only tend to measure correlation, not caring for the range of values of the model output. So by combining them with functions that work at approximating the range of the target values, they form very efficient fitness functions for finding models with good correlation and good fit between predicted and actual values.

Fitness functions for classification and logistic regression[edit]

The design of fitness functions for classification and logistic regression takes advantage of three different characteristics of classification models. The most obvious is just counting the hits, that is, if a record is classified correctly it is counted as a hit. This fitness function is very simple and works well for simple problems, but for more complex problems or datasets highly unbalanced it gives poor results.

One way to improve this type of hits-based fitness function consists of expanding the notion of correct and incorrect classifications. In a binary classification task, correct classifications can be 00 or 11. The “00” representation means that a negative case (represented by “0”) was correctly classified, whereas the “11” means that a positive case (represented by “1”) was correctly classified. Classifications of the type “00” are called true negatives (TN) and “11” true positives (TP).

There are also two types of incorrect classifications and they are represented by 01 and 10. They are called false positives (FP) when the actual value is 0 and the model predicts a 1; and false negatives (FN) when the target is 1 and the model predicts a 0. The counts of TP, TN, FP, and FN are usually kept on a table known as the confusion matrix.

Confusion matrix for a binomial classification task.

So by counting the TP, TN, FP, and FN and further assigning different weights to these four types of classifications, it is possible to create smoother and therefore more efficient fitness functions. Some popular fitness functions based on the confusion matrix include sensitivity/specificity, recall/precision, F-measure, Jaccard similarity, Matthews correlation coefficient, and cost/gain matrix which combines the costs and gains assigned to the 4 different types of classifications.

These functions based on the confusion matrix are quite sophisticated and are adequate to solve most problems efficiently. But there is another dimension to classification models which is key to exploring more efficiently the solution space and therefore results in the discovery of better classifiers. This new dimension involves exploring the structure of the model itself, which includes not only the domain and range, but also the distribution of the model output and the classifier margin.

By exploring this other dimension of classification models and then combining the information about the model with the confusion matrix, it is possible to design very sophisticated fitness functions that allow the smooth exploration of the solution space. For instance, one can combine some measure based on the confusion matrix with the mean squared error evaluated between the raw model outputs and the actual values. Or combine the F-measure with the R-square evaluated for the raw model output and the target; or the cost/gain matrix with the correlation coefficient, and so on. More exotic fitness functions that explore model granularity include the area under the ROC curve and rank measure.

Also related to this new dimension of classification models, is the idea of assigning probabilities to the model output, which is what is done in logistic regression. Then it is also possible to use these probabilities and evaluate the mean squared error (or some other similar measure) between the probabilities and the actual values, then combine this with the confusion matrix to create very efficient fitness functions for logistic regression. Popular examples of fitness functions based on the probabilities include maximum likelihood estimation and hinge loss.

Fitness functions for Boolean problems[edit]

In logic there is no model structure (as defined above for classification and logistic regression) to explore: the domain and range of logical functions comprises only 0’s and 1’s or false and true. So, the fitness functions available for Boolean algebra can only be based on the hits or on the confusion matrix as explained in the section above.

Selection and elitism[edit]

Roulette-wheel selection is perhaps the most popular selection scheme used in evolutionary computation. It involves mapping the fitness of each program to a slice of the roulette wheel proportional to its fitness. Then the roulette is spun as many times as there are programs in the population in order to keep the population size constant. So, with roulette-wheel selection programs are selected both according to fitness and the luck of the draw, which means that some times the best traits might be lost. However, by combining roulette-wheel selection with the cloning of the best program of each generation, one guarantees that at least the very best traits are not lost. This technique of cloning the best-of-generation program is known as simple elitism and is used by most stochastic selection schemes.

Reproduction with modification[edit]

The reproduction of programs involves first the selection and then the reproduction of their genomes. Genome modification is not required for reproduction, but without it adaptation and evolution won’t take place.

Replication and selection[edit]

The selection operator selects the programs for the replication operator to copy. Depending on the selection scheme, the number of copies one program originates may vary, with some programs getting copied more than once while others are copied just once or not at all. In addition, selection is usually set up so that the population size remains constant from one generation to another.

The replication of genomes in nature is very complex and it took scientists a long time to discover the DNA double helix and propose a mechanism for its replication. But the replication of strings is trivial in artificial evolutionary systems, where only an instruction to copy strings is required to pass all the information in the genome from generation to generation.

The replication of the selected programs is a fundamental piece of all artificial evolutionary systems, but for evolution to occur it needs to be implemented not with the usual precision of a copy instruction, but rather with a few errors thrown in. Indeed, genetic diversity is created with genetic operators such as mutation, recombination, transposition, inversion, and many others.

Mutation[edit]

In gene expression programming mutation is the most important genetic operator.[8] It changes genomes by changing an element by another. The accumulation of many small changes over time can create great diversity.

In gene expression programming mutation is totally unconstrained, which means that in each gene domain any domain symbol can be replaced by another. For example, in the heads of genes any function can be replaced by a terminal or another function, regardless of the number of arguments in this new function; and a terminal can be replaced by a function or another terminal.

Recombination[edit]

Recombination usually involves two parent chromosomes to create two new chromosomes by combining different parts from the parent chromosomes. And as long as the parent chromosomes are aligned and the exchanged fragments are homologous (that is, occupy the same position in the chromosome), the new chromosomes created by recombination will always encode syntactically correct programs.

Different kinds of crossover are easily implemented either by changing the number of parents involved (there’s no reason for choosing only two); the number of split points; or the way one chooses to exchange the fragments, for example, either randomly or in some orderly fashion. For example, gene recombination, which is a special case of recombination, can be done by exchanging homologous genes (genes that occupy the same position in the chromosome) or by exchanging genes chosen at random from any position in the chromosome.

Transposition[edit]

Transposition involves the introduction of an insertion sequence somewhere in a chromosome. In gene expression programming insertion sequences might appear anywhere in the chromosome, but they are only inserted in the heads of genes. This method guarantees that even insertion sequences from the tails result in error-free programs.

For transposition to work properly, it must preserve chromosome length and gene structure. So, in gene expression programming transposition can be implemented using two different methods: the first creates a shift at the insertion site, followed by a deletion at the end of the head; the second overwrites the local sequence at the target site and therefore is easier to implement. Both methods can be implemented to operate between chromosomes or within a chromosome or even within a single gene.

Inversion[edit]

Inversion is an interesting operator, especially powerful for combinatorial optimization.[9] It consists of inverting a small sequence within a chromosome.

In gene expression programming it can be easily implemented in all gene domains and, in all cases, the offspring produced is always syntactically correct. For any gene domain, a sequence (ranging from at least two elements to as big as the domain itself) is chosen at random within that domain and then inverted.

Other genetic operators[edit]

Several other genetic operators exist and in gene expression programming, with its different genes and gene domains, the possibilities are endless. For example, genetic operators such as one-point recombination, two-point recombination, gene recombination, uniform recombination, gene transposition, root transposition, domain-specific mutation, domain-specific inversion, domain-specific transposition, and so on, are easily implemented and widely used.

The GEP-RNC algorithm[edit]

Numerical constants are essential elements of mathematical and statistical models and therefore it is important to allow their integration in the models designed by evolutionary algorithms.

Gene expression programming solves this problem very elegantly through the use of an extra gene domain – the Dc – for handling random numerical constants (RNC). By combining this domain with a special terminal placeholder for the RNCs, a richly expressive system can be created.

Structurally, the Dc comes after the tail, has a length equal to the size of the tail t, and is composed of the symbols used to represent the RNCs.

For example, below is shown a simple chromosome composed of only one gene a head size of 7 (the Dc stretches over positions 15–22):

01234567890123456789012
+?*+?**aaa??aaa68083295

where the terminal “?” represents the placeholder for the RNCs. This kind of chromosome is expressed exactly as shown above, giving:

GEP expression tree with placeholder for RNCs.png

Then the ?’s in the expression tree are replaced from left to right and from top to bottom by the symbols (for simplicity represented by numerals) in the Dc, giving:

GEP expression tree with symbols (numerals) for RNCs.png

The values corresponding to these symbols are kept in an array. (For simplicity, the number represented by the numeral indicates the order in the array.) For instance, for the following 10 element array of RNCs:

C = {0.611, 1.184, 2.449, 2.98, 0.496, 2.286, 0.93, 2.305, 2.737, 0.755}

the expression tree above gives:

GEP expression tree with RNCs.png

This elegant structure for handling random numerical constants is at the heart of different GEP systems, such as GEP neural networks and GEP decision trees.

Like the basic gene expression algorithm, the GEP-RNC algorithm is also multigenic and its chromosomes are decoded as usual by expressing one gene after another and then linking them all together by the same kind of linking process.

The genetic operators used in the GEP-RNC system are an extension to the genetic operators of the basic GEP algorithm (see above), and they all can be straightforwardly implemented in these new chromosomes. On the other hand, the basic operators of mutation, inversion, transposition, and recombination are also used in the GEP-RNC algorithm. Furthermore, special Dc-specific operators such as mutation, inversion, and transposition, are also used to aid in a more efficient circulation of the RNCs among individual programs. In addition, there is also a special mutation operator that allows the permanent introduction of variation in the set of RNCs. The initial set of RNCs is randomly created at the beginning of a run, which means that, for each gene in the initial population, a specified number of numerical constants, chosen from a certain range, are randomly generated. Then their circulation and mutation is enabled by the genetic operators.

Neural networks[edit]

An artificial neural network (ANN or NN) is a computational device that consists of many simple connected units or neurons. The connections between the units are usually weighted by real-valued weights. These weights are the primary means of learning in neural networks and a learning algorithm is usually used to adjust them.

Structurally, a neural network has three different classes of units: input units, hidden units, and output units. An activation pattern is presented at the input units and then spreads in a forward direction from the input units through one or more layers of hidden units to the output units. The activation coming into one unit from other unit is multiplied by the weights on the links over which it spreads. All incoming activation is then added together and the unit becomes activated only if the incoming result is above the unit’s threshold.

In summary, the basic components of a neural network are the units, the connections between the units, the weights, and the thresholds. So, in order to fully simulate an artificial neural network one must somehow encode these components in a linear chromosome and then be able to express them in a meaningful way.

In GEP neural networks (GEP-NN or GEP nets), the network architecture is encoded in the usual structure of a head/tail domain.[10] The head contains special functions/neurons that activate the hidden and output units (in the GEP context, all these units are more appropriately called functional units) and terminals that represent the input units. The tail, as usual, contains only terminals/input units.

Besides the head and the tail, these neural network genes contain two additional domains, Dw and Dt, for encoding the weights and thresholds of the neural network. Structurally, the Dw comes after the tail and its length dw depends on the head size h and maximum arity nmax and is evaluated by the formula:

d_{w} = hn_\max

The Dt comes after Dw and has a length dt equal to t. Both domains are composed of symbols representing the weights and thresholds of the neural network.

For each NN-gene, the weights and thresholds are created at the beginning of each run, but their circulation and adaptation are guaranteed by the usual genetic operators of mutation, transposition, inversion, and recombination. In addition, special operators are also used to allow a constant flow of genetic variation in the set of weights and thresholds.

For example, below is shown a neural network with two input units (i1 and i2), two hidden units (h1 and h2), and one output unit (o1). It has a total of six connections with six corresponding weights represented by the numerals 1–6 (for simplicity, the thresholds are all equal to 1 and are omitted):

Neural network with 5 units.png

This representation is the canonical neural network representation, but neural networks can also be represented by a tree, which, in this case, corresponds to:

GEP neural network with 7 nodes.png

where “a” and “b” represent the two inputs i1 and i2 and “D” represents a function with connectivity two. This function adds all its weighted arguments and then thresholds this activation in order to determine the forwarded output. This output (zero or one in this simple case) depends on the threshold of each unit, that is, if the total incoming activation is equal to or greater than the threshold, then the output is one, zero otherwise.

The above NN-tree can be linearized as follows:

0123456789012
DDDabab654321

where the structure in positions 7–12 (Dw) encodes the weights. The values of each weight are kept in an array and retrieved as necessary for expression.

As a more concrete example, below is shown a neural net gene for the exclusive-or problem. It has a head size of 3 and Dw size of 6:

0123456789012
DDDabab393257

Its expression results in the following neural network:

Expression of a GEP neural network for the exclusive-or.png

which, for the set of weights:

W = {−1.978, 0.514, −0.465, 1.22, −1.686, −1.797, 0.197, 1.606, 0, 1.753}

it gives:

GEP neural network solution for the exclusive-or.png

which is a perfect solution to the exclusive-or function.

Besides simple Boolean functions with binary inputs and binary outputs, the GEP-nets algorithm can handle all kinds of functions or neurons (linear neuron, tanh neuron, atan neuron, logistic neuron, limit neuron, radial basis and triangular basis neurons, all kinds of step neurons, and so on). Also interesting is that the GEP-nets algorithm can use all these neurons together and let evolution decide which ones work best to solve the problem at hand. So, GEP-nets can be used not only in Boolean problems but also in logistic regression, classification, and regression. In all cases, GEP-nets can be implemented not only with multigenic systems but also cellular systems, both unicellular and multicellular. Furthermore, multinomial classification problems can also be tackled in one go by GEP-nets both with multigenic systems and multicellular systems.

Decision trees[edit]

Decision trees (DT) are classification models where a series of questions and answers are mapped using nodes and directed edges.

Decision trees have three types of nodes: a root node, internal nodes, and leaf or terminal nodes. The root node and all internal nodes represent test conditions for different attributes or variables in a dataset. Leaf nodes specify the class label for all different paths in the tree.

Most decision tree induction algorithms involve selecting an attribute for the root node and then make the same kind of informed decision about all the nodes in a tree.

Decision trees can also be created by gene expression programming,[11] with the advantage that all the decisions concerning the growth of the tree are made by the algorithm itself without any kind of human input.

There are basically two different types of DT algorithms: one for inducing decision trees with only nominal attributes and another for inducing decision trees with both numeric and nominal attributes. This aspect of decision tree induction also carries to gene expression programming and there are two GEP algorithms for decision tree induction: the evolvable decision trees (EDT) algorithm for dealing exclusively with nominal attributes and the EDT-RNC (EDT with random numerical constants) for handling both nominal and numeric attributes.

In the decision trees induced by gene expression programming, the attributes behave as function nodes in the basic gene expression algorithm, whereas the class labels behave as terminals. This means that attribute nodes have also associated with them a specific arity or number of branches that will determine their growth and, ultimately, the growth of the tree. Class labels behave like terminals, which means that for a k-class classification task, a terminal set with k terminals is used, representing the k different classes.

The rules for encoding a decision tree in a linear genome are very similar to the rules used to encode mathematical expressions (see above). So, for decision tree induction the genes also have a head and a tail, with the head containing attributes and terminals and the tail containing only terminals. This again ensures that all decision trees designed by GEP are always valid programs. Furthermore, the size of the tail t is also dictated by the head size h and the number of branches of the attribute with more branches nmax and is evaluated by the equation:

t = h(n_\max-1)+1 \,

For example, consider the decision tree below to decide whether to play outside:

Decision tree for playing outside.png

It can be linearly encoded as:

01234567
HOWbaaba

where “H” represents the attribute Humidity, “O” the attribute Outlook, “W” represents Windy, and “a” and “b” the class labels "Yes" and "No" respectively. Note that the edges connecting the nodes are properties of the data, specifying the type and number of branches of each attribute, and therefore don’t have to be encoded.

The process of decision tree induction with gene expression programming starts, as usual, with an initial population of randomly created chromosomes. Then the chromosomes are expressed as decision trees and their fitness evaluated against a training dataset. According to fitness they are then selected to reproduce with modification. The genetic operators are exactly the same that are used in a conventional unigenic system, for example, mutation, inversion, transposition, and recombination.

Decision trees with both nominal and numeric attributes are also easily induced with gene expression programming using the framework described above for dealing with random numerical constants. The chromosomal architecture includes an extra domain for encoding random numerical constants, which are used as thresholds for splitting the data at each branching node. For example, the gene below with a head size of 5 (the Dc starts at position 16):

012345678901234567890
WOTHabababbbabba46336

encodes the decision tree shown below:

GEP decision tree, k-expression WOTHababab.png

In this system, every node in the head, irrespective of its type (numeric attribute, nominal attribute, or terminal), has associated with it a random numerical constant, which for simplicity in the example above is represented by a numeral 0–9. These random numerical constants are encoded in the Dc domain and their expression follows a very simple scheme: from top to bottom and from left to right, the elements in Dc are assigned one-by-one to the elements in the decision tree. So, for the following array of RNCs:

C = {62, 51, 68, 83, 86, 41, 43, 44, 9, 67}

the decision tree above results in:

GEP decision tree with numeric and nominal attributes, k-expression WOTHababab.png

which can also be represented more colorfully as a conventional decision tree:

GEP decision tree with numeric and nominal attributes.png

Criticism[edit]

GEP has been criticized for not being a major improvement over other genetic programming techniques. In many experiments, it did not perform better than existing methods.[12]

Software[edit]

Commercial applications[edit]

GeneXproTools
GeneXproTools is a predictive analytics suite developed by Gepsoft. GeneXproTools modeling frameworks include logistic regression, classification, regression, time series prediction, and logic synthesis. GeneXproTools implements the basic gene expression algorithm and the GEP-RNC algorithm, both used in all the modeling frameworks of GeneXproTools.

Open source libraries[edit]

GEP4J – GEP for Java Project
Created by Jason Thomas, GEP4J is an open-source implementation of gene expression programming in Java. It implements different GEP algorithms, including evolving decision trees (with nominal, numeric, or mixed attributes) and automatically defined functions. GEP4J is hosted at Google Code.
PyGEP – Gene Expression Programming for Python
Created by Ryan O'Neil with the goal to create a simple library suitable for the academic study of gene expression programming in Python, aiming for ease of use and rapid implementation. It implements standard multigenic chromosomes and the genetic operators mutation, crossover, and transposition. PyGEP is hosted at Google Code.
jGEP – Java GEP toolkit
Created by Matthew Sottile to rapidly build Java prototype codes that use GEP, which can then be written in a language such as C or Fortran for real speed. jGEP is hosted at SourceForge.

Further reading[edit]

See also[edit]

References[edit]

  1. ^ Box, G. E. P., 1957. Evolutionary operation: A method for increasing industrial productivity. Applied Statistics, 6, 81–101.
  2. ^ Friedman, G. J., 1959. Digital simulation of an evolutionary process. General Systems Yearbook, 4, 171–184.
  3. ^ Rechenberg, Ingo (1973). Evolutionsstrategie. Stuttgart: Holzmann-Froboog. ISBN 3-7728-0373-3. 
  4. ^ Mitchell, Melanie (1996). 'An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press. 
  5. ^ Ferreira, C. (2001). "Gene Expression Programming: A New Adaptive Algorithm for Solving Problems". Complex Systems, Vol. 13, issue 2: 87–129. 
  6. ^ Ferreira, C. (2002). "Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence". Portugal: Angra do Heroismo. ISBN 972-95890-5-4. 
  7. ^ Ferreira, C. (2006). "Automatically Defined Functions in Gene Expression Programming". In N. Nedjah, L. de M. Mourelle, A. Abraham, eds., Genetic Systems Programming: Theory and Experiences, Studies in Computational Intelligence, Vol. 13, pp. 21–56, Springer-Verlag. 
  8. ^ Ferreira, C. (2002). "Mutation, Transposition, and Recombination: An Analysis of the Evolutionary Dynamics". In H. J. Caulfield, S.-H. Chen, H.-D. Cheng, R. Duro, V. Honavar, E. E. Kerre, M. Lu, M. G. Romay, T. K. Shih, D. Ventura, P. P. Wang, Y. Yang, eds., Proceedings of the 6th Joint Conference on Information Sciences, 4th International Workshop on Frontiers in Evolutionary Algorithms, pages 614–617, Research Triangle Park, North Carolina, USA. 
  9. ^ Ferreira, C. (2002). "Combinatorial Optimization by Gene Expression Programming: Inversion Revisited". In J. M. Santos and A. Zapico, eds., Proceedings of the Argentine Symposium on Artificial Intelligence, pages 160–174, Santa Fe, Argentina. 
  10. ^ Ferreira, C. (2006). "Designing Neural Networks Using Gene Expression Programming". In A. Abraham, B. de Baets, M. Köppen, and B. Nickolay, eds., Applied Soft Computing Technologies: The Challenge of Complexity, pages 517–536, Springer-Verlag. 
  11. ^ Ferreira, C. (2006). Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. Springer-Verlag. ISBN 3-540-32796-7. 
  12. ^ Oltean, M.; Grosan, C. (2003), "A comparison of several linear genetic programming techniques", Complex Systems 14 (4): 285––314 

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Genetic_algorithms b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Genetic_algorithms new file mode 100644 index 00000000..27b9f487 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Genetic_algorithms @@ -0,0 +1 @@ + Genetic algorithm - Wikipedia, the free encyclopedia

Genetic algorithm

From Wikipedia, the free encyclopedia
  (Redirected from Genetic algorithms)
Jump to: navigation, search

In the computer science field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic (also sometimes called a metaheuristic) is routinely used to generate useful solutions to optimization and search problems.[1] Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.

Genetic algorithms find application in bioinformatics, phylogenetics, computational science, engineering, economics, chemistry, manufacturing, mathematics, physics, pharmacometrics and other fields.

Contents

Methodology[edit]

In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]

The evolution usually starts from a population of randomly generated individuals and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.

A typical genetic algorithm requires:

  1. a genetic representation of the solution domain,
  2. a fitness function to evaluate the solution domain.

A standard representation of each candidate solution is as an array of bits.[2] Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming; a mix of both linear chromosomes and trees is explored in gene expression programming.

Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion and selection operators.

Initialization[edit]

Initially many individual solutions are (usually) randomly generated to form an initial population. The population size depends on the nature of the problem, but typically contains several hundreds or thousands of possible solutions. Traditionally, the population is generated randomly, allowing the entire range of possible solutions (the search space). Occasionally, the solutions may be "seeded" in areas where optimal solutions are likely to be found.

Selection[edit]

During each successive generation, a proportion of the existing population is selected to breed a new generation. Individual solutions are selected through a fitness-based process, where fitter solutions (as measured by a fitness function) are typically more likely to be selected. Certain selection methods rate the fitness of each solution and preferentially select the best solutions. Other methods rate only a random sample of the population, as the former process may be very time-consuming.

The fitness function is defined over the genetic representation and measures the quality of the represented solution. The fitness function is always problem dependent. For instance, in the knapsack problem one wants to maximize the total value of objects that can be put in a knapsack of some fixed capacity. A representation of a solution might be an array of bits, where each bit represents a different object, and the value of the bit (0 or 1) represents whether or not the object is in the knapsack. Not every such representation is valid, as the size of objects may exceed the capacity of the knapsack. The fitness of the solution is the sum of values of all objects in the knapsack if the representation is valid, or 0 otherwise.

In some problems, it is hard or even impossible to define the fitness expression; in these cases, a simulation may be used to determine the fitness function value of a phenotype (e.g. computational fluid dynamics is used to determine the air resistance of a vehicle whose shape is encoded as the phenotype), or even interactive genetic algorithms are used.

Genetic operators[edit]

The next step is to generate a second generation population of solutions from those selected through genetic operators: crossover (also called recombination), and/or mutation.

For each new solution to be produced, a pair of "parent" solutions is selected for breeding from the pool selected previously. By producing a "child" solution using the above methods of crossover and mutation, a new solution is created which typically shares many of the characteristics of its "parents". New parents are selected for each new child, and the process continues until a new population of solutions of appropriate size is generated. Although reproduction methods that are based on the use of two parents are more "biology inspired", some research[3][4] suggests that more than two "parents" generate higher quality chromosomes.

These processes ultimately result in the next generation population of chromosomes that is different from the initial generation. Generally the average fitness will have increased by this procedure for the population, since only the best organisms from the first generation are selected for breeding, along with a small proportion of less fit solutions. These less fit solutions ensure genetic diversity within the genetic pool of the parents and therefore ensure the genetic diversity of the subsequent generation of children.

Opinion is divided over the importance of crossover versus mutation. There are many references in Fogel (2006) that support the importance of mutation-based search.

Although crossover and mutation are known as the main genetic operators, it is possible to use other operators such as regrouping, colonization-extinction, or migration in genetic algorithms.[5]

It is worth tuning parameters such as the mutation probability, crossover probability and population size to find reasonable settings for the problem class being worked on. A very small mutation rate may lead to genetic drift (which is non-ergodic in nature). A recombination rate that is too high may lead to premature convergence of the genetic algorithm. A mutation rate that is too high may lead to loss of good solutions unless there is elitist selection. There are theoretical but not yet practical upper and lower bounds for these parameters that can help guide selection.[citation needed]

Termination[edit]

This generational process is repeated until a termination condition has been reached. Common terminating conditions are:

  • A solution is found that satisfies minimum criteria
  • Fixed number of generations reached
  • Allocated budget (computation time/money) reached
  • The highest ranking solution's fitness is reaching or has reached a plateau such that successive iterations no longer produce better results
  • Manual inspection
  • Combinations of the above

The building block hypothesis[edit]

Genetic algorithms are simple to implement, but their behavior is difficult to understand. In particular it is difficult to understand why these algorithms frequently succeed at generating solutions of high fitness when applied to practical problems. The building block hypothesis (BBH) consists of:

  1. A description of a heuristic that performs adaptation by identifying and recombining "building blocks", i.e. low order, low defining-length schemata with above average fitness.
  2. A hypothesis that a genetic algorithm performs adaptation by implicitly and efficiently implementing this heuristic.

Goldberg describes the heuristic as follows:

"Short, low order, and highly fit schemata are sampled, recombined [crossed over], and resampled to form strings of potentially higher fitness. In a way, by working with these particular schemata [the building blocks], we have reduced the complexity of our problem; instead of building high-performance strings by trying every conceivable combination, we construct better and better strings from the best partial solutions of past samplings.
"Because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms, we have already given them a special name: building blocks. Just as a child creates magnificent fortresses through the arrangement of simple blocks of wood, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks."[6]

Limitations[edit]

There are several limitations of the use of a genetic algorithm compared to alternative optimization algorithms:

  • Repeated fitness function evaluation for complex problems is often the most prohibitive and limiting segment of artificial evolutionary algorithms. Finding the optimal solution to complex high dimensional, multimodal problems often requires very expensive fitness function evaluations. In real world problems such as structural optimization problems, one single function evaluation may require several hours to several days of complete simulation. Typical optimization methods can not deal with such types of problem. In this case, it may be necessary to forgo an exact evaluation and use an approximated fitness that is computationally efficient. It is apparent that amalgamation of approximate models may be one of the most promising approaches to convincingly use GA to solve complex real life problems.
  • Genetic algorithms do not scale well with complexity. That is, where the number of elements which are exposed to mutation is large there is often an exponential increase in search space size. This makes it extremely difficult to use the technique on problems such as designing an engine, a house or plane. In order to make such problems tractable to evolutionary search, they must be broken down into the simplest representation possible. Hence we typically see evolutionary algorithms encoding designs for fan blades instead of engines, building shapes instead of detailed construction plans, airfoils instead of whole aircraft designs. The second problem of complexity is the issue of how to protect parts that have evolved to represent good solutions from further destructive mutation, particularly when their fitness assessment requires them to combine well with other parts. It has been suggested by some[citation needed] in the community that a developmental approach to evolved solutions could overcome some of the issues of protection, but this remains an open research question.
  • The "better" solution is only in comparison to other solutions. As a result, the stop criterion is not clear in every problem.
  • In many problems, GAs may have a tendency to converge towards local optima or even arbitrary points rather than the global optimum of the problem. This means that it does not "know how" to sacrifice short-term fitness to gain longer-term fitness. The likelihood of this occurring depends on the shape of the fitness landscape: certain problems may provide an easy ascent towards a global optimum, others may make it easier for the function to find the local optima. This problem may be alleviated by using a different fitness function, increasing the rate of mutation, or by using selection techniques that maintain a diverse population of solutions,[7] although the No Free Lunch theorem[8] proves[citation needed] that there is no general solution to this problem. A common technique to maintain diversity is to impose a "niche penalty", wherein, any group of individuals of sufficient similarity (niche radius) have a penalty added, which will reduce the representation of that group in subsequent generations, permitting other (less similar) individuals to be maintained in the population. This trick, however, may not be effective, depending on the landscape of the problem. Another possible technique would be to simply replace part of the population with randomly generated individuals, when most of the population is too similar to each other. Diversity is important in genetic algorithms (and genetic programming) because crossing over a homogeneous population does not yield new solutions. In evolution strategies and evolutionary programming, diversity is not essential because of a greater reliance on mutation.
  • Operating on dynamic data sets is difficult, as genomes begin to converge early on towards solutions which may no longer be valid for later data. Several methods have been proposed to remedy this by increasing genetic diversity somehow and preventing early convergence, either by increasing the probability of mutation when the solution quality drops (called triggered hypermutation), or by occasionally introducing entirely new, randomly generated elements into the gene pool (called random immigrants). Again, evolution strategies and evolutionary programming can be implemented with a so-called "comma strategy" in which parents are not maintained and new parents are selected only from offspring. This can be more effective on dynamic problems.
  • GAs cannot effectively solve problems in which the only fitness measure is a single right/wrong measure (like decision problems), as there is no way to converge on the solution (no hill to climb). In these cases, a random search may find a solution as quickly as a GA. However, if the situation allows the success/failure trial to be repeated giving (possibly) different results, then the ratio of successes to failures provides a suitable fitness measure.

Variants[edit]

The simplest algorithm represents each chromosome as a bit string. Typically, numeric parameters can be represented by integers, though it is possible to use floating point representations. The floating point representation is natural to evolution strategies and evolutionary programming. The notion of real-valued genetic algorithms has been offered but is really a misnomer because it does not really represent the building block theory that was proposed by John Henry Holland in the 1970s. This theory is not without support though, based on theoretical and experimental results (see below). The basic algorithm performs crossover and mutation at the bit level. Other variants treat the chromosome as a list of numbers which are indexes into an instruction table, nodes in a linked list, hashes, objects, or any other imaginable data structure. Crossover and mutation are performed so as to respect data element boundaries. For most data types, specific variation operators can be designed. Different chromosomal data types seem to work better or worse for different specific problem domains.

When bit-string representations of integers are used, Gray coding is often employed. In this way, small changes in the integer can be readily effected through mutations or crossovers. This has been found to help prevent premature convergence at so called Hamming walls, in which too many simultaneous mutations (or crossover events) must occur in order to change the chromosome to a better solution.

Other approaches involve using arrays of real-valued numbers instead of bit strings to represent chromosomes. Theoretically, the smaller the alphabet, the better the performance, but paradoxically, good results have been obtained from using real-valued chromosomes.

A very successful (slight) variant of the general process of constructing a new population is to allow some of the better organisms from the current generation to carry over to the next, unaltered. This strategy is known as elitist selection.

Parallel implementations of genetic algorithms come in two flavours. Coarse-grained parallel genetic algorithms assume a population on each of the computer nodes and migration of individuals among the nodes. Fine-grained parallel genetic algorithms assume an individual on each processor node which acts with neighboring individuals for selection and reproduction. Other variants, like genetic algorithms for online optimization problems, introduce time-dependence or noise in the fitness function.

Genetic algorithms with adaptive parameters (adaptive genetic algorithms, AGAs) is another significant and promising variant of genetic algorithms. The probabilities of crossover (pc) and mutation (pm) greatly determine the degree of solution accuracy and the convergence speed that genetic algorithms can obtain. Instead of using fixed values of pc and pm, AGAs utilize the population information in each generation and adaptively adjust the pc and pm in order to maintain the population diversity as well as to sustain the convergence capacity. In AGA (adaptive genetic algorithm),[9] the adjustment of pc and pm depends on the fitness values of the solutions. In CAGA (clustering-based adaptive genetic algorithm),[10] through the use of clustering analysis to judge the optimization states of the population, the adjustment of pc and pm depends on these optimization states.

It can be quite effective to combine GA with other optimization methods. GA tends to be quite good at finding generally good global solutions, but quite inefficient at finding the last few mutations to find the absolute optimum. Other techniques (such as simple hill climbing) are quite efficient at finding absolute optimum in a limited region. Alternating GA and hill climbing can improve the efficiency of GA while overcoming the lack of robustness of hill climbing.

This means that the rules of genetic variation may have a different meaning in the natural case. For instance – provided that steps are stored in consecutive order – crossing over may sum a number of steps from maternal DNA adding a number of steps from paternal DNA and so on. This is like adding vectors that more probably may follow a ridge in the phenotypic landscape. Thus, the efficiency of the process may be increased by many orders of magnitude. Moreover, the inversion operator has the opportunity to place steps in consecutive order or any other suitable order in favour of survival or efficiency. (See for instance [11] or example in travelling salesman problem, in particular the use of an edge recombination operator.)

A variation, where the population as a whole is evolved rather than its individual members, is known as gene pool recombination.

A number of variations have been developed to attempt to improve performance of GAs on problems with a high degree of fitness epistasis, i.e. where the fitness of a solution consists of interacting subsets of its variables. Such algorithms aim to learn (before exploiting) these beneficial phenotypic interactions. As such, they are aligned with the Building Block Hypothesis in adaptively reducing disruptive recombination. Prominent examples of this approach include the mGA,[12] GEMGA[13] and LLGA.[14]

Problem domains[edit]

Problems which appear to be particularly appropriate for solution by genetic algorithms include timetabling and scheduling problems, and many scheduling software packages are based on GAs[citation needed]. GAs have also been applied to engineering. Genetic algorithms are often applied as an approach to solve global optimization problems.

As a general rule of thumb genetic algorithms might be useful in problem domains that have a complex fitness landscape as mixing, i.e., mutation in combination with crossover, is designed to move the population away from local optima that a traditional hill climbing algorithm might get stuck in. Observe that commonly used crossover operators cannot change any uniform population. Mutation alone can provide ergodicity of the overall genetic algorithm process (seen as a Markov chain).

Examples of problems solved by genetic algorithms include: mirrors designed to funnel sunlight to a solar collector, antennae designed to pick up radio signals in space, and walking methods for computer figures. Many of their solutions have been highly effective, unlike anything a human engineer would have produced, and inscrutable as to how they arrived at that solution.

History[edit]

Computer simulations of evolution started as early as in 1954 with the work of Nils Aall Barricelli, who was using the computer at the Institute for Advanced Study in Princeton, New Jersey.[15][16] His 1954 publication was not widely noticed. Starting in 1957,[17] the Australian quantitative geneticist Alex Fraser published a series of papers on simulation of artificial selection of organisms with multiple loci controlling a measurable trait. From these beginnings, computer simulation of evolution by biologists became more common in the early 1960s, and the methods were described in books by Fraser and Burnell (1970)[18] and Crosby (1973).[19] Fraser's simulations included all of the essential elements of modern genetic algorithms. In addition, Hans-Joachim Bremermann published a series of papers in the 1960s that also adopted a population of solution to optimization problems, undergoing recombination, mutation, and selection. Bremermann's research also included the elements of modern genetic algorithms.[20] Other noteworthy early pioneers include Richard Friedberg, George Friedman, and Michael Conrad. Many early papers are reprinted by Fogel (1998).[21]

Although Barricelli, in work he reported in 1963, had simulated the evolution of ability to play a simple game,[22] artificial evolution became a widely recognized optimization method as a result of the work of Ingo Rechenberg and Hans-Paul Schwefel in the 1960s and early 1970s – Rechenberg's group was able to solve complex engineering problems through evolution strategies.[23][24][25][26] Another approach was the evolutionary programming technique of Lawrence J. Fogel, which was proposed for generating artificial intelligence. Evolutionary programming originally used finite state machines for predicting environments, and used variation and selection to optimize the predictive logics. Genetic algorithms in particular became popular through the work of John Holland in the early 1970s, and particularly his book Adaptation in Natural and Artificial Systems (1975). His work originated with studies of cellular automata, conducted by Holland and his students at the University of Michigan. Holland introduced a formalized framework for predicting the quality of the next generation, known as Holland's Schema Theorem. Research in GAs remained largely theoretical until the mid-1980s, when The First International Conference on Genetic Algorithms was held in Pittsburgh, Pennsylvania.

As academic interest grew, the dramatic increase in desktop computational power allowed for practical application of the new technique. In the late 1980s, General Electric started selling the world's first genetic algorithm product, a mainframe-based toolkit designed for industrial processes. In 1989, Axcelis, Inc. released Evolver, the world's first commercial GA product for desktop computers. The New York Times technology writer John Markoff wrote[27] about Evolver in 1990.

Related techniques[edit]

Parent fields[edit]

Genetic algorithms are a sub-field of:

Related fields[edit]

Evolutionary algorithms[edit]

Evolutionary algorithms is a sub-field of evolutionary computing.

  • Evolution strategies (ES, see Rechenberg, 1994) evolve individuals by means of mutation and intermediate or discrete recombination. ES algorithms are designed particularly to solve problems in the real-value domain. They use self-adaptation to adjust control parameters of the search. De-randomization of self-adaptation has led to the contemporary Covariance Matrix Adaptation Evolution Strategy (CMA-ES).
  • Evolutionary programming (EP) involves populations of solutions with primarily mutation and selection and arbitrary representations. They use self-adaptation to adjust parameters, and can include other variation operations such as combining information from multiple parents.
  • Gene expression programming (GEP) also uses populations of computer programs. These complex computer programs are encoded in simpler linear chromosomes of fixed length, which are afterwards expressed as expression trees. Expression trees or computer programs evolve because the chromosomes undergo mutation and recombination in a manner similar to the canonical GA. But thanks to the special organization of GEP chromosomes, these genetic modifications always result in valid computer programs.[28]
  • Genetic programming (GP) is a related technique popularized by John Koza in which computer programs, rather than function parameters, are optimized. Genetic programming often uses tree-based internal data structures to represent the computer programs for adaptation instead of the list structures typical of genetic algorithms.
  • Grouping genetic algorithm (GGA) is an evolution of the GA where the focus is shifted from individual items, like in classical GAs, to groups or subset of items.[29] The idea behind this GA evolution proposed by Emanuel Falkenauer is that solving some complex problems, a.k.a. clustering or partitioning problems where a set of items must be split into disjoint group of items in an optimal way, would better be achieved by making characteristics of the groups of items equivalent to genes. These kind of problems include bin packing, line balancing, clustering with respect to a distance measure, equal piles, etc., on which classic GAs proved to perform poorly. Making genes equivalent to groups implies chromosomes that are in general of variable length, and special genetic operators that manipulate whole groups of items. For bin packing in particular, a GGA hybridized with the Dominance Criterion of Martello and Toth, is arguably the best technique to date.
  • Interactive evolutionary algorithms are evolutionary algorithms that use human evaluation. They are usually applied to domains where it is hard to design a computational fitness function, for example, evolving images, music, artistic designs and forms to fit users' aesthetic preference.

Swarm intelligence[edit]

Swarm intelligence is a sub-field of evolutionary computing.

  • Ant colony optimization (ACO) uses many ants (or agents) to traverse the solution space and find locally productive areas. While usually inferior to genetic algorithms and other forms of local search, it is able to produce results in problems where no global or up-to-date perspective can be obtained, and thus the other methods cannot be applied.[citation needed]
  • Particle swarm optimization (PSO) is a computational method for multi-parameter optimization which also uses population-based approach. A population (swarm) of candidate solutions (particles) moves in the search space, and the movement of the particles is influenced both by their own best known position and swarm's global best known position. Like genetic algorithms, the PSO method depends on information sharing among population members. In some problems the PSO is often more computationally efficient than the GAs, especially in unconstrained problems with continuous variables.[30]
  • Intelligent Water Drops or the IWD algorithm [31] is a nature-inspired optimization algorithm inspired from natural water drops which change their environment to find the near optimal or optimal path to their destination. The memory is the river's bed and what is modified by the water drops is the amount of soil on the river's bed.

Other evolutionary computing algorithms[edit]

Evolutionary computation is a sub-field of the metaheuristic methods.

  • Harmony search (HS) is an algorithm mimicking the behaviour of musicians in the process of improvisation.
  • Memetic algorithm (MA), also called hybrid genetic algorithm among others, is a relatively new evolutionary method where local search is applied during the evolutionary cycle. The idea of memetic algorithms comes from memes, which unlike genes, can adapt themselves. In some problem areas they are shown to be more efficient than traditional evolutionary algorithms.
  • Bacteriologic algorithms (BA) inspired by evolutionary ecology and, more particularly, bacteriologic adaptation. Evolutionary ecology is the study of living organisms in the context of their environment, with the aim of discovering how they adapt. Its basic concept is that in a heterogeneous environment, you can't find one individual that fits the whole environment. So, you need to reason at the population level. It is also believed BAs could be successfully applied to complex positioning problems (antennas for cell phones, urban planning, and so on) or data mining.[32]
  • Cultural algorithm (CA) consists of the population component almost identical to that of the genetic algorithm and, in addition, a knowledge component called the belief space.
  • Gaussian adaptation (normal or natural adaptation, abbreviated NA to avoid confusion with GA) is intended for the maximisation of manufacturing yield of signal processing systems. It may also be used for ordinary parametric optimisation. It relies on a certain theorem valid for all regions of acceptability and all Gaussian distributions. The efficiency of NA relies on information theory and a certain theorem of efficiency. Its efficiency is defined as information divided by the work needed to get the information.[34] Because NA maximises mean fitness rather than the fitness of the individual, the landscape is smoothed such that valleys between peaks may disappear. Therefore it has a certain “ambition” to avoid local peaks in the fitness landscape. NA is also good at climbing sharp crests by adaptation of the moment matrix, because NA may maximise the disorder (average information) of the Gaussian simultaneously keeping the mean fitness constant.

Other metaheuristic methods[edit]

Metaheuristic methods broadly fall within stochastic optimisation methods.

  • Simulated annealing (SA) is a related global optimization technique that traverses the search space by testing random mutations on an individual solution. A mutation that increases fitness is always accepted. A mutation that lowers fitness is accepted probabilistically based on the difference in fitness and a decreasing temperature parameter. In SA parlance, one speaks of seeking the lowest energy instead of the maximum fitness. SA can also be used within a standard GA algorithm by starting with a relatively high rate of mutation and decreasing it over time along a given schedule.
  • Tabu search (TS) is similar to simulated annealing in that both traverse the solution space by testing mutations of an individual solution. While simulated annealing generates only one mutated solution, tabu search generates many mutated solutions and moves to the solution with the lowest energy of those generated. In order to prevent cycling and encourage greater movement through the solution space, a tabu list is maintained of partial or complete solutions. It is forbidden to move to a solution that contains elements of the tabu list, which is updated as the solution traverses the solution space.
  • Extremal optimization (EO) Unlike GAs, which work with a population of candidate solutions, EO evolves a single solution and makes local modifications to the worst components. This requires that a suitable representation be selected which permits individual solution components to be assigned a quality measure ("fitness"). The governing principle behind this algorithm is that of emergent improvement through selectively removing low-quality components and replacing them with a randomly selected component. This is decidedly at odds with a GA that selects good solutions in an attempt to make better solutions.

Other stochastic optimisation methods[edit]

  • The cross-entropy (CE) method generates candidates solutions via a parameterized probability distribution. The parameters are updated via cross-entropy minimization, so as to generate better samples in the next iteration.
  • Reactive search optimization (RSO) advocates the integration of sub-symbolic machine learning techniques into search heuristics for solving complex optimization problems. The word reactive hints at a ready response to events during the search through an internal online feedback loop for the self-tuning of critical parameters. Methodologies of interest for Reactive Search include machine learning and statistics, in particular reinforcement learning, active or query learning, neural networks, and meta-heuristics.

See also[edit]

References[edit]

  1. ^ Mitchell 1996, p. 2.
  2. ^ a b Whitley 1994, p. 66.
  3. ^ Eiben, A. E. et al (1994). "Genetic algorithms with multi-parent recombination". PPSN III: Proceedings of the International Conference on Evolutionary Computation. The Third Conference on Parallel Problem Solving from Nature: 78–87. ISBN 3-540-58484-6.
  4. ^ Ting, Chuan-Kang (2005). "On the Mean Convergence Time of Multi-parent Genetic Algorithms Without Selection". Advances in Artificial Life: 403–412. ISBN 978-3-540-28848-0.
  5. ^ Akbari, Ziarati (2010). "A multilevel evolutionary algorithm for optimizing numerical functions" IJIEC 2 (2011): 419–430 [1]
  6. ^ Goldberg 1989, p. 41.
  7. ^ Taherdangkoo, Mohammad; Paziresh, Mahsa; Yazdi, Mehran; Bagheri, Mohammad Hadi (19 November 2012). "An efficient algorithm for function optimization: modified stem cells algorithm". Central European Journal of Engineering 3 (1): 36–50. doi:10.2478/s13531-012-0047-8. 
  8. ^ Wolpert, D.H., Macready, W.G., 1995. No Free Lunch Theorems for Optimisation. Santa Fe Institute, SFI-TR-05-010, Santa Fe.
  9. ^ Srinivas. M and Patnaik. L, "Adaptive probabilities of crossover and mutation in genetic algorithms," IEEE Transactions on System, Man and Cybernetics, vol.24, no.4, pp.656–667, 1994.
  10. ^ ZHANG. J, Chung. H and Lo. W. L, “Clustering-Based Adaptive Crossover and Mutation Probabilities for Genetic Algorithms”, IEEE Transactions on Evolutionary Computation vol.11, no.3, pp. 326–335, 2007.
  11. ^ Evolution-in-a-nutshell
  12. ^ D.E. Goldberg, B. Korb, and K. Deb. "Messy genetic algorithms: Motivation, analysis, and first results". Complex Systems, 5(3):493–530, October 1989.
  13. ^ Gene expression: The missing link in evolutionary computation
  14. ^ G. Harik. Learning linkage to efficiently solve problems of bounded difficulty using genetic algorithms. PhD thesis, Dept. Computer Science, University of Michigan, Ann Arbour, 1997
  15. ^ Barricelli, Nils Aall (1954). "Esempi numerici di processi di evoluzione". Methodos: 45–68. 
  16. ^ Barricelli, Nils Aall (1957). "Symbiogenetic evolution processes realized by artificial methods". Methodos: 143–182. 
  17. ^ Fraser, Alex (1957). "Simulation of genetic systems by automatic digital computers. I. Introduction". Aust. J. Biol. Sci. 10: 484–491. 
  18. ^ Fraser, Alex; Donald Burnell (1970). Computer Models in Genetics. New York: McGraw-Hill. ISBN 0-07-021904-4. 
  19. ^ Crosby, Jack L. (1973). Computer Simulation in Genetics. London: John Wiley & Sons. ISBN 0-471-18880-8. 
  20. ^ 02.27.96 - UC Berkeley's Hans Bremermann, professor emeritus and pioneer in mathematical biology, has died at 69
  21. ^ Fogel, David B. (editor) (1998). Evolutionary Computation: The Fossil Record. New York: IEEE Press. ISBN 0-7803-3481-7. 
  22. ^ Barricelli, Nils Aall (1963). "Numerical testing of evolution theories. Part II. Preliminary tests of performance, symbiogenesis and terrestrial life". Acta Biotheoretica (16): 99–126. 
  23. ^ Rechenberg, Ingo (1973). Evolutionsstrategie. Stuttgart: Holzmann-Froboog. ISBN 3-7728-0373-3. 
  24. ^ Schwefel, Hans-Paul (1974). Numerische Optimierung von Computer-Modellen (PhD thesis). 
  25. ^ Schwefel, Hans-Paul (1977). Numerische Optimierung von Computor-Modellen mittels der Evolutionsstrategie : mit einer vergleichenden Einführung in die Hill-Climbing- und Zufallsstrategie. Basel; Stuttgart: Birkhäuser. ISBN 3-7643-0876-1. 
  26. ^ Schwefel, Hans-Paul (1981). Numerical optimization of computer models (Translation of 1977 Numerische Optimierung von Computor-Modellen mittels der Evolutionsstrategie. Chichester ; New York: Wiley. ISBN 0-471-09988-0. 
  27. ^ Markoff, John (1990-08-29). "What's the Best Answer? It's Survival of the Fittest". New York Times. Retrieved 2009-08-09. 
  28. ^ Ferreira, C. "Gene Expression Programming: A New Adaptive Algorithm for Solving Problems". Complex Systems, Vol. 13, issue 2: 87-129. 
  29. ^ Falkenauer, Emanuel (1997). Genetic Algorithms and Grouping Problems. Chichester, England: John Wiley & Sons Ltd. ISBN 978-0-471-97150-4. 
  30. ^ Rania Hassan, Babak Cohanim, Olivier de Weck, Gerhard Vente r (2005) A comparison of particle swarm optimization and the genetic algorithm
  31. ^ Hamed Shah-Hosseini, The intelligent water drops algorithm: a nature-inspired swarm-based optimization algorithm, International Journal of Bio-Inspired Computation (IJBIC), vol. 1, no. ½, 2009, [2][dead link]
  32. ^ Baudry, Benoit; Franck Fleurey, Jean-Marc Jézéquel, and Yves Le Traon (March/April 2005). "Automatic Test Case Optimization: A Bacteriologic Algorithm" (PDF). IEEE Software (IEEE Computer Society) 22 (2): 76–82. doi:10.1109/MS.2005.30. Retrieved 2009-08-09. 
  33. ^ Civicioglu, P. (2012). "Transforming Geocentric Cartesian Coordinates to Geodetic Coordinates by Using Differential Search Algorithm". Computers &Geosciences 46: 229–247. doi:10.1016/j.cageo.2011.12.011. 
  34. ^ Kjellström, G. (December 1991). "On the Efficiency of Gaussian Adaptation". Journal of Optimization Theory and Applications 71 (3): 589–597. doi:10.1007/BF00941405. 

Bibliography[edit]

  • Banzhaf, Wolfgang; Nordin, Peter; Keller, Robert; Francone, Frank (1998). Genetic Programming – An Introduction. San Francisco, CA: Morgan Kaufmann. ISBN 978-1558605107. 
  • Bies, Robert R; Muldoon, Matthew F; Pollock, Bruce G; Manuck, Steven; Smith, Gwenn and Sale, Mark E (2006). "A Genetic Algorithm-Based, Hybrid Machine Learning Approach to Model Selection". Journal of Pharmacokinetics and Pharmacodynamics (Netherlands: Springer): 196–221. 
  • Cha, Sung-Hyuk; Tappert, Charles C (2009). "A Genetic Algorithm for Constructing Compact Binary Decision Trees". Journal of Pattern Recognition Research 4 (1): 1–13. 
  • Fraser, Alex S. (1957). "Simulation of Genetic Systems by Automatic Digital Computers. I. Introduction". Australian Journal of Biological Sciences 10: 484–491. 
  • Goldberg, David (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA: Addison-Wesley Professional. ISBN 978-0201157673. 
  • Goldberg, David (2002). The Design of Innovation: Lessons from and for Competent Genetic Algorithms. Norwell, MA: Kluwer Academic Publishers. ISBN 978-1402070983. 
  • Fogel, David. Evolutionary Computation: Toward a New Philosophy of Machine Intelligence (3rd ed.). Piscataway, NJ: IEEE Press. ISBN 978-0471669517. 
  • Holland, John (1992). Adaptation in Natural and Artificial Systems. Cambridge, MA: MIT Press. ISBN 978-0262581110. 
  • Koza, John (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press. ISBN 978-0262111706. 
  • Michalewicz, Zbigniew (1996). Genetic Algorithms + Data Structures = Evolution Programs. Springer-Verlag. ISBN 978-3540606765. 
  • Mitchell, Melanie (1996). An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press. ISBN 9780585030944. 
  • Poli, R., Langdon, W. B., McPhee, N. F. (2008). A Field Guide to Genetic Programming. Lulu.com, freely available from the internet. ISBN 978-1-4092-0073-4. 
  • Rechenberg, Ingo (1994): Evolutionsstrategie '94, Stuttgart: Fromman-Holzboog.
  • Schmitt, Lothar M; Nehaniv, Chrystopher L; Fujii, Robert H (1998), Linear analysis of genetic algorithms, Theoretical Computer Science 208: 111–148
  • Schmitt, Lothar M (2001), Theory of Genetic Algorithms, Theoretical Computer Science 259: 1–61
  • Schmitt, Lothar M (2004), Theory of Genetic Algorithms II: models for genetic operators over the string-tensor representation of populations and convergence to global optima for arbitrary fitness function under scaling, Theoretical Computer Science 310: 181–231
  • Schwefel, Hans-Paul (1974): Numerische Optimierung von Computer-Modellen (PhD thesis). Reprinted by Birkhäuser (1977).
  • Vose, Michael (1999). The Simple Genetic Algorithm: Foundations and Theory. Cambridge, MA: MIT Press. ISBN 978-0262220583. 
  • Whitley, Darrell (1994). "A genetic algorithm tutorial". Statistics and Computing 4 (2): 65–85. doi:10.1007/BF00175354. 
  • Hingston, Philip; Barone, Luigi; Michalewicz, Zbigniew (2008). Design by Evolution: Advances in Evolutionary Design. Springer. ISBN 978-3540741091. 
  • Eiben, Agoston; Smith, James (2003). Introduction to Evolutionary Computing. Springer. ISBN 978-3540401841. 

External links[edit]

Resources[edit]

  • Genetic Algorithms Index The site Genetic Programming Notebook provides a structured resource pointer to web pages in genetic algorithms field

Tutorials[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/IEEE b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/IEEE new file mode 100644 index 00000000..de7d5446 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/IEEE @@ -0,0 +1 @@ + Institute of Electrical and Electronics Engineers - Wikipedia, the free encyclopedia

Institute of Electrical and Electronics Engineers

From Wikipedia, the free encyclopedia
  (Redirected from IEEE)
Jump to: navigation, search
IEEE
IEEE logo.svg
Type Professional Organization
Founded January 1, 1963
Headquarters New York City, New York, United States
Origins Merger of the American Institute of Electrical Engineers and the Institute of Radio Engineers
Key people Gordon Day, President and CEO
Area served Worldwide
Focus Electrical, Electronics, Communications, Computer Engineering, Computer Science and Information Technology[1]
Method Industry standards, Conferences, Publications
Revenue US$330 million
Members 425,000+
Website www.ieee.org

The Institute of Electrical and Electronics Engineers (IEEE, read I-Triple-E) is a professional association headquartered in New York City that is dedicated to advancing technological innovation and excellence. It has more than 425,000 members in more than 160 countries, about 51.4% of whom reside in the United States.[2][3]

Contents

History

The IEEE corporate office is on the 17th floor of 3 Park Avenue in New York City

The IEEE is incorporated under the Not-for-Profit Corporation Law of the state of New York in the United States.[4] It was formed in 1963 by the merger of the Institute of Radio Engineers (IRE, founded 1912) and the American Institute of Electrical Engineers (AIEE, founded 1884).

The major interests of the AIEE were wire communications (telegraphy and telephony) and light and power systems. The IRE concerned mostly radio engineering, and was formed from two smaller organizations, the Society of Wireless and Telegraph Engineers and the Wireless Institute. With the rise of electronics in the 1930s, electronics engineers usually became members of the IRE, but the applications of electron tube technology became so extensive that the technical boundaries differentiating the IRE and the AIEE became difficult to distinguish. After World War II, the two organizations became increasingly competitive, and in 1961, the leadership of both the IRE and the AIEE resolved to consolidate the two organizations. The two organizations formally merged as the IEEE on January 1, 1963.

Notable presidents of IEEE and its founding organizations include Elihu Thomson (AIEE, 1889–1890), Alexander Graham Bell (AIEE, 1891–1892), Charles Proteus Steinmetz (AIEE, 1901–1902), Lee De Forest (IRE, 1930), Frederick E. Terman (IRE, 1941), William R. Hewlett (IRE, 1954), Ernst Weber (IRE, 1959; IEEE, 1963), and Ivan Getting (IEEE, 1978).

IEEE's Constitution defines the purposes of the organization as "scientific and educational, directed toward the advancement of the theory and practice of Electrical, Electronics, Communications and Computer Engineering, as well as Computer Science, the allied branches of engineering and the related arts and sciences."[1] In pursuing these goals, the IEEE serves as a major publisher of scientific journals and organizer of conferences, workshops, and symposia (many of which have associated published proceedings). It is also a leading standards development organization for the development of industrial standards (having developed over 900 active industry technical standards) in a broad range of disciplines, including electric power and energy, biomedical technology and healthcare, information technology, information assurance, telecommunications, consumer electronics, transportation, aerospace, and nanotechnology. IEEE develops and participates in educational activities such as accreditation of electrical engineering programs in institutes of higher learning. The IEEE logo is a diamond-shaped design which illustrates the right hand grip rule embedded in Benjamin Franklin's kite, and it was created at the time of the 1963 merger.[5]

IEEE has a dual complementary regional and technical structure – with organizational units based on geography (e.g., the IEEE Philadelphia Section, IEEE South Africa Section [1]) and technical focus (e.g., the IEEE Computer Society). It manages a separate organizational unit (IEEE-USA) which recommends policies and implements programs specifically intended to benefit the members, the profession and the public in the United States.

The IEEE includes 38 technical Societies, organized around specialized technical fields, with more than 300 local organizations that hold regular meetings.

The IEEE Standards Association is in charge of the standardization activities of the IEEE.

Publications

IEEE produces 30% of the world's literature in the electrical and electronics engineering and computer science fields, publishing well over 100 peer-reviewed journals.[6]

The published content in these journals as well as the content from several hundred annual conferences sponsored by the IEEE are available in the IEEE online digital library for subscription-based access and individual publication purchases.[7]

In addition to journals and conference proceedings, the IEEE also publishes tutorials and the standards that are produced by its standardization committees.

Educational activities

Picture of the place where an office of IEEE works in the District University of Bogotá, Colombia.

The IEEE provides learning opportunities within the engineering sciences, research, and technology. The goal of the IEEE education programs is to ensure the growth of skill and knowledge in the electricity-related technical professions and to foster individual commitment to continuing education among IEEE members, the engineering and scientific communities, and the general public.

IEEE offers educational opportunities such as IEEE eLearning Library,[8] the Education Partners Program,[9] Standards in Education[10] and Continuing Education Units (CEUs).[11]

IEEE eLearning Library is a collection of online educational courses designed for self-paced learning. Education Partners, exclusive for IEEE members, offers on-line degree programs, certifications and courses at a 10% discount. The Standards in Education website explains what standards are and the importance of developing and using them. The site includes tutorial modules and case illustrations to introduce the history of standards, the basic terminology, their applications and impact on products, as well as news related to standards, book reviews and links to other sites that contain information on standards. Currently, twenty-nine states in the United States require Professional Development Hours (PDH) to maintain a Professional Engineering license, encouraging engineers to seek Continuing Education Units (CEUs) for their participation in continuing education programs. CEUs readily translate into Professional Development Hours (PDHs), with 1 CEU being equivalent to 10 PDHs. Countries outside the United States, such as South Africa, similarly require continuing professional development (CPD) credits, and it is anticipated that IEEE Expert Now courses will feature in the CPD listing for South Africa.

IEEE also sponsors a website[12] designed to help young people understand better what engineering means, and how an engineering career can be made part of their future. Students of age 8–18, parents, and teachers can explore the site to prepare for an engineering career, ask experts engineering-related questions, play interactive games, explore curriculum links, and review lesson plans. This website also allows students to search for accredited engineering degree programs in Canada and the United States; visitors are able to search by state/province/territory, country, degree field, tuition ranges, room and board ranges, size of student body, and location (rural, suburban, or urban).

Standards and development process

IEEE is one of the leading standards-making organizations in the world. IEEE performs its standards making and maintaining functions through the IEEE Standards Association (IEEE-SA). IEEE standards affect a wide range of industries including: power and energy, biomedical and healthcare, Information Technology (IT), telecommunications, transportation, nanotechnology, information assurance, and many more. In 2005, IEEE had close to 900 active standards, with 500 standards under development. One of the more notable IEEE standards is the IEEE 802 LAN/MAN group of standards which includes the IEEE 802.3 Ethernet standard and the IEEE 802.11 Wireless Networking standard.

Membership and member grades

Most IEEE members are electrical and electronics engineers, but the organization's wide scope of interests has attracted people in other disciplines as well (e.g., computer science, mechanical engineering, civil engineering, biology, physics, and mathematics).

An individual can join the IEEE as a student member, professional member, or associate member. In order to qualify for membership, the individual must fulfil certain academic or professional criteria and abide to the code of ethics and bylaws of the organization. There are several categories and levels of IEEE membership and affiliation:

  • Student Members: Student membership is available for a reduced fee to those who are enrolled in an accredited institution of higher education as undergraduate or graduate students in technology or engineering.
  • Members: Ordinary or professional Membership requires that the individual have graduated from a technology or engineering program of an appropriately accredited institution of higher education or have demonstrated professional competence in technology or engineering through at least six years of professional work experience. An associate membership is available to individuals whose area of expertise falls outside the scope of the IEEE or who does not, at the time of enrollment, meet all the requirements for full membership. Students and Associates have all the privileges of members, except the right to vote and hold certain offices.
  • Society Affiliates: Some IEEE Societies also allow a person who is not an IEEE member to become a Society Affiliate of a particular Society within the IEEE, which allows a limited form of participation in the work of a particular IEEE Society.
  • Senior Members: Upon meeting certain requirements, a professional member can apply for Senior Membership, which is the highest level of recognition that a professional member can directly apply for. Applicants for Senior Member must have at least three letters of recommendation from Senior, Fellow, or Honorary members and fulfill other rigorous requirements of education, achievement, remarkable contribution, and experience in the field. The Senior Members are a selected group, and certain IEEE officer positions are available only to Senior (and Fellow) Members. Senior Membership is also one of the requirements for those who are nominated and elevated to the grade IEEE Fellow, a distinctive honor.
  • Fellow Members: The Fellow grade of membership is the highest level of membership, and cannot be applied for directly by the member – instead the candidate must be nominated by others. This grade of membership is conferred by the IEEE Board of Directors in recognition of a high level of demonstrated extraordinary accomplishment.
  • Honorary Members: Individuals who are not IEEE members but have demonstrated exceptional contributions, such as being a recipient of an IEEE Medal of Honor, may receive Honorary Membership from the IEEE Board of Directors.
  • Life Members and Life Fellows: Members who have reached the age of 65 and whose number of years of membership plus their age in years adds up to at least 100 are recognized as Life Members – and, in the case of Fellow members, as Life Fellows.

Awards

Through its awards program, the IEEE recognizes contributions that advance the fields of interest to the IEEE. For nearly a century, the IEEE Awards Program has paid tribute to technical professionals whose exceptional achievements and outstanding contributions have made a lasting impact on technology, society and the engineering profession.

Funds for the awards program, other than those provided by corporate sponsors for some awards, are administered by the IEEE Foundation.

Medals

Technical field awards

Recognitions

Prize paper awards

Scholarships

  • IEEE Life Members Graduate Study Fellowship in Electrical Engineering was established by the IEEE in 2000. The fellowship is awarded annually to a first year, full-time graduate student obtaining their masters for work in the area of electrical engineering, at an engineering school/program of recognized standing worldwide.[13]
  • IEEE Charles LeGeyt Fortescue Graduate Scholarship was established by the IRE in 1939 to commemorate Charles Legeyt Fortescue's contributions to electrical engineering. The scholarship is awarded for one year of full-time graduate work obtaining their masters in electrical engineering an ANE engineering school of recognized standing in the United States.[14]

Societies

IEEE is supported by 38 societies, each one focused on a certain knowledge area. They provide specialized publications, conferences, business networking and sometimes other services.[15][16]

Technical councils

IEEE technical councils are collaborations of several IEEE societies on a broader knowledge area. There are currently seven technical councils:[15][17]

Technical committees

To allow a quick response to new innovations, IEEE can also organize technical committees on top of their societies and technical councils. There are currently two such technical committees:[15]

Organizational units

IEEE Foundation

The IEEE Foundation is a charitable foundation established in 1973 to support and promote technology education, innovation and excellence.[18] It is incorporated separately from the IEEE, although it has a close relationship to it. Members of the Board of Directors of the foundation are required to be active members of IEEE, and one third of them must be current or former members of the IEEE Board of Directors.

Initially, the IEEE Foundation's role was to accept and administer donations for the IEEE Awards program, but donations increased beyond what was necessary for this purpose, and the scope was broadened. In addition to soliciting and administering unrestricted funds, the foundation also administers donor-designated funds supporting particular educational, humanitarian, historical preservation, and peer recognition programs of the IEEE.[18] As of the end of 2009, the foundation's total assets were $27 million, split equally between unrestricted and donor-designated funds.[19]

Copyright policy

The IEEE generally does not create its own research. It is a professional organization that coordinates journal peer-review activities and holds subject-specific conferences in which authors present their research. The IEEE then publishes the authors' papers in journals and other proceedings, and authors are required to transfer their copyright for works they submit for publication.[20][21]

Section 6.3.1 IEEE Copyright Policies – subsections 7 and 8 – states that "all authors…shall transfer to the IEEE in writing any copyright they hold for their individual papers", but that the IEEE will grant the authors permission to make copies and use the papers they originally authored, so long as such use is permitted by the Board of Directors. The guidelines for what the Board considers a "permitted" use are not entirely clear, although posting a copy on a personally controlled website is allowed. The author is also not allowed to change the work absent explicit approval from the organization. The IEEE justifies this practice in the first paragraph of that section, by stating that they will "serve and protect the interests of its authors and their employers".[20][21]

The IEEE places research papers and other publications such as IEEE standards behind a "paywall",[20] although the IEEE explicitly allows authors to make a copy of the papers that they authored freely available on their own website. As of September 2011, the IEEE also provides authors for most new journal papers with the option to pay to allow free download of their papers by the public from the IEEE publication website.[22]

IEEE publications have received a Green[23] rating from the SHERPA/RoMEO guide[24] for affirming "authors and/or their companies shall have the right to post their IEEE-copyrighted material on their own servers without permission" (IEEE Publication Policy 8.1.9.D[25]). This open access policy effectively allows authors, at their choice, to make their article openly available. Roughly 1/3 of the IEEE authors take this route[citation needed].

Some other professional associations use different copyright policies. For example, the USENIX association[20] requires that the author only give up the right to publish the paper elsewhere for 12 months (in addition to allowing authors to post copies of the paper on their own website during that time). The organization operates successfully even though all of its publications are freely available online.[20]

See also

References

  1. ^ a b "IEEE Technical Activities Board Operations Manual". IEEE. Retrieved December 7, 2010 (2010-12-07). , section 1.3 Technical activities objectives
  2. ^ "IEEE at a Glance > IEEE Quick Facts". IEEE. December 31, 2010 (2010-12-31). Retrieved May 5, 2013 (2013-05-05). 
  3. ^ "IEEE 2012 Annual Report". IEEE. October 2011 (2011-10). Retrieved May 5, 2013 (2013-05-05). 
  4. ^ "IEEE Technical Activities Board Operations Manual". IEEE. Retrieved November 10, 2010 (2010-11-10). , section 1.1 IEEE Incorporation
  5. ^ "IEEE – Master Brand and Logos". www.ieee.org. Retrieved 2011-01-28. 
  6. ^ About IEEE
  7. ^ IEEE's online digital library
  8. ^ IEEE – IEEE Expert Now
  9. ^ IEEE – IEEE Education Partners Program
  10. ^ IEEE – The IEEE Standards Education pages have moved
  11. ^ IEEE – IEEE Continuing Education Units
  12. ^ Welcome to TryEngineering.org
  13. ^ IEEE Life Member Graduate Study Fellowship. Retrieved on 2010-01-23.
  14. ^ Charles LeGeyt Fortescue Graduate Scholarship. Retrieved on 2010-01-23.
  15. ^ a b c "IEEE Societies & Communities". IEEE. Retrieved November 7, 2010 (2010-11-07). 
  16. ^ "IEEE Society Memberships". IEEE. Retrieved November 7, 2010 (2010-11-07). 
  17. ^ "IEEE Technical Councils". IEEE. Retrieved November 8, 2010 (2010-11-08). 
  18. ^ a b IEEE Foundation Home page
  19. ^ IEEE Foundation Overview page
  20. ^ a b c d e Johns, Chris (March 12, 2011). "Matt Blaze’s criticism of the ACM and the IEEE". Washington College of Law Intellectual Property Brief (American University). Retrieved 2011-04-17.  This section uses content available under the CC-BY-SA 3.0 License.
  21. ^ a b "6.3.1 IEEE Copyright Policies" (Available online). IEEE. 2011. Retrieved 2011-04-17. 
  22. ^ Davis, Amanda, Most IEEE Journals are now Open Access, The Institute, October 7, 2011.
  23. ^ Sherpa Romeo color code
  24. ^ Sherpa Romeo site
  25. ^ IEEE Publication Policy 8.1.9.D[dead link]

External links

  • Official IEEE website
  • IEEE Global History Network – a wiki-based website containing information about the history of IEEE, its members, their professions, and their technologies.
  • IEEE Xplore – the IEEE Xplore Digital Library, with over 2.6 million technical documents available online for purchase.
  • IEEE.tv – a video content website operated by the IEEE.
  • IEEE eLearning Library – an online library of more than 200 self-study multimedia short courses and tutorials in technical fields of interest to the IEEE.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Information_extraction b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Information_extraction new file mode 100644 index 00000000..6f796a3a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Information_extraction @@ -0,0 +1 @@ + Information extraction - Wikipedia, the free encyclopedia

Information extraction

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction.

Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from news wire reports of corporate mergers, such as denoted by the formal relation:

MergerBetween(company_1, company_2, date),

from an online news sentence such as:

"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.

Contents

History [edit]

Information extraction dates back to the late 1970s in the early days of NLP.[1] An early commercial system from the mid-1980s was JASPER built for Reuters by the Carnegie Group with the aim of providing real-time financial news to financial traders.[2]

Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a competition-based conference that focused on the following domains:

  • MUC-1 (1987), MUC-2 (1989): Naval operations messages.
  • MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
  • MUC-5 (1993): Joint ventures and microelectronics domain.
  • MUC-6 (1995): News articles on management changes.
  • MUC-7 (1998): Satellite launch reports.

Considerable support came from DARPA, the US defense agency, who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism.

Present significance [edit]

The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the world wide web, refers to the existing Internet as the web of documents [3] and advocates that more of the content be made available as a web of data.[4] Until this transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.[5]

Tasks and subtasks [edit]

Applying information extraction on text, is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical subtasks of IE include:

  • Named entity extraction which could include:
    • Named entity recognition: recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions, employing existing knowledge of the domain or information extracted from other sentences. Typically the recognition task involves assigning a unique identifier to the extracted entity. A simpler task is named entity detection, which aims to detect entities without having any existing knowledge about the entity instances. For example, in processing the sentence "M. Smith likes fishing", named entity detection would denote detecting that the phrase "M. Smith" does refer to a person, but without necessarily having (or using) any knowledge about a certain M. Smith who is (/or, "might be") the specific person whom that sentence is talking about.
    • Coreference resolution: detection of coreference and anaphoric links between text entities. In IE tasks, this is typically restricted to finding links between previously-extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing. But he doesn't like biking", it would be beneficial to detect that "he" is referring to the previously detected person "M. Smith".
    • Relationship extraction: identification of relations between entities, such as:
      • PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
      • PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
  • Semi-structured information extraction which may refer to any IE that tries to restore some kind information structure that has been lost through publication such as:
    • Table extraction: finding and extracting tables from documents.
    • Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence
  • Language and vocabulary analysis
  • Audio extraction
    • Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance [6] time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece.

Note this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.

IE on non-text documents is becoming an increasing topic in research and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text. This naturally lead to the fusion of extracted information from multiple kind of documents and sources.

World Wide Web applications [edit]

IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that is available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and layout format that are available in online text. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to induce such rules automatically.

Wrappers typically handle highly structured collections of web pages, such as product catalogues and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured text.

Approaches [edit]

Three standard approaches are now widely accepted

Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed.

Free or open source software and services [edit]

See also [edit]

Lists

References [edit]

  1. ^ Andersen, Peggy M.; Hayes, Philip J.; Huettner, Alison K.; Schmandt, Linda M.; Nirenburg, Irene B.; Weinstein, Steven P. "Automatic Extraction of Facts from Press Releases to Generate News Stories". CiteSeerX: 10.1.1.14.7943. 
  2. ^ Cowie, Jim; Wilks, Yorick. "Information Extraction". CiteSeerX: 10.1.1.61.6480. 
  3. ^ "Linked Data - The Story So Far". 
  4. ^ "Tim Berners-Lee on the next Web". 
  5. ^ R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",Journal of Natural Language Engineering, Cambridge U. Press , 14(1), 2008, pp.33-69.
  6. ^ A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals, Proceedings of WedelMusic, Darmstadt, Germany, 2002.
  7. ^ Peng, F.; McCallum, A. (2006). "Information extraction from research papers using conditional random fields☆". Information Processing & Management 42: 963. doi:10.1016/j.ipm.2005.09.002.  edit
  8. ^ Shimizu, Nobuyuki; Hass, Andrew (2006). "Extracting Frame-based Knowledge Representation from Route Instructions". 

External links [edit]

Enterprise Search

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/International_Conference_on_Very_Large_Data_Bases b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/International_Conference_on_Very_Large_Data_Bases new file mode 100644 index 00000000..aceeb76f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/International_Conference_on_Very_Large_Data_Bases @@ -0,0 +1 @@ + VLDB - Wikipedia, the free encyclopedia

VLDB

From Wikipedia, the free encyclopedia
Jump to: navigation, search
VLDB
Abbreviation VLDB
Discipline Database
Publication details
Publisher VLDB Endowment Inc.
History 1975–
Frequency annual

VLDB is an annual conference held by the non-profit Very Large Data Base Endowment Inc.. The mission of VLDB is to promote and exchange scholarly work in databases and related fields throughout the world. The VLDB conference began in 1975 and is closely associated with SIGMOD and SIGKDD.

Acceptance rate of VLDB, averaged from 1993 to 2007, is 16%[1] and the rate for the Core Database Technology track is 16.7% in 2009 and 18.4% in 2010.[2]

Venues[edit]

Year City Country Link
2014 Hangzhou China http://www.vldb.org/2014/
2013 Riva del Garda Italy http://www.vldb.org/2013/
2012 Istanbul Turkey http://www.vldb2012.org
2011 Seattle United States http://www.vldb.org/2011/
2010 Singapore http://vldb2010.org/
2009 Lyon France http://vldb2009.org/
2008 Auckland New Zealand VLDB at cs.auckland.ac.nz
2007 Vienna Austria http://www.vldb2007.org/
2006 Seoul South Korea dblp
2005 Trondheim Norway dblp
2004 Toronto Canada dblp
2003 Berlin Germany dblp
2002 Hong Kong China dblp
2001 Rome Italy dblp
2000 Cairo Egypt dblp
1999 Edinburgh Scotland uni-trier.de
1998 New York USA uni-trier.de
1997 Athens Greece uni-trier.de

References[edit]

  1. ^ Apers, Peter (2007). "Acceptance rates major database conferences". Retrieved 2009-06-12. 
  2. ^ "VLDB Statistics". 2010. Retrieved 2012-09-17. 

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/KDD_Conference b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/KDD_Conference new file mode 100644 index 00000000..24c864dd --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/KDD_Conference @@ -0,0 +1 @@ + SIGKDD - Wikipedia, the free encyclopedia

SIGKDD

From Wikipedia, the free encyclopedia
  (Redirected from KDD Conference)
Jump to: navigation, search

SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining. It became an official ACM SIG in 1998. The official web page of SIGKDD can be found on www.KDD.org. The current Chairman of SIGKDD (since 2009) is Usama M. Fayyad, Ph.D.

Contents

Conferences[edit]

SIGKDD has hosted an annual conference - ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) - since 1995. KDD Conferences grew from KDD (Knowledge Discovery and Data Mining) workshops at AAAI conferences, which were started by Gregory Piatetsky-Shapiro in 1989, 1991, and 1993, and Usama Fayyad in 1994. [1] Conference papers of each Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining are published through ACM[2]

KDD-2012 took place in Beijing, China [3] and KDD-2013 will take place in Chicago, United States, Aug 11-14, 2013.

KDD-Cup[edit]

SIGKDD sponsors the KDD Cup competition every year in conjunction with the annual conference. It is aimed at members of the industry and academia, particularly students, interested in KDD.

Awards[edit]

The group also annually recognizes members of the KDD community with its Innovation Award and Service Award. Additionally, KDD presents a Best Paper Award [4] to recognize the highest quality paper at each conference.

SIGKDD Explorations[edit]

SIGKDD has also published a biannual academic journal titled SIGKDD Explorations since June, 1999. Editors in Chief

Current Executive Committee[edit]

Chair

Treasurer

Directors

Former Chairpersons

  • Gregory Piatetsky-Shapiro[8] (2005-2008)
  • Won Kim (1998-2004)

Information Directors[edit]

References[edit]

External links[edit]


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/K_optimal_pattern_discovery b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/K_optimal_pattern_discovery new file mode 100644 index 00000000..31eea5d4 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/K_optimal_pattern_discovery @@ -0,0 +1 @@ + K-optimal pattern discovery - Wikipedia, the free encyclopedia

K-optimal pattern discovery

From Wikipedia, the free encyclopedia
Jump to: navigation, search

K-optimal pattern discovery is a data mining technique that provides an alternative to the frequent pattern discovery approach that underlies most association rule learning techniques.

Frequent pattern discovery techniques find all patterns for which there are sufficiently frequent examples in the sample data. In contrast, k-optimal pattern discovery techniques find the k patterns that optimize a user-specified measure of interest. The parameter k is also specified by the user.

Examples of k-optimal pattern discovery techniques include:

  • k-optimal classification rule discovery.[1]
  • k-optimal subgroup discovery.[2]
  • finding k most interesting patterns using sequential sampling.[3]
  • mining top.k frequent closed patterns without minimum support.[4]
  • k-optimal rule discovery.[5]

In contrast to k-optimal rule discovery and frequent pattern mining techniques, subgroup discovery focuses on mining interesting patterns with respect to a specified target property of interest. This includes, for example, binary, nominal, or numeric attributes,[6] but also more complex target concepts such as correlations between several variables. Background knowledge[7] like constraints and ontological relations can often be successfully applied for focusing and improving the discovery results.

References[edit]

  1. ^ Webb, G. I. (1995). OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431-465.
  2. ^ Wrobel, Stefan (1997) An algorithm for multi-relational discovery of subgroups. In Proceedings First European Symposium on Principles of Data Mining and Knowledge Discovery. Springer.
  3. ^ Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833-862.
  4. ^ Han, J., Wang, J., Lu, Y., & Tzvetkov, P. (2002) Mining top-k frequent closed patterns without minimum support. In Proceedings of the International Conference on Data Mining, pp. 211-218.
  5. ^ Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39-79.
  6. ^ Kloesgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. Advances in Knowledge Discovery and Data Mining, pp. 249-271.
  7. ^ Atzmueller, M., Puppe, F., Buscher HP. (2005). Exploiting background knowledge for knowledge-intensive subgroup discovery. Proc. IJCAI'05: 19th International Joint Conference on Artificial Intelligence. Morgan Kaufmann

External links[edit]

Software[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Lift_data_mining_ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Lift_data_mining_ new file mode 100644 index 00000000..83754b4a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Lift_data_mining_ @@ -0,0 +1 @@ + Lift (data mining) - Wikipedia, the free encyclopedia

Lift (data mining)

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In data mining and association rule learning, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response.

For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%).

Typically, the modeller seeks to divide the population into quantiles, and rank the quantiles by lift. Organizations can then consider each quantile, and by weighing the predicted response rate (and associated financial benefit) against the cost, they can decide whether to market to that quantile or not.

Lift is analogous to information retrieval's average precision metric, if one treats the precision (fraction of the positives that are true positives) as the target response probability.

The Lift curve can also be considered a variation on the Receiver operating characteristic (ROC) curve, and is also known in econometrics as the Lorenz or power curve.[1]

Example [edit]

Assume the data set being mined is:

Antecedent Consequent
A 0
A 0
A 1
A 0
B 1
B 0
B 1

where the antecedent is the input variable that we can control, and the consequent is the variable we are trying to predict. Real mining problems would typically have more complex antecedents, but usually focus on single-value consequents.

Most mining algorithms would determine the following rules (targeting models):

  • Rule 1: A implies 0
  • Rule 2: B implies 1

because these are simply the most common patterns found in the data. A simple review of the above table should make these rules obvious.

The support for Rule 1 is 3/7 because that is the number of items in the dataset in which the antecedent is A and the consequent 0. The support for Rule 2 is 2/7 because two of the seven records meet the antecedent of B and the consequent of 1. The supports can be written as:

supp(A \Rightarrow 0) = P(A \and 0) = P(A)P(0|A) = P(0)P(A|0)
supp(B \Rightarrow 1) = P(B \and 1) = P(B)P(1|B) = P(1)P(B|1)

The confidence for Rule 1 is 3/4 because three of the four records that meet the antecedent of A meet the consequent of 0. The confidence for Rule 2 is 2/3 because two of the three records that meet the antecedent of B meet the consequent of 1. The confidences can be written as:

conf(A \Rightarrow 0) = P(0|A)
conf(B \Rightarrow 1) = P(1|B)

Lift can be found by dividing the confidence by the unconditional probability of the consequent, or by dividing the support by the probability of the antecedent times the probability of the consequent, so:

  • The lift for Rule 1 is (3/4)/(4/7) = 1.3125
  • The lift for Rule 2 is (2/3)/(3/7) = 2/3 * 7/3 = 14/9 = 1.(5).

lift(A \Rightarrow 0) = \frac{P(0|A)}{P(0)} = \frac{P(A \and 0)}{P(A)P(0)}
lift(B \Rightarrow 1) = \frac{P(1|B)}{P(1)} = \frac{P(B \and 1)}{P(B)P(1)}

If some rule had a lift of 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other. When two events are independent of each other, no rule can be drawn involving those two events.

If the lift is positive, like it is here for Rules 1 and 2, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.

Observe that even though Rule 1 has higher confidence, it has lower lift. Intuitively, it would seem that Rule 1 is more valuable because of its higher confidence—it seems more accurate (better supported). But accuracy of the rule independent of the data set can be misleading. The value of lift is that it considers both the confidence of the rule and the overall data set.

References [edit]

  1. ^ Tufféry, Stéphane (2011); Data Mining and Statistics for Decision Making, Chichester, GB: John Wiley & Sons, translated from the French Data Mining et statistique décisionnelle (Éditions Technip, 2008)

See also [edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/List_of_machine_learning_algorithms b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/List_of_machine_learning_algorithms new file mode 100644 index 00000000..421ec434 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/List_of_machine_learning_algorithms @@ -0,0 +1 @@ + List of machine learning algorithms - Wikipedia, the free encyclopedia

List of machine learning algorithms

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Contents

Supervised learning[edit]

Statistical classification[edit]

Unsupervised learning[edit]

Association rule learning[edit]

Hierarchical clustering[edit]

Partitional clustering[edit]

Reinforcement learning[edit]

Others[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Local_outlier_factor b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Local_outlier_factor new file mode 100644 index 00000000..fa37434b --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Local_outlier_factor @@ -0,0 +1 @@ + Local outlier factor - Wikipedia, the free encyclopedia

Local outlier factor

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Local outlier factor (LOF) is an anomaly detection algorithm presented as "LOF: Identifying Density-based Local Outliers" by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander.[1] The key idea of LOF is comparing the local density of a point's neighborhood with the local density of its neighbors.

LOF shares some concepts with DBSCAN and OPTICS such as the concepts of "core distance" and "reachability distance", which are used for local density estimation.

Contents

Basic idea[edit]

Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. A has a much lower density than its neighbors.

As indicated by the title, the local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.

The local density is estimated by the typical distance at which a point can be "reached" from its neighbors. The definition of "reachability distance" used in LOF is an additional measure to produce more stable results within clusters.

Formal[edit]

Let \mbox{k-distance}(A) be the distance of the object A to the k nearest neighbor. Note that the set of the k nearest neighbors includes all objects at this distance, which can in the case of a "tie" be more than k objects. We denote the set of k nearest neighbors as N_k(A).

Illustration of the reachability distance. Objects B and C have the same reachability distance (k=3), while D is not a k nearest neighbor

This distance is used to define what is called reachability distance:

\mbox{reachability-distance}_k(A,B)=\max\{\mbox{k-distance}(B), d(A,B)\}

In words, the reachability distance of an object A from B is the true distance of the two objects, but at least the \mbox{k-distance} of B. Objects that belong to the k nearest neighbors of B (the "core" of B, see DBSCAN cluster analysis) are considered to be equally distant. The reason for this distance is to get more stable results. Note that this is not a distance in the mathematical definition, since it is not symmetric.

The local reachability density of an object A is defined by

\mbox{lrd}(A):=1/\left(\frac{\sum_{B\in N_k(A)}\mbox{reachability-distance}_k(A, B)}{|N_k(A)|}\right)

Which is the quotient of the average reachability distance of the object A from its neighbors. Note that it is not the average reachability of the neighbors from A (which by definition would be the \mbox{k-distance}(A)), but the distance at which it can be "reached" from its neighbors. With duplicate points, this value can become infinite.

The local reachability densities are then compared with those of the neighbors using

 \mbox{LOF}_k(A):=\frac{\sum_{B\in N_k(A)}\frac{\mbox{lrd}(B)}{\mbox{lrd}(A)}}{|N_k(A)|} = \frac{\sum_{B\in N_k(A)}\mbox{lrd}(B)}{|N_k(A)|} / \mbox{lrd}(A)

Which is the average local reachability density of the neighbors divided by the objects own local reachability density. A value of approximately 1 indicates that the object is comparable to its neighbors (and thus not an outlier). A value below 1 indicates a denser region (which would be an inlier), while values significantly larger than 1 indicate outliers.

Advantages[edit]

LOF scores as visualized by ELKI. While the upper right cluster has a comparable density to the outliers close to the bottom left cluster, they are detected correctly.

Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors.

While the geometric intuition of LOF is only applicable to low dimensional vector spaces, the algorithm can be applied in any context a dissimilarity function can be defined. It has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection.[2]

Disadvantages and Extensions[edit]

The resulting values are quotient-values and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. These differences can also occur within a dataset due to the locality of the method. There exist extensions of LOF that try to improve over LOF in these aspects:

  • Feature Bagging for Outlier Detection [3] runs LOF on multiple projections and combines the results for improved detection qualities in high dimensions.
  • Local Outlier Probability (LoOP)[4] is a method derived from LOF but using inexpensive local statistics to become less sensitive to the choice of the parameter k. In addition, the resulting values are scaled to a value range of [0:1].
  • Interpreting and Unifying Outlier Scores [5] proposes a normalization of the LOF outlier scores to the interval [0:1] using statistical scaling to increase usability and can be seen a improved version of the LoOP ideas.
  • On Evaluation of Outlier Rankings and Outlier Scores [6] proposes methods for measuring similarity and diversity of methods for building advanced outlier detection ensembles using LOF variants and other algorithms and improving on the Feature Bagging approach discussed above.

References[edit]

  1. ^ Breunig, M. M.; Kriegel, H. -P.; Ng, R. T.; Sander, J. (2000). "LOF: Identifying Density-based Local Outliers". ACM SIGMOD Record 29: 93. doi:10.1145/335191.335388.  edit
  2. ^ Ar Lazarevic, Aysel Ozgur, Levent Ertoz, Jaideep Srivastava, Vipin Kumar (2003). "A comparative study of anomaly detection schemes in network intrusion detection". Proc. 3rd SIAM International Conference on Data Mining: 25–36. 
  3. ^ Lazarevic, A.; Kumar, V. (2005). "Feature bagging for outlier detection". Proc. 11th ACM SIGKDD international conference on Knowledge Discovery in Data Mining: 157–166. doi:10.1145/1081870.1081891.  edit
  4. ^ Kriegel, H. -P.; Kröger, P.; Schubert, E.; Zimek, A. (2009). "LoOP: Local Outlier Probabilities". Proc. 18th ACM Conference on Information and Knowledge Management (CIKM): 1649. doi:10.1145/1645953.1646195.  edit
  5. ^ Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek (2011). "Interpreting and Unifying Outlier Scores". Proc. 11th SIAM International Conference on Data Mining. 
  6. ^ Erich Schubert, Remigius Wojdanowski, Hans-Peter Kriegel, Arthur Zimek (2012). "On Evaluation of Outlier Rankings and Outlier Scores". Proc. 12 SIAM International Conference on Data Mining. 

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Machine_learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Machine_learning new file mode 100644 index 00000000..b58ff4ef --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Machine_learning @@ -0,0 +1 @@ + Machine learning - Wikipedia, the free encyclopedia

Machine learning

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders.

The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances; the conditions under which this can be guaranteed are a key object of study in the subfield of computational learning theory.

There is a wide variety of machine learning tasks and successful applications. Optical character recognition, in which printed characters are recognized automatically based on previous examples, is a classic example of machine learning.[1]

Contents

Definition[edit]

In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed".[2]

Tom M. Mitchell provided a widely quoted, more formal definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E".[3] This definition is notable for its defining machine learning in fundamentally operational rather than cognitive terms, thus following Alan Turing's proposal in Turing's paper "Computing Machinery and Intelligence" that the question "Can machines think?" be replaced with the question "Can machines do what we (as thinking entities) can do?"[4]

Generalization[edit]

Generalization in this context is the ability of an algorithm to perform accurately on new, unseen examples after having trained on a learning data set. The core objective of a learner is to generalize from its experience.[5][6] The training examples come from some generally unknown probability distribution and the learner has to extract from them something more general, something about that distribution, that allows it to produce useful predictions in new cases.

Machine learning and data mining[edit]

These two terms are commonly confused, as they often employ the same methods and overlap significantly. They can be roughly defined as follows:

  • Machine learning focuses on prediction, based on known properties learned from the training data.
  • Data mining (which is the analysis step of Knowledge Discovery in Databases) focuses on the discovery of (previously) unknown properties on the data.

The two areas overlap in many ways: data mining uses many machine learning methods, but often with a slightly different goal in mind. On the other hand, machine learning also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve learner accuracy. Much of the confusion between these two research communities (which do often have separate conferences and separate journals, ECML PKDD being a major exception) comes from the basic assumptions they work with: in machine learning, performance is usually evaluated with respect to the ability to reproduce known knowledge, while in Knowledge Discovery and Data Mining (KDD) the key task is the discovery of previously unknown knowledge. Evaluated with respect to known knowledge, an uninformed (unsupervised) method will easily be outperformed by supervised methods, while in a typical KDD task, supervised methods cannot be used due to the unavailability of training data.

Human interaction[edit]

Some machine learning systems attempt to eliminate the need for human intuition in data analysis, while others adopt a collaborative approach between human and machine. Human intuition cannot, however, be entirely eliminated, since the system's designer must specify how the data is to be represented and what mechanisms will be used to search for a characterization of the data.

Algorithm types[edit]

Machine learning algorithms can be organized into a taxonomy based on the desired outcome of the algorithm or the type of input available during training the machine[citation needed].

  • Supervised learning generates a function that maps inputs to desired outputs (also called labels, because they are often provided by human experts labeling the training examples). For example, in a classification problem, the learner approximates a function mapping a vector into classes by looking at input-output examples of the function.
  • Unsupervised learning models a set of inputs, like clustering. See also data mining and knowledge discovery. Here, labels are not known during training.
  • Semi-supervised learning combines both labeled and unlabeled examples to generate an appropriate function or classifier. Transduction, or transductive inference, tries to predict new outputs on specific and fixed (test) cases from observed, specific (training) cases.
  • Reinforcement learning learns how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback in the form of rewards that guides the learning algorithm.
  • Learning to learn learns its own inductive bias based on previous experience.
  • Developmental learning, elaborated for Robot learning, generates its own sequences (also called curriculum) of learning situations to cumulatively acquire repertoires of novel skills through autonomous self-exploration and social interaction with human teachers, and using guidance mechanisms such as active learning, maturation, motor synergies, and imitation.

Theory[edit]

The computational analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory. Because training sets are finite and the future is uncertain, learning theory usually does not yield guarantees of the performance of algorithms. Instead, probabilistic bounds on the performance are quite common.

In addition to performance bounds, computational learning theorists study the time complexity and feasibility of learning. In computational learning theory, a computation is considered feasible if it can be done in polynomial time. There are two kinds of time complexity results. Positive results show that a certain class of functions can be learned in polynomial time. Negative results show that certain classes cannot be learned in polynomial time.

There are many similarities between machine learning theory and statistical inference, although they use different terms.

Approaches[edit]

Decision tree learning[edit]

Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value.

Association rule learning[edit]

Association rule learning is a method for discovering interesting relations between variables in large databases.

Artificial neural networks[edit]

An artificial neural network (ANN) learning algorithm, usually called "neural network" (NN), is a learning algorithm that is inspired by the structure and functional aspects of biological neural networks. Computations are structured in terms of an interconnected group of artificial neurons, processing information using a connectionist approach to computation. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs, to find patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables.

Genetic programming[edit]

Genetic programming (GP) is an evolutionary algorithm-based methodology inspired by biological evolution to find computer programs that perform a user-defined task. It is a specialization of genetic algorithms (GA) where each individual is a computer program. It is a machine learning technique used to optimize a population of computer programs according to a fitness landscape determined by a program's ability to perform a given computational task.

Inductive logic programming[edit]

Inductive logic programming (ILP) is an approach to rule learning using logic programming as a uniform representation for examples, background knowledge, and hypotheses. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program which entails all the positive and none of the negative examples.

Support vector machines[edit]

Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

Clustering[edit]

Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to some predesignated criterion or criteria, while observations drawn from different clusters are dissimilar. Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated for example by internal compactness (similarity between members of the same cluster) and separation between different clusters. Other methods are based on estimated density and graph connectivity. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis.

Bayesian networks[edit]

A Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning.

Reinforcement learning[edit]

Reinforcement learning is concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states. Reinforcement learning differs from the supervised learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected.

Representation learning[edit]

Several learning algorithms, mostly unsupervised learning algorithms, aim at discovering better representations of the inputs provided during training. Classical examples include principal components analysis and cluster analysis. Representation learning algorithms often attempt to preserve the information in their input but transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions, allowing to reconstruct the inputs coming from the unknown data generating distribution, while not being necessarily faithful for configurations that are implausible under that distribution. Manifold learning algorithms attempt to do so under the constraint that the learned representation is low-dimensional. Sparse coding algorithms attempt to do so under the constraint that the learned representation is sparse (has many zeros). Multilinear subspace learning algorithms aim to learn low-dimensional representations directly from tensor representations for multidimensional data, without reshaping them into (high-dimensional) vectors.[7] Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features. It has been argued that an intelligent machine is one that learns a representation that disentangles the underlying factors of variation that explain the observed data.[8]

Similarity and metric learning[edit]

In this problem, the learning machine is given pairs of examples that are considered similar and pairs of less similar objects. It then needs to learn a similarity function (or a distance metric function) that can predict if new objects are similar. It is sometimes used in Recommendation systems.

Sparse Dictionary Learning[edit]

In this method, a datum is represented as a linear combination of basis functions, and the coefficients are assumed to be sparse. Let x be a d-dimensional datum, D be a d by n matrix, where each column of D represents a basis function. r is the coefficient to represent x using D. Mathematically, sparse dictionary learning means the following   x \approx D \times r where r is sparse. Generally speaking, n is assumed to be larger than d to allow the freedom for a sparse representation.

Sparse dictionary learning has been applied in several contexts. In classification, the problem is to determine which classes a previously unseen datum belongs to. Suppose a dictionary for each class has already been built. Then a new datum is associated with the class such that it's best sparsely represented by the corresponding dictionary. Sparse dictionary learning has also been applied in image de-noising. The key idea is that a clean image path can be sparsely represented by an image dictionary, but the noise cannot.[9]

Applications[edit]

Applications for machine learning include:

In 2006, the online movie company Netflix held the first "Netflix Prize" competition to find a program to better predict user preferences and improve the accuracy on its existing Cinematch movie recommendation algorithm by at least 10%. A joint team made up of researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million.[10]

Software[edit]

Ayasdi, Angoss KnowledgeSTUDIO, Apache Mahout, Gesture Recognition Toolkit, IBM SPSS Modeler, KNIME, KXEN Modeler, LIONsolver, MATLAB, mlpy, MCMLL, OpenCV, dlib, Oracle Data Mining, Orange, Python scikit-learn, R, RapidMiner, Salford Predictive Modeler, SAS Enterprise Miner, Shogun toolbox, STATISTICA Data Miner, and Weka are software suites containing a variety of machine learning algorithms.

Journals and conferences[edit]

See also[edit]

References[edit]

  1. ^ Wernick, Yang, Brankov, Yourganov and Strother, Machine Learning in Medical Imaging, IEEE Signal Processing Magazine, vol. 27, no. 4, July 2010, pp. 25-38
  2. ^ Phil Simon (March 18, 2013). Too Big to Ignore: The Business Case for Big Data. Wiley. p. 89. ISBN 978-1118638170. 
  3. ^ * Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7, p.2.
  4. ^ Harnad, Stevan (2008), "The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence", in Epstein, Robert; Peters, Grace, The Turing Test Sourcebook: Philosophical and Methodological Issues in the Quest for the Thinking Computer, Kluwer 
  5. ^ Christopher M. Bishop (2006) Pattern Recognition and Machine Learning, Springer ISBN 0-387-31073-8.
  6. ^ Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar (2012) Foundations of Machine Learning, The MIT Press ISBN 9780262018258.
  7. ^ Lu, Haiping; Plataniotis, K.N.; Venetsanopoulos, A.N. (2011). "A Survey of Multilinear Subspace Learning for Tensor Data". Pattern Recognition 44 (7): 1540–1551. doi:10.1016/j.patcog.2011.01.004. 
  8. ^ Yoshua Bengio (2009). Learning Deep Architectures for AI. Now Publishers Inc. pp. 1–3. ISBN 978-1-60198-294-0. 
  9. ^ Aharon, M, M Elad, and A Bruckstein. 2006. “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation.” Signal Processing, IEEE Transactions on 54 (11): 4311-4322
  10. ^ "BelKor Home Page" research.att.com

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Mining_Software_Repositories b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Mining_Software_Repositories new file mode 100644 index 00000000..49dbd226 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Mining_Software_Repositories @@ -0,0 +1 @@ + Mining Software Repositories - Wikipedia, the free encyclopedia

Mining Software Repositories

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The Mining Software Repositories (MSR) field analyzes the rich data available in software repositories, such as version control repositories, mailing list archives, bug tracking systems, issue tracking systems, etc. to uncover interesting and actionable information about software systems, projects and software engineering.

Contents

Data Repositories [edit]

Metrics [edit]

  • Floss Mole [1]

Defect Prediction [edit]

  • Promise Software Repository [2]

Collection of Open Source Code. [edit]

Techniques [edit]

Tools [edit]

Experimentation Tools [edit]

Trace lab.

Metric Extraction Tools [edit]

Mining Tools [edit]

  • rapidminer [7]

Contradictory Findings [edit]

Software Metrics [edit]

See also [edit]

External links [edit]


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Molecule_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Molecule_mining new file mode 100644 index 00000000..fceaf220 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Molecule_mining @@ -0,0 +1 @@ + Molecule mining - Wikipedia, the free encyclopedia

Molecule mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

This page describes mining for molecules. Since molecules may be represented by molecular graphs this is strongly related to graph mining and structured data mining. The main problem is how to represent molecules while discriminating the data instances. One way to do this is chemical similarity metrics, which has a long tradition in the field of cheminformatics.

Typical approaches to calculate chemical similarities use chemical fingerprints, but this loses the underlying information about the molecule topology. Mining the molecular graphs directly avoids this problem. So does the inverse QSAR problem which is preferable for vectorial mappings.

Contents

Coding(Moleculei,Moleculej\neqi)[edit]

Kernel methods[edit]

  • Marginalized graph kernel[1]
  • Optimal assignment kernel[2][3][4]
  • Pharmacophore kernel[5]
  • C++ (and R) implementation combining
    • the marginalized graph kernel between labeled graphs
    • extensions of the marginalized kernel
    • Tanimoto kernels
    • graph kernels based on tree patterns
    • kernels based on pharmacophores for 3D structure of molecules

Maximum Common Graph methods[edit]

  • MCS-HSCS[6] (Highest Scoring Common Substructure (HSCS) ranking strategy for single MCS)
  • Small Molecule Subgraph Detector (SMSD)[7]- is a Java based software library for calculating Maximum Common Subgraph (MCS) between small molecules. This will help us to find similarity/distance between two molecules. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph (substructure). [8]

Coding(Moleculei)[edit]

Molecular query methods[edit]

Methods based on special architectures of neural networks[edit]

See also[edit]

References[edit]

  1. ^ H. Kashima, K. Tsuda, A. Inokuchi, Marginalized Kernels Between Labeled Graphs, The 20th International Conference on Machine Learning (ICML2003), 2003. PDF
  2. ^ H. Fröhlich, J. K. Wegner, A. Zell, Optimal Assignment Kernels For Attributed Molecular Graphs, The 22nd International Conference on Machine Learning (ICML 2005), Omnipress, Madison, WI, USA, 2005, 225-232. PDF
  3. ^ H. Fröhlich, J. K. Wegner, A. Zell, Kernel Functions for Attributed Molecular Graphs - A New Similarity Based Approach To ADME Prediction in Classification and Regression, QSAR Comb. Sci., 2006, 25, 317-326. doi:10.1002/qsar.200510135
  4. ^ H. Fröhlich, J. K. Wegner, A. Zell, Assignment Kernels For Chemical Compounds, International Joint Conference on Neural Networks 2005 (IJCNN'05), 2005, 913-918. CiteSeer
  5. ^ P. Mahe, L. Ralaivola, V. Stoven, J. Vert, The pharmacophore kernel for virtual screening with support vector machines, J Chem Inf Model, 2006, 46, 2003-2014. doi:10.1021/ci060138m
  6. ^ J. K. Wegner, H. Fröhlich, H. Mielenz, A. Zell, Data and Graph Mining in Chemical Space for ADME and Activity Data Sets, QSAR Comb. Sci., 2006, 25, 205-220. doi:10.1002/qsar.200510009
  7. ^ S. A. Rahman, M. Bashton, G. L. Holliday, R. Schrader and J. M. Thornton, Small Molecule Subgraph Detector (SMSD) toolkit, Journal of Cheminformatics 2009, 1:12. doi:10.1186/1758-2946-1-12
  8. ^ http://www.ebi.ac.uk/thornton-srv/software/SMSD/
  9. ^ R. D. King, A. Srinivasan, L. Dehaspe, Wamr: a data mining tool for chemical data, J. Comput.-Aid. Mol. Des., 2001, 15, 173-181. doi:10.1023/A:1008171016861
  10. ^ L. Dehaspe, H. Toivonen, King, Finding frequent substructures in chemical compounds, 4th International Conference on Knowledge Discovery and Data Mining, AAAI Press., 1998, 30-36.
  11. ^ A. Inokuchi, T. Washio, T. Okada, H. Motoda, Applying the Apriori-based Graph Mining Method to Mutagenesis Data Analysis, Journal of Computer Aided Chemistry, 2001, 2, 87-92.
  12. ^ A. Inokuchi, T. Washio, K. Nishimura, H. Motoda, A Fast Algorithm for Mining Frequent Connected Subgraphs, IBM Research, Tokyo Research Laboratory, 2002.
  13. ^ A. Clare, R. D. King, Data mining the yeast genome in a lazy functional language, Practical Aspects of Declarative Languages (PADL2003), 2003.
  14. ^ M. Kuramochi, G. Karypis, An Efficient Algorithm for Discovering Frequent Subgraphs, IEEE Transactions on Knowledge and Data Engineering, 2004, 16(9), 1038-1051.
  15. ^ M. Deshpande, M. Kuramochi, N. Wale, G. Karypis, Frequent Substructure-Based Approaches for Classifying Chemical Compounds, IEEE Transactions on Knowledge and Data Engineering, 2005, 17(8), 1036-1050.
  16. ^ C. Helma, T. Cramer, S. Kramer, L. de Raedt, Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds, J. Chem. Inf. Comput. Sci., 2004, 44, 1402-1411. doi:10.1021/ci034254q
  17. ^ T. Meinl, C. Borgelt, M. R. Berthold, Discriminative Closed Fragment Mining and Perfect Extensions in MoFa, Proceedings of the Second Starting AI Researchers Symposium (STAIRS 2004), 2004.
  18. ^ T. Meinl, C. Borgelt, M. R. Berthold, M. Philippsen, Mining Fragments with Fuzzy Chains in Molecular Databases, Second International Workshop on Mining Graphs, Trees and Sequences (MGTS2004), 2004.
  19. ^ T. Meinl, M. R. Berthold, Hybrid Fragment Mining with MoFa and FSG, Proceedings of the 2004 IEEE Conference on Systems, Man & Cybernetics (SMC2004), 2004.
  20. ^ S. Nijssen, J. N. Kok. Frequent Graph Mining and its Application to Molecular Databases, Proceedings of the 2004 IEEE Conference on Systems, Man & Cybernetics (SMC2004), 2004.
  21. ^ C. Helma, Predictive Toxicology, CRC Press, 2005.
  22. ^ M. Wörlein, Extension and parallelization of a graph-mining-algorithm, Friedrich-Alexander-Universität, 2006. PDF
  23. ^ K. Jahn, S. Kramer, Optimizing gSpan for Molecular Datasets, Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences (MGTS-2005), 2005.
  24. ^ X. Yan, J. Han, gSpan: Graph-Based Substructure Pattern Mining, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), IEEE Computer Society, 2002, 721-724.
  25. ^ A. Karwath, L. D. Raedt, SMIREP: predicting chemical activity from SMILES, J Chem Inf Model, 2006, 46, 2432-2444. doi:10.1021/ci060159g
  26. ^ H. Ando, L. Dehaspe, W. Luyten, E. Craenenbroeck, H. Vandecasteele, L. Meervelt, Discovering H-Bonding Rules in Crystals with Inductive Logic Programming, Mol Pharm, 2006, 3, 665-674 . doi:10.1021/mp060034z
  27. ^ P. Mazzatorta, L. Tran, B. Schilter, M. Grigorov, Integration of Structure-Activity Relationship and Artificial Intelligence Systems To Improve in Silico Prediction of Ames Test Mutagenicity, J. Chem. Inf. Model., 2006, ASAP alert. doi:10.1021/ci600411v
  28. ^ N. Wale, G. Karypis. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, ICDM, ''2006, 678-689.
  29. ^ A. Gago Alonso, J.E. Medina Pagola, J.A. Carrasco-Ochoa and J.F. Martínez-Trinidad Mining Connected Subgraph Mining Reducing the Number of Candidates, In Proc. of ECML--PKDD, pp. 365–376, 2008.
  30. ^ Xiaohong Wang, Jun Huan , Aaron Smalter, Gerald Lushington, Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases , in BMC Bioinformatics Vol. 11 (Suppl 3):S8 2010.
  31. ^ Baskin, I. I.; V. A. Palyulin and N. S. Zefirov (1993). [A methodology for searching direct correlations between structures and properties of organic compounds by using computational neural networks] |trans-title= requires |title= (help). Doklady Akademii Nauk SSSR 333 (2): 176–179. 
  32. ^ I. I. Baskin, V. A. Palyulin, N. S. Zefirov (1997). "A Neural Device for Searching Direct Correlations between Structures and Properties of Organic Compounds". J. Chem. Inf. Comput. Sci. 37 (4): 715–721. doi:10.1021/ci940128y. 
  33. ^ D. B. Kireev (1995). "ChemNet: A Novel Neural Network Based Method for Graph/Property Mapping". J. Chem. Inf. Comput. Sci. 35 (2): 175–180. doi:10.1021/ci00024a001. 
  34. ^ A. M. Bianucci; Micheli, Alessio; Sperduti, Alessandro; Starita, Antonina (2000). "Application of Cascade Correlation Networks for Structures to Chemistry". Applied Intelligence 12 (1-2): 117–146. doi:10.1023/A:1008368105614. 
  35. ^ A. Micheli, A. Sperduti, A. Starita, A. M. Bianucci (2001). "Analysis of the Internal Representations Developed by Neural Networks for Structures Applied to Quantitative Structure-Activity Relationship Studies of Benzodiazepines". J. Chem. Inf. Comput. Sci. 41 (1): 202–218. doi:10.1021/ci9903399. PMID 11206375. 
  36. ^ O. Ivanciuc (2001). "Molecular Structure Encoding into Artificial Neural Networks Topology". Roumanian Chemical Quarterly Reviews 8: 197–220. 
  37. ^ A. Goulon, T. Picot, A. Duprat, G. Dreyfus (2007). "Predicting activities without computing descriptors: Graph machines for QSAR". SAR and QSAR in Environmental Research 18 (1-2): 141–153. doi:10.1080/10629360601054313. PMID 17365965. 

Further reading[edit]

  • Schölkopf, B., K. Tsuda and J. P. Vert: Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004.
  • R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons, 2001. ISBN 0-471-05669-3
  • Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997. ISBN 0-521-58519-8
  • R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, 2000. ISBN 3-527-29913-0

See also[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Multifactor_dimensionality_reduction b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Multifactor_dimensionality_reduction new file mode 100644 index 00000000..2390c0f8 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Multifactor_dimensionality_reduction @@ -0,0 +1 @@ + Multifactor dimensionality reduction - Wikipedia, the free encyclopedia

Multifactor dimensionality reduction

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Multifactor dimensionality reduction (MDR) is a data mining approach for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable. MDR was designed specifically to identify interactions among discrete variables that influence a binary outcome and is considered a nonparametric alternative to traditional statistical methods such as logistic regression.

The basis of the MDR method is a constructive induction algorithm that converts two or more variables or attributes to a single attribute. This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of nonlinear or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data.

Contents

Illustrative example[edit]

Consider the following simple example using the exclusive OR (XOR) function. XOR is a logical operator that is commonly used in data mining and machine learning as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2.

Table 1

X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0

A data mining algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data.

Table 2

Z Y
0 0
1 1
1 1
0 0

The data mining algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data. Decision trees, neural networks, or a naive Bayes classifier could be used.

Data mining with MDR[edit]

As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any data mining algorithm there is always concern about overfitting. That is, data mining algorithms are good at finding patterns in completely random data. It is often difficult to determine whether a reported pattern is an important signal or just chance. One approach is to estimate the generalizability of a model to independent datasets using methods such as cross-validation. Models that describe random data typically don't generalize. Another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit. Permutation testing makes it possible to generate an empirical p-value for the result. These approaches have all been shown to be useful for choosing and evaluating MDR models.

Applications[edit]

MDR has mostly been applied[citation needed] to detecting gene-gene interactions or epistasis in genetic studies of common human diseases such as atrial fibrillation, autism, bladder cancer, breast cancer, cardiovascular disease, hypertension, prostate cancer, schizophrenia, and type II diabetes. However, it can be applied to other domains such as economics, engineering, meteorology, etc. where interactions among discrete attributes might be important for predicting a binary outcome.[citation needed]

Software[edit]

www.epistasis.org provides an open-source and freely-available MDR software package.

See also[edit]

References[edit]

  • Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001 Jul;69(1):138-47. PubMed
  • Moore JH, Williams SM. New strategies for identifying gene-gene interactions in hypertension. Ann Med. 2002;34(2):88-95. PubMed
  • Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003 Feb;24(2):150-7. PubMed
  • Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003 Feb 12;19(3):376-82. PubMed
  • Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56(1-3):73-82. PubMed
  • Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004 Mar;47(3):549-54. PubMed
  • Tsai CT, Lai LP, Lin JL, Chiang FT, Hwang JJ, Ritchie MD, Moore JH, Hsu KL, Tseng CD, Liau CS, Tseng YZ. Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation. 2004 Apr 6;109(13):1640-6. PubMed
  • Hahn LW, Moore JH. Ideal discrimination of discrete clinical endpoints using multilocus genotypes. In Silico Biol. 2004;4(2):183-94. PubMed
  • Coffey CS, Hebert PR, Ritchie MD, Krumholz HM, Gaziano JM, Ridker PM, Brown NJ, Vaughan DE, Moore JH. An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation. BMC Bioinformatics. 2004 Apr 30;5:49. PubMed
  • Moore JH. Computational analysis of gene-gene interactions using multifactor dimensionality reduction.

Expert Rev Mol Diagn. 2004 Nov;4(6):795-803. PubMed

  • Williams SM, Ritchie MD, Phillips JA 3rd, Dawson E, Prince M, Dzhura E, Willis A, Semenya A, Summar M, White BC, Addy JH, Kpodonu J, Wong LJ, Felder RA, Jose PA, Moore JH. Multilocus analysis of hypertension: a hierarchical approach.

Hum Hered. 2004;57(1):28-38. PubMed

  • Bastone L, Reilly M, Rader DJ, Foulkes AS. MDR and PRP: a comparison of methods for high-order genotype-phenotype associations. Hum Hered. 2004;58(2):82-92. PubMed
  • Ma DQ, Whitehead PL, Menold MM, Martin ER, Ashley-Koch AE, Mei H, Ritchie MD, Delong GR, Abramson RK, Wright HH, Cuccaro ML, Hussman JP, Gilbert JR, Pericak-Vance MA. Identification of significant association and gene-gene interaction of GABA receptor subunit genes in autism. Am J Hum Genet. 2005 Sep;77(3):377-88. PubMed
  • Soares ML, Coelho T, Sousa A, Batalov S, Conceicao I, Sales-Luis ML, Ritchie MD, Williams SM, Nievergelt CM, Schork NJ, Saraiva MJ, Buxbaum JN. Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene disease. Hum Mol Genet. 2005 Feb 15;14(4):543-53. PubMed
  • Qin S, Zhao X, Pan Y, Liu J, Feng G, Fu J, Bao J, Zhang Z, He L. An association study of the N-methyl-D-aspartate receptor NR1 subunit gene (GRIN1) and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray.

Eur J Hum Genet. 2005 Jul;13(7):807-14. PubMed

  • Wilke RA, Moore JH, Burmester JK. Relative impact of CYP3A genotype and concomitant medication on the severity of atorvastatin-induced muscle damage. Pharmacogenet Genomics. 2005 Jun;15(6):415-21. PubMed
  • Xu J, Lowey J, Wiklund F, Sun J, Lindmark F, Hsu FC, Dimitrov L, Chang B, Turner AR, Liu W, Adami HO, Suh E, Moore JH, Zheng SL, Isaacs WB, Trent JM, Gronberg H. The interaction of four genes in the inflammation pathway significantly predicts prostate cancer risk. Cancer Epidemiol Biomarkers Prev. 2005 Nov;14 (11 Pt 1):2563-8. PubMed
  • Wilke RA, Reif DM, Moore JH. Combinatorial pharmacogenetics. Nat Rev Drug Discov. 2005 Nov;4(11):911-8. PubMed
  • Ritchie MD, Motsinger AA. Multifactor dimensionality reduction for detecting gene-gene and gene-environment interactions in pharmacogenomics studies. Pharmacogenomics. 2005 Dec;6(8):823-34. PubMed
  • Andrew AS, Nelson HH, Kelsey KT, Moore JH, Meng AC, Casella DP, Tosteson TD, Schned AR, Karagas MR. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006 May;27(5):1030-7. PubMed
  • Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61. PubMed

Further reading[edit]

  • R. S. Michalski, "Pattern Recognition as Knowledge-Guided Computer Induction," Department of Computer Science Reports, No. 927, University of Illinois, Urbana, June 1978.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Nearest_neighbor_search b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Nearest_neighbor_search new file mode 100644 index 00000000..4ceb6d20 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Nearest_neighbor_search @@ -0,0 +1 @@ + Nearest neighbor search - Wikipedia, the free encyclopedia

Nearest neighbor search

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Nearest neighbor search (NNS), also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q. In many cases, M is taken to be d-dimensional Euclidean space and distance is measured by Euclidean distance, Manhattan distance or other distance metric.

Donald Knuth in vol. 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office.

Contents

Applications[edit]

The nearest neighbor search problem arises in numerous fields of application, including:

Methods[edit]

Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time.

Linear search[edit]

The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(Nd) where N is the cardinality of S and d is the dimensionality of M. There are no search data structures to maintain, so linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces.[1]

Space partitioning[edit]

Since the 1970s, branch and bound methodology has been applied to the problem. In the case of Euclidean space this approach is known as spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) [2] in the case of randomly distributed points, worst case complexity analyses have been performed.[3] Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions.

In case of general metric space branch and bound approach is known under the name of metric trees. Particular examples include vp-tree and BK-tree.

Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result.

Locality sensitive hashing[edit]

Locality sensitive hashing (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.[4]

Nearest neighbor search in spaces with small intrinsic dimension[edit]

The cover tree has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c12 log n) where c is the expansion constant of the dataset.

Vector Approximation Files[edit]

In high dimensional spaces tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.[5]

Compression/Clustering Based Search[edit]

The VA-File approach is a special case of a compression based search, where each feature component is compressed uniformly and independently. The optimal compression technique in multidimensional spaces is Vector Quantization (VQ), implemented through clustering. The database is clustered and the most "promising" clusters are retrieved. Huge-gains over VA-File, tree-based indexes and sequential scan have been observed.[6][7] Also note the parallels between clustering and LSH.

Variants[edit]

There are numerous variants of the NNS problem and the two most well-known are the k-nearest neighbor search and the ε-approximate nearest neighbor search.

k-nearest neighbor [edit]

k-nearest neighbor search identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. k-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors.

Approximate nearest neighbor[edit]

In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor. In those cases, we can use an algorithm which doesn't guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried.

Algorithms that support the approximate nearest neighbor search include locality-sensitive hashing, best bin first and balanced box-decomposition tree based search.[8]

ε-approximate nearest neighbor search is becoming an increasingly popular tool for fighting the curse of dimensionality.[citation needed]

Nearest neighbor distance ratio[edit]

Nearest neighbor distance ratio do not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor. It is used in CBIR to retrieve pictures through a "query by example" using the similarity between local features. More generally it is involved in several matching problems.

Fixed-radius near neighbors[edit]

Fixed-radius near neighbors is the problem where one wants to efficiently find all points given in Euclidean space within a given fixed distance from a specified point. The data structure should work on a distance which is fixed however the query point is arbitrary.

All nearest neighbors[edit]

For some applications (e.g. entropy estimation), we may have N data-points and wish to know which is the nearest neighbor for every one of those N points. This could of course be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these N queries to produce a more efficient search. As a simple example: when we find the distance from point X to point Y, that also tells us the distance from point Y to point X, so the same calculation can be reused in two different queries.

Given a fixed dimension, a semi-definite positive norm (thereby including every Lp norm), and n points in this space, the nearest neighbour of every point can be found in O(n log n) time and the m nearest neighbours of every point can be found in O(mn log n) time.[9][10]

See also[edit]

Notes[edit]

  1. ^ Weber, Schek, Blott. "A quantitative analysis and performance study for similarity search methods in high dimensional spaces". 
  2. ^ Andrew Moore. "An introductory tutorial on KD trees". 
  3. ^ Lee, D. T.; Wong, C. K. (1977). "Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees". Acta Informatica 9 (1): 23–29. doi:10.1007/BF00263763. 
  4. ^ A. Rajaraman and J. Ullman (2010). "Mining of Massive Datasets, Ch. 3.". 
  5. ^ Weber, Blott. "An Approximation-Based Data Structure for Similarity Search". 
  6. ^ Ramaswamy, Rose, ICIP 2007. "Adaptive cluster-distance bounding for similarity search in image databases". 
  7. ^ Ramaswamy, Rose, TKDE 2001. "Adaptive cluster-distance bounding for high-dimensional indexing". 
  8. ^ S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman and A. Wu, An optimal algorithm for approximate nearest neighbor searching, Journal of the ACM, 45(6):891-923, 1998. [1]
  9. ^ Clarkson, Kenneth L. (1983), "Fast algorithms for the all nearest neighbors problem", 24th IEEE Symp. Foundations of Computer Science, (FOCS '83), pp. 226–232, doi:10.1109/SFCS.1983.16 .
  10. ^ Vaidya, P. M. (1989). "An O(n log n) Algorithm for the All-Nearest-Neighbors Problem". Discrete and Computational Geometry 4 (1): 101–115. doi:10.1007/BF02187718. 

References[edit]

  • Andrews, L.. A template for the nearest neighbor problem. C/C++ Users Journal, vol. 19, no 11 (November 2001), 40 - 49, 2001, ISSN:1075-2838, www.ddj.com/architect/184401449
  • Arya, S., D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions. Journal of the ACM, vol. 45, no. 6, pp. 891–923
  • Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is nearest neighbor meaningful? In Proceedings of the 7th ICDT, Jerusalem, Israel.
  • Chung-Min Chen and Yibei Ling - A Sampling-Based Estimator for Top-k Query. ICDE 2002: 617-627
  • Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. ISBN 0-12-369446-9
  • Zezula, P., Amato, G., Dohnal, V., and Batko, M. Similarity Search - The Metric Space Approach. Springer, 2006. ISBN 0-387-29146-6

Further reading[edit]

  • Shasha, Dennis (2004). High Performance Discovery in Time Series. Berlin: Springer. ISBN 0-387-00857-8. 

External links[edit]

  • Nearest Neighbors and Similarity Search - a website dedicated to educational materials, software, literature, researchers, open problems and events related to NN searching. Maintained by Yury Lifshits.
  • Similarity Search Wiki a collection of links, people, ideas, keywords, papers, slides, code and data sets on nearest neighbours.
  • Metric Spaces Library - An open-source C-based library for metric space indexing by Karina Figueroa, Gonzalo Navarro, Edgar Chávez.
  • ANN - A Library for Approximate Nearest Neighbor Searching by David M. Mount and Sunil Arya.
  • FLANN - Fast Approximate Nearest Neighbor search library by Marius Muja and David G. Lowe
  • Product Quantization Matlab implementation of approximate nearest neighbor search in the compressed domain by Herve Jegou.
  • MESSIF - Metric Similarity Search Implementation Framework by Michal Batko and David Novak.
  • OBSearch - Similarity Search engine for Java (GPL). Implementation by Arnoldo Muller, developed during Google Summer of Code 2007.
  • KNNLSB - K Nearest Neighbors Linear Scan Baseline (distributed, LGPL). Implementation by Georges Quénot (LIG-CNRS).
  • NearTree - An API for finding nearest neighbors among points in spaces of arbitrary dimensions by Lawrence C. Andrews and Herbert J. Bernstein.
  • NearPy - Python framework for fast approximated nearest neighbor search by Ole Krause-Sparmann.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Neural_networks b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Neural_networks new file mode 100644 index 00000000..90ff0a91 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Neural_networks @@ -0,0 +1 @@ + Neural network - Wikipedia, the free encyclopedia

Neural network

From Wikipedia, the free encyclopedia
  (Redirected from Neural networks)
Jump to: navigation, search
Simplified view of a feedforward artificial neural network

The term neural network was traditionally used to refer to a network or circuit of biological neurons.[1] The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes. Thus the term may refer to either biological neural networks, made up of real biological neurons, or artificial neural networks, for solving artificial intelligence problems.

Unlike von Neumann model computations, artificial neural networks do not separate memory and processing and operate via the flow of signals through the net connections, somewhat akin to biological networks.

These artificial networks may be used for predictive modeling, adaptive control and applications where they can be trained via a dataset.

Contents

Overview[edit]

A biological neural network is composed of a group or groups of chemically connected or functionally associated neurons. A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive. Connections, called synapses, are usually formed from axons to dendrites, though dendrodendritic microcircuits[2] and other connections are possible. Apart from the electrical signaling, there are other forms of signaling that arise from neurotransmitter diffusion.

Artificial intelligence, cognitive modelling, and neural networks are information processing paradigms inspired by the way biological neural systems process data. Artificial intelligence and cognitive modeling try to simulate some properties of biological neural networks. In the artificial intelligence field, artificial neural networks have been applied successfully to speech recognition, image analysis and adaptive control, in order to construct software agents (in computer and video games) or autonomous robots.

Historically, digital computers evolved from the von Neumann model, and operate via the execution of explicit instructions via access to memory by a number of processors. On the other hand, the origins of neural networks are based on efforts to model information processing in biological systems. Unlike the von Neumann model, neural network computing does not separate memory and processing.

Neural network theory has served both to better identify how the neurons in the brain function and to provide the basis for efforts to create artificial intelligence.

History[edit]

The preliminary theoretical base for contemporary neural networks was independently proposed by Alexander Bain[3] (1873) and William James[4] (1890). In their work, both thoughts and body activity resulted from interactions among neurons within the brain.

Computer simulation of the branching architecture of the dendrites of pyramidal neurons.[5]

For Bain,[3] every activity led to the firing of a certain set of neurons. When activities were repeated, the connections between those neurons strengthened. According to his theory, this repetition was what led to the formation of memory. The general scientific community at the time was skeptical of Bain’s[3] theory because it required what appeared to be an inordinate number of neural connections within the brain. It is now apparent that the brain is exceedingly complex and that the same brain “wiring” can handle multiple problems and inputs.

James’s[4] theory was similar to Bain’s,[3] however, he suggested that memories and actions resulted from electrical currents flowing among the neurons in the brain. His model, by focusing on the flow of electrical currents, did not require individual neural connections for each memory or action.

C. S. Sherrington[6] (1898) conducted experiments to test James’s theory. He ran electrical currents down the spinal cords of rats. However, instead of demonstrating an increase in electrical current as projected by James, Sherrington found that the electrical current strength decreased as the testing continued over time. Importantly, this work led to the discovery of the concept of habituation.

McCulloch and Pitts[7] (1943) created a computational model for neural networks based on mathematics and algorithms. They called this model threshold logic. The model paved the way for neural network research to split into two distinct approaches. One approach focused on biological processes in the brain and the other focused on the application of neural networks to artificial intelligence.

In the late 1940s psychologist Donald Hebb[8] created a hypothesis of learning based on the mechanism of neural plasticity that is now known as Hebbian learning. Hebbian learning is considered to be a 'typical' unsupervised learning rule and its later variants were early models for long term potentiation. These ideas started being applied to computational models in 1948 with Turing's B-type machines.

Farley and Clark[9] (1954) first used computational machines, then called calculators, to simulate a Hebbian network at MIT. Other neural network computational machines were created by Rochester, Holland, Habit, and Duda[10] (1956).

Rosenblatt[11] (1958) created the perceptron, an algorithm for pattern recognition based on a two-layer learning computer network using simple addition and subtraction. With mathematical notation, Rosenblatt also described circuitry not in the basic perceptron, such as the exclusive-or circuit, a circuit whose mathematical computation could not be processed until after the backpropagation algorithm was created by Werbos[12] (1975).

Neural network research stagnated after the publication of machine learning research by Minsky and Papert[13] (1969). They discovered two key issues with the computational machines that processed neural networks. The first issue was that single-layer neural networks were incapable of processing the exclusive-or circuit. The second significant issue was that computers were not sophisticated enough to effectively handle the long run time required by large neural networks. Neural network research slowed until computers achieved greater processing power. Also key in later advances was the backpropagation algorithm which effectively solved the exclusive-or problem (Werbos 1975).[12]

The parallel distributed processing of the mid-1980s became popular under the name connectionism. The text by Rumelhart and McClelland[14] (1986) provided a full exposition on the use of connectionism in computers to simulate neural processes.

Neural networks, as used in artificial intelligence, have traditionally been viewed as simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is debated, as it is not clear to what degree artificial neural networks mirror brain function.[15]

Neural networks and artificial intelligence[edit]

A neural network (NN), in the case of artificial neurons called artificial neural network (ANN) or simulated neural network (SNN), is an interconnected group of natural or artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network.

In more practical terms neural networks are non-linear statistical data modeling or decision making tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data.

However, the paradigm of neural networks - i.e., implicit, not explicit , learning is stressed - seems more to correspond to some kind of natural intelligence than to the traditional symbol-based Artificial Intelligence, which would stress, instead, rule-based learning.

An artificial neural network involves a network of simple processing elements (artificial neurons) which can exhibit complex global behavior, determined by the connections between the processing elements and element parameters. Artificial neurons were first proposed in 1943 by Warren McCulloch, a neurophysiologist, and Walter Pitts, a logician, who first collaborated at the University of Chicago.[16]

One classical type of artificial neural network is the recurrent Hopfield net.

In a neural network model simple nodes (which can be called by a number of names, including "neurons", "neurodes", "Processing Elements" (PE) and "units"), are connected together to form a network of nodes — hence the term "neural network". While a neural network does not have to be adaptive per se, its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow.[citation needed]

The concept of a neural network appears to have first been proposed by Alan Turing in his 1948 paper Intelligent Machinery in which called them "B-type unorganised machines".[17]

The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations and also to use it. Unsupervised neural networks can also be used to learn representations of the input that capture the salient characteristics of the input distribution, e.g., see the Boltzmann machine (1983), and more recently, deep learning algorithms, which can implicitly learn the distribution function of the observed data. Learning in neural networks is particularly useful in applications where the complexity of the data or task makes the design of such functions by hand impractical.

The tasks to which artificial neural networks are applied tend to fall within the following broad categories:

Application areas of ANNs include system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications, data mining (or knowledge discovery in databases, "KDD"), visualization and e-mail spam filtering.

Neural networks and neuroscience[edit]

Theoretical and computational neuroscience is the field concerned with the theoretical analysis and computational modeling of biological neural systems. Since neural systems are intimately related to cognitive processes and behaviour, the field is closely related to cognitive and behavioural modeling.

The aim of the field is to create models of biological neural systems in order to understand how biological systems work. To gain this understanding, neuroscientists strive to make a link between observed biological processes (data), biologically plausible mechanisms for neural processing and learning (biological neural network models) and theory (statistical learning theory and information theory).

Types of models[edit]

Many models are used; defined at different levels of abstraction, and modeling different aspects of neural systems. They range from models of the short-term behaviour of individual neurons, through models of the dynamics of neural circuitry arising from interactions between individual neurons, to models of behaviour arising from abstract neural modules that represent complete subsystems. These include models of the long-term and short-term plasticity of neural systems and its relation to learning and memory, from the individual neuron to the system level.

Criticism[edit]

A common criticism of neural networks, particularly in robotics, is that they require a large diversity of training for real-world operation. This is not surprising, since any learning machine needs sufficient representative examples in order to capture the underlying structure that allows it to generalize to new cases. Dean Pomerleau, in his research presented in the paper "Knowledge-based Training of Artificial Neural Networks for Autonomous Robot Driving," uses a neural network to train a robotic vehicle to drive on multiple types of roads (single lane, multi-lane, dirt, etc.). A large amount of his research is devoted to (1) extrapolating multiple training scenarios from a single training experience, and (2) preserving past training diversity so that the system does not become overtrained (if, for example, it is presented with a series of right turns – it should not learn to always turn right). These issues are common in neural networks that must decide from amongst a wide variety of responses, but can be dealt with in several ways, for example by randomly shuffling the training examples, by using a numerical optimization algorithm that does not take too large steps when changing the network connections following an example, or by grouping examples in so-called mini-batches.

A. K. Dewdney, a former Scientific American columnist, wrote in 1997, "Although neural nets do solve a few toy problems, their powers of computation are so limited that I am surprised anyone takes them seriously as a general problem-solving tool." (Dewdney, p. 82)

Arguments for Dewdney's position are that to implement large and effective software neural networks, much processing and storage resources need to be committed. While the brain has hardware tailored to the task of processing signals through a graph of neurons, simulating even a most simplified form on Von Neumann technology may compel a NN designer to fill many millions of database rows for its connections - which can consume vast amounts of computer memory and hard disk space. Furthermore, the designer of NN systems will often need to simulate the transmission of signals through many of these connections and their associated neurons - which must often be matched with incredible amounts of CPU processing power and time. While neural networks often yield effective programs, they too often do so at the cost of efficiency (they tend to consume considerable amounts of time and money).

Arguments against Dewdney's position are that neural nets have been successfully used to solve many complex and diverse tasks, ranging from autonomously flying aircraft [2] to detecting credit card fraud[citation needed].

Technology writer Roger Bridgman commented on Dewdney's statements about neural nets:

Neural networks, for instance, are in the dock not only because they have been hyped to high heaven, (what hasn't?) but also because you could create a successful net without understanding how it worked: the bunch of numbers that captures its behaviour would in all probability be "an opaque, unreadable table...valueless as a scientific resource". In spite of his emphatic declaration that science is not technology, Dewdney seems here to pillory neural nets as bad science when most of those devising them are just trying to be good engineers. An unreadable table that a useful machine could read would still be well worth having.[18]

In response to this kind of criticism, one should note that although it is true that analyzing what has been learned by an artificial neural network is difficult, it is much easier to do so than to analyze what has been learned by a biological neural network. Furthermore, researchers involved in exploring learning algorithms for neural networks are gradually uncovering generic principles which allow a learning machine to be successful. For example, Bengio and LeCun (2007) wrote an article regarding local vs non-local learning, as well as shallow vs deep architecture [3].

Some other criticisms came from believers of hybrid models (combining neural networks and symbolic approaches). They advocate the intermix of these two approaches and believe that hybrid models can better capture the mechanisms of the human mind (Sun and Bookman, 1990).

Recent improvements[edit]

While initially research had been concerned mostly with the electrical characteristics of neurons, a particularly important part of the investigation in recent years has been the exploration of the role of neuromodulators such as dopamine, acetylcholine, and serotonin on behaviour and learning.

Biophysical models, such as BCM theory, have been important in understanding mechanisms for synaptic plasticity, and have had applications in both computer science and neuroscience. Research is ongoing in understanding the computational algorithms used in the brain, with some recent biological evidence for radial basis networks and neural backpropagation as mechanisms for processing data.

Computational devices have been created in CMOS for both biophysical simulation and neuromorphic computing. More recent efforts show promise for creating nanodevices[19] for very large scale principal components analyses and convolution. If successful, these efforts could usher in a new era of neural computing[20] that is a step beyond digital computing, because it depends on learning rather than programming and because it is fundamentally analog rather than digital even though the first instantiations may in fact be with CMOS digital devices.

Between 2009 and 2012, the recurrent neural networks and deep feedforward neural networks developed in the research group of Jürgen Schmidhuber at the Swiss AI Lab IDSIA have won eight international competitions in pattern recognition and machine learning.[21] For example, multi-dimensional long short term memory (LSTM)[22][23] won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition (ICDAR), without any prior knowledge about the three different languages to be learned.

Variants of the back-propagation algorithm as well as unsupervised methods by Geoff Hinton and colleagues at the University of Toronto[24][25] can be used to train deep, highly nonlinear neural architectures similar to the 1980 Neocognitron by Kunihiko Fukushima,[26] and the "standard architecture of vision",[27] inspired by the simple and complex cells identified by David H. Hubel and Torsten Wiesel in the primary visual cortex.

Deep learning feedforward networks alternate convolutional layers and max-pooling layers, topped by several pure classification[disambiguation needed] layers. Fast GPU-based implementations of this approach have won several pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition[28] and the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge.[29] Such neural networks also were the first artificial pattern recognizers to achieve human-competitive or even superhuman performance[30] on benchmarks such as traffic sign recognition (IJCNN 2012), or the MNIST handwritten digits problem of Yann LeCun and colleagues at NYU.

See also[edit]

References[edit]

  1. ^ J. J. HOPFIELD Neural networks and physical systems with emergent collective computational abilities. Proc. NatL Acad. Sci. USA Vol. 79, pp. 2554-2558, April 1982 Biophysics [1]
  2. ^ Arbib, p.666
  3. ^ a b c d Bain (1873). Mind and Body: The Theories of Their Relation. New York: D. Appleton and Company. 
  4. ^ a b James (1890). The Principles of Psychology. New York: H. Holt and Company. 
  5. ^ "PLoS Computational Biology Issue Image | Vol. 6(8) August 2010". PLoS Computational Biology 6 (8): ev06.ei08. 2010. doi:10.1371/image.pcbi.v06.i08.  edit
  6. ^ Sherrington, C.S. "Experiments in Examination of the Peripheral Distribution of the Fibers of the Posterior Roots of Some Spinal Nerves". Proceedings of the Royal Society of London 190: 45–186. 
  7. ^ McCulloch, Warren; Walter Pitts (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity". Bulletin of Mathematical Biophysics 5 (4): 115–133. doi:10.1007/BF02478259. 
  8. ^ Hebb, Donald (1949). The Organization of Behavior. New York: Wiley. 
  9. ^ Farley, B; W.A. Clark (1954). "Simulation of Self-Organizing Systems by Digital Computer". IRE Transactions on Information Theory 4 (4): 76–84. doi:10.1109/TIT.1954.1057468. 
  10. ^ Rochester, N.; J.H. Holland, L.H. Habit, and W.L. Duda (1956). "Tests on a cell assembly theory of the action of the brain, using a large digital computer". IRE Transactions on Information Theory 2 (3): 80–93. doi:10.1109/TIT.1956.1056810. 
  11. ^ Rosenblatt, F. (1958). "The Perceptron: A Probalistic Model For Information Storage And Organization In The Brain". Psychological Review 65 (6): 386–408. doi:10.1037/h0042519. PMID 13602029. 
  12. ^ a b Werbos, P.J. (1975). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. 
  13. ^ Minsky, M.; S. Papert (1969). An Introduction to Computational Geometry. MIT Press. ISBN 0-262-63022-2. 
  14. ^ Rumelhart, D.E; James McClelland (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge: MIT Press. 
  15. ^ Russell, Ingrid. "Neural Networks Module". Retrieved 2012. 
  16. ^ McCulloch, Warren; Pitts, Walter, "A Logical Calculus of Ideas Immanent in Nervous Activity", 1943, Bulletin of Mathematical Biophysics 5:115-133.
  17. ^ The Essential Turing by Alan M. Turing and B. Jack Copeland (Nov 18, 2004) ISBN 0198250800 page 403
  18. ^ Roger Bridgman's defence of neural networks
  19. ^ Yang, J. J.; Pickett, M. D.; Li, X. M.; Ohlberg, D. A. A.; Stewart, D. R.; Williams, R. S. Nat. Nanotechnol. 2008, 3, 429–433.
  20. ^ Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. S. Nature 2008, 453, 80–83.
  21. ^ http://www.kurzweilai.net/how-bio-inspired-deep-learning-keeps-winning-competitions 2012 Kurzweil AI Interview with Jürgen Schmidhuber on the eight competitions won by his Deep Learning team 2009-2012
  22. ^ Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552
  23. ^ A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
  24. ^ http://www.scholarpedia.org/article/Deep_belief_networks /
  25. ^ Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets". Neural Computation 18 (7): 1527–1554. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. 
  26. ^ K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4): 93-202, 1980.
  27. ^ M Riesenhuber, T Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 1999.
  28. ^ D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi-Column Deep Neural Network for Traffic Sign Classification. Neural Networks, 2012.
  29. ^ D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. In Advances in Neural Information Processing Systems (NIPS 2012), Lake Tahoe, 2012.
  30. ^ D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012.

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Nothing_to_hide_argument b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Nothing_to_hide_argument new file mode 100644 index 00000000..e4e74fbc --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Nothing_to_hide_argument @@ -0,0 +1 @@ + Nothing to hide argument - Wikipedia, the free encyclopedia

Nothing to hide argument

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The nothing to hide argument is an argument which states that government data mining and surveillance programs do not threaten privacy unless they uncover illegal activities, and that if they do uncover illegal activities, the person committing these activities does not have the right to keep them private. Hence, a person who favors this argument may state "I've got nothing to hide" and therefore do not express opposition to government data mining and surveillance.[1] An individual using this argument may say that a person should not have worries about government data mining or surveillance if he/she has "nothing to hide."[2]

This argument is commonly used in discussions regarding privacy. Geoffrey Stone, a legal scholar, said that the use of the argument is "all-too-common".[3] Bruce Schneier, a data security expert and cryptographer, described it as the "most common retort against privacy advocates."[3]

The motto "If you've got nothing to hide, you've got nothing to fear" has been used in the closed-circuit television program practiced in cities in the United Kingdom.[3]

Contents

Ethnography[edit]

An ethnographic study by Ana Viseu, Andrew Clement, and Jane Aspinal of the integration of online services into everyday life was published as "Situating Privacy Online: Complex Perceptions and Everyday Practices" in the Information, Communication & Society journal in 2004. It found that, in the words of Kirsty Best, author of "Living in the control society Surveillance, users and digital screen technologies", "fully employed, middle to middle-upper income earners articulated similar beliefs about not being targeted for surveillance" compared to other respondents who did not show concern, and that "In these cases, respondents expressed the view that they were not doing anything wrong, or that they had nothing to hide."[4] Of the participant sample in Viseu's study, one reported using privacy-enhancing technology,[5] and Viseu et al. said "One of the clearest features of our subjects’ privacy perceptions and practices was their passivity towards the issue."[6] Viseu et al. said the passivity originated from the "nothing to hide" argument.[7]

Effect on privacy protection[edit]

Viseu, et. al. said that the argument "has been well documented in the privacy literature as a stumbling block to the development of pragmatic privacy protection strategies, and it, too, is related to the ambiguous and symbolic nature of the term ‘privacy’ itself."[7] They explained that privacy is an abstract concept and people only become concerned with it once their privacy is gone, and they compare a loss to privacy with people knowing that ozone depletion and global warming are negative developments but that "the immediate gains of driving the car to work or putting on hairspray outweigh the often invisible losses of polluting the environment."[7]

Arguments in favor and against the nothing to hide argument[edit]

Daniel J. Solove stated in an article for the The Chronicle of Higher Education that he opposes the argument; he stated that a government can leak information about a person and cause damage to that person, or use information about a person to deny access to services even if a person did not actually engage in wrongdoing, and that a government can cause damage to one's personal life through making errors.[3] Snolove wrote "When engaged directly, the nothing-to-hide argument can ensnare, for it forces the debate to focus on its narrow understanding of privacy. But when confronted with the plurality of privacy problems implicated by government data collection and use beyond surveillance and disclosure, the nothing-to-hide argument, in the end, has nothing to say."[3]

danah boyd, a social media researcher, opposes the argument. She said that even though "[p]eople often feel immune from state surveillance because they’ve done nothing wrong" an entity or group can distort a person's image and harm one's reputation, or guilt by association can be used to defame a person.[8]

Bruce Schneier, a computer security expert and cryptographer, expressed opposition, citing Cardinal Richelieu's statement "If one would give me six lines written by the hand of the most honest man, I would find something in them to have him hanged", referring to how a state government can find aspects in a person's life in order to prosecute or blackmail that individual.[9] Schneier also argued "Too many wrongly characterize the debate as "security versus privacy." The real choice is liberty versus control."[9]

Johann Hari, a British writer, argued that the "nothing to hide" argument is irrelevant to the placement of CCTV cameras in public places in the United Kingdom because the cameras are public areas where one is observed by many people he or she would be unfamiliar with and not in "places where you hide".[10]

See also[edit]

References[edit]

Notes[edit]

  1. ^ Mordini, p. 252.
  2. ^ Solove, Nothing to Hide: The False Tradeoff Between Privacy and Security, p. 1. "If you've got nothing to hide, you shouldn't worry about government surveillance."
  3. ^ a b c d e Solove, Daniel J. "Why Privacy Matters Even if You Have 'Nothing to Hide'." The Chronicle of Higher Education. May 15, 2011. Retrieved on June 25, 2013. "The nothing-to-hide argument pervades discussions about privacy. The data-security expert Bruce Schneier calls it the "most common retort against privacy advocates." The legal scholar Geoffrey Stone refers to it as an "all-too-common refrain." In its most compelling form, it is an argument that the privacy interest is generally minimal, thus making the contest with security concerns a foreordained victory for security."
  4. ^ Best, p. 12.
  5. ^ Viseu, et al. p. 102-103.
  6. ^ Viseu, et al. p. 102.
  7. ^ a b c Viseu, et al. p. 103.
  8. ^ boyd, danah. "Danah Boyd: The problem with the ‘I have nothing to hide’ argument." (Opinion) The Dallas Morning News. June 14, 2013. Retrieved on June 25, 2013. "It’s disturbing to me how often I watch as someone’s likeness is constructed in ways[...]"
  9. ^ a b Schneier, Bruce. "The Eternal Value of Privacy." Wired. May 18, 2006. Retrieved on June 25, 2013. - Also available from Schneier's personal website
  10. ^ Hari, Johann. "Johann Hari: This strange backlash against CCTV." The Independent. Monday March 17, 2008. Retrieved on June 26, 2013.

Further reading[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Online_algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Online_algorithm new file mode 100644 index 00000000..c9d1d553 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Online_algorithm @@ -0,0 +1 @@ + Online algorithm - Wikipedia, the free encyclopedia

Online algorithm

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start. In contrast, an offline algorithm is given the whole problem data from the beginning and is required to output an answer which solves the problem at hand. (For example, selection sort requires that the entire list be given before it can sort it, while insertion sort doesn't.)

Because it does not know the whole input, an online algorithm is forced to make decisions that may later turn out not to be optimal, and the study of online algorithms has focused on the quality of decision-making that is possible in this setting. Competitive analysis formalizes this idea by comparing the relative performance of an online and offline algorithm for the same problem instance. For other points of view on online inputs to algorithms, see streaming algorithm (focusing on the amount of memory needed to accurately represent past inputs), dynamic algorithm (focusing on the time complexity of maintaining solutions to problems with online inputs) and online machine learning.

A problem exemplifying the concepts of online algorithms is the Canadian Traveller Problem. The goal of this problem is to minimize the cost of reaching a target in a weighted graph where some of the edges are unreliable and may have been removed from the graph. However, that an edge has been removed (failed) is only revealed to the traveller when she/he reaches one of the edge's endpoints. The worst case for this problem is simply that all of the unreliable edges fail and the problem reduces to the usual Shortest Path Problem. An alternative analysis of the problem can be made with the help of competitive analysis. For this method of analysis, the offline algorithm knows in advance which edges will fail and the goal is to minimize the ratio between the online and offline algorithms' performance. This problem is PSPACE-complete.

Contents

Online algorithms[edit]

The names below are referenced with capital letters since they appear in papers with capital letters. The following are the names of some online algorithms:

See also[edit]

References[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Optimal_matching b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Optimal_matching new file mode 100644 index 00000000..1894b276 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Optimal_matching @@ -0,0 +1 @@ + Optimal matching - Wikipedia, the free encyclopedia

Optimal matching

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences[1] from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.

Contents

Algorithm[edit]

Let S = (s_1, s_2, s_3, \ldots s_T) be a sequence of states s_i belonging to a finite set of possible states. Let us denote {\mathbf S} the sequence space, i.e. the set of all possible sequences of states.

Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators a_i: {\mathbf S} \rightarrow {\mathbf S}. In the most simple approach, a set composed of only three basic operations to transform sequences is used:

  • one state s is inserted in the sequence a^{\rm Ins}_{s'} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_2, s_3, \ldots, s', \ldots s_T)
  • one state is deleted from the sequence a^{\rm Del}_{s_2} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_3, \ldots  s_T) and
  • a state s_1 is replaced (substituted) by state s'_1, a^{\rm Sub}_{s_1,s'_1} (s_1, s_2, s_3, \ldots s_T) = (s'_1, s_2, s_3, \ldots s_T).

Imagine now that a cost c(a_i) \in {\mathbf R}^+_0 is associated to each operator. Given two sequences S_1 and S_2, the idea is to measure the cost of obtaining S_2 from S_1 using operators from the algebra. Let A={a_1, a_2, \ldots a_n} be a sequence of operators such that the application of all the operators of this sequence A to the first sequence S_1 gives the second sequence S_2: S_2 = a_1 \circ a_2 \circ \ldots \circ a_{n} (S_1) where a_1 \circ a_2 denotes the compound operator. To this set we associate the cost c(A) = \sum_{i=1}^n c(a_i), that represents the total cost of the transformation. One should consider at this point that there might exist different such sequences A that transform S_1 into S_2; a reasonable choice is to select the cheapest of such sequences. We thus call distance
d(S_1,S_2)= \min_A \left \{ c(A)~{\rm such~that}~S_2 = A (S_1)  \right \}
that is, the cost of the least expensive set of transformations that turn S_1 into S_2. Notice that d(S_1,S_2) is by definition nonnegative since it is the sum of positive costs, and trivially d(S_1,S_2)=0 if and only if S_1=S_2, that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal c(a^{\rm Ins}) = c(a^{\rm Del}); the term indel cost usually refers to the common cost of insertion and deletion.

Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.

Criticism[edit]

Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu[2]), the main problem in the application of optimal matching is to appropriately define the costs c(a_i).

Optimal matching in causal modelling[edit]

Optimal matching is also a term used in statistical modelling of causal effects. In this context it refers to matching "cases" with "controls", and is completely separate from the sequence-analytic sense.

Software[edit]

  • TDA is a powerful program, offering access to some of the latest developments in transition data analysis.
  • STATA has implemented a package to run optimal matching analysis.
  • TraMineR is an open source R-package for analysing and visualizing states and events sequences, including optimal matching analysis.

References and notes[edit]

  1. ^ A. Abbott and A. Tsay, (2000) Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect Sociological Methods & Research], Vol. 29, 3-33. doi:10.1177/0049124100029001001
  2. ^ L. L. Wu. (2000) Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect" Sociological Methods & Research, 29 41-64. doi:10.1177/0049124100029001003

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Predictive_analytics b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Predictive_analytics new file mode 100644 index 00000000..fd051404 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Predictive_analytics @@ -0,0 +1 @@ + Predictive analytics - Wikipedia, the free encyclopedia

Predictive analytics

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Predictive analytics encompasses a variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events.[1][2]

In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions.

Predictive analytics is used in actuarial science,[3] marketing,[4] financial services,[5] insurance, telecommunications,[6] retail,[7] travel,[8] healthcare,[9] pharmaceuticals[10] and other fields.

One of the most well known applications is credit scoring,[1] which is used throughout financial services. Scoring models process a customer's credit history, loan application, customer data, etc., in order to rank-order individuals by their likelihood of making future credit payments on time. A well-known example is the FICO score.

Contents

Definition[edit]

Predictive analytics is an area of statistical analysis that deals with extracting information from data and using it to predict trends and behavior patterns. Often the unknown event of interest is in the future, but predictive analytics can be applied to any type of unknown whether it be in the past, present or future. For example, identifying suspects after a crime has been committed, or credit card fraud as it occurs. The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting them to predict the unknown outcome. It is important to note, however, that the accuracy and usability of results will depend greatly on the level of data analysis and the quality of assumptions.

Types[edit]

Generally, the term predictive analytics is used to mean predictive modeling, "scoring" data with predictive models, and forecasting. However, people are increasingly using the term to refer to related analytical disciplines, such as descriptive modeling and decision modeling or optimization. These disciplines also involve rigorous data analysis, and are widely used in business for segmentation and decision making, but have different purposes and the statistical techniques underlying them vary.

Predictive models[edit]

Predictive models analyze past performance to assess how likely a customer is to exhibit a specific behavior in order to improve marketing effectiveness. This category also encompasses models that seek out subtle data patterns to answer questions about customer performance, such as fraud detection models. Predictive models often perform calculations during live transactions, for example, to evaluate the risk or opportunity of a given customer or transaction, in order to guide a decision. With advancements in computing speed, individual agent modeling systems have become capable of simulating human behaviors or reactions to given stimuli or scenarios. The new term for animating data specifically linked to an individual in a simulated environment is avatar analytics.

Descriptive models[edit]

Descriptive models quantify relationships in data in a way that is often used to classify customers or prospects into groups. Unlike predictive models that focus on predicting a single customer behavior (such as credit risk), descriptive models identify many different relationships between customers or products. Descriptive models do not rank-order customers by their likelihood of taking a particular action the way predictive models do. Instead, descriptive models can be used, for example, to categorize customers by their product preferences and life stage. Descriptive modeling tools can be utilized to develop further models that can simulate large number of individualized agents and make predictions.

Decision models[edit]

Decision models describe the relationship between all the elements of a decision — the known data (including results of predictive models), the decision, and the forecast results of the decision — in order to predict the results of decisions involving many variables. These models can be used in optimization, maximizing certain outcomes while minimizing others. Decision models are generally used to develop decision logic or a set of business rules that will produce the desired action for every customer or circumstance.

Applications[edit]

Although predictive analytics can be put to use in many applications, we outline a few examples where predictive analytics has shown positive impact in recent years.

Analytical customer relationship management (CRM)[edit]

Analytical Customer Relationship Management is a frequent commercial application of Predictive Analysis. Methods of predictive analysis are applied to customer data to pursue CRM objectives, which involve constructing a holistic view of the customer no matter where their information resides in the company or the department involved. CRM uses predictive analysis in applications for marketing campaigns, sales, and customer services to name a few. These tools are required in order for a company to posture and focus their efforts effectively across the breadth of their customer base. They must analyze and understand the products in demand or have the potential for high demand, predict customers' buying habits in order to promote relevant products at multiple touch points, and proactively identify and mitigate issues that have the potential to lose customers or reduce their ability to gain new ones. Analytical Customer Relationship Management can be applied throughout the customers lifecycle (acquisition, relationship growth, retention, and win-back). Several of the application areas described below (direct marketing, cross-sell, customer retention) are part of Customer Relationship Management.

Clinical decision support systems[edit]

Experts use predictive analysis in health care primarily to determine which patients are at risk of developing certain conditions, like diabetes, asthma, heart disease, and other lifetime illnesses. Additionally, sophisticated clinical decision support systems incorporate predictive analytics to support medical decision making at the point of care. A working definition has been proposed by Robert Hayward of the Centre for Health Evidence: "Clinical Decision Support Systems link health observations with health knowledge to influence health choices by clinicians for improved health care."[citation needed]

Collection analytics[edit]

Every portfolio has a set of delinquent customers who do not make their payments on time. The financial institution has to undertake collection activities on these customers to recover the amounts due. A lot of collection resources are wasted on customers who are difficult or impossible to recover. Predictive analytics can help optimize the allocation of collection resources by identifying the most effective collection agencies, contact strategies, legal actions and other strategies to each customer, thus significantly increasing recovery at the same time reducing collection costs.

Cross-sell[edit]

Often corporate organizations collect and maintain abundant data (e.g. customer records, sale transactions) as exploiting hidden relationships in the data can provide a competitive advantage. For an organization that offers multiple products, predictive analytics can help analyze customers' spending, usage and other behavior, leading to efficient cross sales, or selling additional products to current customers.[2] This directly leads to higher profitability per customer and stronger customer relationships.

Customer retention[edit]

With the number of competing services available, businesses need to focus efforts on maintaining continuous consumer satisfaction, rewarding consumer loyalty and minimizing customer attrition. Businesses tend to respond to customer attrition on a reactive basis, acting only after the customer has initiated the process to terminate service. At this stage, the chance of changing the customer's decision is almost impossible. Proper application of predictive analytics can lead to a more proactive retention strategy. By a frequent examination of a customer’s past service usage, service performance, spending and other behavior patterns, predictive models can determine the likelihood of a customer terminating service sometime in the near future.[6] An intervention with lucrative offers can increase the chance of retaining the customer. Silent attrition, the behavior of a customer to slowly but steadily reduce usage, is another problem that many companies face. Predictive analytics can also predict this behavior, so that the company can take proper actions to increase customer activity.

Direct marketing[edit]

When marketing consumer products and services, there is the challenge of keeping up with competing products and consumer behavior. Apart from identifying prospects, predictive analytics can also help to identify the most effective combination of product versions, marketing material, communication channels and timing that should be used to target a given consumer. The goal of predictive analytics is typically to lower the cost per order or cost per action.

Fraud detection[edit]

Fraud is a big problem for many businesses and can be of various types: inaccurate credit applications, fraudulent transactions (both offline and online), identity thefts and false insurance claims. These problems plague firms of all sizes in many industries. Some examples of likely victims are credit card issuers, insurance companies,[11] retail merchants, manufacturers, business-to-business suppliers and even services providers. A predictive model can help weed out the "bads" and reduce a business's exposure to fraud.

Predictive modeling can also be used to identify high-risk fraud candidates in business or the public sector. Nigrini developed a risk-scoring method to identify audit targets. He describes the use of this approach to detect fraud in the franchisee sales reports of an international fast-food chain. Each location is scored using 10 predictors. The 10 scores are then weighted to give one final overall risk score for each location. The same scoring approach was also used to identify high-risk check kiting accounts, potentially fraudulent travel agents, and questionable vendors. A reasonably complex model was used to identify fraudulent monthly reports submitted by divisional controllers.[12]

The Internal Revenue Service (IRS) of the United States also uses predictive analytics to mine tax returns and identify tax fraud.[11]

Recent[when?] advancements in technology have also introduced predictive behavior analysis for web fraud detection. This type of solution utilizes heuristics in order to study normal web user behavior and detect anomalies indicating fraud attempts.

Portfolio, product or economy-level prediction[edit]

Often the focus of analysis is not the consumer but the product, portfolio, firm, industry or even the economy. For example, a retailer might be interested in predicting store-level demand for inventory management purposes. Or the Federal Reserve Board might be interested in predicting the unemployment rate for the next year. These types of problems can be addressed by predictive analytics using time series techniques (see below). They can also be addressed via machine learning approaches which transform the original time series into a feature vector space, where the learning algorithm finds patterns that have predictive power.[13][14]

Risk management[edit]

When employing risk management techniques, the results are always to predict and benefit from a future scenario. The Capital asset pricing model (CAP-M) "predicts" the best portfolio to maximize return, Probabilistic Risk Assessment (PRA)--when combined with mini-Delphi Techniques and statistical approaches yields accurate forecasts and RiskAoA is a stand-alone predictive tool.[15] These are three examples of approaches that can extend from project to market, and from near to long term. Underwriting (see below) and other business approaches identify risk management as a predictive method.

Underwriting[edit]

Many businesses have to account for risk exposure due to their different services and determine the cost needed to cover the risk. For example, auto insurance providers need to accurately determine the amount of premium to charge to cover each automobile and driver. A financial company needs to assess a borrower's potential and ability to pay before granting a loan. For a health insurance provider, predictive analytics can analyze a few years of past medical claims data, as well as lab, pharmacy and other records where available, to predict how expensive an enrollee is likely to be in the future. Predictive analytics can help underwrite these quantities by predicting the chances of illness, default, bankruptcy, etc. Predictive analytics can streamline the process of customer acquisition by predicting the future risk behavior of a customer using application level data.[3] Predictive analytics in the form of credit scores have reduced the amount of time it takes for loan approvals, especially in the mortgage market where lending decisions are now made in a matter of hours rather than days or even weeks. Proper predictive analytics can lead to proper pricing decisions, which can help mitigate future risk of default.

Technology and Big Data influences on Predictive Analytics[edit]

Big Data is a collection of data sets that are so large and complex that they become awkward to work with using traditional database management tools. The volume, variety and velocity of Big Data have introduced challenges across the board for capture, storage, search, sharing, analysis, and visualization. Examples of big data sources include web logs, RFID and sensor data, social networks, Internet search indexing, call detail records, military surveillance, and complex data in astronomic, biogeochemical, genomics, and atmospheric sciences. Thanks to technological advances in computer hardware—faster CPUs, cheaper memory, and MPP architectures-–and new technologies such as Hadoop, MapReduce, and in-database and text analytics for processing Big Data, it is now feasible to collect, analyze, and mine massive amounts of structured and unstructured data for new insights.[11] Today, exploring Big Data and using predictive analytics is within reach of more organizations than ever before.

Analytical Techniques[edit]

The approaches and techniques used to conduct predictive analytics can broadly be grouped into regression techniques and machine learning techniques.

Regression techniques[edit]

Regression models are the mainstay of predictive analytics. The focus lies on establishing a mathematical equation as a model to represent the interactions between the different variables in consideration. Depending on the situation, there is a wide variety of models that can be applied while performing predictive analytics. Some of them are briefly discussed below.

Linear regression model[edit]

The linear regression model analyzes the relationship between the response or dependent variable and a set of independent or predictor variables. This relationship is expressed as an equation that predicts the response variable as a linear function of the parameters. These parameters are adjusted so that a measure of fit is optimized. Much of the effort in model fitting is focused on minimizing the size of the residual, as well as ensuring that it is randomly distributed with respect to the model predictions.

The goal of regression is to select the parameters of the model so as to minimize the sum of the squared residuals. This is referred to as ordinary least squares (OLS) estimation and results in best linear unbiased estimates (BLUE) of the parameters if and only if the Gauss-Markov assumptions are satisfied.

Once the model has been estimated we would be interested to know if the predictor variables belong in the model – i.e. is the estimate of each variable's contribution reliable? To do this we can check the statistical significance of the model’s coefficients which can be measured using the t-statistic. This amounts to testing whether the coefficient is significantly different from zero. How well the model predicts the dependent variable based on the value of the independent variables can be assessed by using the R² statistic. It measures predictive power of the model i.e. the proportion of the total variation in the dependent variable that is "explained" (accounted for) by variation in the independent variables.

Discrete choice models[edit]

Multivariate regression (above) is generally used when the response variable is continuous and has an unbounded range. Often the response variable may not be continuous but rather discrete. While mathematically it is feasible to apply multivariate regression to discrete ordered dependent variables, some of the assumptions behind the theory of multivariate linear regression no longer hold, and there are other techniques such as discrete choice models which are better suited for this type of analysis. If the dependent variable is discrete, some of those superior methods are logistic regression, multinomial logit and probit models. Logistic regression and probit models are used when the dependent variable is binary.

Logistic regression[edit]

In a classification setting, assigning outcome probabilities to observations can be achieved through the use of a logistic model, which is basically a method which transforms information about the binary dependent variable into an unbounded continuous variable and estimates a regular multivariate model (See Allison's Logistic Regression for more information on the theory of Logistic Regression).

The Wald and likelihood-ratio test are used to test the statistical significance of each coefficient b in the model (analogous to the t tests used in OLS regression; see above). A test assessing the goodness-of-fit of a classification model is the "percentage correctly predicted".

Multinomial logistic regression[edit]

An extension of the binary logit model to cases where the dependent variable has more than 2 categories is the multinomial logit model. In such cases collapsing the data into two categories might not make good sense or may lead to loss in the richness of the data. The multinomial logit model is the appropriate technique in these cases, especially when the dependent variable categories are not ordered (for examples colors like red, blue, green). Some authors have extended multinomial regression to include feature selection/importance methods such as Random multinomial logit.

Probit regression[edit]

Probit models offer an alternative to logistic regression for modeling categorical dependent variables. Even though the outcomes tend to be similar, the underlying distributions are different. Probit models are popular in social sciences like economics.

A good way to understand the key difference between probit and logit models is to assume that there is a latent variable z.

We do not observe z but instead observe y which takes the value 0 or 1. In the logit model we assume that y follows a logistic distribution. In the probit model we assume that y follows a standard normal distribution. Note that in social sciences (e.g. economics), probit is often used to model situations where the observed variable y is continuous but takes values between 0 and 1.

Logit versus probit[edit]

The Probit model has been around longer than the logit model. They behave similarly, except that the logistic distribution tends to be slightly flatter tailed. One of the reasons the logit model was formulated was that the probit model was computationally difficult due to the requirement of numerically calculating integrals. Modern computing however has made this computation fairly simple. The coefficients obtained from the logit and probit model are fairly close. However, the odds ratio is easier to interpret in the logit model.

Practical reasons for choosing the probit model over the logistic model would be:

  • There is a strong belief that the underlying distribution is normal
  • The actual event is not a binary outcome (e.g., bankruptcy status) but a proportion (e.g., proportion of population at different debt levels).

Time series models[edit]

Time series models are used for predicting or forecasting the future behavior of variables. These models account for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for. As a result standard regression techniques cannot be applied to time series data and methodology has been developed to decompose the trend, seasonal and cyclical component of the series. Modeling the dynamic path of a variable can improve forecasts since the predictable component of the series can be projected into the future.

Time series models estimate difference equations containing stochastic components. Two commonly used forms of these models are autoregressive models (AR) and moving average (MA) models. The Box-Jenkins methodology (1976) developed by George Box and G.M. Jenkins combines the AR and MA models to produce the ARMA (autoregressive moving average) model which is the cornerstone of stationary time series analysis. ARIMA(autoregressive integrated moving average models) on the other hand are used to describe non-stationary time series. Box and Jenkins suggest differencing a non stationary time series to obtain a stationary series to which an ARMA model can be applied. Non stationary time series have a pronounced trend and do not have a constant long-run mean or variance.

Box and Jenkins proposed a three stage methodology which includes: model identification, estimation and validation. The identification stage involves identifying if the series is stationary or not and the presence of seasonality by examining plots of the series, autocorrelation and partial autocorrelation functions. In the estimation stage, models are estimated using non-linear time series or maximum likelihood estimation procedures. Finally the validation stage involves diagnostic checking such as plotting the residuals to detect outliers and evidence of model fit.

In recent years time series models have become more sophisticated and attempt to model conditional heteroskedasticity with models such as ARCH (autoregressive conditional heteroskedasticity) and GARCH (generalized autoregressive conditional heteroskedasticity) models frequently used for financial time series. In addition time series models are also used to understand inter-relationships among economic variables represented by systems of equations using VAR (vector autoregression) and structural VAR models.

Survival or duration analysis[edit]

Survival analysis is another name for time to event analysis. These techniques were primarily developed in the medical and biological sciences, but they are also widely used in the social sciences like economics, as well as in engineering (reliability and failure time analysis).

Censoring and non-normality, which are characteristic of survival data, generate difficulty when trying to analyze the data using conventional statistical models such as multiple linear regression. The normal distribution, being a symmetric distribution, takes positive as well as negative values, but duration by its very nature cannot be negative and therefore normality cannot be assumed when dealing with duration/survival data. Hence the normality assumption of regression models is violated.

The assumption is that if the data were not censored it would be representative of the population of interest. In survival analysis, censored observations arise whenever the dependent variable of interest represents the time to a terminal event, and the duration of the study is limited in time.

An important concept in survival analysis is the hazard rate, defined as the probability that the event will occur at time t conditional on surviving until time t. Another concept related to the hazard rate is the survival function which can be defined as the probability of surviving to time t.

Most models try to model the hazard rate by choosing the underlying distribution depending on the shape of the hazard function. A distribution whose hazard function slopes upward is said to have positive duration dependence, a decreasing hazard shows negative duration dependence whereas constant hazard is a process with no memory usually characterized by the exponential distribution. Some of the distributional choices in survival models are: F, gamma, Weibull, log normal, inverse normal, exponential etc. All these distributions are for a non-negative random variable.

Duration models can be parametric, non-parametric or semi-parametric. Some of the models commonly used are Kaplan-Meier and Cox proportional hazard model (non parametric).

Classification and regression trees[edit]

Classification and regression trees (CART) is a non-parametric decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively.

Decision trees are formed by a collection of rules based on variables in the modeling data set:

  • Rules based on variables' values are selected to get the best split to differentiate observations based on the dependent variable
  • Once a rule is selected and splits a node into two, the same process is applied to each "child" node (i.e. it is a recursive procedure)
  • Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met. (Alternatively, the data are split as much as possible and then the tree is later pruned.)

Each branch of the tree ends in a terminal node. Each observation falls into one and exactly one terminal node, and each terminal node is uniquely defined by a set of rules.

A very popular method for predictive analytics is Leo Breiman's Random forests or derived versions of this technique like Random multinomial logit.

Multivariate adaptive regression splines[edit]

Multivariate adaptive regression splines (MARS) is a non-parametric technique that builds flexible models by fitting piecewise linear regressions.

An important concept associated with regression splines is that of a knot. Knot is where one local regression model gives way to another and thus is the point of intersection between two splines.

In multivariate and adaptive regression splines, basis functions are the tool used for generalizing the search for knots. Basis functions are a set of functions used to represent the information contained in one or more variables. Multivariate and Adaptive Regression Splines model almost always creates the basis functions in pairs.

Multivariate and adaptive regression spline approach deliberately overfits the model and then prunes to get to the optimal model. The algorithm is computationally very intensive and in practice we are required to specify an upper limit on the number of basis functions.

Machine learning techniques[edit]

Machine learning, a branch of artificial intelligence, was originally employed to develop techniques to enable computers to learn. Today, since it includes a number of advanced statistical methods for regression and classification, it finds application in a wide variety of fields including medical diagnostics, credit card fraud detection, face and speech recognition and analysis of the stock market. In certain applications it is sufficient to directly predict the dependent variable without focusing on the underlying relationships between variables. In other cases, the underlying relationships can be very complex and the mathematical form of the dependencies unknown. For such cases, machine learning techniques emulate human cognition and learn from training examples to predict future events.

A brief discussion of some of these methods used commonly for predictive analytics is provided below. A detailed study of machine learning can be found in Mitchell (1997).

Neural networks[edit]

Neural networks are nonlinear sophisticated modeling techniques that are able to model complex functions. They can be applied to problems of prediction, classification or control in a wide spectrum of fields such as finance, cognitive psychology/neuroscience, medicine, engineering, and physics.

Neural networks are used when the exact nature of the relationship between inputs and output is not known. A key feature of neural networks is that they learn the relationship between inputs and output through training. There are three types of training in neural networks used by different networks, supervised and unsupervised training, reinforcement learning, with supervised being the most common one.

Some examples of neural network training techniques are backpropagation, quick propagation, conjugate gradient descent, projection operator, Delta-Bar-Delta etc. Some unsupervised network architectures are multilayer perceptrons, Kohonen networks, Hopfield networks, etc.

Radial basis functions[edit]

A radial basis function (RBF) is a function which has built into it a distance criterion with respect to a center. Such functions can be used very efficiently for interpolation and for smoothing of data. Radial basis functions have been applied in the area of neural networks where they are used as a replacement for the sigmoidal transfer function. Such networks have 3 layers, the input layer, the hidden layer with the RBF non-linearity and a linear output layer. The most popular choice for the non-linearity is the Gaussian. RBF networks have the advantage of not being locked into local minima as do the feed-forward networks such as the multilayer perceptron.

Support vector machines[edit]

Support Vector Machines (SVM) are used to detect and exploit complex patterns in data by clustering, classifying and ranking the data. They are learning machines that are used to perform binary classifications and regression estimations. They commonly use kernel based methods to apply linear classification techniques to non-linear classification problems. There are a number of types of SVM such as linear, polynomial, sigmoid etc.

Naïve Bayes[edit]

Naïve Bayes based on Bayes conditional probability rule is used for performing classification tasks. Naïve Bayes assumes the predictors are statistically independent which makes it an effective classification tool that is easy to interpret. It is best employed when faced with the problem of ‘curse of dimensionality’ i.e. when the number of predictors is very high.

k-nearest neighbours[edit]

The nearest neighbour algorithm (KNN) belongs to the class of pattern recognition statistical methods. The method does not impose a priori any assumptions about the distribution from which the modeling sample is drawn. It involves a training set with both positive and negative values. A new sample is classified by calculating the distance to the nearest neighbouring training case. The sign of that point will determine the classification of the sample. In the k-nearest neighbour classifier, the k nearest points are considered and the sign of the majority is used to classify the sample. The performance of the kNN algorithm is influenced by three main factors: (1) the distance measure used to locate the nearest neighbours; (2) the decision rule used to derive a classification from the k-nearest neighbours; and (3) the number of neighbours used to classify the new sample. It can be proved that, unlike other methods, this method is universally asymptotically convergent, i.e.: as the size of the training set increases, if the observations are independent and identically distributed (i.i.d.), regardless of the distribution from which the sample is drawn, the predicted class will converge to the class assignment that minimizes misclassification error. See Devroy et al.

Geospatial predictive modeling[edit]

Conceptually, geospatial predictive modeling is rooted in the principle that the occurrences of events being modeled are limited in distribution. Occurrences of events are neither uniform nor random in distribution – there are spatial environment factors (infrastructure, sociocultural, topographic, etc.) that constrain and influence where the locations of events occur. Geospatial predictive modeling attempts to describe those constraints and influences by spatially correlating occurrences of historical geospatial locations with environmental factors that represent those constraints and influences. Geospatial predictive modeling is a process for analyzing events through a geographic filter in order to make statements of likelihood for event occurrence or emergence.

Tools[edit]

Historically, using predictive analytics tools—as well as understanding the results they delivered—required advanced skills. However, modern predictive analytics tools are no longer restricted to IT specialists. As more organizations adopt predictive analytics into decision-making processes and integrate it into their operations, they are creating a shift in the market toward business users as the primary consumers of the information. Business users want tools they can use on their own. Vendors are responding by creating new software that removes the mathematical complexity, provides user-friendly graphic interfaces and/or builds in short cuts that can, for example, recognize the kind of data available and suggest an appropriate predictive model.[16] Predictive analytics tools have become sophisticated enough to adequately present and dissect data problems, so that any data-savvy information worker can utilize them to analyze data and retrieve meaningful, useful results.[2] For example, modern tools present findings using simple charts, graphs, and scores that indicate the likelihood of possible outcomes.[17]

There are numerous tools available in the marketplace that help with the execution of predictive analytics. These range from those that need very little user sophistication to those that are designed for the expert practitioner. The difference between these tools is often in the level of customization and heavy data lifting allowed.

Notable open source predictive analytic tools include:

Notable commercial predictive analytic tools include:

PMML[edit]

In an attempt to provide a standard language for expressing predictive models, the Predictive Model Markup Language (PMML) has been proposed. Such an XML-based language provides a way for the different tools to define predictive models and to share these between PMML compliant applications. PMML 4.0 was released in June, 2009.

See also[edit]

References[edit]

  1. ^ a b Nyce, Charles (2007), Predictive Analytics White Paper, American Institute for Chartered Property Casualty Underwriters/Insurance Institute of America, p. 1 
  2. ^ a b c Eckerson, Wayne (May 10, 2007), Extending the Value of Your Data Warehousing Investment, The Data Warehouse Institute 
  3. ^ a b Conz, Nathan (September 2, 2008), "Insurers Shift to Customer-focused Predictive Analytics Technologies", Insurance & Technology 
  4. ^ Fletcher, Heather (March 2, 2011), "The 7 Best Uses for Predictive Analytics in Multichannel Marketing", Target Marketing 
  5. ^ Korn, Sue (April 21, 2011), "The Opportunity for Predictive Analytics in Finance", HPC Wire 
  6. ^ a b Barkin, Eric (May 2011), "CRM + Predictive Analytics: Why It All Adds Up", Destination CRM 
  7. ^ Das, Krantik; Vidyashankar, G.S. (July 1, 2006), "Competitive Advantage in Retail Through Analytics: Developing Insights, Creating Value", Information Management 
  8. ^ McDonald, Michèle (September 2, 2010), "New Technology Taps 'Predictive Analytics' to Target Travel Recommendations", Travel Market Report 
  9. ^ Stevenson, Erin (December 16, 2011), "Tech Beat: Can you pronounce health care predictive analytics?", Times-Standard 
  10. ^ McKay, Lauren (August 2009), "The New Prescription for Pharma", Destination CRM 
  11. ^ a b c Schiff, Mike (March 6, 2012), BI Experts: Why Predictive Analytics Will Continue to Grow, The Data Warehouse Institute 
  12. ^ Nigrini, Mark (June 2011). "Forensic Analytics: Methods and Techniques for Forensic Accounting Investigations". Hoboken, NJ: John Wiley & Sons Inc. ISBN 978-0-470-89046-2. 
  13. ^ Dhar, Vasant (April 2011). "Prediction in Financial Markets: The Case for Small Disjuncts". ACM Transactions on Intelligent Systems and Technologies 2 (3). 
  14. ^ Dhar, Vasant; Chou, Dashin and Provost Foster (October 2000). "Discovering Interesting Patterns in Investment Decision Making with GLOWER – A Genetic Learning Algorithm Overlaid With Entropy Reduction". Data Mining and Knowledge Discovery 4 (4). 
  15. ^ https://acc.dau.mil/CommunityBrowser.aspx?id=126070
  16. ^ Halper, Fran (November 1, 2011), "The Top 5 Trends in Predictive Analytics", Information Management 
  17. ^ MacLennan, Jamie (May 1, 2012), 5 Myths about Predictive Analytics, The Data Warehouse Institute 

Further reading[edit]

  • Agresti, Alan (2002). Categorical Data Analysis. Hoboken: John Wiley and Sons. ISBN 0-471-36093-7. 
  • Coggeshall, Stephen, Davies, John, Jones, Roger., and Schutzer, Daniel, "Intelligent Security Systems," in Freedman, Roy S., Flein, Robert A., and Lederman, Jess, Editors (1995). Artificial Intelligence in the Capital Markets. Chicago: Irwin. ISBN 1-55738-811-3. 
  • L. Devroye, L. Györfi, G. Lugosi (1996). A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag. 
  • Enders, Walter (2004). Applied Time Series Econometrics. Hoboken: John Wiley and Sons. ISBN 0-521-83919-X. 
  • Greene, William (2000). Econometric Analysis. Prentice Hall. ISBN 0-13-013297-7.  Unknown parameter |= ignored (help)
  • Guidère, Mathieu; Howard N, Sh. Argamon (2009). Rich Language Analysis for Counterterrrorism. Berlin, London, New York: Springer-Verlag. ISBN 978-3-642-01140-5. 
  • Mitchell, Tom (1997). Machine Learning. New York: McGraw-Hill. ISBN 0-07-042807-7. 
  • Siegel, Eric (2013). Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. John Wiley. ISBN 978-1-1183-5685-2. 
  • Tukey, John (1977). Exploratory Data Analysis. New York: Addison-Wesley. ISBN 0-201-07616-0. 
  • Finlay, Steven (2012). Credit Scoring, Response Modeling and Insurance Rating. A Practical Guide to Forecasting Customer Behavior. Basingstoke: Palgrave Macmillan. ISBN 0-230-34776-2. 

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning new file mode 100644 index 00000000..1960acb5 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning @@ -0,0 +1 @@ + Proactive Discovery of Insider Threats Using Graph Analysis and Learning - Wikipedia, the free encyclopedia

Proactive Discovery of Insider Threats Using Graph Analysis and Learning

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Proactive Discovery of Insider Threats Using Graph Analysis and Learning
Establishment 2011
Sponsor DARPA
Value $9 million
Goal Rapidly data mine large sets to discover anomalies

Proactive Discovery of Insider Threats Using Graph Analysis and Learning or PRODIGAL is a computer system for predicting anomalous behavior amongst humans by data mining network traffic such as emails, text messages and log entries.[1] It is part of DARPA's Anomaly Detection at Multiple Scales (ADAMS) project.[2] The initial schedule is for two years and the budget $9 million.[3]

It uses graph theory, machine learning, statistical anomaly detection, and high-performance computing to scan larger sets of data more quickly than in past systems. The amount of data analyzed is in the range of terabytes per day.[3] The targets of the analysis are employees within the government or defense contracting organizations; specific examples of behavior the system is intended to detect include the actions of Nidal Malik Hasan and Wikileaks alleged source Bradley Manning.[1] Commercial applications may include finance.[1] The results of the analysis, the five most serious threats per day, go to agents, analysts, and operators working in counterintelligence.[1][3][4]

Primary participants [edit]

See also [edit]

References [edit]


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Profiling_practices b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Profiling_practices new file mode 100644 index 00000000..d917dcbc --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Profiling_practices @@ -0,0 +1 @@ + Profiling practices - Wikipedia, the free encyclopedia

Profiling practices

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Profiling (Information science) refers to the whole process of construction and application of profiles generated by computerized profiling technologies. What characterizes profiling technologies is the use of algorithms or other mathematical techniques that allow one to discover patterns or correlations in large quantities of data, aggregated in databases. When these patterns or correlations are used to identify or represent people they can be called profiles. Other than a discussion of profiling technologies or population profiling the notion of profiling practices is not just about the construction of profiles, but also concerns the application of group profiles to individuals, e.g. in the case of credit scoring, price discrimination, or identification of security risks (Hildebrandt & Gutwirth 2008) (Elmer 2004).

Profiling is not simply a matter of computerized pattern recognition; it enables refined price-discrimination, targeted servicing, detection of fraud, and extensive social sorting. Real-time machine profiling constitutes the precondition for emerging socio-technical infrastructures envisioned by advocates of ambient intelligence,[1] Autonomic Computing (Kephart & Chess 2003) and ubiquitous computing (Weiser 1991).

One of the most challenging problems of the information society is dealing with the increasing data overload. With the digitizing of all sorts of content as well as the improvement and drop in cost of recording technologies, the amount of available information has become enormous and is increasing exponentially. It has thus become important for companies, governments, and individuals to be able to discriminate information from noise, detecting those data that are useful or interesting. The development of profiling technologies must be seen against this background. These technologies are thought to efficiently collect and analyse data in order to find or test knowledge in the form of statistical patterns between data. This process is called Knowledge Discovery in Databases (KDD) (Fayyad, Piatetsky-Shapiro & Smyth 1996), which provides the profiler with sets of correlated data that are used as "profiles".

Contents

The profiling process [edit]

The technical process of profiling can be separated in several steps:

  • Preliminary grounding: The profiling process starts with a specification of the applicable problem domain and the identification of the goals of analysis.
  • Data collection: The target dataset or database for analysis is formed by selecting the relevant data in the light of existing domain knowledge and data understanding.
  • Data preparation: The data are preprocessed for removing noise and reducing complexity by eliminating attributes.
  • Data mining: The data are analysed with the algorithm or heuristics developed to suit the data, model and goals.
  • Interpretation: The mined patterns are evaluated on their relevance and validity by specialists and/or professionals in the application domain (e.g. excluding spurious correlations).
  • Application: The constructed profiles are applied, e.g. to categories of persons, to test and fine-tune the algorithms.
  • Institutional decision: The institution decides what actions or policies to apply to groups or individuals whose data match a relevant profile.

Data collection, preparation and mining all belong to the phase in which the profile is under construction. However, profiling also refers to the application of profiles, meaning the usage of profiles for the identification or categorization of groups or individual persons. As can be seen in step six (application), the process is circular. There is a feedback loop between the construction and the application of profiles. The interpretation of profiles can lead to the reiterant – possibly real-time – fine-tuning of specific previous steps in the profiling process. The application of profiles to people whose data were not used to construct the profile is based on data matching, which provides new data that allows for further adjustments. The process of profiling is both dynamic and adaptive. A good illustration of the dynamic and adaptive nature of profiling is the Cross-Industry Standard Process for Data Mining (CRISP-DM).

Types of profiling practices [edit]

In order to clarify the nature of profiling technologies some crucial distinctions have to be made between different types of profiling practices, apart from the distinction between the construction and the application of profiles. The main distinctions are those between bottom-up and top-down profiling (or supervised and unsupervised learning), and between individual and group profiles.

Supervised and unsupervised learning [edit]

Profiles can be classified according to the way they have been generated (Fayyad, Piatetsky-Shapiro & Smyth 1996)(Zarsky 2002-3). On the one hand, profiles can be generated by testing a hypothesized correlation. This is called top-down profiling or supervised learning. This is similar to the methodology of traditional scientific research in that it starts with a hypothesis and consists of testing its validity. The result of this type of profiling is the verification or refutation of the hypothesis. One could also speak of deductive profiling. On the other hand, profiles can be generated by exploring a data base, using the data mining process to detect patterns in the data base that were not previously hypothesized. In a way, this is a matter of generating hypothesis: finding correlations one did not expect or even think of. Once the patterns have been mined, they will enter the loop – described above – and will be tested with the use of new data. This is called unsupervised learning.

Two things are important with regard to this distinction. First, unsupervised learning algorithms seem to allow the construction of a new type of knowledge, not based on hypothesis developed by a researcher and not based on causal or motivational relations but exclusively based on stochastical correlations. Second, unsupervised learning algorithms thus seem to allow for an inductive type of knowledge construction that does not require theoretical justification or causal explanation (Custers 2004).

Some authors claim that if the application of profiles based on computerized stochastical pattern recognition 'works', i.e. allows for reliable predictions of future behaviours, the theoretical or causal explanation of these patterns does not matter anymore (Anderson 2008). However, the idea that 'blind' algorithms provide reliable information does not imply that the information is neutral. In the process of collecting and aggregating data into a database (the first three steps of the process of profile construction), translations are made from real-life events to machine-readable data. These data are then prepared and cleansed to allow for initial computability. Potential bias will have to be located at these points, as well as in the choice of algorithms that are developed. It is not possible to mine a database for all possible linear and non-linear correlations, meaning that the mathematical techniques developed to search for patterns will be determinate of the patterns that can be found. In the case of machine profiling, potential bias is not informed by common sense prejudice or what psychologists call stereotyping, but by the computer techniques employed in the initial steps of the process. These techniques are mostly invisible for those to whom profiles are applied (because their data match the relevant group profiles).

Individual and group profiles [edit]

Profiles must also be classified according to the kind of subject they refer to. This subject can either be an individual or a group of people. When a profile is constructed with the data of a single person, this is called individual profiling (Jaquet-Chiffelle 2008). This kind of profiling is used to discover the particular characteristics of a certain individual, to enable unique identification or the provision of personalized services. However, personalized servicing is most often also based on group profiling, which allows categorisation of a person as a certain type of person, based on the fact that her profile matches with a profile that has been constructed on the basis of massive amounts of data about massive numbers of other people. A group profile can refer to the result of data mining in data sets that refer to an existing community that considers itself as such, like a religious group, a tennis club, a university, a political party etc. In that case it can describe previously unknown patterns of behaviour or other characteristics of such a group (community). A group profile can also refer to a category of people that do not form a community, but are found to share previously unknown patterns of behaviour or other characteristics (Custers 2004). In that case the group profile describes specific behaviours or other characteristics of a category of people, like for instance women with blue eyes and red hair, or adults with relatively short arms and legs. These categories may be found to correlate with health risks, earning capacity, mortality rates, credit risks, etc.

If an individual profile is applied to the individual that it was mined from, then that is direct individual profiling. If a group profile is applied to an individual whose data match the profile, then that is indirect individual profiling, because the profile was generated using data of other people. Similarly, if a group profile is applied to the group that it was mined from, then that is direct group profiling (Jaquet-Chiffelle 2008). However, in as far as the application of a group profile to a group implies the application of the group profile to individual members of the group, it makes sense to speak of indirect group profiling, especially if the group profile is non-distributive.

Distributive and non-distributive profiling [edit]

Group profiles can also be divided in terms of their distributive character (Vedder 1999). A group profile is distributive when its properties apply equally to all the members of its group: all bachelors are unmarried, or all persons with a specific gene have 80% chance to contract a specific disease. A profile is non-distributive when the profile does not necessarily apply to all the members of the group: the group of persons with a specific postal code have an average earning capacity of XX, or the category of persons with blue eyes has an average chance of 37% to contract a specific disease. Note that in this case the chance of an individual to have a particular earning capacity or to contract the specific disease will depend on other factors, e.g. sex, age, background of parents, previous health, education. It should be obvious that, apart from tautological profiles like that of bachelors, most group profiles generated by means of computer techniques are non-distributive. This has far-reaching implications for the accuracy of indirect individual profiling based on data matching with non-distributive group profiles. Quite apart from the fact that the application of accurate profiles may be unfair or cause undue stigmatisation, most group profiles will not be accurate.

Application domains [edit]

Profiling technologies can be applied in a variety of different domains and for a variety of purposes. These profiling practices will all have different effect and raise different issues.

Knowledge about the behaviour and preferences of customers is of great interest to the commercial sector. On the basis of profiling technologies, companies can predict the behaviour of different types of customers. Marketing strategies can then be tailored to the people fitting these types. Examples of profiling practices in marketing are customers loyalty cards, customer relationship management in general, and personalized advertising.[1][2][3]

In the financial sector, institutions use profiling technologies for fraud prevention and credit scoring. Banks want to minimise the risks in giving credit to their customers. On the basis of extensive group profiling customers are assigned a certain scoring value that indicates their creditworthiness. Financial institutions like banks and insurance companies also use group profiling to detect fraud or money-laundering. Databases with transactions are searched with algorithms to find behaviours that deviate from the standard, indicating potentially suspicious transactions.[2]

In the context of employment, profiles can be of use for tracking employees by monitoring their online behaviour, for the detection of fraud by them, and for the deployment of human resources by pooling and ranking their skills. (Leopold & Meints 2008) [4].

Profiling can also be used to support people at work, and also for learning, by intervening in the design of adaptive hypermedia systems personalising the interaction. For instance, this can be useful for supporting the management of attention (Nabeth 2008).

In forensic science, the possibility exists of linking different databases of cases and suspects and mining these for common patterns. This could be used for solving existing cases or for the purpose of establishing risk profiles of potential suspects (Geradts & Sommer 2008) (Harcourt 2006).

Risks and issues [edit]

Profiling technologies have raised a host of ethical, legal and other issues including privacy, equality, due process, security and liability. Numerous authors have warned against the affordances of a new technological infrastructure that could emerge on the basis of semi-autonomic profiling technologies (Lessig 2006)(Solove 2004)(Schwartz 2000).

Privacy is one of the principal issues raised. Profiling technologies make possible a far-reaching monitoring of an individual's behaviour and preferences. Profiles may reveal personal or private information about individuals that they might not even be aware of themselves (Hildebrandt & Gutwirth 2008).

Profiling technologies are by their very nature discriminatory tools. They allow unparalleled kinds of social sorting and segmentation which could have unfair effects. The people that are profiled may have to pay higher prices,[3] they could miss out on important offers or opportunities, and they may run increased risks because catering to their needs is less profitable (Lyon 2003). In most cases they will not be aware of this, since profiling practices are mostly invisible and the profiles themselves are often protected by intellectual property or trade secret. This poses a threat to the equality of and solidarity of citizens. On a larger scale, it might cause the segmentation of society.[4]

One of the problems underlying potential violations of privacy and non-discrimination is that the process of profiling is more often than not invisible for those that are being profiled. This creates difficulties in that it becomes hard, if not impossible, to contest the application of a particular group profile. This disturbs principles of due process: if a person has no access to information on the basis of which she is withheld benefits or attributed certain risks, she cannot contest the way she is being treated (Steinbock 2005).

Profiles can be used against people when they end up in the hands of people who are not entitled to access or use them. An important issue related to these breaches of security is identity theft.

When the application of profiles causes harm, the liability for this harm has to be determined who is to be held accountable. Is the software programmer, the profiling service provider, or the profiled user to be held accountable? This issue of liability is especially complex in the case the application and decisions on profiles have also become automated like in Autonomic Computing or ambient intelligence decisions of automated decisions based on profiling.

See also [edit]

References [edit]

  • Anderson, Chris (2008). "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete". Wired Magazine 16 (7). 
  • Custers, B.H.M. (2004). The Power of Knowledge. Tilburg:Wolf Legal Publishers 
  • Elmer, G. (2004). Profiling Machines. Mapping the Personal Information Economy. MIT Press 
  • Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. (1996). "From Data Mining to Knowledge Discovery in Databases". AI Magazine 17 (3): 37–54. 
  • Geradts, Zeno; Sommer, Peter (2008). "D6.7c: Forensic Profiling". FIDIS Deliverables 6 (7c). 
  • Harcourt, B. E. (2006). Against Prediction. Profiling, Policing, and Punishing in an Actuarial Age. The University of Chicago Press, Chicago and London 
  • Hildebrandt, Mireille; Gutwirth, Serge (2008). Profiling the European Citizen. Cross Disciplinary Perspectives. Springer, Dordrecht. doi:10.1007/978-1-4020-6914-7. ISBN 978-1-4020-6913-0 
  • Jaquet-Chiffelle, David-Olivier (2008). "Reply: Direct and Indirect Profiling in the Light of Virtual Persons. To: Defining Profiling: A New Type of Knowledge?". In Hildebrandt, Mireille; Gutwirth, Serge. Profiling the European Citizen. Springer Netherlands. pp. 17–45. doi:10.1007/978-1-4020-6914-7_2. 
  • Kephart, J. O.; Chess, D. M. (2003). "The Vision of Autonomic Computing". Computer 36 (1 January): 96–104. doi:10.1109/MC.2003.1160055. 
  • Leopold, N.; Meints, M. (2008). "Profiling in Employment Situations (Fraud)". In Hildebrandt, Mireille; Gutwirth, Serge. Profiling the European Citizen. Springer Netherlands. pp. 217–237. doi:10.1007/978-1-4020-6914-7_12. 
  • Lessig, L. (2006). Code 2.0. Basic Books, New York 
  • Lyon, D. (2003). Surveillance as Social Sorting: Privacy, Risk, and Digital Discrimination. Routledge 
  • Nabeth, Thierry (2008). "User Profiling for Attention Support for School and Work". In Hildebrandt, Mireille; Gutwirth, Serge. Profiling the European Citizen. Springer Netherlands. pp. 185–200. doi:10.1007/978-1-4020-6914-7_10. 
  • Schwartz, P. (2000). "Beyond Lessig's Code for the Internet Privacy: Cyberspace Filters, Privacy-Control and Fair Information Practices". Wis. Law Review 743: 743–788. 
  • Solove, D.J. (2004). The Digital Person. Technology and Privacy in the Information Age. New York, New York University Press. 
  • Steinbock, D. (2005). "Data Matching, Data Mining, and Due Process". Georgia Law Review 40 (1): 1–84. 
  • Vedder, A. (1999). "KDD: The Challenge to Individualism". Ethics and Information Technology 1 (4): 275–281. doi:10.1023/A:1010016102284. 
  • Weiser, M. (1991). "The Computer for the Twenty-First Century". Scientific American 265 (3): 94–104. 
  • Zarsky, T. (2002-3). ""Mine Your Own Business!": Making the Case for the Implications of the Data Mining or Personal Information in the Forum of Public Opinion". Yale Journal of Law and Technology 5 (4): 17–47. 

Notes and other references [edit]

  1. ^ ISTAG (2001), Scenarios for Ambient Intelligence in 2010, Information Society Technology Advisory Group
  2. ^ Canhoto, A.I. (2007) Profiling behaviour: the social construction of categories in the detection of financial crime, dissertation at London School of Economics, at http://www.lse.ac.uk/collections/informationSystems/pdf/theses/canhoto.pdf
  3. ^ Odlyzko, A. (2003), Privacy, economics, and price discrimination on the Internet, A. M. Odlyzko. ICEC2003: Fifth International Conference on Electronic Commerce, N. Sadeh, ed., ACM, pp. 355–366, available at http://www.dtc.umn.edu/~odlyzko/doc/privacy.economics.pdf
  4. ^ Gandy, O. (2002) Data Mining and Surveillance in the post 9/11 environment, Presentation at IAMCR, Barcelona, at http://www.asc.upenn.edu/usr/ogandy/IAMCRdatamining.pdf

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/ROUGE_metric_ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/ROUGE_metric_ new file mode 100644 index 00000000..da9699f4 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/ROUGE_metric_ @@ -0,0 +1 @@ + ROUGE (metric) - Wikipedia, the free encyclopedia

ROUGE (metric)

From Wikipedia, the free encyclopedia
Jump to: navigation, search

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation,[1] is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

Contents

Metrics[edit]

The following five evaluation metrics[2] are available.

  • ROUGE-N: N-gram[3] based co-occurrence statistics.
  • ROUGE-L: Longest Common Subsequence (LCS)[4] based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
  • ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes .
  • ROUGE-S: Skip-bigram[5] based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
  • ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.

ROUGE can be downloaded from berouge download link.

See also[edit]

References[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Receiver_operating_characteristic b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Receiver_operating_characteristic new file mode 100644 index 00000000..b0f8d2b4 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Receiver_operating_characteristic @@ -0,0 +1 @@ + Receiver operating characteristic - Wikipedia, the free encyclopedia

Receiver operating characteristic

From Wikipedia, the free encyclopedia
Jump to: navigation, search
ROC curve of three epitope predictors

In signal detection theory, a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity (also called recall in some fields), and FPR is one minus the specificity or true negative rate. In general, if both of the probability distributions for detection and false alarm are known, the ROC curve can be generated by plotting the Cumulative Distribution Function (area under the probability distribution from -inf to +inf) of the detection probability in the y-axis versus the Cumulative Distribution Function of the false alarm probability in x-axis.

ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.

The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields and was soon introduced to psychology to account for perceptual detection of stimuli. ROC analysis since then has been used in medicine, radiology, biometrics, and other areas for many decades and is increasingly used in machine learning and data mining research.

The ROC is also known as a relative operating characteristic curve, because it is a comparison of two operating characteristics (TPR and FPR) as the criterion changes.[1]

Contents

Basic concept[edit]

Terminology and derivations
from a confusion matrix
true positive (TP)
eqv. with hit
true negative (TN)
eqv. with correct rejection
false positive (FP)
eqv. with false alarm, Type I error
false negative (FN)
eqv. with miss, Type II error
sensitivity or true positive rate (TPR)
eqv. with hit rate, recall
\mathit{TPR} = \mathit{TP} / P = \mathit{TP} / (\mathit{TP}+\mathit{FN})
false positive rate (FPR)
eqv. with fall-out
\mathit{FPR} = \mathit{FP} / N = \mathit{FP} / (\mathit{FP} + \mathit{TN})
accuracy (ACC)
\mathit{ACC} = (\mathit{TP} + \mathit{TN}) / (P + N)
specificity (SPC) or True Negative Rate
\mathit{SPC} = \mathit{TN} / N = \mathit{TN} / (\mathit{FP} + \mathit{TN}) = 1 - \mathit{FPR}
positive predictive value (PPV)
eqv. with precision
\mathit{PPV} = \mathit{TP} / (\mathit{TP} + \mathit{FP})
negative predictive value (NPV)
\mathit{NPV} = \mathit{TN} / (\mathit{TN} + \mathit{FN})
false discovery rate (FDR)
\mathit{FDR} = \mathit{FP} / (\mathit{FP} + \mathit{TP})
Matthews correlation coefficient (MCC)
\mathit{MCC} = (\mathit{TP} \times \mathit{TN} - \mathit{FP} \times \mathit{FN}) / \sqrt{P N P' N'}
F1 score
is the harmonic mean of precision and recall
\mathit{F1} = 2 \mathit{TP} / (P+P') = 2 \mathit{TP} / (2 \mathit{TP} + \mathit{FP} + \mathit{FN})

Source: Fawcett (2006).

A classification model (classifier or diagnosis) is a mapping of instances between certain classes/groups. The classifier or diagnosis result can be a real value (continuous output), in which case the classifier boundary between classes must be determined by a threshold value (for instance, to determine whether a person has hypertension based on a blood pressure measure). Or it can be a discrete class label, indicating one of the classes.

Let us consider a two-class prediction problem (binary classification), in which the outcomes are labeled either as positive (p) or negative (n). There are four possible outcomes from a binary classifier. If the outcome from a prediction is p and the actual value is also p, then it is called a true positive (TP); however if the actual value is n then it is said to be a false positive (FP). Conversely, a true negative (TN) has occurred when both the prediction outcome and the actual value are n, and false negative (FN) is when the prediction outcome is n while the actual value is p.

To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but actually does not have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.

Let us define an experiment from P positive instances and N negative instances. The four outcomes can be formulated in a 2×2 contingency table or confusion matrix, as follows:

  actual value
  p n total
prediction
outcome
p' True
Positive
False
Positive
P'
n' False
Negative
True
Negative
N'
total P N

ROC space[edit]

The ROC space and plots of the four prediction examples.

The contingency table can derive several evaluation "metrics" (see infobox). To draw a ROC curve, only the true positive rate (TPR) and false positive rate (FPR) are needed (as functions of some classifier parameter). The TPR defines how many correct positive results occur among all positive samples available during the test. FPR, on the other hand, defines how many incorrect positive results occur among all negative samples available during the test.

A ROC space is defined by FPR and TPR as x and y axes respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent with sensitivity and FPR is equal to 1 − specificity, the ROC graph is sometimes called the sensitivity vs (1 − specificity) plot. Each prediction result or instance of a confusion matrix represents one point in the ROC space.

The best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). The (0,1) point is also called a perfect classification. A completely random guess would give a point along a diagonal line (the so-called line of no-discrimination) from the left bottom to the top right corners (regardless of the positive and negative base rates). An intuitive example of random guessing is a decision by flipping coins (heads or tails). As the size of the sample increases, a random classifier's ROC point migrates towards (0.5,0.5).

The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random), points below the line poor results (worse than random). Note that the output of a consistently poor predictor could simply be inverted to obtain a good predictor.

Let us look into four prediction results from 100 positive and 100 negative instances:

A B C C′
TP=63 FP=28 91
FN=37 TN=72 109
100 100 200
TP=77 FP=77 154
FN=23 TN=23 46
100 100 200
TP=24 FP=88 112
FN=76 TN=12 88
100 100 200
TP=76 FP=12 88
FN=24 TN=88 112
100 100 200
TPR = 0.63 TPR = 0.77 TPR = 0.24 TPR = 0.76
FPR = 0.28 FPR = 0.77 FPR = 0.88 FPR = 0.12
PPV = 0.69 PPV = 0.50 PPV = 0.21 PPV = 0.86
F1 = 0.66 F1 = 0.61 F1 = 0.22 F1 = 0.81
ACC = 0.68 ACC = 0.50 ACC = 0.18 ACC = 0.82

Plots of the four results above in the ROC space are given in the figure. The result of method A clearly shows the best predictive power among A, B, and C. The result of B lies on the random guess line (the diagonal line), and it can be seen in the table that the accuracy of B is 50%. However, when C is mirrored across the center point (0.5,0.5), the resulting method C′ is even better than A. This mirrored method simply reverses the predictions of whatever method or test produced the C contingency table. Although the original C method has negative predictive power, simply reversing its decisions leads to a new predictive method C′ which has positive predictive power. When the C method predicts p or n, the C′ method would predict n or p, respectively. In this manner, the C′ test would perform the best. The closer a result from a contingency table is to the upper left corner, the better it predicts, but the distance from the random guess line in either direction is the best indicator of how much predictive power a method has. If the result is below the line (i.e. the method is worse than a random guess), all of the method's predictions must be reversed in order to utilize its power, thereby moving the result above the random guess line.

Curves in ROC space[edit]

Receiver Operating Characteristic.png

Objects are often classified based on a continuous random variable. For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.

Further interpretations[edit]

Sometimes, the ROC is used to generate a summary statistic. Common versions are:

  • the intercept of the ROC curve with the line at 90 degrees to the no-discrimination line (also called Youden's J statistic)
  • the area between the ROC curve and the no-discrimination line[citation needed]
  • the area under the ROC curve, or "AUC" ("Area Under Curve"), or A' (pronounced "a-prime"),[2] or "c-statistic".[3]
  • d' (pronounced "d-prime"), the distance between the mean of the distribution of activity in the system under noise-alone conditions and its distribution under signal-alone conditions, divided by their standard deviation, under the assumption that both these distributions are normal with the same standard deviation. Under these assumptions, it can be proved that the shape of the ROC depends only on d'.
  • C (Concordance) Statistic: This is a rank order statistic related to Somers' D statistic. It is commonly used in the medical literature to quantify the capacity of the estimated risk score in discriminating among subjects with different event times. It varies between 0.5 and 1.0 with higher values indicating a better predictive model. For binary outcomes C is identical to the area under the receiver operating characteristic curve. Although bootstrapping to generate confidence intervals is possible, the power of testing the differences between two (or more) C statistics is low and alternative methods such as logistic regression should probably be used.[4] The C statistic has been generalized for use in survival analysis[5] and it is also possible to combine this with statistical weighting systems. Other extensions have been proposed.[6][7]

However, any attempt to summarize the ROC curve into a single number loses information about the pattern of tradeoffs of the particular discriminator algorithm.

Area under the curve[edit]

When using normalized units, the area under the curve (AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').[8] It can be shown that the area under the ROC curve (often referred to as simply the AUROC) is closely related to the Mann–Whitney U,[9][10] which tests whether positives are ranked higher than negatives. It is also equivalent to the Wilcoxon test of ranks.[10] The AUC is related to the Gini coefficient (G_1) by the formula G_1 = 2 AUC - 1, where:

G_1 = 1 - \sum_{k=1}^n (X_{k} - X_{k-1}) (Y_k + Y_{k-1})[11]

In this way, it is possible to calculate the AUC by using an average of a number of trapezoidal approximations.

It is also common to calculate the Area Under the ROC Convex Hull (ROC AUCH = ROCH AUC) as any point on the line segment between two prediction results can be achieved by randomly using one or other system with probabilities proportional to the relative length of the opposite component of the segment.[12] Interestingly, it is also possible to invert concavities – just as in the figure the worse solution can be reflected to become a better solution; concavities can be reflected in any line segment, but this more extreme form of fusion is much more likely to overfit the data.[13]

The machine learning community most often uses the ROC AUC statistic for model comparison.[14] However, this practice has recently been questioned based upon new machine learning research that shows that the AUC is quite noisy as a classification measure[15] and has some other significant problems in model comparison.[16][17] A reliable and valid AUC estimate can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example. However, the critical research[15][16] suggests frequent failures in obtaining reliable and valid AUC estimates. Thus, the practical value of the AUC measure has been called into question,[17] raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.

One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system, as well as ignoring the possibility of concavity repair, so that related alternative measures such as Informedness[18] or DeltaP are recommended.[19] These measures are essentially equivalent to the Gini for a single prediction point with DeltaP' = Informedness = 2AUC-1, whilst DeltaP = Markedness represents the dual (viz. predicting the prediction from the real class) and their geometric mean is Matthews correlation coefficient.[18] Alternatively ROC AUC may be divided into two components: its Certainty (ROC-Cert) which corresponds to the single point AUC and its Consistency (ROC-Con) which corresponds to multipoint AUC − singlepoint AUC, with the pair of measures (ROC-ConCert) being argued to capture some of the additional information that ROC adds to the single point measures (noting that it can also be applied to ROCH, and should be if it is to capture the real potential of the system whose parameterization is being investigated).[20]

Other measures[edit]

In engineering, the area between the ROC curve and the no-discrimination line is often preferred, due to its useful mathematical properties as a non-parametric statistic.[citation needed] This area is often simply known as the discrimination. In psychophysics, the Sensitivity Index d', ΔP' or DeltaP' is the most commonly used measure[21] and is equivalent to twice the discrimination, being equal also to Informedness, deskewed WRAcc and Gini Coefficient in the single point case (single parameterization or single system).[18] These measures all have the advantage that 0 represents chance performance whilst Informedness=1 represents perfect performance, and -1 represents the "perverse" case of full informedness used to always give the wrong response, with Informedness being proven to be the probability of making an informed decision (rather than guessing).[22] ROC AUC and AUCH have a related property that chance performance has a fixed value, but it is 0.5, and the normalization to 2AUC-1 brings this to 0 and allows Informedness and Gini to be interpreted as Kappa statistics, but Informedness has been shown to have desirable characteristics for Machine Learning versus other common definitions of Kappa such as Cohen Kappa and Fleiss Kappa.[18][23]

The illustration at the top right of the page shows the use of ROC graphs for the discrimination between the quality of different algorithms for predicting epitopes. The graph shows that if one detects at least 60% of the epitopes in a virus protein, at least 30% of the output is falsely marked as epitopes.

Sometimes it can be more useful to look at a specific region of the ROC Curve rather than at the whole curve. It is possible to compute partial AUC.[24] For example, one could focus on the region of the curve with low false positive rate, which is often of prime interest for population screening tests.[25] Another common approach for classification problems in which P ≪ N (common in bioinformatics applications) is to use a logarithmic scale for the x-axis.[26]

Detection error tradeoff graph[edit]

Example DET graph

An alternative to the ROC curve is the detection error tradeoff (DET) graph, which plots the false negative rate (missed detections) vs. the false positive rate (false alarms) on non-linearly transformed x- and y-axes. The transformation function is the quantile function of the normal distribution, i.e., the inverse of the cumulative normal distribution. It is, in fact, the same transformation as zROC, below, except that the complement of the hit rate, the miss rate or false negative rate, is used. This alternative spends more graph area on the region of interest. Most of the ROC area is of little interest; one primarily cares about the region tight against the y-axis and the top left corner – which, because of using miss rate instead of its complement, the hit rate, is the lower left corner in a DET plot. The DET plot is used extensively in the automatic speaker recognition community, where the name DET was first used. The analysis of the ROC performance in graphs with this warping of the axes was used by psychologists in perception studies halfway the 20th century, where this was dubbed "double probability paper".

Z-transformation[edit]

If a z-transformation is applied to the ROC curve, the curve will be transformed into a straight line.[27] This z-transformation is based on a normal distribution with a mean of zero and a standard deviation of one. In memory strength theory, one must assume that the zROC is not only linear, but has a slope of 1.0. The normal distributions of targets (studied objects that the subjects need to recall) and lures (non studied objects that the subjects attempt to recall) is the factor causing the zROC to be linear.

The linearity of the zROC curve depends on the standard deviations of the target and lure strength distributions. If the standard deviations are equal, the slope will be 1.0. If the standard deviation of the target strength distribution is larger than the standard deviation of the lure strength distribution, then the slope will be smaller than 1.0. In most studies, it has been found that the zROC curve slopes constantly fall below 1, usually between 0.5 and 0.9.[28] Many experiments yielded a zROC slope of 0.8. A slope of 0.8 implies that the variability of the target strength distribution is 25% larger than the variability of the lure strength distribution.[29]

Another variable used is d'. d' is a measure of sensitivity for yes-no recognition that can easily be expressed in terms of z-values. d' measures sensitivity, in that it measures the degree of overlap between target and lure distributions. It is calculated as the mean of the target distribution minus the mean of the lure distribution, expressed in standard deviation units. For a given hit rate and false alarm rate, d' can be calculated with the following equation: d' = z(hit rate) − z(false alarm rate). Although d' is a commonly used parameter, it must be recognized that it is only relevant when strictly adhering to the very strong assumptions of strength theory made above.[30]

The z-transformation of a ROC curve is always linear, as assumed, except in special situations. The Yonelinas familiarity-recollection model is a two-dimensional account of recognition memory. Instead of the subject simply answering yes or no to a specific input, the subject gives the input a feeling of familiarity, which operates like the original ROC curve. What changes, though, is a parameter for Recollection (R). Recollection is assumed to be all-or-none, and it trumps familiarity. If there were no recollection component, zROC would have a predicted slope of 1. However, when adding the recollection component, the zROC curve will be concave up, with a decreased slope. This difference in shape and slope result from an added element of variability due to some items being recollected. Patients with anterograde amnesia are unable to recollect, so their Yonelinas zROC curve would have a slope close to 1.0.[31]

History[edit]

The ROC curve was first used during World War II for the analysis of radar signals before it was employed in signal detection theory.[32] Following the attack on Pearl Harbor in 1941, the United States army began new research to increase the prediction of correctly detected Japanese aircraft from their radar signals.

In the 1950s, ROC curves were employed in psychophysics to assess human (and occasionally non-human animal) detection of weak signals.[32] In medicine, ROC analysis has been extensively used in the evaluation of diagnostic tests.[33][34] ROC curves are also used extensively in epidemiology and medical research and are frequently mentioned in conjunction with evidence-based medicine. In radiology, ROC analysis is a common technique to evaluate new radiology techniques.[35] In the social sciences, ROC analysis is often called the ROC Accuracy Ratio, a common technique for judging the accuracy of default probability models.

ROC curves also proved useful for the evaluation of machine learning techniques. The first application of ROC in machine learning was by Spackman who demonstrated the value of ROC curves in comparing and evaluating different classification algorithms.[36]

See also[edit]

References[edit]

  1. ^ Swets, John A.; Signal detection theory and ROC analysis in psychology and diagnostics : collected papers, Lawrence Erlbaum Associates, Mahwah, NJ, 1996
  2. ^ Fogarty, James; Baker, Ryan S.; Hudson, Scott E. (2005). "Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction". ACM International Conference Proceeding Series, Proceedings of Graphics Interface 2005. Waterloo, ON: Canadian Human-Computer Communications Society. 
  3. ^ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.). 
  4. ^ LaValley MP (2008) Logistic Regression. Circulation 117: 2395-2399 doi: 10.1161/CIRCULATIONAHA.106.682658
  5. ^ Heagerty PJ, Zheng Y (2005) Survival model predictive accuracy and ROC curves. Biometrics 61:92–105
  6. ^ Gonen M, Heller G (2005) Concordance probability and discriminatory power in proportional hazards regression. Biometrika 92:965–970
  7. ^ Chambless LE, Diao G (2006) Estimation of time-dependent area under the ROC curve for long-term risk prediction. Stat Med 25:3474 –3486.
  8. ^ Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874.
  9. ^ Hanley, James A.; McNeil, Barbara J. (1982). "The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve". Radiology 143 (1): 29–36. PMID 7063747. 
  10. ^ a b Mason, Simon J.; Graham, Nicholas E. (2002). "Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation". Quarterly Journal of the Royal Meteorological Society (128): 2145–2166. 
  11. ^ Hand, David J.; and Till, Robert J. (2001); A simple generalization of the area under the ROC curve for multiple class classification problems, Machine Learning, 45, 171–186.
  12. ^ Provost, F.; Fawcett, T. (2001). "Robust classification for imprecise environments.". Machine Learning, 44: 203–231. 
  13. ^ "Repairing concavities in ROC curves.". 19th International Joint Conference on Artificial Intelligence (IJCAI'05),. 2005. pp. 702–707. 
  14. ^ Hanley, James A.; McNeil, Barbara J. (1983-09-01). "A method of comparing the areas under receiver operating characteristic curves derived from the same cases". Radiology 148 (3): 839–43. PMID 6878708. Retrieved 2008-12-03. 
  15. ^ a b Hanczar, Blaise; Hua, Jianping; Sima, Chao; Weinstein, John; Bittner, Michael; and Dougherty, Edward R. (2010); Small-sample precision of ROC-related estimates, Bioinformatics 26 (6): 822–830
  16. ^ a b Lobo, Jorge M.; Jiménez-Valverde, Alberto; and Real, Raimundo (2008), AUC: a misleading measure of the performance of predictive distribution models, Global Ecology and Biogeography, 17: 145–151
  17. ^ a b Hand, David J. (2009); Measuring classifier performance: A coherent alternative to the area under the ROC curve, Machine Learning, 77: 103–123
  18. ^ a b c d Powers, David M W (2007/2011). "Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation". Journal of Machine Learning Technologies 2 (1): 37–63. 
  19. ^ Powers, David M.W. (2012). "The Problem of Area Under the Curve". International Conference on Information Science and Technology. 
  20. ^ Powers, David M.W. (2012). "ROC-ConCert". Spring Conference on Engineering Technology. 
  21. ^ Perruchet, P.; Peereman, R. (2004). "The exploitation of distributional information in syllable processing". J. Neurolinguistics 17: 97−119. 
  22. ^ Powers, David M. W. (2003). "Recall and Precision versus the Bookmaker". Proceedings of the International Conference on Cognitive Science (ICSC- 2003), Sydney Australia, 2003, pp.529-534. 
  23. ^ Powers, David M. W. (2012). "The Problem with Kappa". Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop. 
  24. ^ McClish, Donna Katzman (1989-08-01). "Analyzing a Portion of the ROC Curve". Medical Decision Making 9 (3): 190–195. doi:10.1177/0272989X8900900307. PMID 2668680. Retrieved 2008-09-29. 
  25. ^ Dodd, Lori E.; Pepe, Margaret S. (2003). "Partial AUC Estimation and Regression". Biometrics 59 (3): 614–623. doi:10.1111/1541-0420.00071. PMID 14601762. Retrieved 2007-12-18. 
  26. ^ Karplus, Kevin (2011); Better than Chance: the importance of null models, University of California, Santa Cruz, in Proceedings of the First International Workshop on Pattern Recognition in Proteomics, Structural Biology and Bioinformatics (PR PS BB 2011)
  27. ^ MacMillan, Neil A.; Creelman, C. Douglas (2005). Detection Theory: A User's Guide (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. ISBN 1-4106-1114-0. 
  28. ^ Glanzer, Murray; Kisok, Kim; Hilford, Andy; Adams, John K. (1999). "Slope of the receiver-operating characteristic in recognition memory". Journal of Experimental Psychology: Learning, Memory, and Cognition 25 (2): 500–513. 
  29. ^ Ratcliff, Roger; McCoon, Gail; Tindall, Michael (1994). "Empirical generality of data from recognition memory ROC functions and implications for GMMs". Journal of Experimental Psychology: Learning, Memory, and Cognition 20: 763–785. 
  30. ^ Zhang, Jun; Mueller, Shane T. (2005). "A note on ROC analysis and non-parametric estimate of sensitivity". Psychometrika 70 (203-212). 
  31. ^ Yonelinas, Andrew P.; Kroll, Neal E. A.; Dobbins, Ian G.; Lazzara, Michele; Knight, Robert T. (1998). "Recollection and familiarity deficits in amnesia: Convergence of remember-know, process dissociation, and receiver operating characteristic data". Neuropsychology 12: 323–339. 
  32. ^ a b Green, David M.; Swets, John A. (1966). Signal detection theory and psychophysics. New York, NY: John Wiley and Sons Inc. ISBN 0-471-32420-5. 
  33. ^ Zweig, Mark H.; Campbell, Gregory (1993). "Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine". Clinical Chemistry 39 (8): 561–577. PMID 8472349. 
  34. ^ Pepe, Margaret S. (2003). The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford. ISBN 0-19-856582-8. 
  35. ^ Obuchowski, Nancy A. (2003). "Receiver operating characteristic curves and their use in radiology". Radiology 229 (1): 3–8. doi:10.1148/radiol.2291010898. PMID 14519861. 
  36. ^ Spackman, Kent A. (1989). "Signal detection theory: Valuable tools for evaluating inductive learning". Proceedings of the Sixth International Workshop on Machine Learning. San Mateo, CA: Morgan Kaufmann. pp. 160–163. 

General references[edit]

  • Zhou, Xiao-Hua; Obuchowski, Nancy A.; McClish, Donna K. (2002). Statistical Methods in Diagnostic Medicine. New York, NY: Wiley & Sons. ISBN 978-0-471-34772-9. 

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Regression_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Regression_analysis new file mode 100644 index 00000000..b3faa808 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Regression_analysis @@ -0,0 +1 @@ + Regression analysis - Wikipedia, the free encyclopedia

Regression analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In statistics, regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.

Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable;[1] for example, correlation does not imply causation.

A large body of techniques for carrying out regression analysis has been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional.

The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if many data are available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally. However, in many applications, especially with small effects or questions of causality based on observational data, regression methods can give misleading results.[2][3]

Contents

History[edit]

The earliest form of regression was the method of least squares, which was published by Legendre in 1805,[4] and by Gauss in 1809.[5] Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun (mostly comets, but also later the then newly discovered minor planets). Gauss published a further development of the theory of least squares in 1821,[6] including a version of the Gauss–Markov theorem.

The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean).[7][8] For Galton, regression had only this biological meaning,[9][10] but his work was later extended by Udny Yule and Karl Pearson to a more general statistical context.[11][12] In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925.[13][14][15] Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.

In the 1950s and 1960s, economists used electromechanical desk calculators to calculate regressions. Before 1970, it sometimes took up to 24 hours to receive the result from one regression.[16]

Regression methods continue to be an area of active research. In recent decades, new methods have been developed for robust regression, regression involving correlated responses such as time series and growth curves, regression in which the predictor or response variables are curves, images, graphs, or other complex data objects, regression methods accommodating various types of missing data, nonparametric regression, Bayesian methods for regression, regression in which the predictor variables are measured with error, regression with more predictor variables than observations, and causal inference with regression.

Regression models[edit]

Regression models involve the following variables:

  • The unknown parameters, denoted as β, which may represent a scalar or a vector.
  • The independent variables, X.
  • The dependent variable, Y.

In various fields of application, different terminologies are used in place of dependent and independent variables.

A regression model relates Y to a function of X and β.

Y \approx f (\mathbf {X}, \boldsymbol{\beta} )

The approximation is usually formalized as E(Y | X) = f(X, β). To carry out regression analysis, the form of the function f must be specified. Sometimes the form of this function is based on knowledge about the relationship between Y and X that does not rely on the data. If no such knowledge is available, a flexible or convenient form for f is chosen.

Assume now that the vector of unknown parameters β is of length k. In order to perform a regression analysis the user must provide information about the dependent variable Y:

  • If N data points of the form (Y,X) are observed, where N < k, most classical approaches to regression analysis cannot be performed: since the system of equations defining the regression model is underdetermined, there are not enough data to recover β.
  • If exactly N = k data points are observed, and the function f is linear, the equations Y = f(X, β) can be solved exactly rather than approximately. This reduces to solving a set of N equations with N unknowns (the elements of β), which has a unique solution as long as the X are linearly independent. If f is nonlinear, a solution may not exist, or many solutions may exist.
  • The most common situation is where N > k data points are observed. In this case, there is enough information in the data to estimate a unique value for β that best fits the data in some sense, and the regression model when applied to the data can be viewed as an overdetermined system in β.

In the last case, the regression analysis provides the tools for:

  1. Finding a solution for unknown parameters β that will, for example, minimize the distance between the measured and predicted values of the dependent variable Y (also known as method of least squares).
  2. Under certain statistical assumptions, the regression analysis uses the surplus of information to provide statistical information about the unknown parameters β and predicted values of the dependent variable Y.

Necessary number of independent measurements[edit]

Consider a regression model which has three unknown parameters, β0, β1, and β2. Suppose an experimenter performs 10 measurements all at exactly the same value of independent variable vector X (which contains the independent variables X1, X2, and X3). In this case, regression analysis fails to give a unique set of estimated values for the three unknown parameters; the experimenter did not provide enough information. The best one can do is to estimate the average value and the standard deviation of the dependent variable Y. Similarly, measuring at two different values of X would give enough data for a regression with two unknowns, but not for three or more unknowns.

If the experimenter had performed measurements at three different values of the independent variable vector X, then regression analysis would provide a unique set of estimates for the three unknown parameters in β.

In the case of general linear regression, the above statement is equivalent to the requirement that the matrix XTX is invertible.

Statistical assumptions[edit]

When the number of measurements, N, is larger than the number of unknown parameters, k, and the measurement errors εi are normally distributed then the excess of information contained in (Nk) measurements is used to make statistical predictions about the unknown parameters. This excess of information is referred to as the degrees of freedom of the regression.

Underlying assumptions[edit]

Classical assumptions for regression analysis include:

  • The sample is representative of the population for the inference prediction.
  • The error is a random variable with a mean of zero conditional on the explanatory variables.
  • The independent variables are measured with no error. (Note: If this is not so, modeling may be done instead using errors-in-variables model techniques).
  • The predictors are linearly independent, i.e. it is not possible to express any predictor as a linear combination of the others.
  • The errors are uncorrelated, that is, the variance–covariance matrix of the errors is diagonal and each non-zero element is the variance of the error.
  • The variance of the error is constant across observations (homoscedasticity). If not, weighted least squares or other methods might instead be used.

These are sufficient conditions for the least-squares estimator to possess desirable properties; in particular, these assumptions imply that the parameter estimates will be unbiased, consistent, and efficient in the class of linear unbiased estimators. It is important to note that actual data rarely satisfies the assumptions. That is, the method is used even though the assumptions are not true. Variation from the assumptions can sometimes be used as a measure of how far the model is from being useful. Many of these assumptions may be relaxed in more advanced treatments. Reports of statistical analyses usually include analyses of tests on the sample data and methodology for the fit and usefulness of the model.

Assumptions include the geometrical support of the variables.[17][clarification needed] Independent and dependent variables often refer to values measured at point locations. There may be spatial trends and spatial autocorrelation in the variables that violates statistical assumptions of regression. Geographic weighted regression is one technique to deal with such data.[18] Also, variables may include values aggregated by areas. With aggregated data the modifiable areal unit problem can cause extreme variation in regression parameters.[19] When analyzing data aggregated by political boundaries, postal codes or census areas results may be very different with a different choice of units.

Linear regression[edit]

In linear regression, the model specification is that the dependent variable,  y_i is a linear combination of the parameters (but need not be linear in the independent variables). For example, in simple linear regression for modeling  n data points there is one independent variable:  x_i , and two parameters, \beta_0 and \beta_1:

straight line: y_i=\beta_0 +\beta_1 x_i +\varepsilon_i,\quad i=1,\dots,n.\!

In multiple linear regression, there are several independent variables or functions of independent variables.

Adding a term in xi2 to the preceding regression gives:

parabola: y_i=\beta_0 +\beta_1 x_i +\beta_2 x_i^2+\varepsilon_i,\ i=1,\dots,n.\!

This is still linear regression; although the expression on the right hand side is quadratic in the independent variable x_i, it is linear in the parameters \beta_0, \beta_1 and \beta_2.

In both cases, \varepsilon_i is an error term and the subscript i indexes a particular observation.

Given a random sample from the population, we estimate the population parameters and obtain the sample linear regression model:

 \widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1 x_i.

The residual,  e_i = y_i - \widehat{y}_i , is the difference between the value of the dependent variable predicted by the model,  \widehat{y_i}, and the true value of the dependent variable, y_i. One method of estimation is ordinary least squares. This method obtains parameter estimates that minimize the sum of squared residuals, SSE,[20][21] also sometimes denoted RSS:

SSE=\sum_{i=1}^n e_i^2. \,

Minimization of this function results in a set of normal equations, a set of simultaneous linear equations in the parameters, which are solved to yield the parameter estimators, \widehat{\beta}_0, \widehat{\beta}_1.

Illustration of linear regression on a data set.

In the case of simple regression, the formulas for the least squares estimates are

\widehat{\beta_1}=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}\text{ and }\hat{\beta_0}=\bar{y}-\widehat{\beta_1}\bar{x}

where \bar{x} is the mean (average) of the x values and \bar{y} is the mean of the y values.

Under the assumption that the population error term has a constant variance, the estimate of that variance is given by:

 \hat{\sigma}^2_\varepsilon = \frac{SSE}{n-2}.\,

This is called the mean square error (MSE) of the regression. The denominator is the sample size reduced by the number of model parameters estimated from the same data, (n-p) for p regressors or (n-p-1) if an intercept is used.[22] In this case, p=1 so the denominator is n-2.

The standard errors of the parameter estimates are given by

\hat\sigma_{\beta_0}=\hat\sigma_{\varepsilon} \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum(x_i-\bar x)^2}}
\hat\sigma_{\beta_1}=\hat\sigma_{\varepsilon} \sqrt{\frac{1}{\sum(x_i-\bar x)^2}}.

Under the further assumption that the population error term is normally distributed, the researcher can use these estimated standard errors to create confidence intervals and conduct hypothesis tests about the population parameters.

General linear model[edit]

In the more general multiple regression model, there are p independent variables:

 y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \varepsilon_i, \,

where xij is the ith observation on the jth independent variable, and where the first independent variable takes the value 1 for all i (so \beta_1 is the regression intercept).

The least squares parameter estimates are obtained from p normal equations. The residual can be written as

\varepsilon_i=y_i -  \hat\beta_1 x_{i1} - \cdots - \hat\beta_p x_{ip}.

The normal equations are

\sum_{i=1}^n \sum_{k=1}^p X_{ij}X_{ik}\hat \beta_k=\sum_{i=1}^n X_{ij}y_i,\  j=1,\dots,p.\,

In matrix notation, the normal equations are written as

\mathbf{(X^\top X )\hat{\boldsymbol{\beta}}= {}X^\top Y},\,

where the ij element of X is xij, the i element of the column vector Y is yi, and the j element of \hat \beta is \hat \beta_j. Thus X is n×p, Y is n×1, and \hat \beta is p×1. The solution is

\mathbf{\hat{\boldsymbol{\beta}}= {}(X^\top X )^{-1}X^\top Y}.\,

Diagnostics[edit]

Once a regression model has been constructed, it may be important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters. Commonly used checks of goodness of fit include the R-squared, analyses of the pattern of residuals and hypothesis testing. Statistical significance can be checked by an F-test of the overall fit, followed by t-tests of individual parameters.

Interpretations of these diagnostic tests rest heavily on the model assumptions. Although examination of the residuals can be used to invalidate a model, the results of a t-test or F-test are sometimes more difficult to interpret if the model's assumptions are violated. For example, if the error term does not have a normal distribution, in small samples the estimated parameters will not follow normal distributions and complicate inference. With relatively large samples, however, a central limit theorem can be invoked such that hypothesis testing may proceed using asymptotic approximations.

"Limited dependent" variables[edit]

The phrase "limited dependent" is used in econometric statistics for categorical and constrained variables.

The response variable may be non-continuous ("limited" to lie on some subset of the real line). For binary (zero or one) variables, if analysis proceeds with least-squares linear regression, the model is called the linear probability model. Nonlinear models for binary dependent variables include the probit and logit model. The multivariate probit model is a standard method of estimating a joint relationship between several binary dependent variables and some independent variables. For categorical variables with more than two values there is the multinomial logit. For ordinal variables with more than two values, there are the ordered logit and ordered probit models. Censored regression models may be used when the dependent variable is only sometimes observed, and Heckman correction type models may be used when the sample is not randomly selected from the population of interest. An alternative to such procedures is linear regression based on polychoric correlation (or polyserial correlations) between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurrence of an event, then count models like the Poisson regression or the negative binomial model may be used instead.

Interpolation and extrapolation[edit]

Regression models predict a value of the Y variable given known values of the X variables. Prediction within the range of values in the dataset used for model-fitting is known informally as interpolation. Prediction outside this range of the data is known as extrapolation. Performing extrapolation relies strongly on the regression assumptions. The further the extrapolation goes outside the data, the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values.

It is generally advised[citation needed] that when performing extrapolation, one should accompany the estimated value of the dependent variable with a prediction interval that represents the uncertainty. Such intervals tend to expand rapidly as the values of the independent variable(s) moved outside the range covered by the observed data.

For such reasons and others, some tend to say that it might be unwise to undertake extrapolation.[23]

However, this does not cover the full set of modelling errors that may be being made: in particular, the assumption of a particular form for the relation between Y and X. A properly conducted regression analysis will include an assessment of how well the assumed form is matched by the observed data, but it can only do so within the range of values of the independent variables actually available. This means that any extrapolation is particularly reliant on the assumptions being made about the structural form of the regression relationship. Best-practice advice here[citation needed] is that a linear-in-variables and linear-in-parameters relationship should not be chosen simply for computational convenience, but that all available knowledge should be deployed in constructing a regression model. If this knowledge includes the fact that the dependent variable cannot go outside a certain range of values, this can be made use of in selecting the model – even if the observed dataset has no values particularly near such bounds. The implications of this step of choosing an appropriate functional form for the regression can be great when extrapolation is considered. At a minimum, it can ensure that any extrapolation arising from a fitted model is "realistic" (or in accord with what is known).

Nonlinear regression[edit]

When the model function is not linear in the parameters, the sum of squares must be minimized by an iterative procedure. This introduces many complications which are summarized in Differences between linear and non-linear least squares

Power and sample size calculations[edit]

There are no generally agreed methods for relating the number of observations versus the number of independent variables in the model. One rule of thumb suggested by Good and Hardin is N=m^n, where N is the sample size, n is the number of independent variables and m is the number of observations needed to reach the desired precision if the model had only one independent variable.[24] For example, a researcher is building a linear regression model using a dataset that contains 1000 patients (N). If he decides that five observations are needed to precisely define a straight line (m), then the maximum number of independent variables his model can support is 4, because

\frac{\log{1000}}{\log{5}}=4.29.

Other methods[edit]

Although the parameters of a regression model are usually estimated using the method of least squares, other methods which have been used include:

Software[edit]

All major statistical software packages perform least squares regression analysis and inference. Simple linear regression and multiple regression using least squares can be done in some spreadsheet applications and on some calculators. While many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods, and a method with a given name may be implemented differently in different packages. Specialized regression software has been developed for use in fields such as survey analysis and neuroimaging.

See also[edit]

References[edit]

  1. ^ Armstrong, J. Scott (2012). "Illusions in Regression Analysis". International Journal of Forecasting (forthcoming) 28 (3): 689. doi:10.1016/j.ijforecast.2012.02.001. 
  2. ^ David A. Freedman, Statistical Models: Theory and Practice, Cambridge University Press (2005)
  3. ^ R. Dennis Cook; Sanford Weisberg Criticism and Influence Analysis in Regression, Sociological Methodology, Vol. 13. (1982), pp. 313–361
  4. ^ A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes, Firmin Didot, Paris, 1805. “Sur la Méthode des moindres quarrés” appears as an appendix.
  5. ^ C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)
  6. ^ C.F. Gauss. Theoria combinationis observationum erroribus minimis obnoxiae. (1821/1823)
  7. ^ Mogull, Robert G. (2004). Second-Semester Applied Statistics. Kendall/Hunt Publishing Company. p. 59. ISBN 0-7575-1181-3. 
  8. ^ Galton, Francis (1989). "Kinship and Correlation (reprinted 1989)". Statistical Science (Institute of Mathematical Statistics) 4 (2): 80–86. doi:10.1214/ss/1177012581. JSTOR 2245330. 
  9. ^ Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492–495, 512–514, 532–533. (Galton uses the term "reversion" in this paper, which discusses the size of peas.)
  10. ^ Francis Galton. Presidential address, Section H, Anthropology. (1885) (Galton uses the term "regression" in this paper, which discusses the height of humans.)
  11. ^ Yule, G. Udny (1897). "On the Theory of Correlation". Journal of the Royal Statistical Society (Blackwell Publishing) 60 (4): 812–54. doi:10.2307/2979746. JSTOR 2979746. 
  12. ^ Pearson, Karl; Yule, G.U.; Blanchard, Norman; Lee,Alice (1903). "The Law of Ancestral Heredity". Biometrika (Biometrika Trust) 2 (2): 211–236. doi:10.1093/biomet/2.2.211. JSTOR 2331683. 
  13. ^ Fisher, R.A. (1922). "The goodness of fit of regression formulae, and the distribution of regression coefficients". Journal of the Royal Statistical Society (Blackwell Publishing) 85 (4): 597–612. doi:10.2307/2341124. JSTOR 2341124. 
  14. ^ Ronald A. Fisher (1954). Statistical Methods for Research Workers (Twelfth ed.). Edinburgh: Oliver and Boyd. ISBN 0-05-002170-2. 
  15. ^ Aldrich, John (2005). "Fisher and Regression". Statistical Science 20 (4): 401–417. doi:10.1214/088342305000000331. JSTOR 20061201. 
  16. ^ Rodney Ramcharan. Regressions: Why Are Economists Obessessed with Them? March 2006. Accessed 2011-12-03.
  17. ^ N. Cressie (1996) Change of Support and the Modiable Areal Unit Problem. Geographical Systems 3:159–180.
  18. ^ Fotheringham, A. Stewart; Brunsdon, Chris; Charlton, Martin (2002). Geographically weighted regression: the analysis of spatially varying relationships (Reprint ed.). Chichester, England: John Wiley. ISBN 978-0-471-49616-8. 
  19. ^ Fotheringham, AS; Wong, DWS (1 January 1991). "The modifiable areal unit problem in multivariate statistical analysis". Environment and Planning A 23 (7): 1025–1044. doi:10.1068/a231025. 
  20. ^ M. H. Kutner, C. J. Nachtsheim, and J. Neter (2004), "Applied Linear Regression Models", 4th ed., McGraw-Hill/Irwin, Boston (p. 25)
  21. ^ N. Ravishankar and D. K. Dey (2002), "A First Course in Linear Model Theory", Chapman and Hall/CRC, Boca Raton (p. 101)
  22. ^ Steel, R.G.D, and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 288.
  23. ^ Chiang, C.L, (2003) Statistical methods of analysis, World Scientific. ISBN 981-238-310-7 - page 274 section 9.7.4 "interpolation vs extrapolation"
  24. ^ Good, P. I.; Hardin, J. W. (2009). Common Errors in Statistics (And How to Avoid Them) (3rd ed.). Hoboken, New Jersey: Wiley. p. 211. ISBN 978-0-470-45798-6. 
  25. ^ Tofallis, C. (2009). "Least Squares Percentage Regression". Journal of Modern Applied Statistical Methods 7: 526–534. doi:10.2139/ssrn.1406472. 
  26. ^ YangJing Long (2009). "Human age estimation by metric learning for regression problems". Proc. International Conference on Computer Analysis of Images and Patterns: 74–82. 

Further reading[edit]

  • William H. Kruskal and Judith M. Tanur, ed. (1978), "Linear Hypotheses," International Encyclopedia of Statistics. Free Press, v. 1,
Evan J. Williams, "I. Regression," pp. 523–41.
Julian C. Stanley, "II. Analysis of Variance," pp. 541–554.
  • Lindley, D.V. (1987). "Regression and correlation analysis," New Palgrave: A Dictionary of Economics, v. 4, pp. 120–23.
  • Birkes, David and Dodge, Y., Alternative Methods of Regression. ISBN 0-471-56881-3
  • Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11. pp. 121–135.
  • Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. ISBN 0-471-17082-8. 
  • Fox, J. (1997). Applied Regression Analysis, Linear Models and Related Methods. Sage
  • Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
  • Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts" Journal of Forecasting, 14, pp. 413–430.
  • A. Sen, M. Srivastava, Regression Analysis — Theory, Methods, and Applications, Springer-Verlag, Berlin, 2011 (4th printing).
  • T. Strutz: Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Vieweg+Teubner, ISBN 978-3-8348-1022-9.

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Ren_rou b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Ren_rou new file mode 100644 index 00000000..2856ea0f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Ren_rou @@ -0,0 +1 @@ + Ren-rou - Wikipedia, the free encyclopedia

Ren-rou

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Ren-rou (Chinese: 人肉, pinyin: rén ròu), or ren rou sou suo (人肉搜索) means to mine certain specific information about someone or some people, usually done by a group of non-professional participants in coordinated effort with helps of modern technologies, especially the internet. This action is conducted usually without permission of the subject(s) being ren-roued.

Contents

Purpose of ren-rou [edit]

  • Gathering information about people to know more about someone.
  • Promoting hatred, by exposing individual actions or behaviors that is likely to be opposed by the public, or certain group of people.
  • Promoting assault against someone. By exposing subject's private information including home address, phone number, email, work place, to make this subject available to go after.
  • Holding people accountable to what they say online. By exposing the true identity of an online commentator, the subject (commentator) is attached to his/her voice, so he/she will become targeted and risk being punished (by public or law) if he/she has said something irresponsible (online).
  • Exposing someone's bad or illegal actions, like fraud.

Triggers for individuals become ren-roued [edit]

  • Online posting or commenting (usually in a forum) which stimulates the anger of its viewers very badly.
  • Evidences or traces being revealed against someone, suggest that there is potentially a great evil behind him/her, making people want to find out more.

Methods of conduct information mining [edit]

  • Going after IP address. Sometimes, people who comment anonymously online can leave their IP address available to public or certain group of people. IP address can be entered into some site to find out the physical location of that IP being assigned to. Therefore it can leak out more information to go after.
  • Posting existent information about a subject (being ren-roued) online, so other people who know the subject can recognize, and contribute more information about this subject.
  • Hacking. Breaking into email inboxes, hacking into subject computers, to mine more information.
  • Photo taking, voice recording, usually in covert.
  • Web coordinated tracking.
  • Use search engine to search information. For instance, the subject may have posted some resume on certain sites to be searched for.

pros&cons[1] [edit]

pro:

  • It's a great way to uncover fraud or illegal actions.
  • Make people aware goods and evil (through learning why the subject is being ren-roued).
  • There is law to protect privacy issues, so subjects (being ren-roued) can still protect themselves against any illegal conduct of ren-rou that harms them.

con:

  • It invades people's privacy.
  • It makes people fear and nervous, even it is unnecessary.
  • There is no legal guide line on how to conduct it properly.

References [edit]

[1] 人肉搜索谁是责任主体, http://tech.sina.com.cn/it/2008-12-30/10182704437.shtml

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SEMMA b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SEMMA new file mode 100644 index 00000000..8da866f4 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SEMMA @@ -0,0 +1 @@ + SEMMA - Wikipedia, the free encyclopedia

SEMMA

From Wikipedia, the free encyclopedia
Jump to: navigation, search

SEMMA is an acronym that stands for Sample, Explore, Modify, Model and Assess. It is a list of sequential steps developed by SAS Institute Inc., one of the largest producers of statistics and business intelligence software. It guides the implementation of data mining applications.[1] Although SEMMA is often considered to be a general data mining methodology, SAS claims that it is "rather a logical organisation of the functional tool set of" one of their products, SAS Enterprise Miner, "for carrying out the core tasks of data mining".[2]

Contents

Background[edit]

In the expanding field of data mining, there has been a call for a standard methodology or a simply list of best practices for the deverisified and iterative process of data mining that users can apply to their data mining projects regardless of industry. While the Cross Industry Standard Process for Data Mining or CRISP-DM, founded by the European Strategic Program on Research in Information Technology initiative, aimed to create a netural methodology, SAS also offered a pattern to follow in its data mining tools.

Phases of SEMMA[edit]

The phases of SEMMA and related tasks are the following:[2]

  • Sample. The process starts with data sampling, e.g., selecting the data set for modeling. The data set should be large enough to contain sufficient information to retrieve, yet small enough to be used efficiently. This phase also deals with data partitioning.
  • Explore. This phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the variables, and also abnormalities, with the help of data visualization.
  • Modify. The Modify phase contains methods to select, create and transform variables in preparation for data modeling.
  • Model. In the Model phase the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome.
  • Assess. The last phase is Assess. The evaluation of the modeling results shows the reliability and usefulness of the created models.

Criticism[edit]

SEMMA mainly focuses on the modeling tasks of data mining projects, leaving the business aspects out (unlike, i.e., CRISP-DM and its Business Understanding phase). Additionally, SEMMA is designed to help the users of the SAS Enterprise Miner software. Therefore, applying it outside Enterprise Miner can be ambiguous.[3]

See also[edit]

References[edit]

  1. ^ Azevedo, A. and Santos, M. F. KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182-185.
  2. ^ a b SAS Enterprise Miner website
  3. ^ Rohanizadeh, S. S. and Moghadam, M. B. A Proposed Data Mining Methodology and its Application to Industrial Procedures Journal of Industrial Engineering 4 (2009) pp 37-50.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SIGKDD b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SIGKDD new file mode 100644 index 00000000..cc984dbf --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SIGKDD @@ -0,0 +1 @@ + SIGKDD - Wikipedia, the free encyclopedia

SIGKDD

From Wikipedia, the free encyclopedia
Jump to: navigation, search

SIGKDD is the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining. It became an official ACM SIG in 1998. The official web page of SIGKDD can be found on www.KDD.org. The current Chairman of SIGKDD (since 2009) is Usama M. Fayyad, Ph.D.

Contents

Conferences[edit]

SIGKDD has hosted an annual conference - ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) - since 1995. KDD Conferences grew from KDD (Knowledge Discovery and Data Mining) workshops at AAAI conferences, which were started by Gregory Piatetsky-Shapiro in 1989, 1991, and 1993, and Usama Fayyad in 1994. [1] Conference papers of each Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining are published through ACM[2]

KDD-2012 took place in Beijing, China [3] and KDD-2013 will take place in Chicago, United States, Aug 11-14, 2013.

KDD-Cup[edit]

SIGKDD sponsors the KDD Cup competition every year in conjunction with the annual conference. It is aimed at members of the industry and academia, particularly students, interested in KDD.

Awards[edit]

The group also annually recognizes members of the KDD community with its Innovation Award and Service Award. Additionally, KDD presents a Best Paper Award [4] to recognize the highest quality paper at each conference.

SIGKDD Explorations[edit]

SIGKDD has also published a biannual academic journal titled SIGKDD Explorations since June, 1999. Editors in Chief

Current Executive Committee[edit]

Chair

Treasurer

Directors

Former Chairpersons

  • Gregory Piatetsky-Shapiro[8] (2005-2008)
  • Won Kim (1998-2004)

Information Directors[edit]

References[edit]

External links[edit]


Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SIGMOD b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SIGMOD new file mode 100644 index 00000000..e8b7b916 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SIGMOD @@ -0,0 +1 @@ + SIGMOD - Wikipedia, the free encyclopedia

SIGMOD

From Wikipedia, the free encyclopedia
Jump to: navigation, search

SIGMOD is the Association for Computing Machinery's Special Interest Group on Management of Data, which specializes in large-scale data management problems and databases.

The annual ACM SIGMOD Conference, which began in 1975, is considered one of the most important in the field. While traditionally this conference had always been held within North America, it recently took place in Europe (in 2004) and Asia (in 2007). Acceptance rate of ACM SIGMOD Conference, averaged from 1996 to 2012, is 18%, with the rate of 17% in 2012.[1]

In association with SIGACT and SIGART, SIGMOD also sponsors the annual ACM Symposium on Principles of Database Systems (PODS) conference on the theoretical aspects of database systems. PODS began in 1982, and has been held jointly with the SIGMOD conference since 1991.

Each year, the group gives out several awards to contributions to the field of data management. The most important of these is the SIGMOD Edgar F. Codd Innovations Award (named after the computer scientist Edgar F. Codd), which is awarded to "innovative and highly significant contributions of enduring value to the development, understanding, or use of database systems and databases". Additionally, SIGMOD presents a Best Paper Award[2] to recognize the highest quality paper at each conference.

Contents

Venues of SIGMOD conferences[edit]

Year Place Link
2013 New York [1]
2012 Scottsdale [2]
2011 Athens [3]
2010 Indianapolis [4]
2009 Providence [5]
2008 Vancouver [6]
2007 Beijing [7]
2006 Chicago [8]
2005 Baltimore [9]
2004 Paris [10]
2003 San Diego [11]
2002 Madison [12]
2001 Santa Barbara [13]
2000 Dallas [14]
1999 Philadelphia
1998 Seattle
1997 Tucson
1996 Montreal
1995 San Jose
1994 Minneapolis
1993 Washington, DC
1992 San Diego
1991 Denver
1990 Atlantic City
1989 Portland
1988 Chicago
1987 San Francisco
1986 Washington, DC
1985 Austin
1984 Boston
1983 San Jose, California
1982 Orlando, Florida
1981 Ann Arbor
1980 Santa Monica
1979 Boston
1978 Austin
1977 Toronto
1976 Washington, DC
1975 San Jose

See also[edit]

External links[edit]

References[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SPSS_Modeler b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SPSS_Modeler new file mode 100644 index 00000000..a749d4e8 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/SPSS_Modeler @@ -0,0 +1 @@ + SPSS Modeler - Wikipedia, the free encyclopedia

SPSS Modeler

From Wikipedia, the free encyclopedia
Jump to: navigation, search
IBM SPSS Modeler
IBM SPSS Modeler Logo.jpg
SPSS Modeler Sample Stream.png

Data mining tool
Developer(s) IBM Corp.
Stable release 15.0 (Win / Unix / Linux) / June 2012
Operating system Windows, Linux, UNIX
Type Data mining and Predictive analytics
License Proprietary software
Website http://www-01.ibm.com/software/analytics/spss/products/modeler/

IBM SPSS Modeler is a data mining software application from IBM. It is a data mining and text analytics workbench used to build predictive models. It has a visual interface which allows users to leverage statistical and data mining algorithms without programming. SPSS Modeler has been used in these and other industries:

SPSS Modeler was originally named SPSS Clementine by SPSS Inc., after which it was renamed PASW Modeler in 2009 by SPSS.[8] It was since acquired by IBM in its 2009 acquisition of SPSS Inc. and was subsequently renamed IBM SPSS Modeler, its current name.

Contents

Editions[edit]

IBM sells the current version of SPSS Modeler (version 15) in two separate bundles of features. These two bundles are called "editions" by IBM:

  • SPSS Modeler Professional: used for structured data, such as databases, mainframe data systems, flat files or BI systems
  • SPSS Modeler Premium: Includes all the features of Modeler Professional, with the addition of:
Text analytics
Entity analytics
Social network analysis

Both editions are available in desktop and server configurations.

Architecture[edit]

SPSS Modeler has a three-tier design. Users manipulate icons and options in the front-end application on Windows operating systems. This front-end client application then communicates with the Modeler Server, or directly with a database or dataset. The most common configuration in large corporations is to house the Modeler Server software on a powerful analytical server box (Windows, UNIX, Linux), which then connects to the corporate Data warehouse. Data processing commands are automatically converted from the icon-based user interface into a command code (which is not visible) and is sent to the Modeler Server for processing. Where possible, this command code will be further compiled into SQL and processed in the data warehouse. NB -This section needs further updating

Features[edit]

Modeling Algorithms included

Release history[edit]

  • Clementine 1.0 – June 1994 by ISL[9]
  • Clementine 5.1 – Jan 2000
  • Clementine 12.0 – Jan 2008
  • PASW Modeler 13 (formerly Clementine) – April 2009
  • IBM SPSS Modeler 14.0 – 2010
  • IBM SPSS Modeler 14.2 – 2011
  • IBM SPSS Modeler 15.0 – June 2012

Product history[edit]

Early versions of the software were called Clementine and were Unix based and designed as a consulting tool and not for sale to customers. Originally developed by a UK company named Integral Solutions Limited (ISL),[9] the tool quickly garnered the attention of the data mining community (at that time in its infancy). Original in many respects, it was the first data mining tool to use an icon based Graphical user interface rather than requiring users to write in a Programming language.

In 1998 ISL was acquired by SPSS Inc., who saw the potential for extended development as a commercial data mining tool. In early 2000 the software was developed into a client / server architecture, and shortly afterward the client front-end interface component was completely re-written and replaced with a superior Java front-end.

SPSS Clementine version 12.0
The client front-end runs under Windows. The server back-end Unix variants (Sun, HP-UX, AIX), Linux, and Windows. The graphical user interface is written in Java.

IBM SPSS Modeler 14.2 was the first release of Modeler by IBM

IBM SPSS Modeler 15, released in June 2012, introduced significant new functionality for Social Network Analysis and Entity Analytics.

Competitors[edit]

See also[edit]

References[edit]

  1. ^ Forrester Research, Inc. (2012); The Forrester Wave™: Customer Analytics Solutions, http://www.forrester.com/pimages/rws/reprints/document/80281/oid/1-KRB1C8
  2. ^ http://www-01.ibm.com/software/success/cssdb.nsf/CS/KKMH-88U29V?OpenDocument&Site=default&cty=en_us
  3. ^ http://www-01.ibm.com/software/analytics/spss/12/patient-outcomes/
  4. ^ http://www-01.ibm.com/software/success/cssdb.nsf/cs/STRD-8LJJGH?OpenDocument&Site=spss&cty=en_us
  5. ^ http://public.dhe.ibm.com/common/ssi/ecm/en/imw14303usen/IMW14303USEN.PDF
  6. ^ http://public.dhe.ibm.com/common/ssi/ecm/en/ytw03085usen/YTW03085USEN.PDF
  7. ^ Delen, Dursun (2009); Predicting Movie Box-Office Receipts Using SPSS Clementine Data Mining Software, in Nisbet, Robert; Elder, John; & Miner, Gary (2009). Handbook of Statistical Analysis and Data Mining Applications. Elsevier. pp. 391–415. ISBN 978-0-12-374765-5. 
  8. ^ Oh My Darling! SPSS Says Goodbye Clementine, Hello 'PASW' – Intelligent Enterprise
  9. ^ a b Colin Shearer (1994); Mining the data-lode, Times Higher Education, November 18, 1994.

Further reading[edit]

  • Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., et al. (2000). CRISP-DM 1.0, Chicago, IL: SPSS.
  • Nisbet, R., Elder, J., and Miner, G. (2009). Handbook of Statistical Analysis and Data Mining Applications. Burlington, MA: Academic Press (Elsevier).

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Sequence_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Sequence_mining new file mode 100644 index 00000000..224c4c7c --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Sequence_mining @@ -0,0 +1 @@ + Sequence mining - Wikipedia, the free encyclopedia

Sequence mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Sequence mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.[1] It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequence mining is a special case of structured data mining.

There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning.

Contents

String Mining[edit]

String mining typically deals with a limited alphabet for items that appear in a sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the ASCII character set used in natural language text, nucleotide bases 'A', 'G', 'C' and 'T' in DNA sequences, or amino acids for protein sequences. In biology applications analysis of the arrangement of the alphabet in strings can be used to examine gene and protein sequences to determine their properties. Knowing the sequence of letters of a DNA a protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and biological function. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when insertions, deletions and mutations occur in a string.

A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented by Abouelhoda & Ghanem (2010), which include:[2]

  • Repeat-related problems: that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.
  • Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include BLAST for comparing a single sequence with multiple sequences in a database, and ClustalW for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See sequence alignment.

Itemset Mining[edit]

Some problems in sequence mining lend themselves discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a {customer buys a car}, he or she is likely to {buy insurance} within 1 week", or in the context of stock prices, "if {Nokia up and Ericsson Up}, it is likely that {Motorola up and Samsung up} within 2 days". Traditionally, itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions. For example, by analysing transactions of customer shopping baskets in a supermarket, one can produce a rule which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction".

A survey and taxonomy of the key algorithms for item set mining is presented by Han et al. (2007).[3]

The two common techniques that are applied to sequence databases for frequent itemset mining are the influential apriori algorithm and the more-recent FP-Growth technique.

Variants[edit]

The traditional sequential pattern mining is modified including some constraints and some behaviour. George and Binu (2012) have integrated three significant marketing scenarios for mining promotion-oriented sequential patterns.[4] The promotion-based market scenarios considered in their research are 1) product Downturn, 2) product Revision and 3) product Launch (DRL). By considering these, they developed a DRL-Prefix Span algorithm (tailored from of the Prefix Span) for mining all length DRL patterns.

Application[edit]

With a great variation of products and user buying behaviors, shelf on which products are being displayed is one of the most important resources in retail environment. Retailers can not only increase their profit but, also decrease cost by proper management of shelf space allocation and products display. To solve this problem, George and Binu (2013) have proposed an approach to mine user buying patterns using PrefixSpan algorithm and place the products on shelves based on the order of mined purchasing patterns.[5]

Algorithms[edit]

Commonly used algorithms include:

See also[edit]

References[edit]

  1. ^ Mabroukeh, N. R.; Ezeife, C. I. (2010). "A taxonomy of sequential pattern mining algorithms". ACM Computing Surveys 43: 1. doi:10.1145/1824795.1824798.  .. edit
  2. ^ Abouelhoda, M.; Ghanem, M. (2010). "String Mining in Bioinformatics". In Gaber, M. M. Scientific Data Mining and Knowledge Discovery. Springer. doi:10.1007/978-3-642-02788-8_9. ISBN 978-3-642-02787-1. 
  3. ^ Han, J.; Cheng, H.; Xin, D.; Yan, X. (2007). "Frequent pattern mining: current status and future directions". Data Mining and Knowledge Discovery 15 (1): 55–86. doi:10.1007/s10618-006-0059-1. 
  4. ^ George, Aloysius; Binu, D. (2012). "DRL-PREFIXSPAN A Novel Pattern Growth Algorithm for Discovering Downturn, Revision and Launch (DRL) Sequential Patterns". Central European Journal of Computer Science 2 (4): 426–439. doi:10.2478/s13537-012-0030-8. 
  5. ^ George, A.; Binu, D. (2013). "An Approach to Products Placement in Supermarkets Using PrefixSpan Algorithm". Journal of King Saud University-Computer and Information Sciences 25 (1): 77–87. doi:10.1016/j.jksuci.2012.07.001. 
  6. ^ Ahmad, Ishtiaq; Qazi, Wajahat M.; Khurshid, Ahmed; Ahmad, Munir; Hoessli, Daniel C.; Khawaja, Iffat; Choudhary, M. Iqbal; Shakoori, Abdul R.; Nasir-ud-Din, (1 May 2008). "MAPRes: Mining association patterns among preferred amino acid residues in the vicinity of amino acids targeted for post-translational modifications". PROTEOMICS 8 (10): 1954–1958 Extra |pages= or |at= (help). doi:10.1002/pmic.200700657. 

External links[edit]

Implementations
  • SPMF , a free, open-source data mining platform, written in Java, offering more than 45 algorithms for sequential pattern mining, sequential rule mining, itemset mining and association rule mining.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Society_for_Industrial_and_Applied_Mathematics b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Society_for_Industrial_and_Applied_Mathematics new file mode 100644 index 00000000..bfc142ee --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Society_for_Industrial_and_Applied_Mathematics @@ -0,0 +1 @@ + Society for Industrial and Applied Mathematics - Wikipedia, the free encyclopedia

Society for Industrial and Applied Mathematics

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Society for Industrial and Applied Mathematics

SIAM logo
Formation 1951
Headquarters Philadelphia, Pennsylvania, United States
Membership >12,000
President Lloyd N. Trefethen
Website www.siam.org

The Society for Industrial and Applied Mathematics (SIAM) was founded by a small group of mathematicians from academia and industry who met in Philadelphia in 1951 to start an organization whose members would meet periodically to exchange ideas about the uses of mathematics in industry. This meeting led to the organization of the Society for Industrial and Applied Mathematics. The membership of SIAM has grown from a few hundred in the early 1950s to more than 12,000 as of 2009. SIAM retains its North American influence, but it also has East Asian, Argentinian, Bulgarian, and UK & Ireland sections.

SIAM is one of the four parts of the Joint Policy Board for Mathematics.

Contents

Members [edit]

Membership is open to both individuals and organizations.

Focus [edit]

The focus for the society is applied, computational and industrial mathematics, and the society often promotes its acronym as "Science and Industry Advance with Mathematics". It is composed of a combination of people from a wide variety of vocations. Members include engineers, scientists, industrial mathematicians, and academic mathematicians. The society is active in promoting the use of analysis and modeling in all settings. The society also strives to support and provide guidance to educational institutions wishing to promote applied mathematics.

Activity groups (SIAGs) [edit]

The society includes a number of activity groups to allow for more focused group discussions and collaborations:

Journals [edit]

As of 2012, SIAM publishes 16 research journals:[1]

Books [edit]

SIAM publishes 20-25 books each year.

Conferences [edit]

SIAM organizes conferences and meetings throughout the year focused on various topics in applied math and computational science.

SIAM News [edit]

SIAM News is a newsletter focused on the applied math and computational science community and is published ten times per year.

Prizes and recognition [edit]

SIAM recognizes applied mathematician and computational scientists for their contributions to the fields. Prizes include:[2]

  • Germund Dahlquist Prize: Awarded to a young scientist (normally under 45) for original contributions to fields associated with Germund Dahlquist (numerical solution of differential equations and numerical methods for scientific computing).[3]
  • Ralph E. Kleinman Prize: Awarded for "outstanding research, or other contributions, that bridge the gap between mathematics and applications...Each prize may be given either for a single notable achievement or for a collection of such achievements."[4]
  • J.D. Crawford Prize: Awarded to "one individual for recent outstanding work on a topic in nonlinear science, as evidenced by a publication in English in a peer-reviewed journal within the four calendar years preceding the meeting at which the prize is awarded"[5]
  • Richard C. DiPrima Prize: Awarded to "a young scientist who has done outstanding research in applied mathematics (defined as those topics covered by SIAM journals) and who has completed his/her doctoral dissertation and completed all other requirements for his/her doctorate during the period running from three years prior to the award date to one year prior to the award date".[6]
  • George Pólya Prize: "is given every two years, alternately in two categories: (1) for a notable application of combinatorial theory; (2) for a notable contribution in another area of interest to George Pólya such as approximation theory, complex analysis, number theory, orthogonal polynomials, probability theory, or mathematical discovery and learning."[7]
  • W.T. and Idalia Reid Prize: Awarded for research in and contributions to areas of differential equations and control theory.[8]
  • Theodore von Kármán Prize: Awarded for "notable application of mathematics to mechanics and/or the engineering sciences made during the five to ten years preceding the award".[9]
  • James H. Wilkinson Prize: Awarded for "research in, or other contributions to, numerical analysis and scientific computing during the six years preceding the award".[10]

SIAM Fellows [edit]

  • In 2009 SIAM instituted a Fellows program to recognize certain members who have made outstanding contributions to the fields SIAM serves[11]

Moody's Mega Math (M3) Challenge [edit]

Funded by The Moody's Foundation and organized by SIAM, the Moody's Mega Math Challenge is an applied mathematics modeling competition for high school students along the entire East Coast, from Maine through Florida. Scholarship prizes total $100,000.

Students [edit]

  • SIAM Undergraduate Research Online

Publishes outstanding undergraduate research in applied and computational mathematics

  • Student memberships are generally discounted or free
  • SIAM has career and job resources for students and other applied mathematicians and computational scientists

See also [edit]

References [edit]

  1. ^ "Journals". SIAM. Retrieved 2012-12-04. 
  2. ^ "Prizes, Awards, Lectures and Fellows". SIAM. Retrieved 2012-12-04. 
  3. ^ "Germund Dahlquist Prize". SIAM. Retrieved 2012-12-04. 
  4. ^ "Ralph E. Kleinman Prize". SIAM. Retrieved 2012-12-04. 
  5. ^ "J.D. Crawford Prize (SIAG/Dynamical Systems)". SIAM. Retrieved 2012-12-04. 
  6. ^ "The Richard C. DiPrima Prize". SIAM. Retrieved 2012-12-04. 
  7. ^ "George Pólya Prize". SIAM. Retrieved 2012-12-04. 
  8. ^ "W.T. and Idalia Reid Prize in Mathematics". SIAM. Retrieved 2012-12-04. 
  9. ^ "Theodore von Kármán Prize". SIAM. Retrieved 2012-12-04. 
  10. ^ "James H. Wilkinson Prize in Numerical Analysis and Scientific Computing". SIAM. Retrieved 2012-12-04. 
  11. ^ "Fellows Program". SIAM. Retrieved 2012-12-04. 

External links [edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Software_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Software_mining new file mode 100644 index 00000000..37652080 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Software_mining @@ -0,0 +1 @@ + Software mining - Wikipedia, the free encyclopedia

Software mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Software mining is an application of knowledge discovery in the area of software modernization which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An entity relationship is a frequent format of representing knowledge obtained from existing software. Object Management Group (OMG) developed specification Knowledge Discovery Metamodel (KDM) which defines an ontology for software assets and their relationships for the purpose of performing knowledge discovery of existing code.

Contents

Software mining and data mining [edit]

Software mining is closely related to data mining, since existing software artifacts contain enormous business value, key for the evolution of software systems. Knowledge discovery from software systems addresses structure, behavior as well as the data processed by the software system. Instead of mining individual data sets, software mining focuses on metadata, such as database schemas. OMG Knowledge Discovery Metamodel provides an integrated representation to capturing application metadata as part of a holistic existing system metamodel. Another OMG specification, the Common Warehouse Metamodel focuses entirely on mining enterprise metadata.

Text-Mining Software Tools [edit]

Text-Mining Software Tools enable easy handling of text documents for the purpose of data analysis including automatic model generation and document classification, document clustering, document visualization, dealing with Web documents, and crawling the Web.

Levels of software mining [edit]

Knowledge discovery in software is related to a concept of reverse engineering. Software mining addresses structure, behavior as well as the data processed by the software system.

Mining software systems may happen at various levels:

  • program level (individual statements and variables)
  • design pattern level
  • call graph level (individual procedures and their relationships)
  • architectural level (subsystems and their interfaces)
  • data level (individual columns and attributes of data stores)
  • application level (key data items and their flow through the applications)
  • business level (domain concepts, business rules and their implementation in code)

Forms of representing the results of Software Mining [edit]

See also [edit]

Mining Software Repositories

References [edit]

  1. ^ Interview with the creators of Metawidget

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Spatial_index b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Spatial_index new file mode 100644 index 00000000..187a0b2c --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Spatial_index @@ -0,0 +1 @@ + Spatial database - Wikipedia, the free encyclopedia

Spatial database

From Wikipedia, the free encyclopedia
  (Redirected from Spatial index)
Jump to: navigation, search

A spatial database is a database that is optimized to store and query data that represents objects defined in a geometric space. Most spatial databases allow representing simple geometric objects such as points, lines and polygons. Some spatial databases handle more complex structures such as 3D objects, topological coverages, linear networks, and TINs. While typical databases are designed to manage various numeric and character types of data, additional functionality needs to be added for databases to process spatial data types efficiently. These are typically called geometry or feature. The Open Geospatial Consortium created the Simple Features specification and sets standards for adding spatial functionality to database systems.[1]

Contents

Features of spatial databases[edit]

Database systems use indexes to quickly look up values and the way that most databases index data is not optimal for spatial queries. Instead, spatial databases use a spatial index to speed up database operations.

In addition to typical SQL queries such as SELECT statements, spatial databases can perform a wide variety of spatial operations. The following operations and many more are specified by the Open Geospatial Consortium standard:

  • Spatial Measurements: Computes line length, polygon area, the distance between geometries, etc.
  • Spatial Functions: Modify existing features to create new ones, for example by providing a buffer around them, intersecting features, etc.
  • Spatial Predicates: Allows true/false queries about spatial relationships between geometries. Examples include "do two polygons overlap" or 'is there a residence located within a mile of the area we are planning to build the landfill?' (see DE-9IM)
  • Geometry Constructors: Creates new geometries, usually by specifying the vertices (points or nodes) which define the shape.
  • Observer Functions: Queries which return specific information about a feature such as the location of the center of a circle

Some databases support only simplified or modified sets of these operations, especially in cases of NoSQL systems like MongoDB and CouchDB.

Spatial index[edit]

Spatial indices are used by spatial databases (databases which store information related to objects in space) to optimize spatial queries. Conventional index types do not efficiently handle spatial queries such as how far two points differ, or whether points fall within a spatial area of interest. Common spatial index methods include:

Spatial database systems[edit]

  • All OpenGIS Specifications compliant products[2]
  • Open source spatial databases and APIs, some of which are OpenGIS compliant[3]
  • Boeing's Spatial Query Server spatially enables Sybase ASE.
  • Smallworld VMDS, the native GE Smallworld GIS database
  • SpatiaLite extends Sqlite with spatial datatypes, functions, and utilities.
  • IBM DB2 Spatial Extender can be used to enable any edition of DB2, including the free DB2 Express-C, with support for spatial types
  • Oracle Spatial
  • Microsoft SQL Server has support for spatial types since version 2008
  • PostgreSQL DBMS (database management system) uses the spatial extension PostGIS to implement the standardized datatype geometry and corresponding functions.
  • MySQL DBMS implements the datatype geometry plus some spatial functions that have been implemented according to the OpenGIS specifications.[4] However, in MySQL version 5.5 and earlier, functions that test spatial relationships are limited to working with minimum bounding rectangles rather than the actual geometries. MySQL versions earlier than 5.0.16 only supported spatial data in MyISAM tables. As of MySQL 5.0.16, InnoDB, NDB, BDB, and ARCHIVE also support spatial features.
  • Neo4j - Graph database that can build 1D and 2D indexes as Btree, Quadtree and Hilbert curve directly in the graph
  • AllegroGraph - a Graph database provides a novel mechanism for efficient storage and retrieval of two-dimensional geospatial coordinates for Resource Description Framework data. It includes an extension syntax for SPARQL queries
  • MongoDB supports geospatial indexes in 2D
  • Esri has a number of both single-user and multiuser geodatabases.
  • SpaceBase is a real-time spatial database.[5]
  • CouchDB a document based database system that can be spatially enabled by a plugin called Geocouch
  • CartoDB is a cloud based geospatial database on top of PostgreSQL with PostGIS.
  • StormDB is an upcoming cloud based database on top of PostgreSQL with geospatial capabilities.

See also[edit]

References[edit]

Further reading[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistical_inference b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistical_inference new file mode 100644 index 00000000..1a60307b --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistical_inference @@ -0,0 +1 @@ + Statistical inference - Wikipedia, the free encyclopedia

Statistical inference

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In statistics, statistical inference is the process of drawing conclusions from data that is subject to random variation, for example, observational errors or sampling variation.[1] More substantially, the terms statistical inference, statistical induction and inferential statistics are used to describe systems of procedures that can be used to draw conclusions from datasets arising from systems affected by random variation,[2] such as observational errors, random sampling, or random experimentation.[1] Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations.

The outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy.

Contents

Introduction[edit]

Scope[edit]

For the most part, statistical inference makes propositions about populations, using data drawn from the population of interest via some form of random sampling. More generally, data about a random process is obtained from its observed behavior during a finite period of time. Given a parameter or hypothesis about which one wishes to make inference, statistical inference most often uses:

  • a statistical model of the random process that is supposed to generate the data, which is known when randomization has been used, and
  • a particular realization of the random process; i.e., a set of data.

The conclusion of a statistical inference is a statistical proposition.[citation needed] Some common forms of statistical proposition are:

Comparison to descriptive statistics[edit]

Statistical inference is generally distinguished from descriptive statistics. In simple terms, descriptive statistics can be thought of as being just a straightforward presentation of facts, in which modeling decisions made by a data analyst have had minimal influence.

Models and assumptions[edit]

Any statistical inference requires some assumptions. A statistical model is a set of assumptions concerning the generation of the observed data and similar data. Descriptions of statistical models usually emphasize the role of population quantities of interest, about which we wish to draw inference.[4] Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn.[5]

Degree of models/assumptions[edit]

Statisticians distinguish between three levels of modeling assumptions;

  • Fully parametric: The probability distributions describing the data-generation process are assumed to be fully described by a family of probability distributions involving only a finite number of unknown parameters.[4] For example, one may assume that the distribution of population values is truly Normal, with unknown mean and variance, and that datasets are generated by 'simple' random sampling. The family of generalized linear models is a widely used and flexible class of parametric models.
  • Non-parametric: The assumptions made about the process generating the data are much less than in parametric statistics and may be minimal.[6] For example, every continuous probability distribution has a median, which may be estimated using the sample median or the Hodges–Lehmann–Sen estimator, which has good properties when the data arise from simple random sampling.
  • Semi-parametric: This term typically implies assumptions 'in between' fully and non-parametric approaches. For example, one may assume that a population distribution has a finite mean. Furthermore, one may assume that the mean response level in the population depends in a truly linear manner on some covariate (a parametric assumption) but not make any parametric assumption describing the variance around that mean (i.e., about the presence or possible form of any heteroscedasticity). More generally, semi-parametric models can often be separated into 'structural' and 'random variation' components. One component is treated parametrically and the other non-parametrically. The well-known Cox model is a set of semi-parametric assumptions.

Importance of valid models/assumptions[edit]

Whatever level of assumption is made, correctly calibrated inference in general requires these assumptions to be correct; i.e., that the data-generating mechanisms really has been correctly specified.

Incorrect assumptions of 'simple' random sampling can invalidate statistical inference.[7] More complex semi- and fully parametric assumptions are also cause for concern. For example, incorrectly assuming the Cox model can in some cases lead to faulty conclusions.[8] Incorrect assumptions of Normality in the population also invalidates some forms of regression-based inference.[9] The use of any parametric model is viewed skeptically by most experts in sampling human populations: "most sampling statisticians, when they deal with confidence intervals at all, limit themselves to statements about [estimators] based on very large samples, where the central limit theorem ensures that these [estimators] will have distributions that are nearly normal."[10] In particular, a normal distribution "would be a totally unrealistic and catastrophically unwise assumption to make if we were dealing with any kind of economic population."[10] Here, the central limit theorem states that the distribution of the sample mean "for very large samples" is approximately normally distributed, if the distribution is not heavy tailed.

Approximate distributions[edit]

Given the difficulty in specifying exact distributions of sample statistics, many methods have been developed for approximating these.

With finite samples, approximation results measure how close a limiting distribution approaches the statistic's sample distribution: For example, with 10,000 independent samples the normal distribution approximates (to two digits of accuracy) the distribution of the sample mean for many population distributions, by the Berry–Esseen theorem.[11] Yet for many practical purposes, the normal approximation provides a good approximation to the sample-mean's distribution when there are 10 (or more) independent samples, according to simulation studies and statisticians' experience.[11] Following Kolmogorov's work in the 1950s, advanced statistics uses approximation theory and functional analysis to quantify the error of approximation. In this approach, the metric geometry of probability distributions is studied; this approach quantifies approximation error with, for example, the Kullback–Leibler distance, Bregman divergence, and the Hellinger distance.[12][13][14]

With indefinitely large samples, limiting results like the central limit theorem describe the sample statistic's limiting distribution, if one exists. Limiting results are not statements about finite samples, and indeed are irrelevant to finite samples.[15][16][17] However, the asymptotic theory of limiting distributions is often invoked for work with finite samples. For example, limiting results are often invoked to justify the generalized method of moments and the use of generalized estimating equations, which are popular in econometrics and biostatistics. The magnitude of the difference between the limiting distribution and the true distribution (formally, the 'error' of the approximation) can be assessed using simulation.[18] The heuristic application of limiting results to finite samples is common practice in many applications, especially with low-dimensional models with log-concave likelihoods (such as with one-parameter exponential families).

Randomization-based models[edit]

For a given dataset that was produced by a randomization design, the randomization distribution of a statistic (under the null-hypothesis) is defined by evaluating the test statistic for all of the plans that could have been generated by the randomization design. In frequentist inference, randomization allows inferences to be based on the randomization distribution rather than a subjective model, and this is important especially in survey sampling and design of experiments.[19][20] Statistical inference from randomized studies is also more straightforward than many other situations.[21][22][23] In Bayesian inference, randomization is also of importance: in survey sampling, use of sampling without replacement ensures the exchangeability of the sample with the population; in randomized experiments, randomization warrants a missing at random assumption for covariate information.[24]

Objective randomization allows properly inductive procedures.[25][26][27][28] Many statisticians prefer randomization-based analysis of data that was generated by well-defined randomization procedures.[29] (However, it is true that in fields of science with developed theoretical knowledge and experimental control, randomized experiments may increase the costs of experimentation without improving the quality of inferences.[30][31]) Similarly, results from randomized experiments are recommended by leading statistical authorities as allowing inferences with greater reliability than do observational studies of the same phenomena.[32] However, a good observational study may be better than a bad randomized experiment.

The statistical analysis of a randomized experiment may be based on the randomization scheme stated in the experimental protocol and does not need a subjective model.[33][34]

However, at any time, some hypotheses cannot be tested using objective statistical models, which accurately describe randomized experiments or random samples. In some cases, such randomized studies are uneconomical or unethical.

Model-based analysis of randomized experiments[edit]

It is standard practice to refer to a statistical model, often a linear model, when analyzing data from randomized experiments. However, the randomization scheme guides the choice of a statistical model. It is not possible to choose an appropriate model without knowing the randomization scheme.[20] Seriously misleading results can be obtained analyzing data from randomized experiments while ignoring the experimental protocol; common mistakes include forgetting the blocking used in an experiment and confusing repeated measurements on the same experimental unit with independent replicates of the treatment applied to different experimental units.[35]

Modes of inference[edit]

Different schools of statistical inference have become established. These schools (or 'paradigms') are not mutually exclusive, and methods which work well under one paradigm often have attractive interpretations under other paradigms. The two main paradigms in use are frequentist and Bayesian inference, which are both summarized below.

Frequentist inference[edit]

This paradigm calibrates the production of propositions[clarification needed (complicated jargon)] by considering (notional) repeated sampling of datasets similar to the one at hand. By considering its characteristics under repeated sample, the frequentist properties of any statistical inference procedure can be described — although in practice this quantification may be challenging.

Examples of frequentist inference[edit]

Frequentist inference, objectivity, and decision theory[edit]

One interpretation of frequentist inference (or classical inference) is that it is applicable only in terms of frequency probability; that is, in terms of repeated sampling from a population. However, the approach of Neyman[36] develops these procedures in terms of pre-experiment probabilities. That is, before undertaking an experiment, one decides on a rule for coming to a conclusion such that the probability of being correct is controlled in a suitable way: such a probability need not have a frequentist or repeated sampling interpretation. In contrast, Bayesian inference works in terms of conditional probabilities (i.e., probabilities conditional on the observed data), compared to the marginal (but conditioned on unknown parameters) probabilities used in the frequentist approach.

The frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions. However, some elements of frequentist statistics, such as statistical decision theory, do incorporate utility functions.[citation needed] In particular, frequentist developments of optimal inference (such as minimum-variance unbiased estimators, or uniformly most powerful testing) make use of loss functions, which play the role of (negative) utility functions. Loss functions need not be explicitly stated for statistical theorists to prove that a statistical procedure has an optimality property.[37] However, loss-functions are often useful for stating optimality properties: for example, median-unbiased estimators are optimal under absolute value loss functions, in that they minimize expected loss, and least squares estimators are optimal under squared error loss functions, in that they minimize expected loss.

While statisticians using frequentist inference must choose for themselves the parameters of interest, and the estimators/test statistic to be used, the absence of obviously explicit utilities and prior distributions has helped frequentist procedures to become widely viewed as 'objective'.[citation needed]

Bayesian inference[edit]

The Bayesian calculus describes degrees of belief using the 'language' of probability; beliefs are positive, integrate to one, and obey probability axioms. Bayesian inference uses the available posterior beliefs as the basis for making statistical propositions. There are several different justifications for using the Bayesian approach.

Examples of Bayesian inference[edit]

Bayesian inference, subjectivity and decision theory[edit]

Many informal Bayesian inferences are based on "intuitively reasonable" summaries of the posterior. For example, the posterior mean, median and mode, highest posterior density intervals, and Bayes Factors can all be motivated in this way. While a user's utility function need not be stated for this sort of inference, these summaries do all depend (to some extent) on stated prior beliefs, and are generally viewed as subjective conclusions. (Methods of prior construction which do not require external input have been proposed but not yet fully developed.)

Formally, Bayesian inference is calibrated with reference to an explicitly stated utility, or loss function; the 'Bayes rule' is the one which maximizes expected utility, averaged over the posterior uncertainty. Formal Bayesian inference therefore automatically provides optimal decisions in a decision theoretic sense. Given assumptions, data and utility, Bayesian inference can be made for essentially any problem, although not every statistical inference need have a Bayesian interpretation. Analyses which are not formally Bayesian can be (logically) incoherent; a feature of Bayesian procedures which use proper priors (i.e., those integrable to one) is that they are guaranteed to be coherent. Some advocates of Bayesian inference assert that inference must take place in this decision-theoretic framework, and that Bayesian inference should not conclude with the evaluation and summarization of posterior beliefs.

Other modes of inference (besides frequentist and Bayesian)[edit]

Information and computational complexity[edit]

Other forms of statistical inference have been developed from ideas in information theory[38] and the theory of Kolmogorov complexity.[39] For example, the minimum description length (MDL) principle selects statistical models that maximally compress the data; inference proceeds without assuming counterfactual or non-falsifiable 'data-generating mechanisms' or probability models for the data, as might be done in frequentist or Bayesian approaches.

However, if a 'data generating mechanism' does exist in reality, then according to Shannon's source coding theorem it provides the MDL description of the data, on average and asymptotically.[40] In minimizing description length (or descriptive complexity), MDL estimation is similar to maximum likelihood estimation and maximum a posteriori estimation (using maximum-entropy Bayesian priors). However, MDL avoids assuming that the underlying probability model is known; the MDL principle can also be applied without assumptions that e.g. the data arose from independent sampling.[40][41] The MDL principle has been applied in communication-coding theory in information theory, in linear regression, and in time-series analysis (particularly for choosing the degrees of the polynomials in Autoregressive moving average (ARMA) models).[41]

Information-theoretic statistical inference has been popular in data mining, which has become a common approach for very large observational and heterogeneous datasets made possible by the computer revolution and internet.[39]

The evaluation of statistical inferential procedures often uses techniques or criteria from computational complexity theory or numerical analysis.[42][43]

Fiducial inference[edit]

Fiducial inference was an approach to statistical inference based on fiducial probability, also known as a "fiducial distribution". In subsequent work, this approach has been called ill-defined, extremely limited in applicability, and even fallacious.[44][45] However this argument is the same as that which shows[46] that a so-called confidence distribution is not a valid probability distribution and, since this has not invalidated the application of confidence intervals, it does not necessarily invalidate conclusions drawn from fiducial arguments.

Structural inference[edit]

Developing ideas of Fisher and of Pitman from 1938 to 1939,[47] George A. Barnard developed "structural inference" or "pivotal inference",[48] an approach using invariant probabilities on group families. Barnard reformulated the arguments behind fiducial inference on a restricted class of models on which "fiducial" procedures would be well-defined and useful.

Inference topics[edit]

The topics below are usually included in the area of statistical inference.

  1. Statistical assumptions
  2. Statistical decision theory
  3. Estimation theory
  4. Statistical hypothesis testing
  5. Revising opinions in statistics
  6. Design of experiments, the analysis of variance, and regression
  7. Survey sampling
  8. Summarizing statistical data

See also[edit]

Notes[edit]

  1. ^ a b Upton, G., Cook, I. (2008) Oxford Dictionary of Statistics, OUP. ISBN 978-0-19-954145-4
  2. ^ Dodge, Y. (2003) The Oxford Dictionary of Statistical Terms, OUP. ISBN 0-19-920613-9 (entry for "inferential statistics")
  3. ^ According to Peirce, acceptance means that inquiry on this question ceases for the time being. In science, all scientific theories are revisable
  4. ^ a b Cox (2006) page 2
  5. ^ Evans et al., Michael (2004). Probability and Statistics: The Science of Uncertainty. Freeman and Company. p. 267. 
  6. ^ van der Vaart, A.W. (1998) Asymptotic Statistics Cambridge University Press. ISBN 0-521-78450-6 (page 341)
  7. ^ Kruskal, William (December 1988). "Miracles and Statistics: The Casual Assumption of Independence (ASA Presidential address)". Journal of the American Statistical Association 83 (404): 929–940. JSTOR 2290117. 
  8. ^ Freedman, D.A. (2008) "Survival analysis: An Epidemiological hazard?". The American Statistician (2008) 62: 110-119. (Reprinted as Chapter 11 (pages 169–192) of: Freedman, D.A. (2010) Statistical Models and Causal Inferences: A Dialogue with the Social Sciences (Edited by David Collier, Jasjeet S. Sekhon, and Philip B. Stark.) Cambridge University Press. ISBN 978-0-521-12390-7)
  9. ^ Berk, R. (2003) Regression Analysis: A Constructive Critique (Advanced Quantitative Techniques in the Social Sciences) (v. 11) Sage Publications. ISBN 0-7619-2904-5
  10. ^ a b Brewer, Ken (2002). Combined Survey Sampling Inference: Weighing of Basu's Elephants. Hodder Arnold. p. 6. ISBN 0-340-69229-4, 978-0340692295 Check |isbn= value (help). 
  11. ^ a b Jörgen Hoffman-Jörgensen's Probability With a View Towards Statistics, Volume I. Page 399[full citation needed]
  12. ^ Le Cam (1986)[page needed]
  13. ^ Erik Torgerson (1991) Comparison of Statistical Experiments, volume 36 of Encyclopedia of Mathematics. Cambridge University Press.[full citation needed]
  14. ^ Liese, Friedrich and Miescke, Klaus-J. (2008). Statistical Decision Theory: Estimation, Testing, and Selection. Springer. ISBN 0-387-73193-8. 
  15. ^ Kolmogorov (1963a) (Page 369): "The frequency concept, based on the notion of limiting frequency as the number of trials increases to infinity, does not contribute anything to substantiate the applicability of the results of probability theory to real practical problems where we have always to deal with a finite number of trials". (page 369)
  16. ^ "Indeed, limit theorems 'as n tends to infinity' are logically devoid of content about what happens at any particular n. All they can do is suggest certain approaches whose performance must then be checked on the case at hand." — Le Cam (1986) (page xiv)
  17. ^ Pfanzagl (1994): "The crucial drawback of asymptotic theory: What we expect from asymptotic theory are results which hold approximately . . . . What asymptotic theory has to offer are limit theorems."(page ix) "What counts for applications are approximations, not limits." (page 188)
  18. ^ Pfanzagl (1994) : "By taking a limit theorem as being approximately true for large sample sizes, we commit an error the size of which is unknown. [. . .] Realistic information about the remaining errors may be obtained by simulations." (page ix)
  19. ^ Neyman, J.(1934) "On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection", Journal of the Royal Statistical Society, 97 (4), 557–625 JSTOR 2342192
  20. ^ a b Hinkelmann and Kempthorne(2008)[page needed]
  21. ^ ASA Guidelines for a first course in statistics for non-statisticians. (available at the ASA website)
  22. ^ David A. Freedman et alia's Statistics.
  23. ^ David S. Moore and George McCabe. Introduction to the Practice of Statistics.
  24. ^ Gelman, Rubin. Bayesian Data Analysis.
  25. ^ Peirce (1877-1878)
  26. ^ Peirce (1883)
  27. ^ David Freedman et alia Statistics and David A. Freedman Statistical Models.
  28. ^ Rao, C.R. (1997) Statistics and Truth: Putting Chance to Work, World Scientific. ISBN 981-02-3111-3
  29. ^ Peirce, Freedman, Moore and McCabe.[citation needed]
  30. ^ Box, G.E.P. and Friends (2006) Improving Almost Anything: Ideas and Essays, Revised Edition, Wiley. ISBN 978-0-471-72755-2
  31. ^ Cox (2006), page 196
  32. ^ ASA Guidelines for a first course in statistics for non-statisticians. (available at the ASA website)
    • David A. Freedman et alia's Statistics.
    • David S. Moore and George McCabe. Introduction to the Practice of Statistics.
  33. ^ Neyman, Jerzy. 1923 [1990]. "On the Application of Probability Theory to AgriculturalExperiments. Essay on Principles. Section 9." Statistical Science 5 (4): 465–472. Trans. Dorota M. Dabrowska and Terence P. Speed.
  34. ^ Hinkelmann & Kempthorne (2008)[page needed]
  35. ^ Hinkelmann and Kempthorne (2008) Chapter 6.
  36. ^ Neyman, J. (1937) "Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability", Philosophical Transactions of the Royal Society of London A, 236, 333–380.
  37. ^ Preface to Pfanzagl.
  38. ^ Soofi (2000)
  39. ^ a b Hansen & Yu (2001)
  40. ^ a b Hansen and Yu (2001), page 747.
  41. ^ a b Rissanen (1989), page 84
  42. ^ Joseph F. Traub, G. W. Wasilkowski, and H. Wozniakowski. (1988)[page needed]
  43. ^ Judin and Nemirovski.
  44. ^ Neyman (1956)
  45. ^ Zabell (1992)}
  46. ^ Cox (2006) page 66
  47. ^ Davison, page 12.[full citation needed]
  48. ^ Barnard, G.A. (1995) "Pivotal Models and the Fiducial Argument", International Statistical Review, 63 (3), 309–323. JSTOR 1403482

References[edit]

Further reading[edit]

  • Casella, G., Berger, R.L. (2001). Statistical Inference. Duxbury Press. ISBN 0-534-24312-6
  • David A. Freedman. "Statistical Models and Shoe Leather" (1991). Sociological Methodology, vol. 21, pp. 291–313.
  • David A. Freedman. Statistical Models and Causal Inferences: A Dialogue with the Social Sciences. 2010. Edited by David Collier, Jasjeet S. Sekhon, and Philip B. Stark. Cambridge University Press.
  • Kruskal, William (December 1988). "Miracles and Statistics: The Casual Assumption of Independence (ASA Presidential address)". Journal of the American Statistical Association 83 (404): 929–940. JSTOR 2290117. 
  • Lenhard, Johannes (2006). "Models and Statistical Inference: The Controversy between Fisher and Neyman—Pearson," British Journal for the Philosophy of Science, Vol. 57 Issue 1, pp. 69–91.
  • Lindley, D. (1958). "Fiducial distribution and Bayes' theorem", Journal of the Royal Statistical Society, Series B, 20, 102–7
  • Sudderth, William D. (1994). "Coherent Inference and Prediction in Statistics," in Dag Prawitz, Bryan Skyrms, and Westerstahl (eds.), Logic, Methodology and Philosophy of Science IX: Proceedings of the Ninth International Congress of Logic, Methodology and Philosophy of Science, Uppsala, Sweden, August 7–14, 1991, Amsterdam: Elsevier.
  • Trusted, Jennifer (1979). The Logic of Scientific Inference: An Introduction, London: The Macmillan Press, Ltd.
  • Young, G.A., Smith, R.L. (2005) Essentials of Statistical Inference, CUP. ISBN 0-521-83971-8

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistical_model b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistical_model new file mode 100644 index 00000000..b8462577 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistical_model @@ -0,0 +1 @@ + Statistical model - Wikipedia, the free encyclopedia

Statistical model

From Wikipedia, the free encyclopedia
Jump to: navigation, search

A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more other variables. The model is statistical as the variables are not deterministically but stochastically related. In mathematical terms, a statistical model is frequently thought of as a pair  (Y, P) where  Y is the set of possible observations and  P the set of possible probability distributions on  Y . It is assumed that there is a distinct element of  P which generates the observed data. Statistical inference enables us to make statements about which element(s) of this set are likely to be the true one.

Most statistical tests can be described in the form of a statistical model. For example, the Student's t-test for comparing the means of two groups can be formulated as seeing if an estimated parameter in the model is different from 0. Another similarity between tests and models is that there are assumptions involved. Error is assumed to be normally distributed in most models.[1]

Contents

Formal definition [edit]

A statistical model is a collection of probability distribution functions or probability density functions (collectively referred to as distributions for brevity). A parametric model is a collection of distributions, each of which is indexed by a unique finite-dimensional parameter: \mathcal{P}=\{\mathbb{P}_{\theta} : \theta \in \Theta\}, where \theta is a parameter and \Theta \subseteq \mathbb{R}^d is the feasible region of parameters, which is a subset of d-dimensional Euclidean space. A statistical model may be used to describe the set of distributions from which one assumes that a particular data set is sampled. For example, if one assumes that data arise from a univariate Gaussian distribution, then one has assumed a Gaussian model: \mathcal{P}=\{\mathbb{P}(x; \mu, \sigma) = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left\{ -\frac{1}{2\sigma^2}(x-\mu)^2\right\} : \mu \in \mathbb{R}, \sigma > 0\}.

A non-parametric model is a set of probability distributions with infinite dimensional parameters, and might be written as \mathcal{P}=\{\text{all distributions}\}. A semi-parametric model also has infinite dimensional parameters, but is not dense in the space of distributions. For example, a mixture of Gaussians with one Gaussian at each data point is dense in the space of distributions. Formally, if d is the dimension of the parameter, and n is the number of samples, if d \rightarrow \infty as n \rightarrow \infty and d/n \rightarrow 0 as n \rightarrow \infty, then the model is semi-parametric.

Model comparison [edit]

Models can be compared to each other. This can either be done when you have done an exploratory data analysis or a confirmatory data analysis. In an exploratory analysis, you formulate all models you can think of, and see which describes your data best. In a confirmatory analysis you test which of your models you have described before the data was collected fits the data best, or test if your only model fits the data. In linear regression analysis you can compare the amount of variance explained by the independent variables, R2, across the different models. In general, you can compare models that are nested by using a Likelihood-ratio test. Nested models are models that can be obtained by restricting a parameter in a more complex model to be zero.

An example [edit]

Height and age are probabilistically distributed over humans. They are stochastically related; when you know that a person is of age 7, this influences the chance of this person being 6 feet tall. You could formalize this relationship in a linear regression model of the following form: heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to get a prediction of height, ε is the error term, and i is the subject. This means that height starts at some value, there is a minimum height when someone is born, and it is predicted by age to some amount. This prediction is not perfect as error is included in the model. This error contains variance that stems from sex and other variables. When sex is included in the model, the error term will become smaller, as you will have a better idea of the chance that a particular 16-year-old is 6 feet tall when you know this 16-year-old is a girl. The model would become heighti = b0 + b1agei + b2sexi + εi, where the variable sex is dichotomous. This model would presumably have a higher R2. The first model is nested in the second model: the first model is obtained from the second when b2 is restricted to zero.

Classification [edit]

According to the number of the endogenous variables and the number of equations, models can be classified as complete models (the number of equations equal to the number of endogenous variables) and incomplete models. Some other statistical models are the general linear model (restricted to continuous dependent variables), the generalized linear model (for example, logistic regression), the multilevel model, and the structural equation model.[2]

See also [edit]

References [edit]

  1. ^ Field, A. (2005). Discovering statistics using SPSS. Sage, London.
  2. ^ Adèr, H.J. (2008). Chapter 12: Modelling. In H.J. Adèr & G.J. Mellenbergh (Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A consultant's companion (pp. 271-304). Huizen, The Netherlands: Johannes van Kessel Publishing.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistics b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistics new file mode 100644 index 00000000..b8f13a07 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Statistics @@ -0,0 +1 @@ + Statistics - Wikipedia, the free encyclopedia

Statistics

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Statistics is the study of the collection, organization, analysis, interpretation and presentation of data.[1][2] It deals with all aspects of data, including the planning of data collection in terms of the design of surveys and experiments.[1]

The word statistics, when referring to the scientific discipline, is singular, as in "Statistics is an art."[3] This should not be confused with the word statistic, referring to a quantity (such as mean or median) calculated from a set of data,[4] whose plural is statistics ("this statistic seems wrong" or "these statistics are misleading").

More probability density is found the closer one gets to the expected (mean) value in a normal distribution. Statistics used in standardized testing assessment are shown. The scales include standard deviations, cumulative percentages, percentile equivalents, Z-scores, T-scores, standard nines, and percentages in standard nines.

Scope[edit]

Some consider statistics a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data,[5] while others consider it a branch of mathematics[6] concerned with collecting and interpreting data. Because of its empirical roots and its focus on applications, statistics is usually considered a distinct mathematical science rather than a branch of mathematics.[7][8] Much of statistics is non-mathematical: ensuring that data collection is undertaken in a way that produces valid conclusions; coding and archiving data so that information is retained and made useful for international comparisons of official statistics; reporting of results and summarised data (tables and graphs) in ways comprehensible to those who must use them; implementing procedures that ensure the privacy of census information.

Statisticians improve data quality by developing specific experiment designs and survey samples. Statistics itself also provides tools for prediction and forecasting the use of data and statistical models. Statistics is applicable to a wide variety of academic disciplines, including natural and social sciences, government, and business. Statistical consultants can help organizations and companies that don't have in-house expertise relevant to their particular questions.

Statistical methods can summarize or describe a collection of data. This is called descriptive statistics. This is particularly useful in communicating the results of experiments and research. In addition, data patterns may be modeled in a way that accounts for randomness and uncertainty in the observations.

These models can be used to draw inferences about the process or population under study—a practice called inferential statistics. Inference is a vital element of scientific advance, since it provides a way to draw conclusions from data that are subject to random variation. To prove the propositions being investigated further, the conclusions are tested as well, as part of the scientific method. Descriptive statistics and analysis of the new data tend to provide more information as to the truth of the proposition.

"Applied statistics" comprises descriptive statistics and the application of inferential statistics.[9][verification needed] Theoretical statistics concerns both the logical arguments underlying justification of approaches to statistical inference, as well encompassing mathematical statistics. Mathematical statistics includes not only the manipulation of probability distributions necessary for deriving results related to methods of estimation and inference, but also various aspects of computational statistics and the design of experiments.

Statistics is closely related to probability theory, with which it is often grouped. The difference is, roughly, that probability theory starts from the given parameters of a total population to deduce probabilities that pertain to samples. Statistical inference, however, moves in the opposite direction—inductively inferring from samples to the parameters of a larger or total population.

History[edit]

Statistical methods date back at least to the 5th century BC. The earliest known writing on statistics appears in a 9th-century book entitled Manuscript on Deciphering Cryptographic Messages, written by Al-Kindi. In this book, Al-Kindi provides a detailed description of how to use statistics and frequency analysis to decipher encrypted messages. This was the birth of both statistics and cryptanalysis, according to the Saudi engineer Ibrahim Al-Kadi.[10][11]

The Nuova Cronica, a 14th-century history of Florence by the Florentine banker and official Giovanni Villani, includes much statistical information on population, ordinances, commerce, education, and religious facilities, and has been described as the first introduction of statistics as a positive element in history.[12]

Some scholars pinpoint the origin of statistics to 1663, with the publication of Natural and Political Observations upon the Bills of Mortality by John Graunt.[13] Early applications of statistical thinking revolved around the needs of states to base policy on demographic and economic data, hence its stat- etymology. The scope of the discipline of statistics broadened in the early 19th century to include the collection and analysis of data in general. Today, statistics is widely employed in government, business, and natural and social sciences.

Its mathematical foundations were laid in the 17th century with the development of the probability theory by Blaise Pascal and Pierre de Fermat. Probability theory arose from the study of games of chance. The method of least squares was first described by Carl Friedrich Gauss around 1794. The use of modern computers has expedited large-scale statistical computation, and has also made possible new methods that are impractical to perform manually.

Overview[edit]

In applying statistics to a scientific, industrial, or societal problem, it is necessary to begin with a population or process to be studied. Populations can be diverse topics such as "all persons living in a country" or "every atom composing a crystal". A population can also be composed of observations of a process at various times, with the data from each observation serving as a different member of the overall group. Data collected about this kind of "population" constitutes what is called a time series.

For practical reasons, a chosen subset of the population called a sample is studied—as opposed to compiling data about the entire group (an operation called census). Once a sample that is representative of the population is determined, data is collected for the sample members in an observational or experimental setting. This data can then be subjected to statistical analysis, serving two related purposes: description and inference.

"... it is only the manipulation of uncertainty that interests us. We are not concerned with the matter that is uncertain. Thus we do not study the mechanism of rain; only whether it will rain."

Dennis Lindley, 2000[15]

The concept of correlation is particularly noteworthy for the potential confusion it can cause. Statistical analysis of a data set often reveals that two variables (properties) of the population under consideration tend to vary together, as if they were connected. For example, a study of annual income that also looks at age of death might find that poor people tend to have shorter lives than affluent people. The two variables are said to be correlated; however, they may or may not be the cause of one another. The correlation phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable. For this reason, there is no way to immediately infer the existence of a causal relationship between the two variables. (See Correlation does not imply causation.)

To use a sample as a guide to an entire population, it is important that it truly represent the overall population. Representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. A major problem lies in determining the extent that the sample chosen is actually representative. Statistics offers methods to estimate and correct for any random trending within the sample and data collection procedures. There are also methods of experimental design for experiments that can lessen these issues at the outset of a study, strengthening its capability to discern truths about the population.

Randomness is studied using the mathematical discipline of probability theory. Probability is used in "mathematical statistics" (alternatively, "statistical theory") to study the sampling distributions of sample statistics and, more generally, the properties of statistical procedures. The use of any statistical method is valid when the system or population under consideration satisfies the assumptions of the method.

Misuse of statistics can produce subtle, but serious errors in description and interpretation—subtle in the sense that even experienced professionals make such errors, and serious in the sense that they can lead to devastating decision errors. For instance, social policy, medical practice, and the reliability of structures like bridges all rely on the proper use of statistics. See below for further discussion.

Even when statistical techniques are correctly applied, the results can be difficult to interpret for those lacking expertise. The statistical significance of a trend in the data—which measures the extent to which a trend could be caused by random variation in the sample—may or may not agree with an intuitive sense of its significance. The set of basic statistical skills (and skepticism) that people need to deal with information in their everyday lives properly is referred to as statistical literacy.

Statistical methods[edit]

Experimental and observational studies[edit]

A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on dependent variables or response. There are two major types of causal statistical studies: experimental studies and observational studies. In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed. The difference between the two types lies in how the study is actually conducted. Each can be very effective. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead, data are gathered and correlations between predictors and response are investigated.

Experiments[edit]

The basic steps of a statistical experiment are:

  1. Planning the research, including finding the number of replicates of the study, using the following information: preliminary estimates regarding the size of treatment effects, alternative hypotheses, and the estimated experimental variability. Consideration of the selection of experimental subjects and the ethics of research is necessary. Statisticians recommend that experiments compare (at least) one new treatment with a standard treatment or control, to allow an unbiased estimate of the difference in treatment effects.
  2. Design of experiments, using blocking to reduce the influence of confounding variables, and randomized assignment of treatments to subjects to allow unbiased estimates of treatment effects and experimental error. At this stage, the experimenters and statisticians write the experimental protocol that shall guide the performance of the experiment and that specifies the primary analysis of the experimental data.
  3. Performing the experiment following the experimental protocol and analyzing the data following the experimental protocol.
  4. Further examining the data set in secondary analyses, to suggest new hypotheses for future study.
  5. Documenting and presenting the results of the study.

Experiments on human behavior have special concerns. The famous Hawthorne study examined changes to the working environment at the Hawthorne plant of the Western Electric Company. The researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers. The researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected productivity. It turned out that productivity indeed improved (under the experimental conditions). However, the study is heavily criticized today for errors in experimental procedures, specifically for the lack of a control group and blindness. The Hawthorne effect refers to finding that an outcome (in this case, worker productivity) changed due to observation itself. Those in the Hawthorne study became more productive not because the lighting was changed but because they were being observed.[citation needed]

Observational study[edit]

An example of an observational study is one that explores the correlation between smoking and lung cancer. This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis. In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through a case-control study, and then look for the number of cases of lung cancer in each group.

Levels of measurement[edit]

There are four main levels of measurement used in statistics: nominal, ordinal, interval, and ratio.[16] Each of these have different degrees of usefulness in statistical research. Ratio measurements have both a meaningful zero value and the distances between different measurements defined; they provide the greatest flexibility in statistical methods that can be used for analyzing the data.[citation needed] Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in Celsius or Fahrenheit). Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values. Nominal measurements have no meaningful rank order among values.

Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature.

Key terms used in statistics[edit]

Null hypothesis[edit]

Interpretation of statistical information can often involve the development of a null hypothesis in that the assumption is that whatever is proposed as a cause has no effect on the variable being measured.

The best illustration for a novice is the predicament encountered by a jury trial. The null hypothesis, H0, asserts that the defendant is innocent, whereas the alternative hypothesis, H1, asserts that the defendant is guilty. The indictment comes because of suspicion of the guilt. The H0 (status quo) stands in opposition to H1 and is maintained unless H1 is supported by evidence"beyond a reasonable doubt". However,"failure to reject H0" in this case does not imply innocence, but merely that the evidence was insufficient to convict. So the jury does not necessarily accept H0 but fails to reject H0. While one can not "prove" a null hypothesis one can test how close it is to being true with a power test, which tests for type II errors.

Error[edit]

Working from a null hypothesis two basic forms of error are recognized:

  • Type I errors where the null hypothesis is falsely rejected giving a "false positive".
  • Type II errors where the null hypothesis fails to be rejected and an actual difference between populations is missed giving a false negative.

Error also refers to the extent to which individual observations in a sample differ from a central value, such as the sample or population mean. Many statistical methods seek to minimize the mean-squared error, and these are called "methods of least squares."

Measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random (noise) or systematic (bias), but other important types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also be important.

Interval estimation[edit]

Most studies only sample part of a population, so results don't fully represent the whole population. Any estimates obtained from the sample only approximate the population value. Confidence intervals allow statisticians to express how closely the sample estimate matches the true value in the whole population. Often they are expressed as 95% confidence intervals. Formally, a 95% confidence interval for a value is a range where, if the sampling and analysis were repeated under the same conditions (yielding a different dataset), the interval would include the true (population) value 95% of the time. This does not imply that the probability that the true value is in the confidence interval is 95%. From the frequentist perspective, such a claim does not even make sense, as the true value is not a random variable. Either the true value is or is not within the given interval. However, it is true that, before any data are sampled and given a plan for how to construct the confidence interval, the probability is 95% that the yet-to-be-calculated interval will cover the true value: at this point, the limits of the interval are yet-to-be-observed random variables. One approach that does yield an interval that can be interpreted as having a given probability of containing the true value is to use a credible interval from Bayesian statistics: this approach depends on a different way of interpreting what is meant by "probability", that is as a Bayesian probability.

Significance[edit]

Statistics rarely give a simple Yes/No type answer to the question asked of them. Interpretation often comes down to the level of statistical significance applied to the numbers and often refers to the probability of a value accurately rejecting the null hypothesis (sometimes referred to as the p-value).

Referring to statistical significance does not necessarily mean that the overall result is significant in real world terms. For example, in a large study of a drug it may be shown that the drug has a statistically significant but very small beneficial effect, such that the drug is unlikely to help the patient noticeably.

Criticisms arise because the hypothesis testing approach forces one hypothesis (the null hypothesis) to be "favored," and can also seem to exaggerate the importance of minor differences in large studies. A difference that is highly statistically significant can still be of no practical significance, but it is possible to properly formulate tests in account for this. (See also criticism of hypothesis testing.)

One response involves going beyond reporting only the significance level to include the p-value when reporting whether a hypothesis is rejected or accepted. The p-value, however, does not indicate the size of the effect. A better and increasingly common approach is to report confidence intervals. Although these are produced from the same calculations as those of hypothesis tests or p-values, they describe both the size of the effect and the uncertainty surrounding it.

Examples[edit]

Some well-known statistical tests and procedures are:

Specialized disciplines[edit]

Statistical techniques are used in a wide range of types of scientific and social research, including: biostatistics, computational biology, computational sociology, network biology, social science, sociology and social research. Some fields of inquiry use applied statistics so extensively that they have specialized terminology. These disciplines include:

In addition, there are particular types of statistical analysis that have also developed their own specialised terminology and methodology:

Statistics form a key basis tool in business and manufacturing as well. It is used to understand measurement systems variability, control processes (as in statistical process control or SPC), for summarizing data, and to make data-driven decisions. In these roles, it is a key tool, and perhaps the only reliable tool.

Statistical computing[edit]

gretl, an example of an open source statistical package

The rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science. Early statistical models were almost always from the class of linear models, but powerful computers, coupled with suitable numerical algorithms, caused an increased interest in nonlinear models (such as neural networks) as well as the creation of new types, such as generalized linear models and multilevel models.

Increased computing power has also led to the growing popularity of computationally intensive methods based on resampling, such as permutation tests and the bootstrap, while techniques such as Gibbs sampling have made use of Bayesian models more feasible. The computer revolution has implications for the future of statistics with new emphasis on "experimental" and "empirical" statistics. A large number of both general and special purpose statistical software are now available.

Misuse[edit]

There is a general perception that statistical knowledge is all-too-frequently intentionally misused by finding ways to interpret only the data that are favorable to the presenter.[17] A mistrust and misunderstanding of statistics is associated with the quotation, "There are three kinds of lies: lies, damned lies, and statistics". Misuse of statistics can be both inadvertent and intentional, and the book How to Lie With Statistics[17] outlines a range of considerations. In an attempt to shed light on the use and misuse of statistics, reviews of statistical techniques used in particular fields are conducted (e.g. Warne, Lazo, Ramos, and Ritter (2012)).[18]

Ways to avoid misuse of statistics include using proper diagrams and avoiding bias.[19] Misuse can occur when conclusions are overgeneralized and claimed to be representative of more than they really are, often by either deliberately or unconsciously overlooking sampling bias.[20] Bar graphs are arguably the easiest diagrams to use and understand, and they can be made either by hand or with simple computer programs.[19] Unfortunately, most people do not look for bias or errors, so they are not noticed. Thus, people may often believe that something is true even if it is not well represented.[20] To make data gathered from statistics believable and accurate, the sample taken must be representative of the whole.[21] According to Huff, "The dependability of a sample can be destroyed by [bias]... allow yourself some degree of skepticism."[22]

To assist in the understanding of statistics Huff proposed a series of questions to be asked in each case:[22]

  • Who says so? (Does he/she have an axe to grind?)
  • How does he/she know? (Does he/she have the resources to know the facts?)
  • What’s missing? (Does he/she give us a complete picture?)
  • Did someone change the subject? (Does he/she offer us the right answer to the wrong problem?)
  • Does it make sense? (Is his/her conclusion logical and consistent with what we already know?)

Statistics applied to mathematics or the arts[edit]

Traditionally, statistics was concerned with drawing inferences using a semi-standardized methodology that was "required learning" in most sciences. This has changed with use of statistics in non-inferential contexts. What was once considered a dry subject, taken in many fields as a degree-requirement, is now viewed enthusiastically. Initially derided by some mathematical purists, it is now considered essential methodology in certain areas.

  • In number theory, scatter plots of data generated by a distribution function may be transformed with familiar tools used in statistics to reveal underlying patterns, which may then lead to hypotheses.
  • Methods of statistics including predictive methods in forecasting are combined with chaos theory and fractal geometry to create video works that are considered to have great beauty.
  • The process art of Jackson Pollock relied on artistic experiments whereby underlying distributions in nature were artistically revealed.[citation needed] With the advent of computers, statistical methods were applied to formalize such distribution-driven natural processes to make and analyze moving video art.[citation needed]
  • Methods of statistics may be used predicatively in performance art, as in a card trick based on a Markov process that only works some of the time, the occasion of which can be predicted using statistical methodology.
  • Statistics can be used to predicatively create art, as in the statistical or stochastic music invented by Iannis Xenakis, where the music is performance-specific. Though this type of artistry does not always come out as expected, it does behave in ways that are predictable and tunable using statistics.

See also[edit]

References[edit]

  1. ^ a b Dodge, Y. (2006) The Oxford Dictionary of Statistical Terms, OUP. ISBN 0-19-920613-9
  2. ^ The Free Online Dictionary
  3. ^ "Statistics". Merriam-Webster Online Dictionary. 
  4. ^ "Statistic". Merriam-Webster Online Dictionary. 
  5. ^ Moses, Lincoln E. (1986) Think and Explain with Statistics, Addison-Wesley, ISBN 978-0-201-15619-5 . pp. 1–3
  6. ^ Hays, William Lee, (1973) Statistics for the Social Sciences, Holt, Rinehart and Winston, p.xii, ISBN 978-0-03-077945-9
  7. ^ Moore, David (1992). "Teaching Statistics as a Respectable Subject". In F. Gordon and S. Gordon. Statistics for the Twenty-First Century. Washington, DC: The Mathematical Association of America. pp. 14–25. ISBN 978-0-88385-078-7. 
  8. ^ Chance, Beth L.; Rossman, Allan J. (2005). "Preface". Investigating Statistical Concepts, Applications, and Methods. Duxbury Press. ISBN 978-0-495-05064-3. 
  9. ^ Anderson, D.R.; Sweeney, D.J.; Williams, T.A.. (1994) Introduction to Statistics: Concepts and Applications, pp. 5–9. West Group. ISBN 978-0-314-03309-3
  10. ^ Al-Kadi, Ibrahim A. (1992) "The origins of cryptology: The Arab contributions”, Cryptologia, 16(2) 97–126. doi:10.1080/0161-119291866801
  11. ^ Singh, Simon (2000). The code book : the science of secrecy from ancient Egypt to quantum cryptography (1st Anchor Books ed.). New York: Anchor Books. ISBN 0-385-49532-3. [page needed]
  12. ^ Villani, Giovanni. Encyclopædia Britannica. Encyclopædia Britannica 2006 Ultimate Reference Suite DVD. Retrieved on 2008-03-04.
  13. ^ Willcox, Walter (1938) "The Founder of Statistics". Review of the International Statistical Institute 5(4):321–328. JSTOR 1400906
  14. ^ Breiman, Leo (2001). "Statistical Modelling: the two cultures". Statistical Science 16 (3): 199–231. doi:10.1214/ss/1009213726. MR 1874152. CiteSeerX: 10.1.1.156.4933. 
  15. ^ Lindley, D. (2000). "The Philosophy of Statistics". Journal of the Royal Statistical Society, Series D 49 (3): 293–337. doi:10.1111/1467-9884.00238. JSTOR 2681060. 
  16. ^ Thompson, B. (2006). Foundations of behavioral statistics. New York, NY: Guilford Press.
  17. ^ a b Huff, Darrell (1954) How to Lie With Statistics, WW Norton & Company, Inc. New York, NY. ISBN 0-393-31072-8
  18. ^ Warne, R. Lazo, M., Ramos, T. and Ritter, N. (2012). Statistical Methods Used in Gifted Education Journals, 2006–2010. Gifted Child Quarterly, 56(3) 134–149. doi:10.1177/0016986212444122
  19. ^ a b Drennan, Robert D. (2008). "Statistics in archaeology". In Pearsall, Deborah M. Encyclopedia of Archaeology. Elsevier Inc. pp. 2093–2100. ISBN 978-0-12-373962-9. 
  20. ^ a b Cohen, Jerome B. (December 1938). "Misuse of Statistics". Journal of the American Statistical Association (JSTOR) 33 (204): 657–674. doi:10.1080/01621459.1938.10502344. 
  21. ^ Freund, J. F. (1988). "Modern Elementary Statistics". Credo Reference. 
  22. ^ a b Huff, Darrell; Irving Geis (1954). How to Lie with Statistics. New York: Norton. "The dependability of a sample can be destroyed by [bias]... allow yourself some degree of skepticism." 

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Structure_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Structure_mining new file mode 100644 index 00000000..68e982e9 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Structure_mining @@ -0,0 +1 @@ + Structure mining - Wikipedia, the free encyclopedia

Structure mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Structure mining or structured data mining is the process of finding and extracting useful information from semi structured data sets. Graph mining is a special case of structured data mining[citation needed].

Contents

Description[edit]

The growth of the use of semi-structured data has created new opportunities for data mining, which has traditionally been concerned with tabular data sets, reflecting the strong association between data mining and relational databases. Much of the world's interesting and mineable data does not easily fold into relational databases, though a generation of software engineers have been trained to believe this was the only way to handle data, and data mining algorithms have generally been developed only to cope with tabular data.

XML, being the most frequent way of representing semi-structured data, is able to represent both tabular data and arbitrary trees. Any particular representation of data to be exchanged between two applications in XML is normally described by a Schema often written in XSD. Practical examples of such Schemata, for instance NewsML, are normally very sophisticated, containing multiple optional subtrees, used for representing special case data. Frequently around 90% of a Schema is concerned with the definition of these optional data items and sub-trees.

Messages and data, therefore, that are transmitted or encoded using XML and that conform to the same Schema are liable to contain very different data depending on what is being transmitted.

Such data presents large problems for conventional data mining. Two messages that conform to the same Schema may have little data in common. Building a training set from such data means that if one were to try to format it as tabular data for conventional data mining, large sections of the tables would or could be empty.

There is a tacit assumption made in the design of most data mining algorithms that the data presented will be complete. The other desideratum is that the actual mining algorithms employed, whether supervised or unsupervised, must be able to handle sparse data. Namely, machine learning algorithms perform badly with incomplete data sets were only part of the information is supplied. For instance methods based on neural networks.[citation needed] or Ross Quinlan's ID3 algorithm.[citation needed] are highly accurate with good and representative samples of the problem, but perform badly with biased data. Most of times better model presentation with more careful and unbiased representation of input and output is enough. A particularly relevant area where finding the appropriate structure and model is the key issue is text mining.

XPath is the standard mechanism used to refer to nodes and data items within XML. It has similarities to standard techniques for navigating directory hierarchies used in operating systems user interfaces. To data and structure mine XML data of any form, at least two extensions are required to conventional data mining. These are the ability to associate an XPath statement with any data pattern and sub statements with each data node in the data pattern, and the ability to mine the presence and count of any node or set of nodes within the document.

As an example, if one were to represent a family tree in XML, using these extensions one could create a data set containing all the individuals in the tree, data items such as name and age at death, and counts of related nodes, such as number of children. More sophisticated searches could extract data such as grandparents' lifespans etc.

The addition of these data types related to the structure of a document or message facilitates structure mining.

See also[edit]

References[edit]

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Support_vector_machines b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Support_vector_machines new file mode 100644 index 00000000..11c1a9de --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Support_vector_machines @@ -0,0 +1 @@ + Support vector machine - Wikipedia, the free encyclopedia

Support vector machine

From Wikipedia, the free encyclopedia
  (Redirected from Support vector machines)
Jump to: navigation, search

In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

In addition to performing linear classification, SVMs can efficiently perform non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

Contents

Formal definition[edit]

More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space, by defining them in terms of a kernel function K(x,y) selected to suit the problem.[2] The hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters \alpha_i of images of feature vectors that occur in the data base. With this choice of a hyperplane, the points x in the feature space that are mapped into the hyperplane are defined by the relation: \textstyle\sum_i \alpha_i K(x_i,x) = \mathrm{constant}. Note that if K(x,y) becomes small as y grows further away from x, each element in the sum measures the degree of closeness of the test point x to the corresponding data base point x_i. In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that the set of points x mapped into any hyperplane can be quite convoluted as a result, allowing much more complex discrimination between sets which are not convex at all in the original space.

History[edit]

The original SVM algorithm was invented by Vladimir N. Vapnik and the current standard incarnation (soft margin) was proposed by Vapnik and Corinna Cortes in 1995.[1]

Motivation[edit]

H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximum margin.

Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (a list of p numbers), and we want to know whether we can separate such points with a (p − 1)-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier; or equivalently, the perceptron of optimal stability.

Linear SVM[edit]

Given some training data \mathcal{D}, a set of n points of the form

\mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^n

where the yi is either 1 or −1, indicating the class to which the point \mathbf{x}_i belongs. Each  \mathbf{x}_i is a p-dimensional real vector. We want to find the maximum-margin hyperplane that divides the points having y_i=1 from those having y_i=-1. Any hyperplane can be written as the set of points \mathbf{x} satisfying

Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.
\mathbf{w}\cdot\mathbf{x} - b=0,\,

where \cdot denotes the dot product and {\mathbf{w}} the normal vector to the hyperplane. The parameter \tfrac{b}{\|\mathbf{w}\|} determines the offset of the hyperplane from the origin along the normal vector {\mathbf{w}}.

If the training data are linearly separable, we can select two hyperplanes in a way that they separate the data and there are no points between them, and then try to maximize their distance. The region bounded by them is called "the margin". These hyperplanes can be described by the equations

\mathbf{w}\cdot\mathbf{x} - b=1\,

and

\mathbf{w}\cdot\mathbf{x} - b=-1.\,

By using geometry, we find the distance between these two hyperplanes is \tfrac{2}{\|\mathbf{w}\|}, so we want to minimize \|\mathbf{w}\|. As we also have to prevent data points from falling into the margin, we add the following constraint: for each i either

\mathbf{w}\cdot\mathbf{x}_i - b \ge 1\qquad\text{ for }\mathbf{x}_i of the first class

or

\mathbf{w}\cdot\mathbf{x}_i - b \le -1\qquad\text{ for }\mathbf{x}_i of the second.

This can be rewritten as:

y_i(\mathbf{w}\cdot\mathbf{x}_i - b) \ge 1, \quad \text{ for all } 1 \le i \le n.\qquad\qquad(1)

We can put this together to get the optimization problem:

Minimize (in {\mathbf{w},b})

\|\mathbf{w}\|

subject to (for any i = 1, \dots, n)

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1. \,

Primal form[edit]

The optimization problem presented in the preceding section is difficult to solve because it depends on ||w||, the norm of w, which involves a square root. Fortunately it is possible to alter the equation by substituting ||w|| with \tfrac{1}{2}\|\mathbf{w}\|^2 (the factor of 1/2 being used for mathematical convenience) without changing the solution (the minimum of the original and the modified equation have the same w and b). This is a quadratic programming optimization problem. More clearly:

Minimize (in {\mathbf{w},b})

\frac{1}{2}\|\mathbf{w}\|^2

subject to (for any i = 1, \dots, n)

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1.

By introducing Lagrange multipliers \boldsymbol{\alpha}, the previous constrained problem can be expressed as

\min_{\mathbf{w},b } \max_{\boldsymbol{\alpha}\geq 0 } \left\{ \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{n}{\alpha_i[y_i(\mathbf{w}\cdot \mathbf{x_i} - b)-1]} \right\}

that is we look for a saddle point. In doing so all the points which can be separated as y_i(\mathbf{w}\cdot\mathbf{x_i} - b) - 1 > 0 do not matter since we must set the corresponding \alpha_i to zero.

This problem can now be solved by standard quadratic programming techniques and programs. The "stationary" Karush–Kuhn–Tucker condition implies that the solution can be expressed as a linear combination of the training vectors

\mathbf{w} = \sum_{i=1}^n{\alpha_i y_i\mathbf{x_i}}.

Only a few \alpha_i will be greater than zero. The corresponding \mathbf{x_i} are exactly the support vectors, which lie on the margin and satisfy y_i(\mathbf{w}\cdot\mathbf{x_i} - b) = 1. From this one can derive that the support vectors also satisfy

\mathbf{w}\cdot\mathbf{x_i} - b = 1 / y_i = y_i \iff b = \mathbf{w}\cdot\mathbf{x_i} - y_i

which allows one to define the offset b. In practice, it is more robust to average over all N_{SV} support vectors:

b = \frac{1}{N_{SV}} \sum_{i=1}^{N_{SV}}{(\mathbf{w}\cdot\mathbf{x_i} - y_i)}

Dual form[edit]

Writing the classification rule in its unconstrained dual form reveals that the maximum-margin hyperplane and therefore the classification task is only a function of the support vectors, the subset of the training data that lie on the margin.

Using the fact that \|\mathbf{w}\|^2 = w\cdot w and substituting \mathbf{w} = \sum_{i=1}^n{\alpha_i y_i\mathbf{x_i}}, one can show that the dual of the SVM reduces to the following optimization problem:

Maximize (in \alpha_i )

\tilde{L}(\mathbf{\alpha})=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

subject to (for any i = 1, \dots, n)

\alpha_i \geq 0,\,

and to the constraint from the minimization in  b

 \sum_{i=1}^n \alpha_i y_i = 0.

Here the kernel is defined by k(\mathbf{x}_i,\mathbf{x}_j)=\mathbf{x}_i\cdot\mathbf{x}_j.

W can be computed thanks to the \alpha terms:

\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i.

Biased and unbiased hyperplanes[edit]

For simplicity reasons, sometimes it is required that the hyperplane pass through the origin of the coordinate system. Such hyperplanes are called unbiased, whereas general hyperplanes not necessarily passing through the origin are called biased. An unbiased hyperplane can be enforced by setting b = 0 in the primal optimization problem. The corresponding dual is identical to the dual given above without the equality constraint

\sum_{i=1}^n \alpha_i y_i = 0

Soft margin[edit]

In 1995, Corinna Cortes and Vladimir N. Vapnik suggested a modified maximum margin idea that allows for mislabeled examples.[1] If there exists no hyperplane that can split the "yes" and "no" examples, the Soft Margin method will choose a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. The method introduces non-negative slack variables, \xi_i, which measure the degree of misclassification of the data x_i

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i \quad 1 \le i \le n. \quad\quad(2)

The objective function is then increased by a function which penalizes non-zero \xi_i, and the optimization becomes a trade off between a large margin and a small error penalty. If the penalty function is linear, the optimization problem becomes:

\min_{\mathbf{w},\mathbf{\xi}, b } \left\{\frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \right\}

subject to (for any  i=1,\dots n)

 y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1 - \xi_i, ~~~~\xi_i \ge 0

This constraint in (2) along with the objective of minimizing \|\mathbf{w}\| can be solved using Lagrange multipliers as done above. One has then to solve the following problem:

\min_{\mathbf{w},\mathbf{\xi}, b } \max_{\boldsymbol{\alpha},\boldsymbol{\beta} } \left \{ \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^n \xi_i - \sum_{i=1}^{n}{\alpha_i[y_i(\mathbf{w}\cdot \mathbf{x_i} - b) -1 + \xi_i]} - \sum_{i=1}^{n} \beta_i \xi_i \right \}

with  \alpha_i, \beta_i \ge 0.

Dual form[edit]

Maximize (in \alpha_i )

\tilde{L}(\mathbf{\alpha})=\sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i, j} \alpha_i \alpha_j y_i y_j k(\mathbf{x}_i, \mathbf{x}_j)

subject to (for any i = 1, \dots, n)

0 \leq \alpha_i \leq C,\,

and

 \sum_{i=1}^n \alpha_i y_i = 0.

The key advantage of a linear penalty function is that the slack variables vanish from the dual problem, with the constant C appearing only as an additional constraint on the Lagrange multipliers. For the above formulation and its huge impact in practice, Cortes and Vapnik received the 2008 ACM Paris Kanellakis Award.[3] Nonlinear penalty functions have been used, particularly to reduce the effect of outliers on the classifier, but unless care is taken the problem becomes non-convex, and thus it is considerably more difficult to find a global solution.

Nonlinear classification[edit]

Kernel machine

The original optimal hyperplane algorithm proposed by Vapnik in 1963 was a linear classifier. However, in 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick (originally proposed by Aizerman et al.[4]) to maximum-margin hyperplanes.[5] The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinear and the transformed space high dimensional; thus though the classifier is a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space.

If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimensions. Maximum margin classifiers are well regularized, so the infinite dimensions do not spoil the results. Some common kernels include:

The kernel is related to the transform \varphi(\mathbf{x_i}) by the equation k(\mathbf{x_i}, \mathbf{x_j}) = \varphi(\mathbf{x_i})\cdot \varphi(\mathbf{x_j}). The value w is also in the transformed space, with \textstyle\mathbf{w} = \sum_i \alpha_i y_i \varphi(\mathbf{x}_i). Dot products with w for classification can again be computed by the kernel trick, i.e. \textstyle \mathbf{w}\cdot\varphi(\mathbf{x}) = \sum_i \alpha_i y_i k(\mathbf{x}_i, \mathbf{x}). However, there does not in general exist a value w' such that \mathbf{w}\cdot\varphi(\mathbf{x}) = k(\mathbf{w'}, \mathbf{x}).

Properties[edit]

SVMs belong to a family of generalized linear classifiers and can be interpreted as an extension of the perceptron. They can also be considered a special case of Tikhonov regularization. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers.

A comparison of the SVM to other classifiers has been made by Meyer, Leisch and Hornik.[6]

Parameter selection[edit]

The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C.

A common choice is a Gaussian kernel, which has a single parameter γ. The best combination of C and γ is often selected by a grid search with exponentially growing sequences of C and γ, for example, C \in \{ 2^{-5}, 2^{-3}, \dots, 2^{13},2^{15} \}; \gamma \in \{ 2^{-15},2^{-13}, \dots, 2^{1},2^{3} \}. Typically, each combination of parameter choices is checked using cross validation, and the parameters with best cross-validation accuracy are picked. The final model, which is used for testing and for classifying new data, is then trained on the whole training set using the selected parameters.[7]

Issues[edit]

Potential drawbacks of the SVM are the following three aspects:

  • Uncalibrated class membership probabilities
  • The SVM is only directly applicable for two-class tasks. Therefore, algorithms that reduce the multi-class task to several binary problems have to be applied; see the multi-class SVM section.
  • Parameters of a solved model are difficult to interpret.

Extensions[edit]

Multiclass SVM[edit]

Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements.

The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems.[8] Common methods for such reduction include:[8] [9]

  • Building binary classifiers which distinguish between (i) one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.
  • Directed Acyclic Graph SVM (DAGSVM)[10]
  • error-correcting output codes[11]

Crammer and Singer proposed a multiclass SVM method which casts the multiclass classification problem into a single optimization problem, rather than decomposing it into multiple binary classification problems.[12] See also Lee, Lin and Wahba.[13][14]

Transductive support vector machines[edit]

Transductive support vector machines extend SVMs in that they could also treat partially labeled data in semi-supervised learning by following the principles of transduction. Here, in addition to the training set \mathcal{D}, the learner is also given a set

\mathcal{D}^\star = \{ \mathbf{x}^\star_i | \mathbf{x}^\star_i \in \mathbb{R}^p\}_{i=1}^k \,

of test examples to be classified. Formally, a transductive support vector machine is defined by the following primal optimization problem:[15]

Minimize (in {\mathbf{w}, b, \mathbf{y^\star}})

\frac{1}{2}\|\mathbf{w}\|^2

subject to (for any i = 1, \dots, n and any j = 1, \dots, k)

y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1,\,
y^\star_j(\mathbf{w}\cdot\mathbf{x^\star_j} - b) \ge 1,

and

y^\star_j \in \{-1, 1\}.\,

Transductive support vector machines were introduced by Vladimir N. Vapnik in 1998.

Structured SVM[edit]

SVMs have been generalized to structured SVMs, where the label space is structured and of possibly infinite size.

Regression[edit]

A version of SVM for regression was proposed in 1996 by Vladimir N. Vapnik, Harris Drucker, Christopher J. C. Burges, Linda Kaufman and Alexander J. Smola.[16] This method is called support vector regression (SVR). The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction (within a threshold \epsilon). Another SVM version known as least squares support vector machine (LS-SVM) has been proposed by Suykens and Vandewalle.[17]

Implementation[edit]

The parameters of the maximum-margin hyperplane are derived by solving the optimization. There exist several specialized algorithms for quickly solving the QP problem that arises from SVMs, mostly relying on heuristics for breaking the problem down into smaller, more-manageable chunks.

A common method is Platt's Sequential Minimal Optimization (SMO) algorithm, which breaks the problem down into 2-dimensional sub-problems that may be solved analytically, eliminating the need for a numerical optimization algorithm.

Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush–Kuhn–Tucker conditions of the primal and dual problems.[18] Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used in the kernel trick.

Applications[edit]

SVM can be used to solve various real world problems:

  • SVM is helpful in text and hypertext categorization as its application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.
  • Classification of images can also be performed using SVM. Experimental results show that SVM achieves significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.
  • SVM is also useful in medical science to classify protein as up to 90% of the compounds can classify correctly.
  • Hand-written characters can be recognized using SVM.

See also[edit]

References[edit]

  1. ^ a b c Cortes, Corinna; and Vapnik, Vladimir N.; "Support-Vector Networks", Machine Learning, 20, 1995. http://www.springerlink.com/content/k238jx04hm87j80g/
  2. ^ *Press, William H.; Teukolsky, Saul A.; Vetterling, William T.; Flannery, B. P. (2007). "Section 16.5. Support Vector Machines". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8. 
  3. ^ ACM Website, Press release of March 17th 2009. http://www.acm.org/press-room/news-releases/awards-08-groupa
  4. ^ Aizerman, Mark A.; Braverman, Emmanuel M.; and Rozonoer, Lev I. (1964). "Theoretical foundations of the potential function method in pattern recognition learning". Automation and Remote Control 25: 821–837. 
  5. ^ Boser, Bernhard E.; Guyon, Isabelle M.; and Vapnik, Vladimir N.; A training algorithm for optimal margin classifiers. In Haussler, David (editor); 5th Annual ACM Workshop on COLT, pages 144–152, Pittsburgh, PA, 1992. ACM Press
  6. ^ Meyer, David; Leisch, Friedrich; and Hornik, Kurt; The support vector machine under test, Neurocomputing 55(1–2): 169–186, 2003 http://dx.doi.org/10.1016/S0925-2312(03)00431-4
  7. ^ Hsu, Chih-Wei; Chang, Chih-Chung; and Lin, Chih-Jen (2003). A Practical Guide to Support Vector Classification. Department of Computer Science and Information Engineering, National Taiwan University. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
  8. ^ a b Duan, Kai-Bo; and Keerthi, S. Sathiya (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". Proceedings of the Sixth International Workshop on Multiple Classifier Systems. Lecture Notes in Computer Science 3541: 278. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7. 
  9. ^ Hsu, Chih-Wei; and Lin, Chih-Jen (2002). "A Comparison of Methods for Multiclass Support Vector Machines". IEEE Transactions on Neural Networks. 
  10. ^ Platt, John; Cristianini, N.; and Shawe-Taylor, J. (2000). "Large margin DAGs for multiclass classification". In Solla, Sara A.; Leen, Todd K.; and Müller, Klaus-Robert; eds. Advances in Neural Information Processing Systems. MIT Press. pp. 547–553. 
  11. ^ Dietterich, Thomas G.; and Bakiri, Ghulum; Bakiri (1995). "Solving Multiclass Learning Problems via Error-Correcting Output Codes". Journal of Artificial Intelligence Research, Vol. 2 2: 263–286. arXiv:cs/9501101. Bibcode:1995cs........1101D.  Unknown parameter |class= ignored (help)
  12. ^ Crammer, Koby; and Singer, Yoram (2001). "On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines". J. of Machine Learning Research 2: 265–292. 
  13. ^ Lee, Y.; Lin, Y.; and Wahba, G. (2001). "Multicategory Support Vector Machines". Computing Science and Statistics 33. 
  14. ^ Lee, Y.; Lin, Y.; and Wahba, G. (2004). "Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data". Journal of the American Statistical Association 99 (465): 67–81. doi:10.1198/016214504000000098. 
  15. ^ Joachims, Thorsten; "Transductive Inference for Text Classification using Support Vector Machines", Proceedings of the 1999 International Conference on Machine Learning (ICML 1999), pp. 200-209.
  16. ^ Drucker, Harris; Burges, Christopher J. C.; Kaufman, Linda; Smola, Alexander J.; and Vapnik, Vladimir N. (1997); "Support Vector Regression Machines", in Advances in Neural Information Processing Systems 9, NIPS 1996, 155–161, MIT Press.
  17. ^ Suykens, Johan A. K.; Vandewalle, Joos P. L.; Least squares support vector machine classifiers, Neural Processing Letters, vol. 9, no. 3, Jun. 1999, pp. 293–300.
  18. ^ Ferris, Michael C.; and Munson, Todd S. (2002). "Interior-point methods for massive support vector machines". SIAM Journal on Optimization 13 (3): 783–804. doi:10.1137/S1052623400374379. 

External links[edit]

  • Burges, Christopher J. C.; A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery 2:121–167, 1998
  • www.kernel-machines.org (general information and collection of research papers)
  • Teknomo, K. SVM tutorial using spreadsheet Visual Introduction to SVM.
  • www.support-vector-machines.org (Literature, Review, Software, Links related to Support Vector Machines — Academic Site)
  • videolectures.net (SVM-related video lectures)
  • Animation clip: SVM with polynomial kernel visualization
  • Fletcher, Tristan; A very basic SVM tutorial for complete beginners
  • Karatzoglou, Alexandros et al.; Support Vector Machines in R, Journal of Statistical Software April 2006, Volume 15, Issue 9.
  • Shogun (toolbox) contains about 20 different implementations of SVMs, written in C++ with MATLAB, Octave, Python, R, Java, Lua, Ruby and C# interffaces
  • libsvm libsvm is a library of SVMs which is actively patched
  • liblinear liblinear is a library for large linear classification including some SVMs
  • flssvm flssvm is a least squares svm implementation written in fortran
  • Shark Shark is a C++ machine learning library implementing various types of SVMs
  • dlib dlib is a C++ library for working with kernel methods and SVMs
  • SVM light is a collection of software tools for learning and classification using SVM.
  • SVMJS live demo is a GUI demo for Javascript implementation of SVMs
  • Stanford University Andrew Ng Video on SVM
  • Byvatov E, Schneider G., Support vector machine applications in bioinformatics.Appl Bioinformatics. 2003;2(2):67-77.
  • Simon Tong,Edward Chang, Support vector machine active learning for image retrieval, Proceeding MULTIMEDIA '01 Proceedings of the ninth ACM international conference on Multimedia Pages 107-118
  • Simon Tong , Daphne Koller,Support Vector Machine Active Learning with Applications to Text Classification ,JOURNAL OF MACHINE LEARNING RESEARCH (2001)

Bibliography[edit]

  • Theodoridis, Sergios; and Koutroumbas, Konstantinos; "Pattern Recognition", 4th Edition, Academic Press, 2009, ISBN 978-1-59749-272-0
  • Cristianini, Nello; and Shawe-Taylor, John; An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge University Press, 2000. ISBN 0-521-78019-5 ([1] SVM Book)
  • Huang, Te-Ming; Kecman, Vojislav; and Kopriva, Ivica (2006); Kernel Based Algorithms for Mining Huge Data Sets, in Supervised, Semi-supervised, and Unsupervised Learning, Springer-Verlag, Berlin, Heidelberg, 260 pp. 96 illus., Hardcover, ISBN 3-540-31681-7 [2]
  • Kecman, Vojislav; Learning and Soft Computing — Support Vector Machines, Neural Networks, Fuzzy Logic Systems, The MIT Press, Cambridge, MA, 2001.[3]
  • Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. ISBN 0-262-19475-9
  • Schölkopf, Bernhard; Burges, Christopher J. C.; and Smola, Alexander J. (editors); Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, 1999. ISBN 0-262-19416-3. [4]
  • Shawe-Taylor, John; and Cristianini, Nello; Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. ISBN 0-521-81397-2 ([5] Kernel Methods Book)
  • Steinwart, Ingo; and Christmann, Andreas; Support Vector Machines, Springer-Verlag, New York, 2008. ISBN 978-0-387-77241-7 ([6] SVM Book)
  • Tan, Peter Jing; and Dowe, David L. (2004); MML Inference of Oblique Decision Trees, Lecture Notes in Artificial Intelligence (LNAI) 3339, Springer-Verlag, pp1082-1088. (This paper uses minimum message length (MML) and actually incorporates probabilistic support vector machines in the leaves of decision trees.)
  • Vapnik, Vladimir N.; The Nature of Statistical Learning Theory, Springer-Verlag, 1995. ISBN 0-387-98780-0
  • Vapnik, Vladimir N.; and Kotz, Samuel; Estimation of Dependences Based on Empirical Data, Springer, 2006. ISBN 0-387-30865-2, 510 pages [this is a reprint of Vapnik's early book describing philosophy behind SVM approach. The 2006 Appendix describes recent development].
  • Fradkin, Dmitriy; and Muchnik, Ilya; Support Vector Machines for Classification in Abello, J.; and Carmode, G. (Eds); Discrete Methods in Epidemiology, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, volume 70, pp. 13–20, 2006. [7]. Succinctly describes theoretical ideas behind SVM.
  • Bennett, Kristin P.; and Campbell, Colin; Support Vector Machines: Hype or Hallelujah?, SIGKDD Explorations, 2, 2, 2000, 1–13. [8]. Excellent introduction to SVMs with helpful figures.
  • Ivanciuc, Ovidiu; Applications of Support Vector Machines in Chemistry, in Reviews in Computational Chemistry, Volume 23, 2007, pp. 291–400. Reprint available: [9]
  • Catanzaro, Bryan; Sundaram, Narayanan; and Keutzer, Kurt; Fast Support Vector Machine Training and Classification on Graphics Processors, in International Conference on Machine Learning, 2008 [10]
  • Campbell, Colin; and Ying, Yiming; Learning with Support Vector Machines, 2011, Morgan and Claypool. ISBN 978-1-60845-616-1. [11]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Text_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Text_mining new file mode 100644 index 00000000..f109213a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Text_mining @@ -0,0 +1 @@ + Text mining - Wikipedia, the free encyclopedia

Text mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.

Contents

Text mining and text analytics[edit]

The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.[1] The term is roughly synonymous with text mining; indeed, Ronen Feldman modified a 2000 description of "text mining"[2] in 2004 to describe "text analytics."[3] The latter term is now used more frequently in business settings while "text mining" is used in some of the earliest application areas, dating to the 1980s,[4] notably life-sciences research and government intelligence.

The term text analytics also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text.[5] These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.

History[edit]

Labor-intensive manual text mining approaches first surfaced in the mid-1980s,[6] but technological advances have enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (common estimates say over 80%)[5] is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

The challenge of exploiting the large proportion of enterprise information that originates in "unstructured" form has been recognized for decades.[7] It is recognized in the earliest definition of business intelligence (BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:

"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points."

Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in "unstructured" documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text Data Mining:[8]

For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.

Hearst's 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.

Text analysis processes[edit]

Subtasks — components of a larger text-analytics effort — typically include:

  • Information retrieval or identification of a corpus is a preparatory step: collecting or identifying a set textual materials, on the Web or held in a file system, database, or content management system, for analysis.
  • Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive natural language processing, such as part of speech tagging, syntactic parsing, and other types of linguistic analysis.[citation needed]
  • Named entity recognition is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on. Disambiguation — the use of contextual clues — may be required to decide where, for instance, "Ford" refers to a former U.S. president, a vehicle manufacturer, a movie star (Glenn or Harrison?[who?]), a river crossing, or some other entity.
  • Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches.
  • Coreference: identification of noun phrases and other terms that refer to the same object.
  • Relationship, fact, and event Extraction: identification of associations among entities and other information in text
  • Sentiment analysis involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.[9]
  • Quantitative text analysis is a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose of psychological profiling etc.[10]

Applications[edit]

The technology is now broadly applied for a wide variety of government, research, and business needs. Applications can be sorted into a number of categories by analysis type or by business function. Using this approach to classifying solutions, application categories include:

Security applications[edit]

Many text mining software packages are marketed for security applications, especially monitoring and analysis of online plain text sources such as Internet news, blogs, etc. for national security purposes.[11] It is also involved in the study of text encryption/decryption.

Biomedical applications[edit]

A range of text mining applications in the biomedical literature has been described.[12]

One online text mining application in the biomedical literature is GoPubMed.[13] GoPubmed was the first semantic search engine on the Web.[citation needed] Another example is PubGene that combines biomedical text mining with network visualization as an Internet service.[14][15] TPX is a concept-assisted search and navigation tool for biomedical literature analyses[16] - it runs on PubMed/PMC and can be configured, on request, to run on local literature repositories too.

Software applications[edit]

Text mining methods and software is also being researched and developed by major firms, including IBM and Microsoft, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities.[17]

Online media applications[edit]

Text mining is being used by large media companies, such as the Tribune Company, to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.

Marketing applications[edit]

Text mining is starting to be used in marketing as well, more specifically in analytical customer relationship management. Coussement and Van den Poel (2008)[18][19] apply it to improve predictive analytics models for customer churn (customer attrition).[18]

Sentiment analysis[edit]

Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie.[20] Such an analysis may need a labeled data set or labeling of the affectivity of words. Resources for affectivity of words and concepts have been made for WordNet[21] and ConceptNet,[22] respectively.

Text has been used to detect emotions in the related area of affective computing.[23] Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.

Academic applications[edit]

The issue of text mining is of importance to publishers who hold large databases of information needing indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

Further, private initiatives also offer tools for academic text mining:

  • Newsanalytics.net provides researchers with a free scalable solution for keyword-based text analysis. The initiative's research apps were developed to support news analytics news analytics, but are equally useful for regular text analysis applications.

Software and applications[edit]

Text mining computer programs are available from many commercial and open source companies and sources.

Commercial[edit]

  • AeroText – a suite of text mining applications for content analysis. Content used can be in multiple languages.
  • Angoss – Angoss Text Analytics provides entity and theme extraction, topic categorization, sentiment analysis and document summarization capabilities via the embedded Lexalytics Salience Engine. The software provides the unique capability of merging the output of unstructured, text-based analysis with structured data to provide additional predictive variables for improved predictive models and association analysis.
  • Attensity – hosted, integrated and stand-alone text mining (analytics) software that uses natural language processing technology to address collective intelligence in social media and forums; the voice of the customer in surveys and emails; customer relationship management; e-services; research and e-discovery; risk and compliance; and intelligence analysis.
  • Autonomy – text mining, clustering and categorization software
  • Basis Technology – provides a suite of text analysis modules to identify language, enable search in more than 20 languages, extract entities, and efficiently search for and translate entities.
  • Clarabridge – text analytics (text mining) software, including natural language (NLP), machine learning, clustering and categorization. Provides SaaS, hosted and on-premise text and sentiment analytics that enables companies to collect, listen to, analyze, and act on the Voice of the Customer (VOC) from both external (Twitter, Facebook, Yelp!, product forums, etc.) and internal sources (call center notes, CRM, Enterprise Data Warehouse, BI, surveys, emails, etc.).
  • Endeca Technologies – provides software to analyze and cluster unstructured text.
  • Expert System S.p.A. – suite of semantic technologies and products for developers and knowledge managers.
  • Fair Isaac – leading provider of decision management solutions powered by advanced analytics (includes text analytics).
  • General Sentiment - Social Intelligence platform that uses natural language processing to discover affinities between the fans of brands with the fans of traditional television shows in social media. Stand alone text analytics to capture social knowledge base on billions of topics stored to 2004.
  • IBM LanguageWare - the IBM suite for text analytics (tools and Runtime).
  • IBM SPSS - provider of Modeler Premium (previously called IBM SPSS Modeler and IBM SPSS Text Analytics), which contains advanced NLP-based text analysis capabilities (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with Predictive Modeling. Text Analytics for Surveys provides the ability to categorize survey responses using NLP-based capabilities for further analysis or reporting.
  • Inxight – provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects that was bought by SAP AG in 2008).
  • LanguageWare – text analysis libraries and customization software from IBM.
  • Language Computer Corporation – text extraction and analysis tools, available in multiple languages.
  • Lexalytics - provider of a text analytics engine used in Social Media Monitoring, Voice of Customer, Survey Analysis, and other applications.
  • LexisNexis – provider of business intelligence solutions based on an extensive news and company information content set. LexisNexis acquired DataOps to pursue search
  • Mathematica – provides built in tools for text alignment, pattern matching, clustering and semantic analysis.
  • Medallia - offers one system of record for survey, social, text, written and online feedback.
  • Omniviz from Instem Scientific - Data mining and visual analytics tool.[27]
  • SAS – SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software used for Information Management.
  • Smartlogic – Semaphore; Content Intelligence platform containing commercial text analytics, natural language processing, rule-based classification, ontology/taxonomy modelling and information vizualization software used for Information Management.
  • StatSoft – provides STATISTICA Text Miner as an optional extension to STATISTICA Data Miner, for Predictive Analytics Solutions.
  • Sysomos - provider social media analytics software platform, including text analytics and sentiment analysis on online consumer conversations.
  • WordStat - Content analysis and text mining add-on module of QDA Miner for analyzing large amounts of text data.
  • Xpresso - XPRESSO, an engine developed by the Abzooba’s core technology group, is focused on the automated distillation of expressions in social media conversations.[28]
  • Thomson Data Analyzer – enables complex analysis on patent information, scientific publications and news.

Open source[edit]

  • QueryTermAnalyzer - query term weight analyzer
  • Carrot2 – text and search results clustering framework.
  • GATE – General Architecture for Text Engineering, an open-source toolbox for natural language processing and language engineering
  • OpenNLP - natural language processing
  • Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language.
  • RapidMiner with its Text Processing Extension – data and text mining software.
  • Unstructured Information Management Architecture (UIMA) – a component framework to analyze unstructured content such as text, audio and video, originally developed by IBM.
  • The programming language R provides a framework for text mining applications in the package tm
  • The KNIME Text Processing extension.
  • KH Coder - For content analysis, text mining or corpus linguistics.
  • The PLOS Text Mining Collection[29]

Implications[edit]

Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a semantic web, text mining can find content based on meaning and context (rather than just by a specific word).

Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis.

Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.

See also[edit]

Notes[edit]

  1. ^ Defining Text Analytics[dead link]
  2. ^ KDD-2000 Workshop on Text Mining
  3. ^ Text Analytics: Theory and Practice[dead link]
  4. ^ Hobbs, Jerry R.; Walker, Donald E.; Amsler, Robert A. (1982). "Natural language access to structured text". Proceedings of the 9th conference on Computational linguistics 1. pp. 127–32. doi:10.3115/991813.991833. 
  5. ^ a b Unstructured Data and the 80 Percent Rule[dead link]
  6. ^ Content Analysis of Verbatim Explanations
  7. ^ http://www.b-eye-network.com/view/6311[full citation needed]
  8. ^ Hearst, Marti A. (1999). "Untangling text data mining". Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. pp. 3–10. doi:10.3115/1034678.1034679. ISBN 1-55860-609-2. 
  9. ^ http://www.clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=722[dead link]
  10. ^ Mehl, Matthias R. (2006). "Quantitative Text Analysis.". Handbook of multimethod measurement in psychology. p. 141. doi:10.1037/11383-011. ISBN 1-59147-318-7. 
  11. ^ Zanasi, Alessandro (2009). "Virtual Weapons for Real Wars: Text Mining for National Security". Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS'08. Advances in Soft Computing 53. p. 53. doi:10.1007/978-3-540-88181-0_7. ISBN 978-3-540-88180-3. 
  12. ^ Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579. PMID 18225946. 
  13. ^ Doms, A.; Schroeder, M. (2005). "GoPubMed: Exploring PubMed with the Gene Ontology". Nucleic Acids Research 33 (Web Server issue): W783–6. doi:10.1093/nar/gki470. PMC 1160231. PMID 15980585. 
  14. ^ Jenssen, Tor-Kristian; Lægreid, Astrid; Komorowski, Jan; Hovig, Eivind (2001). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics 28 (1): 21–8. doi:10.1038/ng0501-21. PMID 11326270. 
  15. ^ Masys, Daniel R. (2001). "Linking microarray data to the literature". Nature Genetics 28 (1): 9–10. doi:10.1038/ng0501-9. PMID 11326264. 
  16. ^ Joseph, Thomas; Saipradeep, Vangala G; Venkat Raghavan, Ganesh Sekar; Srinivasan, Rajgopal; Rao, Aditya; Kotte, Sujatha; Sivadasan, Naveen (2012). "TPX: Biomedical literature search made easy". Bioinformation 8 (12): 578–80. doi:10.6026/97320630008578. PMC 3398782. PMID 22829734. 
  17. ^ Texor
  18. ^ a b Coussement, Kristof; Van Den Poel, Dirk (2008). "Integrating the voice of customers through call center emails into a decision support system for churn prediction". Information & Management 45 (3): 164–74. doi:10.1016/j.im.2008.01.005. 
  19. ^ Coussement, Kristof; Van Den Poel, Dirk (2008). "Improving customer complaint management by automatic email classification using linguistic style features as predictors". Decision Support Systems 44 (4): 870–82. doi:10.1016/j.dss.2007.10.010. 
  20. ^ Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). "Thumbs up?". Proceedings of the ACL-02 conference on Empirical methods in natural language processing 10. pp. 79–86. doi:10.3115/1118693.1118704. 
  21. ^ Alessandro Valitutti, Carlo Strapparava, Oliviero Stock (2005). "Developing Affective Lexical Resources". Psychology Journal 2 (1): 61–83. 
  22. ^ Erik Cambria; Robert Speer, Catherine Havasi and Amir Hussain (2010). "SenticNet: a Publicly Available Semantic Resource for Opinion Mining". Proceedings of AAAI CSK. pp. 14–18. 
  23. ^ Calvo, Rafael A; d'Mello, Sidney (2010). "Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications". IEEE Transactions on Affective Computing 1 (1): 18–37. doi:10.1109/T-AFFC.2010.1. 
  24. ^ The University of Manchester
  25. ^ Tsujii Laboratory
  26. ^ The University of Tokyo
  27. ^ Yang, Yunyun; Akers, Lucy; Klose, Thomas; Barcelon Yang, Cynthia (2008). "Text mining and visualization tools – Impressions of emerging capabilities". World Patent Information 30 (4): 280. doi:10.1016/j.wpi.2008.01.007. 
  28. ^ http://www.abzooba.com/product.html
  29. ^ "Table of Contents: Text Mining". PLOS. 

References[edit]

  • Ananiadou, S. and McNaught, J. (Editors) (2006). Text Mining for Biology and Biomedicine. Artech House Books. ISBN 978-1-58053-984-5
  • Bilisoly, R. (2008). Practical Text Mining with Perl. New York: John Wiley & Sons. ISBN 978-0-470-17643-6
  • Feldman, R., and Sanger, J. (2006). The Text Mining Handbook. New York: Cambridge University Press. ISBN 978-0-521-83657-9
  • Indurkhya, N., and Damerau, F. (2010). Handbook Of Natural Language Processing, 2nd Edition. Boca Raton, FL: CRC Press. ISBN 978-1-4200-8592-1
  • Kao, A., and Poteet, S. (Editors). Natural Language Processing and Text Mining. Springer. ISBN 1-84628-175-X
  • Konchady, M. Text Mining Application Programming (Programming Series). Charles River Media. ISBN 1-58450-460-9
  • Manning, C., and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. ISBN 978-0-262-13360-9
  • Miner, G., Elder, J., Hill. T, Nisbet, R., Delen, D. and Fast, A. (2012). Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Academic Press. ISBN 978-0-12-386979-1
  • McKnight, W. (2005). "Building business intelligence: Text data mining in business intelligence". DM Review, 21-22.
  • Srivastava, A., and Sahami. M. (2009). Text Mining: Classification, Clustering, and Applications. Boca Raton, FL: CRC Press. ISBN 978-1-4200-5940-3

External links[edit]

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Uncertain_data b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Uncertain_data new file mode 100644 index 00000000..f43322bc --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Uncertain_data @@ -0,0 +1 @@ + Uncertain data - Wikipedia, the free encyclopedia

Uncertain data

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In computer science, uncertain data is the notion of data that contains specific uncertainty. Uncertain data is typically found in the area of sensor networks. When representing such data in a database, some indication of the probability of the various values.

There are three main models of uncertain data in databases. In attribute uncertainty, each uncertain attribute in a tuple is subject to its own independent probability distribution.[1] For example, if readings are taken of temperature and wind speed, each would be described by its own probability distribution, as knowing the reading for one measurement would not provide any information about the other.

In correlated uncertainty, multiple attributes may be described by a joint probability distribution.[1] For example, if readings are taken of the position of an object, and the x- and y-coordinates stored, the probability of different values may depend on the distance from the recorded coordinates. As distance depends on both coordinates, it may be appropriate to use a joint distribution for these coordinates, as they are not independent.

In tuple uncertainty, all the attributes of a tuple are subject to a joint probability distribution. This covers the case of correlated uncertainty, but also includes the case where there is a probability of a tuple not belonging in the relevant relation, which is indicates by all the probabilities not summing to one.[1] For example, assume we have the following tuple from a probabilistic database:

(a, 0.4) | (b, 0.5)

Then, the tuple has 10% chance of not existing in the database.

References[edit]

  1. ^ a b c Prabhakar, Sunil. ORION: Managing Uncertain (Sensor) Data. 
  • "Error-Aware Density-Based Clustering of Imprecise Measurement Values". Seventh IEEE International Conference on Data Mining Workshops, 2007. ICDM Workshops 2007. IEEE.  Unknown parameter |later= ignored (help);
  • "Clustering Uncertain Data With Possible Worlds". Proceedings of the 1st Workshop on Management and mining Of UNcertain Data in conjunction with the 25th International Conference on Data Engineering, 2009. IEEE.  Unknown parameter |later= ignored (help);

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Ward_s_method b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Ward_s_method new file mode 100644 index 00000000..04987b78 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Ward_s_method @@ -0,0 +1 @@ + Bad title - Wikipedia, the free encyclopedia

Bad title

Jump to: navigation, search

Return to Main Page.

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Web_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Web_mining new file mode 100644 index 00000000..b084ebbe --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_html/Web_mining @@ -0,0 +1 @@ + Web mining - Wikipedia, the free encyclopedia

Web mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Web mining - is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

Contents

Web usage mining [edit]

Web usage mining is the process of extracting useful information from server logs e.g. users' history. Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a Web site. Web usage mining itself can be classified further depending on the kind of usage data considered:

  • Web Server Data: The user logs are collected by the Web server. Typical data includes IP address, page reference and access time.
  • Application Server Data: Commercial application servers have significant features to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs.
  • Application Level Data: New kinds of events can be defined in an application, and logging can be turned on for them thus generating histories of these specially defined events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the categories above.

Web structure mining [edit]

Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided a into two kinds:

1. Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.

2. Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

Web content mining [edit]

Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permeates much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB [6], MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

Web content mining is differentiated from two different points of view:[1] Information Retrieval View and Database View. R. Kosala et al.[2] summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database.

There are several ways to represent documents; vector space model is typically used. The documents constitute the whole vector space. If a term t occurs n(D, t) in document D, the t-th coordinate of D is n(D, t) . When the length of the words in a document goes to 􀂒, D maxt n(D, t) 􀀠 . This representation does not realize the importance of words in a document. To resolve this, tf-idf (Term Frequency Times Inverse Document Frequency) is introduced.

By multi-scanning the document, we can implement feature selection. Under the condition that the category result is rarely affected, the extraction of feature subset is needed. The general algorithm is to construct an evaluating function to evaluate the features. As feature set, Information Gain, Cross Entropy, Mutual Information, and Odds Ratio are usually used. The classifier and pattern analysis methods of text data mining are very similar to traditional data mining techniques. The usual evaluative merits are Classification Accuracy, Precision, Recall and Information Score.

Web mining in foreign languages [edit]

It should be noted that the language code of Chinese words is very complicated compared to that of English. The GB code, BIG5 code and HZ code are common Chinese word codes in web documents. Before text mining, one needs to identify the code standard of the HTML documents and transform it into inner code, then use other data mining techniques to find useful knowledge and patterns.

Web Usage mining Pros and Cons [edit]

Pros [edit]

Web usage mining essentially has many advantages which makes this technology attractive to corporations including the government agencies. This technology has enabled e-commerce to do personalized marketing, which eventually results in higher trade volumes. Government agencies are using this technology to classify threats and fight against terrorism. The predicting capability of mining applications can benefit society by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer or customers.

Cons [edit]

Web usage mining by itself does not create issues, but this technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web usage mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent.[3] The obtained data will be analyzed, and clustered to form profiles; the data will be made anonymous before clustering so that there are no personal profiles.[3] Thus these applications de-individualize the users by judging them by their mouse clicks. De-individualization, can be defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits.[3]

Another important concern is that the companies collecting the data for a specific purpose might use the data for a totally different purpose, and this essentially violates the user’s interests.

The growing trend of selling personal data as a commodity encourages website owners to trade personal data obtained from their site. This trend has increased the amount of data being captured and traded increasing the likeliness of one’s privacy being invaded. The companies which buy the data are obliged make it anonymous and these companies are considered authors of any specific release of mining patterns. They are legally responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits, but there is no law preventing them from trading the data.

Some mining algorithms might use controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These practices might be against the anti-discrimination legislation.[4] The applications make it hard to identify the use of such controversial attributes, and there is no strong rule against the usage of such algorithms with such attributes. This process could result in denial of service or a privilege to an individual based on his race, religion or sexual orientation, right now this situation can be avoided by the high ethical standards maintained by the data mining company. The collected data is being made anonymous so that, the obtained data and the obtained patterns cannot be traced back to an individual. It might look as if this poses no threat to one’s privacy, actually many extra information can be inferred by the application by combining two separate unscrupulous data from the user.

Resources [edit]

External links [edit]

Books [edit]

  • Jesus Mena, "Data Mining Your Website", Digital Press, 1999
  • Soumen Chakrabarti, "Mining the Web: Analysis of Hypertext and Semi Structured Data", Morgan Kaufmann, 2002
  • Bing Liu, "Web Data Mining: Exploring Hyperlinks, Contents and Usage Data", Springer, 2007
  • Advances in Web Mining and Web Usage Analysis 2005 - revised papers from 7 th workshop on Knowledge Discovery on the Web, Olfa Nasraoui, Osmar Zaiane, Myra Spiliopoulou, Bamshad Mobasher, Philip Yu, Brij Masand, Eds., Springer Lecture Notes in Artificial Intelligence, LNAI 4198, 2006
  • Web Mining and Web Usage Analysis 2004 - revised papers from 6 th workshop on Knowledge Discovery on the Web, Bamshad Mobasher, Olfa Nasraoui, Bing Liu, Brij Masand, Eds., Springer Lecture Notes in Artificial Intelligence, 2006
  • Mike Thelwall, "Link Analysis: An Information Science Approach", 2004, Academic Press

[5]

Bibliographic references [edit]

  • Baraglia, R. Silvestri, F. (2007) "Dynamic personalization of web sites without user intervention", In Communication of the ACM 50(2): 63-67
  • Cooley, R. Mobasher, B. and Srivastave, J. (1997) “Web Mining: Information and Pattern Discovery on the World Wide Web” In Proceedings of the 9th IEEE International Conference on Tool with Artificial Intelligence
  • Cooley, R., Mobasher, B. and Srivastava, J. “Data Preparation for Mining World Wide Web Browsing Patterns”, Journal of Knowledge and Information System, Vol.1, Issue. 1, pp. 5–32, 1999
  • Kohavi, R., Mason, L. and Zheng, Z. (2004) “Lessons and Challenges from Mining Retail E-commerce Data” Machine Learning, Vol 57, pp. 83–113
  • Lillian Clark, I-Hsien Ting, Chris Kimble, Peter Wright, Daniel Kudenko (2006)"Combining ethnographic and clickstream data to identify user Web browsing strategies" Journal of Information Research, Vol. 11 No. 2, January 2006
  • Eirinaki, M., Vazirgiannis, M. (2003) "Web Mining for Web Personalization", ACM Transactions on Internet Technology, Vol.3, No.1, February 2003
  • Mobasher, B., Cooley, R. and Srivastava, J. (2000) “Automatic Personalization based on web usage Mining” Communications of the ACM, Vol. 43, No.8, pp. 142–151
  • Mobasher, B., Dai, H., Kuo, T. and Nakagawa, M. (2001) “Effective Personalization Based on Association Rule Discover from Web Usage Data” In Proceedings of WIDM 2001, Atlanta, GA, USA, pp. 9–15
  • Nasraoui O., Petenes C., "Combining Web Usage Mining and Fuzzy Inference for Website Personalization", in Proc. of WebKDD 2003 – KDD Workshop on Web mining as a Premise to Effective and Intelligent Web Applications, Washington DC, August 2003, p. 37
  • Nasraoui O., Frigui H., Joshi A., and Krishnapuram R., “Mining Web Access Logs Using Relational Competitive Fuzzy Clustering”, Proceedings of the Eighth International Fuzzy Systems Association Congress, Hsinchu, Taiwan, August 1999
  • Nasraoui O., “World Wide Web Personalization,” Invited chapter in “Encyclopedia of Data Mining and Data Warehousing”, J. Wang, Ed, Idea Group, 2005
  • Pierrakos, D., Paliouras, G., Papatheodorou, C., Spyropoulos C. D. (2003) “Web usage mining as a tool for personalization: a survey”, User modelling and user adapted interaction journal, Vol.13, Issue 4, pp. 311–372
  • I-Hsien Ting, Chris Kimble, Daniel Kudenko (2005)"A Pattern Restore Method for Restoring Missing Patterns in Server Side Clickstream Data"
  • I-Hsien Ting, Chris Kimble, Daniel Kudenko (2006)"UBB Mining: Finding Unexpected Browsing Behaviour in Clickstream Data to improve a Web Site’s Design"

References [edit]

  1. ^ Wang, Yan. "Web Mining and Knowledge Discovery of Usage Patterns". 
  2. ^ Kosala, Raymond; Hendrik Blockeel (July 2000). "Web Mining Research: A Survey". SIGKDD Explorations 2 (1). 
  3. ^ a b c Lita van Wel and Lambèr Royakkers (2004). "Ethical issues in web data mining". Ethical issues in web data mining. .
  4. ^ Kirsten Wahlstrom, John F. Roddick, Vladimir Estivill-Castro, Denise de Vries (2007). "Legal and Technical Issues of Privacy Preservation in Data Mining". Legal and Technical Issues of Privacy Preservation in Data Mining. .
  5. ^ Data Mining By Korth

Navigation menu

\ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Academic_journal b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Academic_journal new file mode 100644 index 00000000..b66d03bd --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Academic_journal @@ -0,0 +1 @@ +academic journal wikipedia the free encyclopedia academic journal from wikipedia the free encyclopedia jump to navigation search an academic journal is a peer reviewed periodical in which scholarship relating to a particular academic discipline is published academic journals serve as forums for the introduction and presentation for scrutiny of new research and the critique of existing research 1 content typically takes the form of articles presenting original research review articles and book reviews the term academic journal applies to scholarly publications in all fields this article discusses the aspects common to all academic field journals scientific journals and journals of the quantitative social sciences vary in form and function from journals of the humanities and qualitative social sciences their specific aspects are separately discussed contents 1 scholarly articles 2 reviewing 2 1 review articles 2 2 book reviews 3 prestige 3 1 ranking 4 publishing 5 new developments 6 see also 7 references 8 further reading 9 external links scholarly articles edit there are two kinds of article or paper submissions in academia solicited where an individual has been invited to submit work either through direct contact or through a general submissions call and unsolicited where an individual submits a work for potential publication without directly being asked to do so 2 upon receipt of a submitted article editors at the journal determine whether to reject the submission outright or begin the process of peer review in the latter case the submission becomes subject to review by outside scholars of the editor s choosing who typically remain anonymous the number of these peer reviewers or referees varies according to each journal s editorial practice typically no fewer than two though sometimes three or more experts in the subject matter of the article produce reports upon the content style and other factors which inform the editors publication decisions though these reports are generally confidential some journals and publishers also practice public peer review the editors either choose to reject the article ask for a revision and resubmission or accept the article for publication even accepted articles are often subjected to further sometimes considerable editing by journal editorial staff before they appear in print the peer review can take from several weeks to several months 3 reviewing edit review articles edit main article review article review articles also called reviews of progress are checks on the research published in journals some journals are devoted entirely to review articles others contain a few in each issue but most do not publish review articles such reviews often cover the research from the preceding year some for longer or shorter terms some are devoted to specific topics some to general surveys some journals are enumerative listing all significant articles in a given subject others are selective including only what they think worthwhile yet others are evaluative judging the state of progress in the subject field some journals are published in series each covering a complete subject field year or covering specific fields through several years unlike original research articles review articles tend to be solicited submissions sometimes planned years in advance they are typically relied upon by students beginning a study in a given field or for current awareness of those already in the field 4 book reviews edit book reviews of scholarly books are checks upon the research books published by scholars unlike articles book reviews tend to be solicited journals typically have a separate book review editor determining which new books to review and by whom if an outside scholar accepts the book review editor s request for a book review he or she generally receives a free copy of the book from the journal in exchange for a timely review publishers send books to book review editors in the hope that their books will be reviewed the length and depth of research book reviews varies much from journal to journal as does the extent of textbook and trade book review 5 prestige edit different types of peer reviewed research journals these specific publications are about economics an academic journal s prestige is established over time and can reflect many factors some but not all of which are expressible quantitatively in each academic discipline there are dominant journals that receive the largest number of submissions and therefore can be selective in choosing their content yet not only the largest journals are of excellent quality 6 ranking edit in the natural sciences and in the hard social sciences the impact factor is a convenient proxy measuring the number of later articles citing articles already published in the journal there are other possible quantitative factors such as the overall number of citations how quickly articles are cited and the average half life of articles i e when they are no longer cited there also is the question of whether or not any quantitative factor can reflect true prestige natural science journals are categorized and ranked in the science citation index social science journals in the social sciences citation index 6 in the anglo american humanities there is no tradition as there is in the sciences of giving impact factors that could be used in establishing a journal s prestige recent moves have been made by the european science foundation to rectify the situation resulting in the publication of preliminary lists for the ranking of academic journals in the humanities 6 in some disciplines such as knowledge management intellectual capital the lack of a well established journal ranking system is perceived as a major obstacle on the way to tenure promotion and achievement recognition 7 the categorization of journal prestige in some subjects has been attempted typically using letters to rank their academic world importance we can distinguish three categories of techniques to assess journal quality and develop journal rankings 8 stated preference revealed preference and publication power approaches 9 publishing edit many academic journals are subsidized by universities or professional organizations and do not exist to make a profit however they often accept advertising page and image charges from authors to pay for production costs on the other hand some journals are produced by commercial publishers who do make a profit by charging subscriptions to individuals and libraries they may also sell all of their journals in discipline specific collections or a variety of other packages 10 journal editors tend to have other professional responsibilities most often as teaching professors in the case of the largest journals there are paid staff assisting in the editing the production of the journals is almost always done by publisher paid staff humanities and social science academic journals are usually subsidized by universities or professional organization 11 new developments edit the internet has revolutionized the production of and access to academic journals with their contents available online via services subscribed to by academic libraries individual articles are subject indexed in databases such as google scholar some of the smallest most specialized journals are prepared in house by an academic department and published only online such form of publication has sometimes been in the blog format currently there is a movement in higher education encouraging open access either via self archiving whereby the author deposits a paper in a repository where it can be searched for and read or via publishing it in a free open access journal which does not charge for subscriptions being either subsidized or financed with author page charges however to date open access has affected science journals more than humanities journals commercial publishers are now experimenting with open access models but are trying to protect their subscription revenues 12 see also edit academic authorship academic library academic publishing academic writing healthcare journal journal citation reports list of academic databases and search engines list of academic journals scientific journal journal ranking arxiv imrad references edit gary blake and robert w bly the elements of technical writing pg 113 new york macmillan publishers 1993 isbn 0020130856 gwen meyer gregory 2005 the successful academic librarian winning strategies from library leaders information today pp 160 36 37 160 mich le lamont 2009 how professors think inside the curious world of academic judgment harvard university press pp 160 1 14 160 deborah e de lange 2011 research companion to green international management studies a guide for future research collaboration and review writing edward elgar publishing pp 160 1 5 160 rita james simon and linda mahan october 1969 a note on the role of book review editor as decision maker the library quarterly p 160 353 356 160 a b c rowena murray 2009 writing for academic journals mcgraw hill international pp 160 42 45 160 nick bontis 2009 a follow up ranking of academic journals journal of knowledge management p 160 17 160 lowry p b humphreys s malwitz j nix j 2007 a scientometric study of the perceived quality of business and technical communication journals ieee transactions of professional communication 160 alexander serenko and changquan jiao 2011 investigating information systems research in canada june 11 2011 p 160 ff 160 bergstrom theodore c 2001 free labor for costly journals journal of economic perspectives 15 3 183 198 doi 10 1257 jep 15 4 183 160 robert a day and barbara gastel 2011 how to write and publish a scientific paper abc clio pp 160 122 124 160 james hendler 2007 reinventing academic publishing part 1 ieee intelligent systems p 160 2 3 160 further reading edit bakkalbasi n bauer k glover j wang l jun 2006 three options for citation tracking google scholar scopus and web of science free full text biomedical digital libraries 3 7 doi 10 1186 1742 5581 3 7 pmc 160 1533854 pmid 160 16805916 160 bontis nick serenko a 2009 a follow up ranking of academic journals journal of knowledge management 13 1 16 26 doi 10 1108 13673270910931134 160 deis amp goodman update on scopus and web of science charleston advisor hendler james 2007 reinventing academic publishing part 1 ieee intelligent systems 22 5 doi 10 1109 mis 2007 93 160 lowry p b humphreys s malwitz j nix j 2007 a scientometric study of the perceived quality of business and technical communication journals ieee transactions of professional communication 50 4 352 78 doi 10 1109 tpc 2007 908733 160 waller a c editorial peer review its strengths and weaknesses asist monograph series information today 2001 isbn 1 57387 100 1 serenko alexander jiao c 2011 investigating information systems research in canada canadian journal of administrative sciences in press doi 10 1002 cjas 214 160 external links edit wikisource has original text related to this article research articles erih initial lists european science foundation journalseek a searchable database of online scholarly journals master journal list thomson reuters a list of selected and notable academic journals in the arts humanities sciences and social sciences links to electronic journals jurn directory of arts amp humanities ejournals academic journals what are they and academic journals compared to magazines academic writing dennis g jerz seton hill university 2001 08 2001 v t e academic publishing journals academic journal scientific journal open access journal public health journal papers working paper survey article research paper position paper literature review other types of publication thesis compilation thesis monograph specialized patent biological chemical book book chapter technical report pamphlet essay white paper preprint poster session lab notes abstract open source software search engines google scholar scirus citeseer getcited scopus web of knowledge espacenet pubmed impact metrics impact factor h index eigenfactor scimago journal rank citation index related topics scientific writing proceedings peer review gray literature scientific literature learned society open access open science data open research electronic publishing lists academic journals scientific journals university presses open access journals style formatting guides category scientific documents category academic publishing retrieved from http en wikipedia org w index php title academic_journal amp oldid 561240782 categories academic publishingacademic journalstechnical communicationpeer review navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal esky dansk deutsch espa ol fran ais galego hrvatski ido bahasa indonesia italiano nederlands norsk bokm l polski portugus sloven ina suomi t rk e edit links this page was last modified on 25 june 2013 at 23 00 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Accuracy_paradox b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Accuracy_paradox new file mode 100644 index 00000000..948ce604 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Accuracy_paradox @@ -0,0 +1 @@ +accuracy paradox wikipedia the free encyclopedia accuracy paradox from wikipedia the free encyclopedia jump to navigation search this article needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed december 2009 the accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy it may be better to avoid the accuracy metric in favor of other metrics such as precision and recall accuracy is often the starting point for analyzing the quality of a predictive model as well as an obvious criterion for prediction accuracy measures the ratio of correct predictions to the total number of cases evaluated it may seem obvious that the ratio of correct predictions to cases should be a key metric a predictive model may have high accuracy but be useless in an example predictive model for an insurance fraud application all cases that are predicted as high risk by the model will be investigated to evaluate the performance of the model the insurance company has created a sample data set of 10 000 claims all 10 000 cases in the validation sample have been carefully checked and it is known which cases are fraudulent to analyze the quality of the model the insurance uses the table of confusion the definition of accuracy the table of confusion for model m1fraud and the calculation of accuracy for model m1fraud is shown below where tn is the number of true negative cases fp is the number of false positive cases fn is the number of false negative cases tp is the number of true positive cases formula 1 definition of accuracy predicted negative predicted positive negative cases 9 700 150 positive cases 50 100 table 1 table of confusion for fraud model m1fraud formula 2 accuracy for model m1fraud with an accuracy of 98 0 model m1fraud appears to perform fairly well the paradox lies in the fact that accuracy can be easily improved to 98 5 by always predicting no fraud the table of confusion and the accuracy for this trivial always predict negative model m2fraud and the accuracy of this model are shown below predicted negative predicted positive negative cases 9 850 0 positive cases 150 0 table 2 table of confusion for fraud model m2fraud formula 3 accuracy for model m2fraud model m2fraudreduces the rate of inaccurate predictions from 2 to 1 5 this is an apparent improvement of 25 the new model m2fraud shows fewer incorrect predictions and markedly improved accuracy as compared to the original model m1fraud but is obviously useless the alternative model m2fraud does not offer any value to the company for preventing fraud the less accurate model is more useful than the more accurate model model improvements should not be measured in terms of accuracy gains it may be going too far to say that accuracy is irrelevant but caution is advised when using accuracy in the evaluation of predictive models see also edit receiver operating characteristic for other measures of how good model predictions are bibliography edit zhu xingquan 2007 knowledge discovery and data mining challenges and realities igi global pp 160 118 119 isbn 160 978 1 59904 252 7 160 doi 10 1117 12 785623 pp 86 87 of this master s thesis retrieved from http en wikipedia org w index php title accuracy_paradox amp oldid 461158461 categories statistical paradoxesmachine learningdata mininghidden categories articles needing additional references from december 2009all articles needing additional references navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 17 november 2011 at 19 41 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Affinity_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Affinity_analysis new file mode 100644 index 00000000..9e60461e --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Affinity_analysis @@ -0,0 +1 @@ +affinity analysis wikipedia the free encyclopedia affinity analysis from wikipedia the free encyclopedia jump to navigation search affinity analysis is a data analysis and data mining technique that discovers co occurrence relationships among activities performed by or recorded about specific individuals or groups in general this can be applied to any process where agents can be uniquely identified and information about their activities can be recorded in retail affinity analysis is used to perform market basket analysis in which retailers seek to understand the purchase behavior of customers this information can then be used for purposes of cross selling and up selling in addition to influencing sales promotions loyalty programs store design and discount plans 1 contents 1 examples 2 business use 3 see also 4 references 5 further reading 6 external links examples edit market basket analysis might tell a retailer that customers is often purchase shampoo and conditioner together so putting both items on promotion at the same time would not create a significant increase in profit while a promotion involving just one of the items would likely drive sales of the other market basket analysis may provide the retailer with information to understand the purchase behavior of a buyer this information will enable the retailer to understand the buyer s needs and rewrite the store s layout accordingly develop cross promotional programs or even capture new buyers much like the cross selling concept an apocryphal early illustrative example for this was when one super market chain discovered in its analysis that customers that bought diapers often bought beer as well have put the diapers close to beer coolers and their sales increased dramatically although this urban legend is only an example that professors use to illustrate the concept to students the explanation of this imaginary phenomenon might be that fathers that are sent out to buy diapers often buy a beer as well as a reward this kind of analysis is supposedly an example of the use of data mining a widely used example of cross selling on the web with market basket analysis is amazon com s use of customers who bought book a also bought book b e g people who read history of portugal were also interested in naval history market basket analysis can be used to divide customers into groups a company could look at what other items people purchase along with eggs and classify them as baking a cake if they are buying eggs along with flour and sugar or making omelets if they are buying eggs along with bacon and cheese this identification could then be used to drive other programs business use edit business use of market basket analysis has significantly increased since the introduction of electronic point of sale 1 amazon uses affinity analysis for cross selling when it recommends products to people based on their purchase history and the purchase history of other people who bought the same item family dollar plans to use market basket analysis to help maintain sales growth while moving towards stocking more low margin consumable goods 2 a common urban legend highlighting the unexpected insights that can be found involves a chain often incorrectly given as wal mart discovering that beer and diapers were often purchased together and responding to that by moving the beer closer to the diapers to drive sales however while the relationship seems to have been noted it is unclear whether any action was taken to promote selling them together 3 see also edit association rule learning references edit a b demystifying market basket analysis retrieved 3 november 2009 160 family dollar supports merchandising with it retrieved 3 november 2009 160 the parable of the beer and diapers the register retrieved 3 september 2009 160 further reading edit j han et al 2006 data mining concepts and techniques isbn 978 1 55860 901 3 v kumar et al 2005 introduction to data mining isbn 978 0 321 32136 7 u fayyad et al 1996 advances in knowledge discovery and data mining isbn 978 0 262 56097 9 external links edit examples of basic market basket analysis using excel an overview of market basket analysis analytics concepts product affinity or market basket analysis retrieved from http en wikipedia org w index php title affinity_analysis amp oldid 555974700 categories data mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol italiano polski portugus edit links this page was last modified on 20 may 2013 at 17 25 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Alpha_algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Alpha_algorithm new file mode 100644 index 00000000..9d22e95d --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Alpha_algorithm @@ -0,0 +1 @@ +alpha algorithm wikipedia the free encyclopedia alpha algorithm from wikipedia the free encyclopedia jump to navigation search the algorithm is an algorithm used in process mining aimed at reconstructing causality from a set of sequences of events it was first put forward by van der aalst weijter and m ru ter 1 several extensions or modifications of it have since been presented which will be listed below it constructs p t nets with special properties workflow nets from event logs as might be collected by an erp system each transition in the net corresponds to an observed task contents 1 short description 1 1 definitions used 2 description 3 properties 4 limitations 5 extensions 6 references short description edit the algorithm takes a workflow log as input and results in a workflow net being constructed it does so by examining causal relationships observed between tasks for example one specific task might always precede another specific task in every execution trace which would be useful information definitions used edit a workflow trace or execution trace is a string over an alphabet of tasks a workflow log is a set of workflow traces description edit declaratively the algorithm can be presented as follows three sets of tasks are determined is the set of all tasks which occur in at least one trace is the set of all tasks which occur trace initially is the set of all tasks which occur trace terminally basic ordering relations are determined first the latter three can be constructed therefrom iff b directly precedes a in some trace iff iff iff places are discovered each place is identified with a pair of sets of tasks in order to keep the number of places low is the set of all pairs of maximal sets of tasks such that neither and contain any members of and is a subset of contains one place for every member of plus the input place and the output place the flow relation is the union of the following the result is a petri net structure with one input place and one output place because every transition of is on a path from to it is indeed a workflow net properties edit it can be shown 2 that in the case of a complete workflow log generated by a sound swf net the net generating it can be reconstructed complete means that its relation is maximal it is not required that all possible traces be present which would be countably infinite for a net with a loop limitations edit general workflow nets may contain several types of constructs 3 which the algorithm cannot rediscover this section requires expansion may 2010 constructing takes exponential time in the number of tasks since is not constrained and arbitrary subsets of must be considered extensions edit this section requires expansion may 2010 for example 4 5 references edit van der aalst w m p and weijter a j m m and maruster l 2003 workflow mining discovering process models from event logs ieee transactions on knowledge and data engineering vol 16 van der aalst et al 2003 a de medeiros a k and van der aalst w m p and weijters a j m m 2003 workflow mining current status and future directions in volume 2888 of lecture notes in computer science springer verlag a de medeiros a k and van dongen b f and van der aalst w m p and weijters a j m m 2004 process mining extending the algorithm to mine short loops wen l and van der aalst w m p and wang j and sun j 2007 mining process models with non free choice constructs data mining and knowledge discovery vol 15 p 145 180 springer verlag retrieved from http en wikipedia org w index php title alpha_algorithm amp oldid 548333215 categories data mininghidden categories articles to be expanded from may 2010all articles to be expanded navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 2 april 2013 at 15 38 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Analytics b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Analytics new file mode 100644 index 00000000..149fd4dd --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Analytics @@ -0,0 +1 @@ +analytics wikipedia the free encyclopedia analytics from wikipedia the free encyclopedia jump to navigation search a sample google analytics dashboard tools like this help businesses identify trends and make decisions analytics is the discovery and communication of meaningful patterns in data especially valuable in areas rich with recorded information analytics relies on the simultaneous application of statistics computer programming and operations research to quantify performance analytics often favors data visualization to communicate insight firms may commonly apply analytics to business data to describe predict and improve business performance specifically arenas within analytics include enterprise decision management retail analytics store assortment and sku optimization marketing optimization and marketing mix analytics web analytics sales force sizing and optimization price and promotion modeling predictive science credit risk analysis and fraud analytics since analytics can require extensive computation see big data the algorithms and software used for analytics harness the most current methods in computer science statistics and mathematics 1 contents 1 analytics vs analysis 2 examples 2 1 marketing optimization 2 2 portfolio analysis 2 3 risk analytics 3 challenges 4 see also 5 references 6 external links analytics vs analysis edit analytics is a two sided coin on one side it uses descriptive and predictive models to gain valuable knowledge from data data analysis on the other analytics uses this insight to recommend action or to guide decision making communication thus analytics is not so much concerned with individual analyses or analysis steps but with the entire methodology there is a pronounced tendency to use the term analytics in business settings e g text analytics vs the more generic text mining to emphasize this broader perspective examples edit marketing optimization edit marketing has evolved from a creative process into a highly data driven process marketing organizations use analytics to determine the outcomes of campaigns or efforts and to guide decisions for investment and consumer targeting demographic studies customer segmentation conjoint analysis and other techniques allow marketers to use large amounts of consumer purchase survey and panel data to understand and communicate marketing strategy web analytics allows marketers to collect session level information about interactions on a website those interactions provide the web analytics information systems with the information to track the referrer search keywords ip address and activities of the visitor with this information a marketer can improve the marketing campaigns site creative content and information architecture analysis techniques frequently used in marketing include marketing mix modeling pricing and promotion analyses sales force optimization customer analytics e g segmentation web analytics and optimization of web sites and online campaigns now frequently works hand in hand with the more traditional marketing analysis techniques a focus on digital media has slightly changed the vocabulary so that marketing mix modeling is commonly referred to as attribution modeling in the digital or mixed media context these tools and techniques support both strategic marketing decisions such as how much overall to spend on marketing and how to allocate budgets across a portfolio of brands and the marketing mix and more tactical campaign support in terms of targeting the best potential customer with the optimal message in the most cost effective medium at the ideal time an example of the holistic approach required for this strategy is the astronomy model portfolio analysis edit a common application of business analytics is portfolio analysis in this a bank or lending agency has a collection of accounts of varying value and risk the accounts may differ by the social status wealthy middle class poor etc of the holder the geographical location its net value and many other factors the lender must balance the return on the loan with the risk of default for each loan the question is then how to evaluate the portfolio as a whole the least risk loan may be to the very wealthy but there are a very limited number of wealthy people on the other hand there are many poor that can be lent to but at greater risk some balance must be struck that maximizes return and minimizes risk the analytics solution may combine time series analysis with many other issues in order to make decisions on when to lend money to these different borrower segments or decisions on the interest rate charged to members of a portfolio segment to cover any losses among members in that segment risk analytics edit predictive models in banking industry is widely developed to bring certainty across the risk scores for individual customers credit scores are built to predict individual s delinquency behaviour and also scores are widely used to evaluate the credit worthiness of each applicant and rated while processing loan applications challenges edit in the industry of commercial analytics software an emphasis has emerged on solving the challenges of analyzing massive complex data sets often when such data is in a constant state of change such data sets are commonly referred to as big data whereas once the problems posed by big data were only found in the scientific community today big data is a problem for many businesses that operate transactional systems online and as a result amass large volumes of data quickly 2 the analysis of unstructured data types is another challenge getting attention in the industry unstructured data differs from structured data in that its format varies widely and cannot be stored in traditional relational databases without significant effort at data transformation 3 sources of unstructured data such as email the contents of word processor documents pdfs geospatial data etc are rapidly becoming a relevant source of business intelligence for businesses governments and universities 4 for example in britain the discovery that one company was illegally selling fraudulent doctor s notes in order to assist people in defrauding employers and insurance companies 5 is an opportunity for insurance firms to increase the vigilance of their unstructured data analysis the mckinsey global institute estimates that big data analysis could save the american health care system 300 billion per year and the european public sector 250 billion 6 these challenges are the current inspiration for much of the innovation in modern analytics information systems giving birth to relatively new machine analysis concepts such as complex event processing full text search and analysis and even new ideas in presentation 7 one such innovation is the introduction of grid like architecture in machine analysis allowing increases in the speed of massively parallel processing by distributing the workload to many computers all with equal access to the complete data set 8 analytics is increasingly used in education particularly at the district and government office levels however the complexity of student performance measures presents challenges when educators try to understand and use analytics to discern patterns in student performance predict graduation likelihood improve chances of student success etc for example in a study involving districts known for strong data use 48 of teachers had difficulty posing questions prompted by data 36 did not comprehend given data and 52 incorrectly interpreted data 9 to combat this some analytics tools for educators adhere to an over the counter data format embedding labels supplemental documentation and a help system and making key package display and content decisions to improve educators understanding and use of the analytics being displayed 10 one more emerging challenge is dynamic regulatory needs for example in the banking industry basel iii and future capital adequacy needs are likely to make even smaller banks adopt internal risk models in such incidents cloud computing and open source r can help smaller banks to adopt risk analytics and support branch level monitoring by applying predictive analytics citation needed see also edit analysis big data business analytics business intelligence complex event processing data mining data presentation architecture learning analytics list of software engineering topics machine learning online analytical processing online video analytics operations research predictive analytics prescriptive analytics statistics web analytics references edit kohavi rothleder and simoudis 2002 emerging trends in business analytics communications of the acm 45 8 45 48 160 naone erica the new big data technology review mit retrieved august 22 2011 160 inmon bill 2007 tapping into unstructured data prentice hall isbn 160 978 0 13 236029 6 160 more than one of author and last specified help wise lyndsay data analysis and unstructured data dashboard insight retrieved february 14 2011 160 fake doctors sick notes for sale for 25 nhs fraud squad warns london the telegraph retrieved august 2008 160 big data the next frontier for innovation competition and productivity as reported in building with big data the economist may 26 2011 archived from the original on 3 june 2011 retrieved may 26 2011 160 ortega dan mobililty fueling a brainier business intelligence it business edge retrieved june 21 2011 160 khambadkone krish are you ready for big data infogain retrieved february 10 2011 160 u s department of education office of planning evaluation and policy development 2009 implementing data informed decision making in schools teacher access supports and use united states department of education eric document reproduction service no ed504191 rankin j 2013 march 28 how data systems amp reports can either fight or propagate the data analysis error epidemic and how educator leaders can help presentation conducted from technology information center for administrative leadership tical school leadership summit external links edit informs bi monthly digital magazine on the analytics profession look up analytics in wiktionary the free dictionary retrieved from http en wikipedia org w index php title analytics amp oldid 561470785 categories analyticsfinancial data analysismathematical financeformal sciencesbusiness termsbusiness intelligencebig datahidden categories pages with citations having redundant parametersall articles with unsourced statementsarticles with unsourced statements from november 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch fran ais bahasa indonesia edit links this page was last modified on 25 june 2013 at 05 47 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Anomaly_Detection_at_Multiple_Scales b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Anomaly_Detection_at_Multiple_Scales new file mode 100644 index 00000000..5db8a7bd --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Anomaly_Detection_at_Multiple_Scales @@ -0,0 +1 @@ +anomaly detection at multiple scales wikipedia the free encyclopedia anomaly detection at multiple scales from wikipedia the free encyclopedia jump to navigation search anomaly detection at multiple scales establishment 2011 sponsor darpa value 35 million goal detect insider threats in defense and government networks website www darpa mil anomaly detection at multiple scales or adams is a 35 million darpa project designed to identify patterns and anomalies in very large data sets it is under darpa s information innovation office and began in 2011 1 2 3 4 the project is intended to detect and prevent insider threats such as a soldier in good mental health becoming homicidal or suicidal an innocent insider becoming malicious or a a government employee whom abuses access privileges to share classified information 2 5 specific cases mentioned are nidal malik hasan and wikileaks alleged source bradley manning 6 commercial applications may include finance 6 the intended recipients of the system output are operators in the counterintelligence agencies 2 5 the proactive discovery of insider threats using graph analysis and learning is part of the adams project 5 7 the georgia tech team includes noted high performance computing researcher david a bader 8 see also edit cyber insider threat einstein us cert program threat computer intrusion detection references edit adams darpa information innovation office retrieved 2011 12 05 160 a b c anomaly detection at multiple scales adams broad agency announcement darpa baa 11 04 general services administration 2010 10 22 retrieved 2011 12 05 160 ackerman spencer 2010 10 11 darpa starts sleuthing out disloyal troops wired retrieved 2011 12 06 160 keyes charley 2010 10 27 military wants to scan communications to find internal threats cnn retrieved 2011 12 06 160 a b c georgia tech helps to develop system that will detect insider threats from massive data sets georgia institute of technology 2011 11 10 retrieved 2011 12 06 160 a b video interview darpa s adams project taps big data to find the breaking bad inside hpc 2011 11 29 retrieved 2011 12 06 160 brandon john 2011 12 03 could the u s government start reading your emails fox news retrieved 2011 12 06 160 anomaly detection at multiple scales georgia tech college of computing retrieved 2011 12 06 160 this military related article is a stub you can help wikipedia by expanding it v t e this software article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title anomaly_detection_at_multiple_scales amp oldid 528260378 categories military stubssoftware stubsdata miningcomputer securitygeorgia tech research institutedarpa navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 16 december 2012 at 04 39 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Anomaly_detection b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Anomaly_detection new file mode 100644 index 00000000..524b46e5 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Anomaly_detection @@ -0,0 +1 @@ +anomaly detection wikipedia the free encyclopedia anomaly detection from wikipedia the free encyclopedia jump to navigation search anomaly detection also referred to as outlier detection 1 refers to detecting patterns in a given data set that do not conform to an established normal behavior 2 the patterns thus detected are called anomalies and often translate to critical and actionable information in several application domains anomalies are also referred to as outliers change deviation surprise aberrant peculiarity intrusion etc in particular in the context of abuse and network intrusion detection the interesting objects are often not rare objects but unexpected bursts in activity this pattern does not adhere to the common statistical definition of an outlier as a rare object and many outlier detection methods in particular unsupervised methods will fail on such data unless it has been aggregated appropriately instead a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns 3 three broad categories of anomaly detection techniques exist unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set supervised anomaly detection techniques require a data set that has been labeled as normal and abnormal and involves training a classifier the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection semi supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set and then testing the likelihood of a test instance to be generated by the learnt model citation needed contents 1 applications 2 popular techniques 3 application to data security 4 see also 5 references applications edit anomaly detection is applicable in a variety of domains such as intrusion detection fraud detection fault detection system health monitoring event detection in sensor networks and detecting eco system disturbances it is often used in preprocessing to remove anomalous data from the dataset in supervised learning removing the anomalous data from the dataset often results in a statistically significant increase in accuracy 4 5 popular techniques edit several anomaly detection techniques have been proposed in literature some of the popular techniques are distance based techniques k nearest neighbor local outlier factor 6 one class support vector machines replicator neural networks cluster analysis based outlier detection pointing at records that deviate from learned association rules application to data security edit anomaly detection was proposed for intrusion detection systems ids by dorothy denning in 1986 7 anomaly detection for ids is normally accomplished with thresholds and statistics but can also be done with soft computing and inductive learning 8 types of statistics proposed by 1999 included profiles of users workstations networks remote hosts groups of users and programs based on frequencies means variances covariances and standard deviations 9 the counterpart of anomaly detection in intrusion detection is misuse detection see also edit outliers in statistics change detection intrusion detection system references edit hans peter kriegel peer kr ger arthur zimek 2009 outlier detection techniques tutorial 13th pacific asia conference on knowledge discovery and data mining pakdd 2009 bangkok thailand retrieved 2010 06 05 160 varun chandola arindam banerjee and vipin kumar anomaly detection a survey acm computing surveys vol 41 3 article 15 july 2009 dokas paul levent ertoz vipin kumar aleksandar lazarevic jaideep srivastava pang ning tan 2002 data mining for network intrusion detection proceedings nsf workshop on next generation data mining 160 ivan tomek 1976 an experiment with the edited nearest neighbor rule ieee transactions on systems man and cybernetics 6 pp 160 448 452 160 michael r smith and tony martinez 2011 improving classification accuracy by identifying and removing instances that should be misclassified proceedings of international joint conference on neural networks ijcnn 2011 pp 160 2690 2697 160 breunig m m kriegel h p ng r t sander j 2000 lof identifying density based local outliers acm sigmod record 29 93 doi 10 1145 335191 335388 160 edit denning dorothy an intrusion detection model proceedings of the seventh ieee symposium on security and privacy may 1986 pages 119 131 teng henry s chen kaihu and lu stephen c y adaptive real time anomaly detection using inductively generated sequential patterns 1990 ieee symposium on security and privacy jones anita k and sielken robert s computer system intrusion detection a survey technical report department of computer science university of virginia charlottesville va 1999 retrieved from http en wikipedia org w index php title anomaly_detection amp oldid 556544258 categories data miningdata securitystatistical outliershidden categories all articles with unsourced statementsarticles with unsourced statements from october 2011 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 24 may 2013 at 06 45 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Apriori_algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Apriori_algorithm new file mode 100644 index 00000000..8f6657d7 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Apriori_algorithm @@ -0,0 +1 @@ +apriori algorithm wikipedia the free encyclopedia apriori algorithm from wikipedia the free encyclopedia jump to navigation search this article may be confusing or unclear to readers please help us clarify the article suggestions may be found on the talk page december 2006 apriori 1 is a classic algorithm for frequent item set mining and association rule learning over transactional databases it proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database the frequent item sets determined by apriori can be used to determine association rules which highlight general trends in the database this has applications in domains such as market basket analysis contents 1 setting 2 examples 2 1 example 1 2 2 example 2 3 limitations 4 see also 5 references 6 external links setting edit apriori is designed to operate on databases containing transactions for example collections of items bought by customers or details of a website frequentation other algorithms are designed for finding association rules in data having no transactions winepi and minepi or having no timestamps dna sequencing each transaction is seen as a set of items an itemset given a threshold the apriori algorithm identifies the item sets which are subsets of at least transactions in the database apriori uses a bottom up approach where frequent subsets are extended one item at a time a step known as candidate generation and groups of candidates are tested against the data the algorithm terminates when no further successful extensions are found apriori uses breadth first search and a hash tree structure to count candidate item sets efficiently it generates candidate item sets of length from item sets of length then it prunes the candidates which have an infrequent sub pattern according to the downward closure lemma the candidate set contains all frequent length item sets after that it scans the transaction database to determine frequent item sets among the candidates the pseudo code for the algorithm is given below for a transaction database and a support threshold of usual set theoretic notation is employed though note that is a multiset is the candidate set for level generate algorithm is assumed to generate the candidate sets from the large item sets of the preceding level heeding the downward closure lemma accesses a field of the data structure that represents candidate set which is initially assumed to be zero many details are omitted below usually the most important part of the implementation is the data structure used for storing the candidate sets and counting their frequencies examples edit example 1 edit consider the following database where each row is a transaction and each cell is an individual item of the transaction alpha beta gamma alpha beta theta alpha beta epsilon alpha beta theta the association rules that can be determined from this database are the following 100 of sets with alpha also contain beta 25 of sets with alpha beta also have gamma 50 of sets with alpha beta also have theta we can also illustrate this through variety of examples example 2 edit assume that a large supermarket tracks sales data by stock keeping unit sku for each item each item such as butter or bread is identified by a numerical sku the supermarket has a database of transactions where each transaction is a set of skus that were bought together let the database of transactions consist of the sets 1 2 3 4 1 2 2 3 4 2 3 1 2 4 3 4 and 2 4 we will use apriori to determine the frequent item sets of this database to do so we will say that an item set is frequent if it appears in at least 3 transactions of the database the value 3 is the support threshold the first step of apriori is to count up the number of occurrences called the support of each member item separately by scanning the database a first time we obtain the following result item support 1 3 2 6 3 4 4 5 all the itemsets of size 1 have a support of at least 3 so they are all frequent the next step is to generate a list of all pairs of the frequent items item support 1 2 3 1 3 1 1 4 2 2 3 3 2 4 4 3 4 3 the pairs 1 2 2 3 2 4 and 3 4 all meet or exceed the minimum support of 3 so they are frequent the pairs 1 3 and 1 4 are not now because 1 3 and 1 4 are not frequent any larger set which contains 1 3 or 1 4 cannot be frequent in this way we can prune sets we will now look for frequent triples in the database but we can already exclude all the triples that contain one of these two pairs item support 2 3 4 2 in the example there are no frequent triplets 2 3 4 is below the minimal threshold and the other triplets were excluded because they were super sets of pairs that were already below the threshold we have thus determined the frequent sets of items in the database and illustrated how some items were not counted because one of their subsets was already known to be below the threshold limitations edit apriori while historically significant suffers from a number of inefficiencies or trade offs which have spawned other algorithms candidate generation generates large numbers of subsets the algorithm attempts to load up the candidate set with as many as possible before each scan bottom up subset exploration essentially a breadth first traversal of the subset lattice finds any maximal subset s only after all of its proper subsets later algorithms such as max miner 2 try to identify the maximal frequent item sets without enumerating their subsets and perform jumps in the search space rather than a purely bottom up approach see also edit association rule learning fsa red algorithm dynamic item set counting references edit rakesh agrawal and ramakrishnan srikant fast algorithms for mining association rules in large databases proceedings of the 20th international conference on very large data bases vldb pages 487 499 santiago chile september 1994 bayardo jr roberto j efficiently mining long patterns from databases acm sigmod record vol 27 no 2 acm 1998 external links edit implementation of the apriori algorithm in c artool gpl java association rule mining application with gui offering implementations of multiple algorithms for discovery of frequent patterns and extraction of association rules includes apriori spmf open source java implementations of more than 50 algorithms for frequent itemsets mining association rule mining and sequential pattern mining it offers apriori and several variations such as aprioriclose uapriori aprioriinverse apriorirare msapriori aprioritid etc and other more efficient algorithms such as fpgrowth retrieved from http en wikipedia org w index php title apriori_algorithm amp oldid 557813883 categories data mininghidden categories wikipedia articles needing clarification from december 2006all wikipedia articles needing clarification navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol fran ais italiano nederlands portugus edit links this page was last modified on 17 june 2013 at 08 30 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Artificial_intelligence b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Artificial_intelligence new file mode 100644 index 00000000..0f9e18f6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Artificial_intelligence @@ -0,0 +1 @@ +artificial intelligence wikipedia the free encyclopedia artificial intelligence from wikipedia the free encyclopedia jump to navigation search ai redirects here for other uses see ai for other uses see artificial intelligence disambiguation artificial intelligence ai is technology and a branch of computer science that studies and develops intelligent machines and software major ai researchers and textbooks define the field as the study and design of intelligent agents 1 where an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success 2 john mccarthy who coined the term in 1955 3 defines it as the science and engineering of making intelligent machines 4 ai research is highly technical and specialised deeply divided into subfields that often fail to communicate with each other 5 some of the division is due to social and cultural factors subfields have grown up around particular institutions and the work of individual researchers ai research is also divided by several technical issues there are subfields which are focused on the solution of specific problems on one of several possible approaches on the use of widely differing tools and towards the accomplishment of particular applications the central problems or goals of ai research include reasoning knowledge planning learning communication perception and the ability to move and manipulate objects 6 general intelligence or strong ai is still among the field s long term goals 7 currently popular approaches include statistical methods computational intelligence and traditional symbolic ai there are an enormous number of tools used in ai including versions of search and mathematical optimization logic methods based on probability and economics and many others the field was founded on the claim that a central property of humans intelligence the sapience of homo sapiens can be so precisely described that it can be simulated by a machine 8 this raises philosophical issues about the nature of the mind and the ethics of creating artificial beings issues which have been addressed by myth fiction and philosophy since antiquity 9 artificial intelligence has been the subject of tremendous optimism 10 but has also suffered stunning setbacks 11 today it has become an essential part of the technology industry and many of the most difficult problems in computer science 12 contents 1 history 2 goals 2 1 deduction reasoning problem solving 2 2 knowledge representation 2 3 planning 2 4 learning 2 5 natural language processing 2 6 motion and manipulation 2 7 perception 2 8 social intelligence 2 9 creativity 2 10 general intelligence 3 approaches 3 1 cybernetics and brain simulation 3 2 symbolic 3 3 sub symbolic 3 4 statistical 3 5 integrating the approaches 4 tools 4 1 search and optimization 4 2 logic 4 3 probabilistic methods for uncertain reasoning 4 4 classifiers and statistical learning methods 4 5 neural networks 4 6 control theory 4 7 languages 5 evaluating progress 6 applications 6 1 competitions and prizes 6 2 platforms 7 philosophy 8 predictions and ethics 9 see also 10 references 10 1 notes 11 references 11 1 ai textbooks 11 2 history of ai 11 3 other sources 12 further reading 13 external links history edit main articles history of artificial intelligence and timeline of artificial intelligence thinking machines and artificial beings appear in greek myths such as talos of crete the bronze robot of hephaestus and pygmalion s galatea 13 human likenesses believed to have intelligence were built in every major civilization animated cult images were worshiped in egypt and greece 14 and humanoid automatons were built by yan shi hero of alexandria and al jazari 15 it was also widely believed that artificial beings had been created by j bir ibn hayy n judah loew and paracelsus 16 by the 19th and 20th centuries artificial beings had become a common feature in fiction as in mary shelley s frankenstein or karel apek s r u r rossum s universal robots 17 pamela mccorduck argues that all of these are examples of an ancient urge as she describes it to forge the gods 9 stories of these creatures and their fates discuss many of the same hopes fears and ethical concerns that are presented by artificial intelligence mechanical or formal reasoning has been developed by philosophers and mathematicians since antiquity the study of logic led directly to the invention of the programmable digital electronic computer based on the work of mathematician alan turing and others turing s theory of computation suggested that a machine by shuffling symbols as simple as 0 and 1 could simulate any conceivable act of mathematical deduction 18 19 this along with concurrent discoveries in neurology information theory and cybernetics inspired a small group of researchers to begin to seriously consider the possibility of building an electronic brain 20 the field of ai research was founded at a conference on the campus of dartmouth college in the summer of 1956 21 the attendees including john mccarthy marvin minsky allen newell and herbert simon became the leaders of ai research for many decades 22 they and their students wrote programs that were to most people simply astonishing 23 computers were solving word problems in algebra proving logical theorems and speaking english 24 by the middle of the 1960s research in the u s was heavily funded by the department of defense 25 and laboratories had been established around the world 26 ai s founders were profoundly optimistic about the future of the new field herbert simon predicted that machines will be capable within twenty years of doing any work a man can do and marvin minsky agreed writing that within a generation 160 the problem of creating artificial intelligence will substantially be solved 27 they had failed to recognize the difficulty of some of the problems they faced 28 in 1974 in response to the criticism of sir james lighthill and ongoing pressure from the us congress to fund more productive projects both the u s and british governments cut off all undirected exploratory research in ai the next few years would later be called an ai winter 29 a period when funding for ai projects was hard to find in the early 1980s ai research was revived by the commercial success of expert systems 30 a form of ai program that simulated the knowledge and analytical skills of one or more human experts by 1985 the market for ai had reached over a billion dollars at the same time japan s fifth generation computer project inspired the u s and british governments to restore funding for academic research in the field 31 however beginning with the collapse of the lisp machine market in 1987 ai once again fell into disrepute and a second longer lasting ai winter began 32 in the 1990s and early 21st century ai achieved its greatest successes albeit somewhat behind the scenes artificial intelligence is used for logistics data mining medical diagnosis and many other areas throughout the technology industry 12 the success was due to several factors the increasing computational power of computers see moore s law a greater emphasis on solving specific subproblems the creation of new ties between ai and other fields working on similar problems and a new commitment by researchers to solid mathematical methods and rigorous scientific standards 33 on 11 may 1997 deep blue became the first computer chess playing system to beat a reigning world chess champion garry kasparov 34 in 2005 a stanford robot won the darpa grand challenge by driving autonomously for 131 miles along an unrehearsed desert trail 35 two years later a team from cmu won the darpa urban challenge when their vehicle autonomously navigated 55 miles in an urban environment while adhering to traffic hazards and all traffic laws 36 in february 2011 in a jeopardy quiz show exhibition match ibm s question answering system watson defeated the two greatest jeopardy champions brad rutter and ken jennings by a significant margin 37 the kinect which provides a 3d body motion interface for the xbox 360 uses algorithms that emerged from lengthy ai research 38 as does the iphones s siri goals edit the general problem of simulating or creating intelligence has been broken down into a number of specific sub problems these consist of particular traits or capabilities that researchers would like an intelligent system to display the traits described below have received the most attention 6 deduction reasoning problem solving edit early ai researchers developed algorithms that imitated the step by step reasoning that humans use when they solve puzzles or make logical deductions 39 by the late 1980s and 1990s ai research had also developed highly successful methods for dealing with uncertain or incomplete information employing concepts from probability and economics 40 for difficult problems most of these algorithms can require enormous computational resources most experience a combinatorial explosion the amount of memory or computer time required becomes astronomical when the problem goes beyond a certain size the search for more efficient problem solving algorithms is a high priority for ai research 41 human beings solve most of their problems using fast intuitive judgements rather than the conscious step by step deduction that early ai research was able to model 42 ai has made some progress at imitating this kind of sub symbolic problem solving embodied agent approaches emphasize the importance of sensorimotor skills to higher reasoning neural net research attempts to simulate the structures inside the brain that give rise to this skill statistical approaches to ai mimic the probabilistic nature of the human ability to guess knowledge representation edit an ontology represents knowledge as a set of concepts within a domain and the relationships between those concepts main articles knowledge representation and commonsense knowledge knowledge representation 43 and knowledge engineering 44 are central to ai research many of the problems machines are expected to solve will require extensive knowledge about the world among the things that ai needs to represent are objects properties categories and relations between objects 45 situations events states and time 46 causes and effects 47 knowledge about knowledge what we know about what other people know 48 and many other less well researched domains a representation of what exists is an ontology the set of objects relations concepts and so on that the machine knows about the most general are called upper ontologies which attempt to provide a foundation for all other knowledge 49 among the most difficult problems in knowledge representation are default reasoning and the qualification problem many of the things people know take the form of working assumptions for example if a bird comes up in conversation people typically picture an animal that is fist sized sings and flies none of these things are true about all birds john mccarthy identified this problem in 1969 50 as the qualification problem for any commonsense rule that ai researchers care to represent there tend to be a huge number of exceptions almost nothing is simply true or false in the way that abstract logic requires ai research has explored a number of solutions to this problem 51 the breadth of commonsense knowledge the number of atomic facts that the average person knows is astronomical research projects that attempt to build a complete knowledge base of commonsense knowledge e g cyc require enormous amounts of laborious ontological engineering they must be built by hand one complicated concept at a time 52 a major goal is to have the computer understand enough concepts to be able to learn by reading from sources like the internet and thus be able to add to its own ontology citation needed the subsymbolic form of some commonsense knowledge much of what people know is not represented as facts or statements that they could express verbally for example a chess master will avoid a particular chess position because it feels too exposed 53 or an art critic can take one look at a statue and instantly realize that it is a fake 54 these are intuitions or tendencies that are represented in the brain non consciously and sub symbolically 55 knowledge like this informs supports and provides a context for symbolic conscious knowledge as with the related problem of sub symbolic reasoning it is hoped that situated ai computational intelligence or statistical ai will provide ways to represent this kind of knowledge 55 planning edit a hierarchical control system is a form of control system in which a set of devices and governing software is arranged in a hierarchy main article automated planning and scheduling intelligent agents must be able to set goals and achieve them 56 they need a way to visualize the future they must have a representation of the state of the world and be able to make predictions about how their actions will change it and be able to make choices that maximize the utility or value of the available choices 57 in classical planning problems the agent can assume that it is the only thing acting on the world and it can be certain what the consequences of its actions may be 58 however if the agent is not the only actor it must periodically ascertain whether the world matches its predictions and it must change its plan as this becomes necessary requiring the agent to reason under uncertainty 59 multi agent planning uses the cooperation and competition of many agents to achieve a given goal emergent behavior such as this is used by evolutionary algorithms and swarm intelligence 60 learning edit main article machine learning machine learning is the study of computer algorithms that improve automatically through experience 61 62 and has been central to ai research since the field s inception 63 unsupervised learning is the ability to find patterns in a stream of input supervised learning includes both classification and numerical regression classification is used to determine what category something belongs in after seeing a number of examples of things from several categories regression is the attempt to produce a function that describes the relationship between inputs and outputs and predicts how the outputs should change as the inputs change in reinforcement learning 64 the agent is rewarded for good responses and punished for bad ones these can be analyzed in terms of decision theory using concepts like utility the mathematical analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory 65 within developmental robotics developmental learning approaches were elaborated for lifelong cumulative acquisition of repertoires of novel skills by a robot through autonomous self exploration and social interaction with human teachers and using guidance mechanisms such as active learning maturation motor synergies and imitation 66 67 68 69 natural language processing edit a parse tree represents the syntactic structure of a sentence according to some formal grammar main article natural language processing natural language processing 70 gives machines the ability to read and understand the languages that humans speak a sufficiently powerful natural language processing system would enable natural language user interfaces and the acquisition of knowledge directly from human written sources such as internet texts some straightforward applications of natural language processing include information retrieval or text mining and machine translation 71 a common method of processing and extracting meaning from natural language is through semantic indexing increases in processing speeds and the drop in the cost of data storage makes indexing large volumes of abstractions of the users input much more efficient motion and manipulation edit main article robotics the field of robotics 72 is closely related to ai intelligence is required for robots to be able to handle such tasks as object manipulation 73 and navigation with sub problems of localization knowing where you are or finding out where other things are mapping learning what is around you building a map of the environment and motion planning figuring out how to get there or path planning going from one point in space to another point which may involve compliant motion where the robot moves while maintaining physical contact with an object 74 75 perception edit main articles machine perception computer vision and speech recognition machine perception 76 is the ability to use input from sensors such as cameras microphones sonar and others more exotic to deduce aspects of the world computer vision 77 is the ability to analyze visual input a few selected subproblems are speech recognition 78 facial recognition and object recognition 79 social intelligence edit main article affective computing kismet a robot with rudimentary social skills 80 affective computing is the study and development of systems and devices that can recognize interpret process and simulate human affects 81 82 it is an interdisciplinary field spanning computer sciences psychology and cognitive science 83 while the origins of the field may be traced as far back as to early philosophical inquiries into emotion 84 the more modern branch of computer science originated with rosalind picard s 1995 paper 85 on affective computing 86 87 a motivation for the research is the ability to simulate empathy the machine should interpret the emotional state of humans and adapt its behaviour to them giving an appropriate response for those emotions emotion and social skills 88 play two roles for an intelligent agent first it must be able to predict the actions of others by understanding their motives and emotional states this involves elements of game theory decision theory as well as the ability to model human emotions and the perceptual skills to detect emotions also in an effort to facilitate human computer interaction an intelligent machine might want to be able to display emotions even if it does not actually experience them itself in order to appear sensitive to the emotional dynamics of human interaction creativity edit main article computational creativity a sub field of ai addresses creativity both theoretically from a philosophical and psychological perspective and practically via specific implementations of systems that generate outputs that can be considered creative or systems that identify and assess creativity related areas of computational research are artificial intuition and artificial imagination general intelligence edit main articles strong ai and ai complete most researchers think that their work will eventually be incorporated into a machine with general intelligence known as strong ai combining all the skills above and exceeding human abilities at most or all of them 7 a few believe that anthropomorphic features like artificial consciousness or an artificial brain may be required for such a project 89 90 many of the problems above may require general intelligence to be considered solved for example even a straightforward specific task like machine translation requires that the machine read and write in both languages nlp follow the author s argument reason know what is being talked about knowledge and faithfully reproduce the author s intention social intelligence a problem like machine translation is considered ai complete in order to solve this particular problem you must solve all the problems 91 approaches edit there is no established unifying theory or paradigm that guides ai research researchers disagree about many issues 92 a few of the most long standing questions that have remained unanswered are these should artificial intelligence simulate natural intelligence by studying psychology or neurology or is human biology as irrelevant to ai research as bird biology is to aeronautical engineering 93 can intelligent behavior be described using simple elegant principles such as logic or optimization or does it necessarily require solving a large number of completely unrelated problems 94 can intelligence be reproduced using high level symbols similar to words and ideas or does it require sub symbolic processing 95 john haugeland who coined the term gofai good old fashioned artificial intelligence also proposed that ai should more properly be referred to as synthetic intelligence 96 a term which has since been adopted by some non gofai researchers 97 98 cybernetics and brain simulation edit main articles cybernetics and computational neuroscience in the 1940s and 1950s a number of researchers explored the connection between neurology information theory and cybernetics some of them built machines that used electronic networks to exhibit rudimentary intelligence such as w grey walter s turtles and the johns hopkins beast many of these researchers gathered for meetings of the teleological society at princeton university and the ratio club in england 20 by 1960 this approach was largely abandoned although elements of it would be revived in the 1980s symbolic edit main article gofai when access to digital computers became possible in the middle 1950s ai research began to explore the possibility that human intelligence could be reduced to symbol manipulation the research was centered in three institutions carnegie mellon university stanford and mit and each one developed its own style of research john haugeland named these approaches to ai good old fashioned ai or gofai 99 during the 1960s symbolic approaches had achieved great success at simulating high level thinking in small demonstration programs approaches based on cybernetics or neural networks were abandoned or pushed into the background 100 researchers in the 1960s and the 1970s were convinced that symbolic approaches would eventually succeed in creating a machine with artificial general intelligence and considered this the goal of their field cognitive simulation economist herbert simon and allen newell studied human problem solving skills and attempted to formalize them and their work laid the foundations of the field of artificial intelligence as well as cognitive science operations research and management science their research team used the results of psychological experiments to develop programs that simulated the techniques that people used to solve problems this tradition centered at carnegie mellon university would eventually culminate in the development of the soar architecture in the middle 1980s 101 102 logic based unlike newell and simon john mccarthy felt that machines did not need to simulate human thought but should instead try to find the essence of abstract reasoning and problem solving regardless of whether people used the same algorithms 93 his laboratory at stanford sail focused on using formal logic to solve a wide variety of problems including knowledge representation planning and learning 103 logic was also the focus of the work at the university of edinburgh and elsewhere in europe which led to the development of the programming language prolog and the science of logic programming 104 anti logic or scruffy researchers at mit such as marvin minsky and seymour papert 105 found that solving difficult problems in vision and natural language processing required ad hoc solutions they argued that there was no simple and general principle like logic that would capture all the aspects of intelligent behavior roger schank described their anti logic approaches as scruffy as opposed to the neat paradigms at cmu and stanford 94 commonsense knowledge bases such as doug lenat s cyc are an example of scruffy ai since they must be built by hand one complicated concept at a time 106 knowledge based when computers with large memories became available around 1970 researchers from all three traditions began to build knowledge into ai applications 107 this knowledge revolution led to the development and deployment of expert systems introduced by edward feigenbaum the first truly successful form of ai software 30 the knowledge revolution was also driven by the realization that enormous amounts of knowledge would be required by many simple ai applications sub symbolic edit by the 1980s progress in symbolic ai seemed to stall and many believed that symbolic systems would never be able to imitate all the processes of human cognition especially perception robotics learning and pattern recognition a number of researchers began to look into sub symbolic approaches to specific ai problems 95 bottom up embodied situated behavior based or nouvelle ai researchers from the related field of robotics such as rodney brooks rejected symbolic ai and focused on the basic engineering problems that would allow robots to move and survive 108 their work revived the non symbolic viewpoint of the early cybernetics researchers of the 1950s and reintroduced the use of control theory in ai this coincided with the development of the embodied mind thesis in the related field of cognitive science the idea that aspects of the body such as movement perception and visualization are required for higher intelligence computational intelligence interest in neural networks and connectionism was revived by david rumelhart and others in the middle 1980s 109 these and other sub symbolic approaches such as fuzzy systems and evolutionary computation are now studied collectively by the emerging discipline of computational intelligence 110 statistical edit in the 1990s ai researchers developed sophisticated mathematical tools to solve specific subproblems these tools are truly scientific in the sense that their results are both measurable and verifiable and they have been responsible for many of ai s recent successes the shared mathematical language has also permitted a high level of collaboration with more established fields like mathematics economics or operations research stuart russell and peter norvig describe this movement as nothing less than a revolution and the victory of the neats 33 critics argue that these techniques are too focused on particular problems and have failed to address the long term goal of general intelligence 111 there is an ongoing debate about the relevance and validity of statistical approaches in ai exemplified in part by exchanges between peter norvig and noam chomsky 112 113 integrating the approaches edit intelligent agent paradigm an intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success the simplest intelligent agents are programs that solve specific problems more complicated agents include human beings and organizations of human beings such as firms the paradigm gives researchers license to study isolated problems and find solutions that are both verifiable and useful without agreeing on one single approach an agent that solves a specific problem can use any approach that works some agents are symbolic and logical some are sub symbolic neural networks and others may use new approaches the paradigm also gives researchers a common language to communicate with other fields such as decision theory and economics that also use concepts of abstract agents the intelligent agent paradigm became widely accepted during the 1990s 2 agent architectures and cognitive architectures researchers have designed systems to build intelligent systems out of interacting intelligent agents in a multi agent system 114 a system with both symbolic and sub symbolic components is a hybrid intelligent system and the study of such systems is artificial intelligence systems integration a hierarchical control system provides a bridge between sub symbolic ai at its lowest reactive levels and traditional symbolic ai at its highest levels where relaxed time constraints permit planning and world modelling 115 rodney brooks subsumption architecture was an early proposal for such a hierarchical system 116 tools edit in the course of 50 years of research ai has developed a large number of tools to solve the most difficult problems in computer science a few of the most general of these methods are discussed below search and optimization edit main articles search algorithm mathematical optimization and evolutionary computation many problems in ai can be solved in theory by intelligently searching through many possible solutions 117 reasoning can be reduced to performing a search for example logical proof can be viewed as searching for a path that leads from premises to conclusions where each step is the application of an inference rule 118 planning algorithms search through trees of goals and subgoals attempting to find a path to a target goal a process called means ends analysis 119 robotics algorithms for moving limbs and grasping objects use local searches in configuration space 73 many learning algorithms use search algorithms based on optimization simple exhaustive searches 120 are rarely sufficient for most real world problems the search space the number of places to search quickly grows to astronomical numbers the result is a search that is too slow or never completes the solution for many problems is to use heuristics or rules of thumb that eliminate choices that are unlikely to lead to the goal called pruning the search tree heuristics supply the program with a best guess for the path on which the solution lies 121 heuristics limit the search for solutions into a smaller sample size 74 a very different kind of search came to prominence in the 1990s based on the mathematical theory of optimization for many problems it is possible to begin the search with some form of a guess and then refine the guess incrementally until no more refinements can be made these algorithms can be visualized as blind hill climbing we begin the search at a random point on the landscape and then by jumps or steps we keep moving our guess uphill until we reach the top other optimization algorithms are simulated annealing beam search and random optimization 122 evolutionary computation uses a form of optimization search for example they may begin with a population of organisms the guesses and then allow them to mutate and recombine selecting only the fittest to survive each generation refining the guesses forms of evolutionary computation include swarm intelligence algorithms such as ant colony or particle swarm optimization 123 and evolutionary algorithms such as genetic algorithms gene expression programming and genetic programming 124 logic edit main articles logic programming and automated reasoning logic 125 is used for knowledge representation and problem solving but it can be applied to other problems as well for example the satplan algorithm uses logic for planning 126 and inductive logic programming is a method for learning 127 several different forms of logic are used in ai research propositional or sentential logic 128 is the logic of statements which can be true or false first order logic 129 also allows the use of quantifiers and predicates and can express facts about objects their properties and their relations with each other fuzzy logic 130 is a version of first order logic which allows the truth of a statement to be represented as a value between 0 and 1 rather than simply true 1 or false 0 fuzzy systems can be used for uncertain reasoning and have been widely used in modern industrial and consumer product control systems subjective logic 131 models uncertainty in a different and more explicit manner than fuzzy logic a given binomial opinion satisfies belief disbelief uncertainty 1 within a beta distribution by this method ignorance can be distinguished from probabilistic statements that an agent makes with high confidence default logics non monotonic logics and circumscription 51 are forms of logic designed to help with default reasoning and the qualification problem several extensions of logic have been designed to handle specific domains of knowledge such as description logics 45 situation calculus event calculus and fluent calculus for representing events and time 46 causal calculus 47 belief calculus and modal logics 48 probabilistic methods for uncertain reasoning edit main articles bayesian network hidden markov model kalman filter decision theory and utility theory many problems in ai in reasoning planning learning perception and robotics require the agent to operate with incomplete or uncertain information ai researchers have devised a number of powerful tools to solve these problems using methods from probability theory and economics 132 bayesian networks 133 are a very general tool that can be used for a large number of problems reasoning using the bayesian inference algorithm 134 learning using the expectation maximization algorithm 135 planning using decision networks 136 and perception using dynamic bayesian networks 137 probabilistic algorithms can also be used for filtering prediction smoothing and finding explanations for streams of data helping perception systems to analyze processes that occur over time e g hidden markov models or kalman filters 137 a key concept from the science of economics is utility a measure of how valuable something is to an intelligent agent precise mathematical tools have been developed that analyze how an agent can make choices and plan using decision theory decision analysis 138 information value theory 57 these tools include models such as markov decision processes 139 dynamic decision networks 137 game theory and mechanism design 140 classifiers and statistical learning methods edit main articles classifier mathematics statistical classification and machine learning the simplest ai applications can be divided into two types classifiers if shiny then diamond and controllers if shiny then pick up controllers do however also classify conditions before inferring actions and therefore classification forms a central part of many ai systems classifiers are functions that use pattern matching to determine a closest match they can be tuned according to examples making them very attractive for use in ai these examples are known as observations or patterns in supervised learning each pattern belongs to a certain predefined class a class can be seen as a decision that has to be made all the observations combined with their class labels are known as a data set when a new observation is received that observation is classified based on previous experience 141 a classifier can be trained in various ways there are many statistical and machine learning approaches the most widely used classifiers are the neural network 142 kernel methods such as the support vector machine 143 k nearest neighbor algorithm 144 gaussian mixture model 145 naive bayes classifier 146 and decision tree 147 the performance of these classifiers have been compared over a wide range of tasks classifier performance depends greatly on the characteristics of the data to be classified there is no single classifier that works best on all given problems this is also referred to as the no free lunch theorem determining a suitable classifier for a given problem is still more an art than science 148 neural networks edit main articles neural network and connectionism a neural network is an interconnected group of nodes akin to the vast network of neurons in the human brain the study of artificial neural networks 142 began in the decade before the field ai research was founded in the work of walter pitts and warren mccullough other important early researchers were frank rosenblatt who invented the perceptron and paul werbos who developed the backpropagation algorithm 149 the main categories of networks are acyclic or feedforward neural networks where the signal passes in only one direction and recurrent neural networks which allow feedback among the most popular feedforward networks are perceptrons multi layer perceptrons and radial basis networks 150 among recurrent networks the most famous is the hopfield net a form of attractor network which was first described by john hopfield in 1982 151 neural networks can be applied to the problem of intelligent control for robotics or learning using such techniques as hebbian learning and competitive learning 152 hierarchical temporal memory is an approach that models some of the structural and algorithmic properties of the neocortex 153 control theory edit main article intelligent control control theory the grandchild of cybernetics has many important applications especially in robotics 154 languages edit main article list of programming languages for artificial intelligence ai researchers have developed several specialized languages for ai research including lisp 155 and prolog 156 evaluating progress edit main article progress in artificial intelligence in 1950 alan turing proposed a general procedure to test the intelligence of an agent now known as the turing test this procedure allows almost all the major problems of artificial intelligence to be tested however it is a very difficult challenge and at present all agents fail 157 artificial intelligence can also be evaluated on specific problems such as small problems in chemistry hand writing recognition and game playing such tests have been termed subject matter expert turing tests smaller problems provide more achievable goals and there are an ever increasing number of positive results 158 one classification for outcomes of an ai test is 159 optimal it is not possible to perform better strong super human performs better than all humans super human performs better than most humans sub human performs worse than most humans for example performance at draughts is optimal 160 performance at chess is super human and nearing strong super human see computer chess 160 computers versus human and performance at many everyday tasks such as recognizing a face or crossing a room without bumping into something is sub human a quite different approach measures machine intelligence through tests which are developed from mathematical definitions of intelligence examples of these kinds of tests start in the late nineties devising intelligence tests using notions from kolmogorov complexity and data compression 161 two major advantages of mathematical definitions are their applicability to nonhuman intelligences and their absence of a requirement for human testers an area that artificial intelligence had contributed greatly to is intrusion detection 162 applications edit an automated online assistant providing customer service on a web page one of many very primitive applications of artificial intelligence this section requires expansion january 2011 main article applications of artificial intelligence artificial intelligence techniques are pervasive and are too numerous to list frequently when a technique reaches mainstream use it is no longer considered artificial intelligence this phenomenon is described as the ai effect 163 competitions and prizes edit main article competitions and prizes in artificial intelligence there are a number of competitions and prizes to promote research in artificial intelligence the main areas promoted are general machine intelligence conversational behavior data mining robotic cars robot soccer and games platforms edit a platform or computing platform is defined as some sort of hardware architecture or software framework including application frameworks that allows software to run as rodney brooks 164 pointed out many years ago it is not just the artificial intelligence software that defines the ai features of the platform but rather the actual platform itself that affects the ai that results i e there needs to be work in ai problems on real world platforms rather than in isolation a wide variety of platforms has allowed different aspects of ai to develop ranging from expert systems albeit pc based but still an entire real world system to various robot platforms such as the widely available roomba with open interface 165 philosophy edit main article philosophy of artificial intelligence artificial intelligence by claiming to be able to recreate the capabilities of the human mind is both a challenge and an inspiration for philosophy are there limits to how intelligent machines can be is there an essential difference between human intelligence and artificial intelligence can a machine have a mind and consciousness a few of the most influential answers to these questions are given below 166 turing s polite convention we need not decide if a machine can think we need only decide if a machine can act as intelligently as a human being this approach to the philosophical problems associated with artificial intelligence forms the basis of the turing test 157 the dartmouth proposal every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it this conjecture was printed in the proposal for the dartmouth conference of 1956 and represents the position of most working ai researchers 167 newell and simon s physical symbol system hypothesis a physical symbol system has the necessary and sufficient means of general intelligent action newell and simon argue that intelligences consist of formal operations on symbols 168 hubert dreyfus argued that on the contrary human expertise depends on unconscious instinct rather than conscious symbol manipulation and on having a feel for the situation rather than explicit symbolic knowledge see dreyfus critique of ai 169 170 g del s incompleteness theorem a formal system such as a computer program cannot prove all true statements 171 roger penrose is among those who claim that g del s theorem limits what machines can do see the emperor s new mind 172 searle s strong ai hypothesis the appropriately programmed computer with the right inputs and outputs would thereby have a mind in exactly the same sense human beings have minds 173 john searle counters this assertion with his chinese room argument which asks us to look inside the computer and try to find where the mind might be 174 the artificial brain argument the brain can be simulated hans moravec ray kurzweil and others have argued that it is technologically feasible to copy the brain directly into hardware and software and that such a simulation will be essentially identical to the original 90 predictions and ethics edit main articles artificial intelligence in fiction ethics of artificial intelligence transhumanism and technological singularity artificial intelligence is a common topic in both science fiction and projections about the future of technology and society the existence of an artificial intelligence that rivals human intelligence raises difficult ethical issues and the potential power of the technology inspires both hopes and fears in fiction artificial intelligence has appeared fulfilling many roles these include a servant r2d2 in star wars a law enforcer k i t t knight rider a comrade lt commander data in star trek the next generation a conqueror overlord the matrix omnius a dictator with folded hands a benevolent provider de facto ruler the culture a supercomputer the red queen in resident evil gilium in outlaw star an assassin terminator a sentient race battlestar galactica transformers mass effect an extension to human abilities ghost in the shell the savior of the human race r daneel olivaw in isaac asimov s robot series mary shelley s frankenstein considers a key issue in the ethics of artificial intelligence if a machine can be created that has intelligence could it also feel if it can feel does it have the same rights as a human the idea also appears in modern science fiction including the films i robot blade runner and a i artificial intelligence in which humanoid machines have the ability to feel human emotions this issue now known as robot rights is currently being considered by for example california s institute for the future although many critics believe that the discussion is premature 175 the subject is profoundly discussed in the 2010 documentary film plug amp pray 176 martin ford author of the lights in the tunnel automation accelerating technology and the economy of the future 177 and others argue that specialized artificial intelligence applications robotics and other forms of automation will ultimately result in significant unemployment as machines begin to match and exceed the capability of workers to perform most routine and repetitive jobs ford predicts that many knowledge based occupations and in particular entry level jobs will be increasingly susceptible to automation via expert systems machine learning 178 and other ai enhanced applications ai based applications may also be used to amplify the capabilities of low wage offshore workers making it more feasible to outsource knowledge work 179 joseph weizenbaum wrote that ai applications can not by definition successfully simulate genuine human empathy and that the use of ai technology in fields such as customer service or psychotherapy 180 was deeply misguided weizenbaum was also bothered that ai researchers and some philosophers were willing to view the human mind as nothing more than a computer program a position now known as computationalism to weizenbaum these points suggest that ai research devalues human life 181 many futurists believe that artificial intelligence will ultimately transcend the limits of progress ray kurzweil has used moore s law which describes the relentless exponential improvement in digital technology to calculate that desktop computers will have the same processing power as human brains by the year 2029 he also predicts that by 2045 artificial intelligence will reach a point where it is able to improve itself at a rate that far exceeds anything conceivable in the past a scenario that science fiction writer vernor vinge named the singularity 182 robot designer hans moravec cyberneticist kevin warwick and inventor ray kurzweil have predicted that humans and machines will merge in the future into cyborgs that are more capable and powerful than either 183 this idea called transhumanism which has roots in aldous huxley and robert ettinger has been illustrated in fiction as well for example in the manga ghost in the shell and the science fiction series dune in the 1980s artist hajime sorayama s sexy robots series were painted and published in japan depicting the actual organic human form with life like muscular metallic skins and later the gynoids book followed that was used by or influenced movie makers including george lucas and other creatives sorayama never considered these organic robots to be real part of nature but always unnatural product of the human mind a fantasy existing in the mind even when realized in actual form almost 20 years later the first ai robotic pet aibo came available as a companion to people aibo grew out of sony s computer science laboratory csl famed engineer dr toshitada doi is credited as aibo s original progenitor in 1994 he had started work on robots with artificial intelligence expert masahiro fujita within csl of sony doi s friend the artist hajime sorayama was enlisted to create the initial designs for the aibo s body those designs are now part of the permanent collections of museum of modern art and the smithsonian institution with later versions of aibo being used in studies in carnegie mellon university in 2006 aibo was added into carnegie mellon university s robot hall of fame political scientist charles t rubin believes that ai can be neither designed nor guaranteed to be friendly 184 he argues that any sufficiently advanced benevolence may be indistinguishable from malevolence humans should not assume machines or robots would treat us favorably because there is no a priori reason to believe that they would be sympathetic to our system of morality which has evolved along with our particular biology which ais would not share edward fredkin argues that artificial intelligence is the next stage in evolution an idea first proposed by samuel butler s darwin among the machines 1863 and expanded upon by george dyson in his book of the same name in 1998 185 see also edit ai portal mind and brain portal chess portal strategy games portal main article outline of artificial intelligence ai complete artificial intelligence in fiction artificial intelligence journal artificial intelligence video games synthetic intelligence cognitive sciences developmental robotics computer go human cognome project friendly artificial intelligence list of basic artificial intelligence topics list of ai researchers list of important ai publications list of ai projects list of machine learning algorithms list of emerging technologies list of scientific journals philosophy of mind technological singularity never ending language learning references edit notes edit definition of ai as the study of intelligent agents poole mackworth amp goebel 1998 p 1 which provides the version that is used in this article note that they use the term computational intelligence as a synonym for artificial intelligence russell amp norvig 2003 who prefer the term rational agent and write the whole agent view is now widely accepted in the field russell amp norvig 2003 p 160 55 nilsson 1998 a b the intelligent agent paradigm russell amp norvig 2003 pp 160 27 32 58 968 972 poole mackworth amp goebel 1998 pp 160 7 21 luger amp stubblefield 2004 pp 160 235 240 the definition used in this article in terms of goals actions perception and environment is due to russell amp norvig 2003 other definitions also include knowledge and learning as additional criteria although there is some controversy on this point see crevier 1993 p 160 50 mccarthy states unequivocally i came up with the term in a c net interview skillings 2006 mccarthy first used the term in the proposal for the dartmouth conference which appeared in 1955 mccarthy et al 1955 mccarthy s definition of ai mccarthy 2007 pamela mccorduck 2004 pp 160 424 writes of the rough shattering of ai in subfields vision natural language decision theory genetic algorithms robotics 160 and these with own sub subfield that would hardly have anything to say to each other a b this list of intelligent traits is based on the topics covered by the major ai textbooks including russell amp norvig 2003 luger amp stubblefield 2004 poole mackworth amp goebel 1998 nilsson 1998 a b general intelligence strong ai is discussed in popular introductions to ai kurzweil 1999 and kurzweil 2005 see the dartmouth proposal under philosophy below a b this is a central idea of pamela mccorduck s machines who think she writes i like to think of artificial intelligence as the scientific apotheosis of a venerable cultural tradition mccorduck 2004 p 160 34 artificial intelligence in one form or another is an idea that has pervaded western intellectual history a dream in urgent need of being realized mccorduck 2004 p 160 xviii our history is full of attempts nutty eerie comical earnest legendary and real to make artificial intelligences to reproduce what is the essential us bypassing the ordinary means back and forth between myth and reality our imaginations supplying what our workshops couldn t we have engaged for a long time in this odd form of self reproduction mccorduck 2004 p 160 3 she traces the desire back to its hellenistic roots and calls it the urge to forge the gods mccorduck 2004 pp 160 340 400 the optimism referred to includes the predictions of early ai researchers see optimism in the history of ai as well as the ideas of modern transhumanists such as ray kurzweil the setbacks referred to include the alpac report of 1966 the abandonment of perceptrons in 1970 the lighthill report of 1973 and the collapse of the lisp machine market in 1987 a b ai applications widely used behind the scenes russell amp norvig 2003 p 160 28 kurzweil 2005 p 160 265 nrc 1999 pp 160 216 222 ai in myth mccorduck 2004 pp 160 4 5 russell amp norvig 2003 p 160 939 cult images as artificial intelligence crevier 1993 p 160 1 statue of amun mccorduck 2004 pp 160 6 9 these were the first machines to be believed to have true intelligence and consciousness hermes trismegistus expressed the common belief that with these statues craftsman had reproduced the true nature of the gods their sensus and spiritus mccorduck makes the connection between sacred automatons and mosaic law developed around the same time which expressly forbids the worship of robots mccorduck 2004 pp 160 6 9 humanoid automata yan shi needham 1986 p 160 53 hero of alexandria mccorduck 2004 p 160 6 al jazari a thirteenth century programmable robot shef ac uk retrieved 25 april 2009 160 wolfgang von kempelen mccorduck 2004 p 160 17 artificial beings j bir ibn hayy n s takwin o connor kathleen malone 1994 the alchemical creation of life takwin and other concepts of genesis in medieval islam university of pennsylvania retrieved 10 january 2007 160 judah loew s golem mccorduck 2004 pp 160 15 16 buchanan 2005 p 160 50 paracelsus homunculus mccorduck 2004 pp 160 13 14 ai in early science fiction mccorduck 2004 pp 160 17 25 this insight that digital computers can simulate any process of formal reasoning is known as the church turing thesis formal reasoning berlinski david 2000 the advent of the algorithm harcourt books isbn 160 0 15 601391 6 oclc 160 46890682 160 a b ai s immediate precursors mccorduck 2004 pp 160 51 107 crevier 1993 pp 160 27 32 russell amp norvig 2003 pp 160 15 940 moravec 1988 p 160 3 see also cybernetics and early neural networks in history of artificial intelligence among the researchers who laid the foundations of ai were alan turing john von neumann norbert wiener claude shannon warren mccullough walter pitts and donald hebb dartmouth conference mccorduck 2004 pp 160 111 136 crevier 1993 pp 160 47 49 who writes the conference is generally recognized as the official birthdate of the new science russell amp norvig 2003 p 160 17 who call the conference the birth of artificial intelligence nrc 1999 pp 160 200 201 hegemony of the dartmouth conference attendees russell amp norvig 2003 p 160 17 who write for the next 20 years the field would be dominated by these people and their students mccorduck 2004 pp 160 129 130 russell and norvig write it was astonishing whenever a computer did anything kind of smartish russell amp norvig 2003 p 160 18 golden years of ai successful symbolic reasoning programs 1956 1973 mccorduck 2004 pp 160 243 252 crevier 1993 pp 160 52 107 moravec 1988 p 160 9 russell amp norvig 2003 pp 160 18 21 the programs described are daniel bobrow s student newell and simon s logic theorist and terry winograd s shrdlu darpa pours money into undirected pure research into ai during the 1960s mccorduck 2004 pp 160 131 crevier 1993 pp 160 51 64 65 nrc 1999 pp 160 204 205 ai in england howe 1994 optimism of early ai herbert simon quote simon 1965 p 160 96 quoted in crevier 1993 p 160 109 marvin minsky quote minsky 1967 p 160 2 quoted in crevier 1993 p 160 109 see the problems in history of artificial intelligence first ai winter mansfield amendment lighthill report crevier 1993 pp 160 115 117 russell amp norvig 2003 p 160 22 nrc 1999 pp 160 212 213 howe 1994 a b expert systems acm 1998 i 2 1 russell amp norvig 2003 pp 160 22 24 luger amp stubblefield 2004 pp 160 227 331 nilsson 1998 chpt 17 4 mccorduck 2004 pp 160 327 335 434 435 crevier 1993 pp 160 145 62 197 203 boom of the 1980s rise of expert systems fifth generation project alvey mcc sci mccorduck 2004 pp 160 426 441 crevier 1993 pp 160 161 162 197 203 211 240 russell amp norvig 2003 p 160 24 nrc 1999 pp 160 210 211 second ai winter mccorduck 2004 pp 160 430 435 crevier 1993 pp 160 209 210 nrc 1999 pp 160 214 216 a b formal methods are now preferred victory of the neats russell amp norvig 2003 pp 160 25 26 mccorduck 2004 pp 160 486 487 mccorduck 2004 pp 160 480 483 darpa grand challenge home page welcome archive darpa mil retrieved 31 october 2011 160 markoff john 16 february 2011 on jeopardy watson win is all but trivial the new york times 160 kinect s ai breakthrough explained problem solving puzzle solving game playing and deduction russell amp norvig 2003 chpt 3 9 poole mackworth amp goebel 1998 chpt 2 3 7 9 luger amp stubblefield 2004 chpt 3 4 6 8 nilsson 1998 chpt 7 12 uncertain reasoning russell amp norvig 2003 pp 160 452 644 poole mackworth amp goebel 1998 pp 160 345 395 luger amp stubblefield 2004 pp 160 333 381 nilsson 1998 chpt 19 intractability and efficiency and the combinatorial explosion russell amp norvig 2003 pp 160 9 21 22 psychological evidence of sub symbolic reasoning wason amp shapiro 1966 showed that people do poorly on completely abstract problems but if the problem is restated to allow the use of intuitive social intelligence performance dramatically improves see wason selection task kahneman slovic amp tversky 1982 have shown that people are terrible at elementary problems that involve uncertain reasoning see list of cognitive biases for several examples lakoff amp n ez 2000 have controversially argued that even our skills at mathematics depend on knowledge and skills that come from the body i e sensorimotor and perceptual skills see where mathematics comes from knowledge representation acm 1998 i 2 4 russell amp norvig 2003 pp 160 320 363 poole mackworth amp goebel 1998 pp 160 23 46 69 81 169 196 235 277 281 298 319 345 luger amp stubblefield 2004 pp 160 227 243 nilsson 1998 chpt 18 knowledge engineering russell amp norvig 2003 pp 160 260 266 poole mackworth amp goebel 1998 pp 160 199 233 nilsson 1998 chpt 17 1 17 4 a b representing categories and relations semantic networks description logics inheritance including frames and scripts russell amp norvig 2003 pp 160 349 354 poole mackworth amp goebel 1998 pp 160 174 177 luger amp stubblefield 2004 pp 160 248 258 nilsson 1998 chpt 18 3 a b representing events and time situation calculus event calculus fluent calculus including solving the frame problem russell amp norvig 2003 pp 160 328 341 poole mackworth amp goebel 1998 pp 160 281 298 nilsson 1998 chpt 18 2 a b causal calculus poole mackworth amp goebel 1998 pp 160 335 337 a b representing knowledge about knowledge belief calculus modal logics russell amp norvig 2003 pp 160 341 344 poole mackworth amp goebel 1998 pp 160 275 277 ontology russell amp norvig 2003 pp 160 320 328 qualification problem mccarthy amp hayes 1969 russell amp norvig 2003 page 160 needed while mccarthy was primarily concerned with issues in the logical representation of actions russell amp norvig 2003 apply the term to the more general issue of default reasoning in the vast network of assumptions underlying all our commonsense knowledge a b default reasoning and default logic non monotonic logics circumscription closed world assumption abduction poole et al places abduction under default reasoning luger et al places this under uncertain reasoning russell amp norvig 2003 pp 160 354 360 poole mackworth amp goebel 1998 pp 160 248 256 323 335 luger amp stubblefield 2004 pp 160 335 363 nilsson 1998 18 3 3 breadth of commonsense knowledge russell amp norvig 2003 p 160 21 crevier 1993 pp 160 113 114 moravec 1988 p 160 13 lenat amp guha 1989 introduction dreyfus amp dreyfus 1986 gladwell 2005 a b expert knowledge as embodied intuition dreyfus amp dreyfus 1986 hubert dreyfus is a philosopher and critic of ai who was among the first to argue that most useful human knowledge was encoded sub symbolically see dreyfus critique of ai gladwell 2005 gladwell s blink is a popular introduction to sub symbolic reasoning and knowledge hawkins amp blakeslee 2005 hawkins argues that sub symbolic knowledge should be the primary focus of ai research note however that recent work in cognitive science challenges the view that there is anything like sub symbolic human information processing i e human cognition is essentially symbolic regardless of the level and of the consciousness status of the processing augusto luis m 2013 unconscious representations 1 belying the traditional model of human cognition axiomathes doi 10 1007 s10516 012 9206 z 160 augusto luis m 2013 unconscious representations 2 towards an integrated cognitive architecture axiomathes doi 10 1007 s10516 012 9207 y 160 planning acm 1998 i 2 8 russell amp norvig 2003 pp 160 375 459 poole mackworth amp goebel 1998 pp 160 281 316 luger amp stubblefield 2004 pp 160 314 329 nilsson 1998 chpt 10 1 2 22 a b information value theory russell amp norvig 2003 pp 160 600 604 classical planning russell amp norvig 2003 pp 160 375 430 poole mackworth amp goebel 1998 pp 160 281 315 luger amp stubblefield 2004 pp 160 314 329 nilsson 1998 chpt 10 1 2 22 planning and acting in non deterministic domains conditional planning execution monitoring replanning and continuous planning russell amp norvig 2003 pp 160 430 449 multi agent planning and emergent behavior russell amp norvig 2003 pp 160 449 455 this is a form of tom mitchell s widely quoted definition of machine learning a computer program is set to learn from an experience e with respect to some task t and some performance measure p if its performance on t as measured by p improves with experience e learning acm 1998 i 2 6 russell amp norvig 2003 pp 160 649 788 poole mackworth amp goebel 1998 pp 160 397 438 luger amp stubblefield 2004 pp 160 385 542 nilsson 1998 chpt 3 3 10 3 17 5 20 alan turing discussed the centrality of learning as early as 1950 in his classic paper computing machinery and intelligence turing 1950 in 1956 at the original dartmouth ai summer conference ray solomonoff wrote a report on unsupervised probabilistic machine learning an inductive inference machine pdf scanned copy of the original version published in 1957 an inductive inference machine ire convention record section on information theory part 2 pp 56 62 reinforcement learning russell amp norvig 2003 pp 160 763 788 luger amp stubblefield 2004 pp 160 442 449 computational learning theory citation in progress citation needed weng j mcclelland pentland a sporns o stockman i sur m and e thelen 2001 autonomous mental development by robots and animals science vol 291 pp 599 600 lungarella m metta g pfeifer r and g sandini 2003 developmental robotics a survey connection science 15 151 190 asada m hosoda k kuniyoshi y ishiguro h inui t yoshikawa y ogino m and c yoshida 2009 cognitive developmental robotics a survey ieee transactions on autonomous mental development vol 1 no 1 pp 12 34 oudeyer p y 2010 on the impact of robotics in behavioral and cognitive sciences from insect navigation to human cognitive development ieee transactions on autonomous mental development 2 1 pp 2 16 natural language processing acm 1998 i 2 7 russell amp norvig 2003 pp 160 790 831 poole mackworth amp goebel 1998 pp 160 91 104 luger amp stubblefield 2004 pp 160 591 632 applications of natural language processing including information retrieval i e text mining and machine translation russell amp norvig 2003 pp 160 840 857 luger amp stubblefield 2004 pp 160 623 630 robotics acm 1998 i 2 9 russell amp norvig 2003 pp 160 901 942 poole mackworth amp goebel 1998 pp 160 443 460 a b moving and configuration space russell amp norvig 2003 pp 160 916 932 a b tecuci g 2012 artificial intelligence wires comp stat 4 168 180 doi 10 1002 wics 200 robotic mapping localization etc russell amp norvig 2003 pp 160 908 915 machine perception russell amp norvig 2003 pp 160 537 581 863 898 nilsson 1998 chpt 6 computer vision acm 1998 i 2 10 russell amp norvig 2003 pp 160 863 898 nilsson 1998 chpt 6 speech recognition acm 1998 i 2 7 russell amp norvig 2003 pp 160 568 578 object recognition russell amp norvig 2003 pp 160 885 892 kismet mit artificial intelligence laboratory humanoid robotics group 160 thro ellen 1993 robotics new york 160 edelson edward 1991 the nervous system new york remmel nunn 160 tao jianhua tieniu tan 2005 affective computing a review affective computing and intelligent interaction lncs 3784 springer pp 160 981 995 doi 10 1007 11573548 160 james william 1884 what is emotion mind 9 188 205 doi 10 1093 mind os ix 34 188 160 cited by tao and tan affective computing mit technical report 321 abstract 1995 kleine cosack christian october 2006 recognition and simulation of emotions pdf archived from the original on 28 may 2008 retrieved 13 may 2008 the introduction of emotion to computer science was done by pickard sic who created the field of affective computing 160 diamond david december 2003 the love machine building computers that care wired archived from the original on 18 may 2008 retrieved 13 may 2008 rosalind picard a genial mit professor is the field s godmother her 1997 book affective computing triggered an explosion of interest in the emotional side of computers and their users 160 emotion and affective computing minsky 2006 gerald edelman igor aleksander and others have both argued that artificial consciousness is required for strong ai aleksander 1995 edelman 2007 a b artificial brain arguments ai requires a simulation of the operation of the human brain russell amp norvig 2003 p 160 957 crevier 1993 pp 160 271 and 279 a few of the people who make some form of the argument moravec 1988 kurzweil 2005 p 160 262 hawkins amp blakeslee 2005 the most extreme form of this argument the brain replacement scenario was put forward by clark glymour in the mid 1970s and was touched on by zenon pylyshyn and john searle in 1980 ai complete shapiro 1992 p 160 9 nils nilsson writes simply put there is wide disagreement in the field about what ai is all about nilsson 1983 p 160 10 a b biological intelligence vs intelligence in general russell amp norvig 2003 pp 160 2 3 who make the analogy with aeronautical engineering mccorduck 2004 pp 160 100 101 who writes that there are two major branches of artificial intelligence one aimed at producing intelligent behavior regardless of how it was accomplioshed and the other aimed at modeling intelligent processes found in nature particularly human ones kolata 1982 a paper in science which describes mccathy s indifference to biological models kolata quotes mccarthy as writing this is ai so we don t care if it s psychologically real 1 mccarthy recently reiterated his position at the ai 50 conference where he said artificial intelligence is not by definition simulation of human intelligence maker 2006 a b neats vs scruffies mccorduck 2004 pp 160 421 424 486 489 crevier 1993 pp 160 168 nilsson 1983 pp 160 10 11 a b symbolic vs sub symbolic ai nilsson 1998 p 160 7 who uses the term sub symbolic haugeland 1985 p 160 255 http citeseerx ist psu edu viewdoc download doi 10 1 1 38 8384 amp rep rep1 amp type pdf pei wang 2008 artificial general intelligence 2008 proceedings of the first agi conference ios press p 160 63 isbn 160 978 1 58603 833 5 retrieved 31 october 2011 160 haugeland 1985 pp 160 112 117 the most dramatic case of sub symbolic ai being pushed into the background was the devastating critique of perceptrons by marvin minsky and seymour papert in 1969 see history of ai ai winter or frank rosenblatt cognitive simulation newell and simon ai at cmu then called carnegie tech mccorduck 2004 pp 160 139 179 245 250 322 323 epam crevier 1993 pp 160 145 149 soar history mccorduck 2004 pp 160 450 451 crevier 1993 pp 160 258 263 mccarthy and ai research at sail and sri international mccorduck 2004 pp 160 251 259 crevier 1993 ai research at edinburgh and in france birth of prolog crevier 1993 pp 160 193 196 howe 1994 ai at mit under marvin minsky in the 1960s 160 mccorduck 2004 pp 160 259 305 crevier 1993 pp 160 83 102 163 176 russell amp norvig 2003 p 160 19 cyc mccorduck 2004 p 160 489 who calls it a determinedly scruffy enterprise crevier 1993 pp 160 239 243 russell amp norvig 2003 p 160 363 365 lenat amp guha 1989 knowledge revolution mccorduck 2004 pp 160 266 276 298 300 314 421 russell amp norvig 2003 pp 160 22 23 embodied approaches to ai mccorduck 2004 pp 160 454 462 brooks 1990 moravec 1988 revival of connectionism crevier 1993 pp 160 214 215 russell amp norvig 2003 p 160 25 computational intelligence ieee computational intelligence society pat langley the changing science of machine learning machine learning volume 82 number 3 275 279 doi 10 1007 s10994 011 5242 y yarden katz noam chomsky on where artificial intelligence went wrong the atlantic november 1 2012 peter norvig on chomsky and the two cultures of statistical learning agent architectures hybrid intelligent systems russell amp norvig 2003 pp 160 27 932 970 972 nilsson 1998 chpt 25 hierarchical control system albus j s 4 d rcs reference model architecture for unmanned ground vehicles in g gerhart r gunderson and c shoemaker editors proceedings of the spie aerosense session on unmanned ground vehicle technology volume 3693 pages 11 20 subsumption architecture citation in progress citation needed search algorithms russell amp norvig 2003 pp 160 59 189 poole mackworth amp goebel 1998 pp 160 113 163 luger amp stubblefield 2004 pp 160 79 164 193 219 nilsson 1998 chpt 7 12 forward chaining backward chaining horn clauses and logical deduction as search russell amp norvig 2003 pp 160 217 225 280 294 poole mackworth amp goebel 1998 pp 160 46 52 luger amp stubblefield 2004 pp 160 62 73 nilsson 1998 chpt 4 2 7 2 state space search and planning russell amp norvig 2003 pp 160 382 387 poole mackworth amp goebel 1998 pp 160 298 305 nilsson 1998 chpt 10 1 2 uninformed searches breadth first search depth first search and general state space search russell amp norvig 2003 pp 160 59 93 poole mackworth amp goebel 1998 pp 160 113 132 luger amp stubblefield 2004 pp 160 79 121 nilsson 1998 chpt 8 heuristic or informed searches e g greedy best first and a russell amp norvig 2003 pp 160 94 109 poole mackworth amp goebel 1998 pp 160 pp 132 147 luger amp stubblefield 2004 pp 160 133 150 nilsson 1998 chpt 9 optimization searches russell amp norvig 2003 pp 160 110 116 120 129 poole mackworth amp goebel 1998 pp 160 56 163 luger amp stubblefield 2004 pp 160 127 133 artificial life and society based learning luger amp stubblefield 2004 pp 160 530 541 genetic programming and genetic algorithms luger amp stubblefield 2004 pp 160 509 530 nilsson 1998 chpt 4 2 holland john h 1975 adaptation in natural and artificial systems university of michigan press isbn 160 0 262 58111 6 160 koza john r 1992 genetic programming mit press isbn 160 0 262 11170 5 160 unknown parameter subtitle ignored help poli r langdon w b mcphee n f 2008 a field guide to genetic programming lulu com freely available from http www gp field guide org uk isbn 160 978 1 4092 0073 4 160 logic acm 1998 i 2 3 russell amp norvig 2003 pp 160 194 310 luger amp stubblefield 2004 pp 160 35 77 nilsson 1998 chpt 13 16 satplan russell amp norvig 2003 pp 160 402 407 poole mackworth amp goebel 1998 pp 160 300 301 nilsson 1998 chpt 21 explanation based learning relevance based learning inductive logic programming case based reasoning russell amp norvig 2003 pp 160 678 710 poole mackworth amp goebel 1998 pp 160 414 416 luger amp stubblefield 2004 pp 160 422 442 nilsson 1998 chpt 10 3 17 5 propositional logic russell amp norvig 2003 pp 160 204 233 luger amp stubblefield 2004 pp 160 45 50 nilsson 1998 chpt 13 first order logic and features such as equality acm 1998 i 2 4 russell amp norvig 2003 pp 160 240 310 poole mackworth amp goebel 1998 pp 160 268 275 luger amp stubblefield 2004 pp 160 50 62 nilsson 1998 chpt 15 fuzzy logic russell amp norvig 2003 pp 160 526 527 subjective logic citation in progress citation needed stochastic methods for uncertain reasoning acm 1998 i 2 3 russell amp norvig 2003 pp 160 462 644 poole mackworth amp goebel 1998 pp 160 345 395 luger amp stubblefield 2004 pp 160 165 191 333 381 nilsson 1998 chpt 19 bayesian networks russell amp norvig 2003 pp 160 492 523 poole mackworth amp goebel 1998 pp 160 361 381 luger amp stubblefield 2004 pp 160 182 190 363 379 nilsson 1998 chpt 19 3 4 bayesian inference algorithm russell amp norvig 2003 pp 160 504 519 poole mackworth amp goebel 1998 pp 160 361 381 luger amp stubblefield 2004 pp 160 363 379 nilsson 1998 chpt 19 4 amp 7 bayesian learning and the expectation maximization algorithm russell amp norvig 2003 pp 160 712 724 poole mackworth amp goebel 1998 pp 160 424 433 nilsson 1998 chpt 20 bayesian decision theory and bayesian decision networks russell amp norvig 2003 pp 160 597 600 a b c stochastic temporal models russell amp norvig 2003 pp 160 537 581 dynamic bayesian networks russell amp norvig 2003 pp 160 551 557 hidden markov model russell amp norvig 2003 pp 160 549 551 kalman filters russell amp norvig 2003 pp 160 551 557 decision theory and decision analysis russell amp norvig 2003 pp 160 584 597 poole mackworth amp goebel 1998 pp 160 381 394 markov decision processes and dynamic decision networks russell amp norvig 2003 pp 160 613 631 game theory and mechanism design russell amp norvig 2003 pp 160 631 643 statistical learning methods and classifiers russell amp norvig 2003 pp 160 712 754 luger amp stubblefield 2004 pp 160 453 541 a b neural networks and connectionism russell amp norvig 2003 pp 160 736 748 poole mackworth amp goebel 1998 pp 160 408 414 luger amp stubblefield 2004 pp 160 453 505 nilsson 1998 chpt 3 kernel methods such as the support vector machine kernel methods russell amp norvig 2003 pp 160 749 752 k nearest neighbor algorithm russell amp norvig 2003 pp 160 733 736 gaussian mixture model russell amp norvig 2003 pp 160 725 727 naive bayes classifier russell amp norvig 2003 pp 160 718 decision tree russell amp norvig 2003 pp 160 653 664 poole mackworth amp goebel 1998 pp 160 403 408 luger amp stubblefield 2004 pp 160 408 417 classifier performance van der walt amp bernard 2006 backpropagation russell amp norvig 2003 pp 160 744 748 luger amp stubblefield 2004 pp 160 467 474 nilsson 1998 chpt 3 3 feedforward neural networks perceptrons and radial basis networks russell amp norvig 2003 pp 160 739 748 758 luger amp stubblefield 2004 pp 160 458 467 recurrent neural networks hopfield nets russell amp norvig 2003 p 160 758 luger amp stubblefield 2004 pp 160 474 505 competitive learning hebbian coincidence learning hopfield networks and attractor networks luger amp stubblefield 2004 pp 160 474 505 hierarchical temporal memory hawkins amp blakeslee 2005 control theory acm 1998 i 2 8 russell amp norvig 2003 pp 160 926 932 lisp luger amp stubblefield 2004 pp 160 723 821 crevier 1993 pp 160 59 62 russell amp norvig 2003 p 160 18 prolog poole mackworth amp goebel 1998 pp 160 477 491 luger amp stubblefield 2004 pp 160 641 676 575 581 a b the turing test turing s original publication turing 1950 historical influence and philosophical implications haugeland 1985 pp 160 6 9 crevier 1993 p 160 24 mccorduck 2004 pp 160 70 71 russell amp norvig 2003 pp 160 2 3 and 948 subject matter expert turing test citation in progress citation needed rajani sandeep 2011 artificial intelligence man or machine international journal of information technology and knowlede management 4 1 173 176 retrieved 24 september 2012 160 game ai citation in progress citation needed mathematical definitions of intelligence jose hernandez orallo 2000 beyond the turing test journal of logic language and information 9 4 447 466 doi 10 1023 a 1008367325700 citeseerx 10 1 1 44 8943 160 accessdate requires url help d l dowe and a r hajek 1997 a computational extension to the turing test proceedings of the 4th conference of the australasian cognitive science jsociety retrieved 21 july 2009 160 j hernandez orallo and d l dowe 2010 measuring universal intelligence towards an anytime intelligence test artificial intelligence journal 174 18 1508 1539 doi 10 1016 j artint 2010 09 006 160 accessdate requires url help kumar 2012 ai set to exceed human brain power web article cnn 26 july 2006 archived from the original on 19 february 2008 retrieved 26 february 2008 160 brooks r a how to build complete creatures rather than isolated cognitive simulators in k vanlehn ed architectures for intelligence pp 225 239 lawrence erlbaum associates hillsdale nj 1991 hacking roomba 160 search results 160 atmel philosophy of ai all of these positions in this section are mentioned in standard discussions of the subject such as russell amp norvig 2003 pp 160 947 960 fearn 2007 pp 160 38 55 dartmouth proposal mccarthy et al 1955 the original proposal crevier 1993 p 160 49 historical significance the physical symbol systems hypothesis newell amp simon 1976 p 160 116 mccorduck 2004 p 160 153 russell amp norvig 2003 p 160 18 dreyfus criticized the necessary condition of the physical symbol system hypothesis which he called the psychological assumption the mind can be viewed as a device operating on bits of information according to formal rules dreyfus 1992 p 160 156 dreyfus critique of artificial intelligence dreyfus 1972 dreyfus amp dreyfus 1986 crevier 1993 pp 160 120 132 mccorduck 2004 pp 160 211 239 russell amp norvig 2003 pp 160 950 952 this is a paraphrase of the relevant implication of g del s theorems the mathematical objection russell amp norvig 2003 p 160 949 mccorduck 2004 pp 160 448 449 making the mathematical objection lucas 1961 penrose 1989 refuting mathematical objection turing 1950 under 2 the mathematical objection hofstadter 1979 background g del 1931 church 1936 kleene 1935 turing 1937 this version is from searle 1999 and is also quoted in dennett 1991 p 160 435 searle s original formulation was the appropriately programmed computer really is a mind in the sense that computers given the right programs can be literally said to understand and have other cognitive states searle 1980 p 160 1 strong ai is defined similarly by russell amp norvig 2003 p 160 947 the assertion that machines could possibly act intelligently or perhaps better act as if they were intelligent is called the weak ai hypothesis by philosophers and the assertion that machines that do so are actually thinking as opposed to simulating thinking is called the strong ai hypothesis searle s chinese room argument searle 1980 searle s original presentation of the thought experiment searle 1999 discussion russell amp norvig 2003 pp 160 958 960 mccorduck 2004 pp 160 443 445 crevier 1993 pp 160 269 271 robot rights russell amp norvig 2003 p 160 964 robots could demand legal rights bbc news 21 december 2006 retrieved 3 february 2011 160 prematurity of henderson mark 24 april 2007 human rights for robots we re getting carried away the times online london 160 dead link in fiction mccorduck 2004 p 160 190 25 discusses frankenstein and identifies the key ethical issues as scientific hubris and the suffering of the monster i e robot rights independent documentary plug amp pray featuring joseph weizenbaum and raymond kurzweil ford martin r 2009 the lights in the tunnel automation accelerating technology and the economy of the future acculant publishing isbn 160 978 1448659814 e book available free online 160 machine learning a job killer ai could decrease the demand for human labor russell amp norvig 2003 pp 160 960 961 ford martin 2009 the lights in the tunnel automation accelerating technology and the economy of the future acculant publishing isbn 160 978 1 4486 5981 4 160 in the early 1970s kenneth colby presented a version of weizenbaum s eliza known as doctor which he promoted as a serious therapeutic tool crevier 1993 pp 160 132 144 joseph weizenbaum s critique of ai weizenbaum 1976 crevier 1993 pp 160 132 144 mccorduck 2004 pp 160 356 373 russell amp norvig 2003 p 160 961 weizenbaum the ai researcher who developed the first chatterbot program eliza argued in 1976 that the misuse of artificial intelligence has the potential to devalue human life technological singularity vinge 1993 kurzweil 2005 russell amp norvig 2003 p 160 963 transhumanism moravec 1988 kurzweil 2005 russell amp norvig 2003 p 160 963 rubin charles spring 2003 artificial intelligence and human nature the new atlantis 1 88 100 160 ai as evolution edward fredkin is quoted in mccorduck 2004 p 160 401 butler samuel 13 june 1863 darwin among the machines the press christchurch new zealand 160 wikilink embedded in url title help letter to the editor dyson george 1998 darwin among the machiens allan lane science isbn 160 0 7382 0030 1 160 references edit ai textbooks edit luger george stubblefield william 2004 artificial intelligence structures and strategies for complex problem solving 5th ed the benjamin cummings publishing company inc isbn 160 0 8053 4780 1 160 neapolitan richard jiang xia 2012 contemporary artificial intelligence chapman amp hall crc isbn 160 978 1 4398 4469 4 160 nilsson nils 1998 artificial intelligence a new synthesis morgan kaufmann publishers isbn 160 978 1 55860 467 4 160 russell stuart j norvig peter 2003 artificial intelligence a modern approach 2nd ed upper saddle river new jersey prentice hall isbn 160 0 13 790395 2 160 poole david mackworth alan goebel randy 1998 computational intelligence a logical approach new york oxford university press isbn 160 0 19 510270 3 160 winston patrick henry 1984 artificial intelligence reading massachusetts addison wesley isbn 160 0 201 08259 4 160 history of ai edit crevier daniel 1993 ai the tumultuous search for artificial intelligence new york ny basicbooks isbn 0 465 02997 3 160 mccorduck pamela 2004 machines who think 2nd ed natick ma a k peters ltd isbn 160 1 56881 205 1 160 nilsson nils 2010 the quest for artificial intelligence a history of ideas and achievements new york isbn 160 978 0 521 12293 1 160 unknown parameter publishier ignored help other sources edit acm computing classification system artificial intelligence acm 1998 retrieved 30 august 2007 160 aleksander igor 1995 artificial neuroconsciousness an update iwann archived from the original on 2 march 1997 160 bibtex internet archive brooks rodney 1990 elephants don t play chess pdf robotics and autonomous systems 6 3 15 doi 10 1016 s0921 8890 05 80025 9 archived from the original on 9 august 2007 retrieved 30 august 2007 160 buchanan bruce g 2005 a very brief history of artificial intelligence pdf ai magazine 53 60 archived from the original on 26 september 2007 retrieved 30 august 2007 160 dennett daniel 1991 consciousness explained the penguin press isbn 160 0 7139 9037 6 160 dreyfus hubert 1972 what computers can t do new york mit press isbn 160 0 06 011082 1 160 dreyfus hubert 1979 what computers still can t do new york mit press isbn 160 0 262 04134 0 160 dreyfus hubert dreyfus stuart 1986 mind over machine the power of human intuition and expertise in the era of the computer oxford uk blackwell isbn 160 0 02 908060 6 160 dreyfus hubert 1992 what computers still can t do new york mit press isbn 160 0 262 54067 3 160 edelman gerald 23 november 2007 gerald edelman neural darwinism and brain based devices talking robots 160 fearn nicholas 2007 the latest answers to the oldest questions a philosophical adventure with the world s greatest thinkers new york grove press isbn 160 0 8021 1839 9 160 forster dion 2006 self validating consciousness in strong artificial intelligence an african theological contribution pretoria university of south africa 160 gladwell malcolm 2005 blink new york little brown and co isbn 160 0 316 17232 4 160 haugeland john 1985 artificial intelligence the very idea cambridge mass mit press isbn 160 0 262 08153 9 160 hawkins jeff blakeslee sandra 2005 on intelligence new york ny owl books isbn 160 0 8050 7853 3 160 hofstadter douglas 1979 g del escher bach an eternal golden braid new york ny vintage books isbn 160 0 394 74502 7 160 howe j november 1994 artificial intelligence at edinburgh university a perspective retrieved 30 august 2007 160 kahneman daniel slovic d tversky amos 1982 judgment under uncertainty heuristics and biases new york cambridge university press isbn 160 0 521 28414 7 160 kolata g 1982 how can computers get common sense science 217 4566 1237 1238 doi 10 1126 science 217 4566 1237 pmid 160 17837639 160 kurzweil ray 1999 the age of spiritual machines penguin books isbn 160 0 670 88217 8 160 kurzweil ray 2005 the singularity is near penguin books isbn 160 0 670 03384 7 160 lakoff george 1987 women fire and dangerous things what categories reveal about the mind university of chicago press isbn 160 0 226 46804 6 160 lakoff george n ez rafael e 2000 where mathematics comes from how the embodied mind brings mathematics into being basic books isbn 160 0 465 03771 2 160 lenat douglas guha r v 1989 building large knowledge based systems addison wesley isbn 160 0 201 51752 3 160 lighthill professor sir james 1973 artificial intelligence a general survey artificial intelligence a paper symposium science research council 160 lucas john 1961 minds machines and g del in anderson a r minds and machines archived from the original on 19 august 2007 retrieved 30 august 2007 160 maker meg houston 2006 ai 50 ai past present future dartmouth college archived from the original on 8 october 2008 retrieved 16 october 2008 160 mccarthy john minsky marvin rochester nathan shannon claude 1955 a proposal for the dartmouth summer research project on artificial intelligence archived from the original on 26 august 2007 retrieved 30 august 2007 160 mccarthy john hayes p j 1969 some philosophical problems from the standpoint of artificial intelligence machine intelligence 4 463 502 archived from the original on 10 august 2007 retrieved 30 august 2007 160 mccarthy john 12 november 2007 what is artificial intelligence 160 minsky marvin 1967 computation finite and infinite machines englewood cliffs n j prentice hall isbn 160 0 13 165449 7 160 minsky marvin 2006 the emotion machine new york ny simon amp schusterl isbn 160 0 7432 7663 9 160 moravec hans 1976 the role of raw power in intelligence retrieved 30 august 2007 160 moravec hans 1988 mind children harvard university press isbn 160 0 674 57616 0 160 nrc united states national research council 1999 developments in artificial intelligence funding a revolution government support for computing research national academy press 160 needham joseph 1986 science and civilization in china volume 2 caves books ltd 160 newell allen simon h a 1963 gps a program that simulates human thought in feigenbaum e a feldman j computers and thought new york mcgraw hill 160 newell allen simon h a 1976 computer science as empirical inquiry symbols and search communications of the acm 19 3 160 nilsson nils 1983 artificial intelligence prepares for 2001 ai magazine 1 1 160 presidential address to the association for the advancement of artificial intelligence penrose roger 1989 the emperor s new mind concerning computer minds and the laws of physics oxford university press isbn 160 0 19 851973 7 160 searle john 1980 minds brains and programs behavioral and brain sciences 3 3 417 457 doi 10 1017 s0140525x00005756 160 searle john 1999 mind language and society new york ny basic books isbn 160 0 465 04521 9 oclc 160 231867665 43689264 160 serenko alexander detlor brian 2004 intelligent agents as innovations ai and society 18 4 364 381 doi 10 1007 s00146 004 0310 5 160 serenko alexander ruhi umar cocosila mihail 2007 unplanned effects of intelligent agents on internet use social informatics approach ai and society 21 1 2 141 166 doi 10 1007 s00146 006 0051 8 160 shapiro stuart c 1992 artificial intelligence in shapiro stuart c encyclopedia of artificial intelligence 2nd ed new york john wiley pp 160 54 57 isbn 160 0 471 50306 1 160 simon h a 1965 the shape of automation for men and management new york harper amp row 160 skillings jonathan 3 july 2006 getting machines to think like us cnet retrieved 3 february 2011 160 tecuci gheorghe march april 2012 artificial intelligence wiley interdisciplinary reviews computational statistics wiley 4 2 168 180 doi 10 1002 wics 200 160 accessdate requires url help turing alan october 1950 computing machinery and intelligence mind lix 236 433 460 doi 10 1093 mind lix 236 433 issn 160 0026 4423 retrieved 2008 08 18 160 van der walt christiaan bernard etienne 2006 lt year is presumed based on acknowledgements at the end of the article gt data characteristics that determine classifier performance pdf retrieved 5 august 2009 160 vinge vernor 1993 the coming technological singularity how to survive in the post human era 160 wason p c shapiro d 1966 reasoning in foss b m new horizons in psychology harmondsworth penguin 160 weizenbaum joseph 1976 computer power and human reason san francisco w h freeman amp company isbn 160 0 7167 0464 1 160 kumar gulshan krishan kumar 2012 the use of artificial intelligence based ensembles for intrusion detection a review applied computational intelligence and soft computing 2012 1 20 doi 10 1155 2012 850160 retrieved 11 february 2013 160 further reading edit techcast article series john sagi framing consciousness boden margaret mind as machine oxford university press 2006 johnston john 2008 the allure of machinic life cybernetics artificial life and the new ai mit press myers courtney boyd ed 2009 the ai report forbes june 2009 serenko alexander 2010 the development of an ai journal ranking based on the revealed preference approach pdf journal of informetrics 4 4 447 459 doi 10 1016 j joi 2010 04 001 160 sun r amp bookman l eds computational architectures integrating neural and symbolic processes kluwer academic publishers needham ma 1994 external links edit find more about artificial intelligence at wikipedia s sister projects definitions and translations from wiktionary media from commons learning resources from wikiversity news stories from wikinews quotations from wikiquote source texts from wikisource textbooks from wikibooks what is ai an introduction to artificial intelligence by ai founder john mccarthy the handbook of artificial intelligence volume by avron barr and edward a feigenbaum stanford university logic and artificial intelligence entry by richmond thomason in the stanford encyclopedia of philosophy ai at the open directory project aitopics a large directory of links and other resources maintained by the association for the advancement of artificial intelligence the leading organization of academic ai researchers artificial intelligence discussion group 160 links to related articles v t e john mccarthy artificial intelligence circumscription dartmouth conferences frame problem garbage collection lisp mccarthy 91 function situation calculus space fountain vera watson v t e major fields of computer science mathematical foundations mathematical logic set theory number theory graph theory type theory category theory numerical analysis information theory combinatorics boolean algebra theory of computation automata theory computability theory computational complexity theory quantum computing theory algorithms data structures analysis of algorithms algorithm design computational geometry programming languages compilers parsers interpreters procedural programming object oriented programming functional programming logic programming programming paradigms concurrent parallel distributed systems multiprocessing grid computing concurrency control software engineering requirements analysis software design computer programming formal methods software testing software development process system architecture computer architecture computer organization operating systems telecommunication networking computer audio routing network topology cryptography databases database management systems relational databases sql transactions database indexes data mining artificial intelligence automated reasoning computational linguistics computer vision evolutionary computation expert systems machine learning natural language processing robotics computer graphics visualization computer animation image processing human computer interaction computer accessibility user interfaces wearable computing ubiquitous computing virtual reality scientific computing artificial life bioinformatics cognitive science computational chemistry computational neuroscience computational physics numerical algorithms symbolic mathematics note computer science can also be divided into different topics or fields according to the acm computing classification system v t e technology outline of technology outline of applied science fields agriculture agricultural engineering aquaculture fisheries science food chemistry food engineering food microbiology food technology gurt ict in agriculture nutrition biomedical bioinformatics biological engineering biomechatronics biomedical engineering biotechnology cheminformatics genetic engineering healthcare science medical research medical technology nanomedicine neuroscience neurotechnology pharmacology reproductive technology tissue engineering buildings and construction acoustical engineering architectural engineering building services engineering civil engineering construction engineering domestic technology facade engineering fire protection engineering safety engineering sanitary engineering structural engineering educational educational software digital technologies in education ict in education impact multimedia learning virtual campus virtual education energy nuclear engineering nuclear technology petroleum engineering soft energy technology environmental clean technology clean coal technology ecological design ecological engineering ecotechnology environmental engineering environmental engineering science green building green nanotechnology landscape engineering renewable energy sustainable design sustainable engineering industrial automation business informatics engineering management enterprise engineering financial engineering industrial biotechnology industrial engineering metallurgy mining engineering productivity improving technologies research and development it and communications artificial intelligence broadcast engineering computer engineering computer science information technology music technology ontology engineering rf engineering software engineering telecommunications engineering visual technology military army engineering maintenance electronic warfare military communications military engineering stealth technology transport aerospace engineering automotive engineering naval architecture space technology traffic engineering transport engineering other applied sciences cryogenics electro optics electronics engineering geology engineering physics hydraulics materials science microtechnology nanotechnology other engineering fields audio biochemical ceramic chemical polymer control electrical electronic entertainment geotechnical hydraulic mechanical mechatronics optical protein quantum robotics animatronics systems components infrastructure invention timeline knowledge machine skill craft tool gadget history prehistoric technology neolithic revolution ancient technology medieval technology renaissance technology industrial revolution second jet age digital revolution information age theories and concepts appropriate technology critique of technology diffusion of innovations disruptive innovation dual use technology ephemeralization ethics of technology high tech hype cycle inevitability thesis low technology mature technology philosophy of technology strategy of technology technicism techno progressivism technocapitalism technocentrism technocracy technocriticism technoetic technological change technological convergence technological determinism technological escalation technological evolution technological fix technological innovation system technological momentum technological nationalism technological rationality technological revival technological singularity technological somnambulism technological utopianism technology lifecycle technology acceptance model technology adoption lifecycle technomancy technorealism transhumanism other emerging technologies list fictional technology high technology business districts kardashev scale list of technologies science and technology by country technology alignment technology assessment technology brokering technology companies technology demonstration technology education technical universities and colleges technology evangelist technology governance technology integration technology journalism technology management technology shock technology strategy technology and society technology transfer book category commons portal wikiquotes v t e philosophy of science biology chemistry physics mind artifical intelligence information perception space and time thermal and statistical physics social science environment psychology technology computer science people adolf gr nbaum albert einstein alfred north whitehead aristotle auguste comte averroes bas van fraassen berlin circle bertrand russell carl gustav hempel c d broad charles sanders peirce dominicus gundissalinus daniel dennett epicurians francis bacon friedrich schelling galileo galilei henri poincar herbert spencer hugh of saint victor immanuel kant imre lakatos isaac newton john dewey john stuart mill j rgen habermas karl pearson karl popper karl theodor jaspers larry laudan mario bunge michael polanyi otto neurath paul h berlin paul feyerabend pierre duhem pierre gassendi plato r b braithwaite ren descartes robert kilwardby roger bacon rudolf carnap stephen toulmin stoicism thomas hobbes thomas kuhn vienna circle w v o quine wilhelm windelband wilhelm wundt william of ockham william whewell more concepts analysis analytic synthetic distinction a priori and a posteriori artificial intelligence causality commensurability construct demarcation problem empirical evidence explanatory power explanandum fact falsifiability ignoramus et ignorabimus inductive reasoning ingenuity inquiry models of scientific inquiry nature objectivity observation paradigm problem of induction scientific law scientific method scientific revolution scientific theory testability theory choice theory ladenness underdetermination metatheory confirmation holism coherentism contextualism conventionalism deductive nomological model determinism empiricism fallibilism foundationalism hypothetico deductive model infinitism instrumentalism naturalism physicalism positivism pragmatism rationalism received view of theories reductionism semantic view of theories scientific realism scientific anti realism scientific skepticism scientism uniformitarianism vitalism metaphysics related epistemology history and philosophy of science history of science history of evolutionary thought pseudoscience relationship between religion and science rhetoric of science sociology of scientific knowledge criticism of science alchemy more portal category task force discussion changes v t e philosophy of mind philosophers anscombe austin aquinas bain bergson bhattacharya block broad churchland dennett dharmakirti davidson descartes goldman heidegger husserl fodor james kierkegaard leibniz merleau ponty minsky moore nagel plantinga putnam popper rorty ryle searle spinoza turing vasubandhu wittgenstein zhuangzi more theories behaviourism biological naturalism dualism eliminative materialism emergent materialism epiphenomenalism functionalism identity theory interactionism materialism mind body problem monism na ve realism neutral monism phenomenalism phenomenology existential phenomenology neurophenomenology physicalism pragmatism property dualism representational theory of mind solipsism substance dualism concepts abstract object artificial intelligence chinese room cognition concept concept and object consciousness idea identity ingenuity intelligence intentionality introspection intuition language of thought materialism mental event mental image mental process mental property mental representation mind mind body dichotomy pain problem of other minds propositional attitude qualia tabula rasa understanding more related articles metaphysics philosophy of artificial intelligence philosophy of information philosophy of perception philosophy of self portal category task force discussion v t e robotics main articles outline glossary index history future of robotics robotics worldwide robot hall of fame roboethics robotic laws human robot interaction ai competitions types humanoid android biomorphic hexapod industrial articulated domestic entertainment military medical service disability prosthesis agricultural food service beam robotics microbotics nanorobotics list of robots fictional robots classifications stationary ground underwater aerial space polar robots locomotion wheels tracks walking running swimming climbing hopping metachronal motion crawling brachiating navigation manual remote or tele op guarded tele op line following robot autonomously randomized robot autonomously guided robot sliding autonomy research roboticist areas evolutionary kits simulator suite open source software adaptable developmental paradigms ubiquitous category commons portal wikiproject v t e emerging technologies technology fields agriculture agricultural robot in vitro meat genetically modified food precision agriculture vertical farming biomedical ampakine brain transplant cryonics cells alive system freezers full genome sequencing genetic engineering gene therapy personalized medicine regenerative medicine stem cell treatments tissue engineering head transplant isolated brain robotic surgery strategies for engineered negligible senescence suspended animation synthetic biology synthetic genomics displays autostereoscopy holographic display next generation of display technology screenless display bionic contact lens head mounted display head up display virtual retinal display electronics electronic nose electronic textile flexible electronics memristor spintronics thermal copper pillar bump energy energy storage beltway battery carbon neutral fuel compressed air energy storage flywheel energy storage grid energy storage lithium air battery molten salt battery nanowire battery silicon air battery thermal energy storage ultracapacitor fusion power molten salt reactor renewable energy airborne wind turbine artificial photosynthesis biofuels carbon negative fuel concentrated solar power home fuel cell hydrogen economy nantenna solar roadway space based solar power smart grid contactless energy transfer it and communications artificial intelligence applications of artificial intelligence progress in artificial intelligence machine translation machine vision semantic web speech recognition atomtronics cybermethodology fourth generation optical discs 3d optical data storage holographic data storage gpgpu memory cbram fram millipede mram nram pram racetrack memory rram sonos optical computing quantum computing quantum cryptography rfid three dimensional integrated circuit manufacturing 3d printing contour crafting claytronics molecular assembler utility fog materials science graphene high temperature superconductivity high temperature superfluidity metamaterials metamaterial cloaking multi function structures nanotechnology carbon nanotubes molecular nanotechnology nanomaterials programmable matter quantum dots military antimatter weapon directed energy weapon laser maser particle beam weapon sonic weapon electromagnetic weapon coilgun railgun plasma weapon pure fusion weapon neuroscience artificial brain blue brain project electroencephalography mind uploading brain reading neuroinformatics neuroprosthetics bionic eye brain implant exocortex retinal implant robotics nanorobotics powered exoskeleton self reconfiguring modular robot swarm robotics transport adaptive compliant wing alternative fuel vehicle hydrogen vehicle backpack helicopter driverless car flying car ground effect train jet pack interstellar travel laser propulsion maglev train non rocket spacelaunch mass driver orbital ring skyhook space elevator space fountain space tether personal rapid transit plasma propulsion engine helicon thruster vasimr pulse detonation engine nuclear pulse propulsion scramjet solar sail spaceplane supersonic transport tweel vactrain other anti gravity arcology cloak of invisibility digital scent technology domed city force field plasma window immersive virtual reality magnetic refrigeration phased array optics quantum technology quantum teleportation topics collingridge dilemma differential technological development ephemeralization exploratory engineering fictional technology proactionary principle technological change technological convergence technological evolution technological unemployment technology forecasting accelerating change moore s law timeline of the future in forecasts technological singularity technology scouting technology readiness level technology roadmap transhumanism virtusphere category list retrieved from http en wikipedia org w index php title artificial_intelligence amp oldid 561340137 categories artificial intelligencecyberneticsformal sciencestechnology in societycomputational neuroscienceemerging technologiesopen problemshidden categories wikipedia articles needing page number citations from february 2011all articles with unsourced statementsarticles with unsourced statements from january 2011pages with citations using unsupported parameterspages using citations with accessdate and no urlall articles with dead external linksarticles with dead external links from february 2011articles with inconsistent citation formatspages with citations having wikilinks embedded in url titlesarticles with unsourced statements from october 2010articles to be expanded from january 2011all articles to be expandeduse dmy dates from july 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages aragon s az rbaycanca b n l m g bosanski catal esky dansk deutsch eesti espa ol esperanto euskara fran ais furlan galego hrvatski ido bahasa indonesia interlingua slenska italiano basa jawa latina latvie u lietuvi lojban magyar malagasy bahasa melayu nederlands norsk bokm l norsk nynorsk occitan polski portugus ripoarisch rom n shqip simple english sloven ina sloven ina srpski srpskohrvatski suomi svenska tagalog tatar a t rk e t rkmen e v neto ti ng vi t vro winaray emait ka edit links this page was last modified on 24 june 2013 at 10 35 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_for_Computing_Machinery b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_for_Computing_Machinery new file mode 100644 index 00000000..0f9d2b2e --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_for_Computing_Machinery @@ -0,0 +1 @@ +association for computing machinery wikipedia the free encyclopedia association for computing machinery from wikipedia the free encyclopedia jump to navigation search association for computing machinery formation 1947 type 501 c 3 not for profit membership corporation headquarters new york city membership 100 000 president vint cerf website www acm org the association for computing machinery acm is a u s based international learned society for computing it was founded in 1947 and is the world s largest and most prestigious 1 scientific and educational computing society it is a not for profit professional membership group 2 its membership is more than 100 000 as of 2011 its headquarters are in new york city the acm and the ieee computer society are the primary us umbrella organizations for academic and scholarly interests in computing unlike the ieee the acm is solely dedicated to computing contents 1 activities 2 services 3 digital library 4 competition 5 membership grades 5 1 fellows 5 2 other membership grades 6 chapters 6 1 special interest groups 6 2 professional chapters 6 3 student chapters 7 conferences 8 awards 9 leadership 10 infrastructure 11 acm w association for computing machinery committee on women 11 1 athena lectures 12 publications 13 see also 14 references 15 external links activities edit two penn plaza site of the acm headquarters in new york city acm is organized into over 170 local chapters and 35 special interest groups sigs through which it conducts most of its activities additionally there are over 500 college and university chapters the first student chapter was founded in 1961 at the university of louisiana at lafayette many of the sigs like siggraph sigplan sigcse and sigcomm sponsor regular conferences which have become famous as the dominant venue for presenting innovations in certain fields the groups also publish a large number of specialized journals magazines and newsletters acm also sponsors other computer science related events such as the worldwide acm international collegiate programming contest icpc and has sponsored some other events such as the chess match between garry kasparov and the ibm deep blue computer services edit acm press publishes a prestigious citation needed academic journal journal of the acm and general magazines for computer professionals communications of the acm also known as communications or cacm and queue other publications of the acm include acm xrds formerly crossroads and renamed and designed in 2010 the most popular student computing magazine in the us acm interactions an interdisciplinary hci publication focused on the connections between experiences people and technology and the third largest acm publication 3 acm computing surveys csur a number of journals specific to subfields of computer science titled acm transactions some of the more notable transactions include acm transactions on computer systems tocs ieee acm transactions on computational biology and bioinformatics tcbb acm transactions on computational logic tocl acm transactions on computer human interaction tochi acm transactions on database systems tods acm transactions on graphics tog acm transactions on mathematical software toms acm transactions on multimedia computing communications and applications tomccap ieee acm transactions on networking ton acm transactions on programming languages and systems toplas although communications no longer publishes primary research and is not considered a prestigious venue many of the great debates and results in computing history have been published in its pages acm has made almost all of its publications available to paid subscribers online at its digital library and also has a guide to computing literature individual members additionally have access to safari books online and books24x7 the acm also offers insurance online courses and other services to its members digital library edit the acm digital library a part of the acm portal contains a comprehensive archive of the organization s journals magazines and conference proceedings online services include a forum called ubiquity and tech news digest acm requires the copyright of all submissions to be assigned to the organization as a condition of publishing the work 4 authors may post the documents on their own websites but they are required to link back to the digital library s reference page for the paper though authors are not allowed to charge for access to copies of their work downloading a copy from the acm site requires a paid subscription competition edit acm s primary historical competitor has been the ieee computer society which is the largest subgroup of the institute of electrical and electronics engineers the ieee focuses more on hardware and standardization issues than theoretical computer science but there is considerable overlap with acm s agenda they occasionally cooperate on projects like developing computing curricula 5 some of the major awards in computer science are given jointly by acm and the ieee cs 6 there is also a mounting challenge to the acm s publication practices coming from the open access movement some authors see a centralized peer review process as less relevant and publish on their home pages or on unreviewed sites like arxiv other organizations have sprung up which do their peer review entirely free and online such as journal of artificial intelligence research jair journal of machine learning research jmlr and the journal of research and practice in information technology membership grades edit in addition to student and regular members acm has several advanced membership grades to recognize those with multiple years of membership and demonstrated performance that sets them apart from their peers 7 fellows edit see also category fellows of the association for computing machinery the acm fellows program was established by council of the association for computing machinery in 1993 to recognize and honor outstanding acm members for their achievements in computer science and information technology and for their significant contributions to the mission of the acm there are presently about 500 fellows 8 out of about 60 000 professional members other membership grades edit in 2006 acm began recognizing two additional membership grades senior members have ten or more years of professional experience and 5 years of continuous acm membership distinguished engineers and distinguished scientists have at least 15 years of profession experience and 5 years of continuous acm membership and who have made a significant impact on the computing field chapters edit acm has three kinds of chapters special interest groups 9 professional chapters and student chapters 10 special interest groups edit sigaccess accessible computing sigact algorithms and computation theory sigada ada programming language sigapp applied computing sigarch computer architecture sigart artificial intelligence sigbed embedded systems sigcas computers and society sigchi computer human interaction sigcomm data communication sigcse computer science education sigda design automation sigdoc design of communication sigecom electronic commerce sigevo genetic and evolutionary computation siggraph computer graphics and interactive techniques sighpc high performance computing sigir information retrieval sigite information technology education sigkdd knowledge discovery and data mining sigmetrics measurement and evaluation sigmicro microarchitecture sigmis management information systems sigmm multimedia sigmobile mobility of systems users data and computing sigmod management of data sigops operating systems sigplan programming languages sigsac security audit and control sigsam symbolic and algebraic manipulation sigsim simulation and modeling sigsoft software engineering sigspatial spatial information siguccs university and college computing services sigweb hypertext hypermedia and web professional chapters edit as of 2011 acm has professional amp sig chapters in 56 countries 11 student chapters edit as of 2011 there exist acm student chapters in 38 different countries 12 these chapters include acm student chapter isi kolkata ascisik acm student chapter feu east asia college aristotle university of thessaloniki auth acm birla institute of technology mesra bit acm birla institute of technology and science bits acm brock university baldwin wallace university california state university long beach csulbacm california state university sacramento csusacm college of engineering guindy anna university au ceg acm cornell university acsu florida state university georgia institute of technology gtacm heritage institute of technology kolkata acm hitk indian institute of technology delhi acm iitd johns hopkins university jhuacm lehigh university louisiana state university acm lsu mississippi state university national institute of technology trichy national institute of technology calicut nit calicut acm student chapter national institute of technology surat nit surat acm student chapter national university of computer amp emerging sciences nuces acm national university of singapore nus student chapter of the acm new jersey institute of technology north carolina state university peirce college pennsylvania state university portland state university pdx acm purdue university psg college of technology psg tech acm rochester institute of technology rit acm stanford university southern illinois university carbondale texas lutheran university university at buffalo ub acm university of alabama in huntsville university of arizona uofa acm university of california irvine acm uci university of california los angeles ucla acm university of california san diego cses university of california santa barbara ucsb acm university of california santa cruz ucsc acm university of illinois chicago uic acm university of illinois at urbana champaign acm uiuc university of iowa university of kurdistan iran university of massachusetts amherst umass acm university of minnesota twin cities uofm university of missouri mizzou acm university of the philippines upacm university of south alabama acmusa university of tehran utacm university of texas austin utacm university of texas pan american acmutpa universit t ulm vidyalankar institute of technology mumbai dwarkadas j sanghvi college of engineering mumbai washington university in st louis wu acm washington state university wsu acm western washington university wwu acm worcester polytechnic institute wpiacm yeshwantrao chavan college of engineering ycce usha mittal institute of technology conferences edit the acm sponsors numerous conferences listed below most of the special interest groups also have an annual conference acm conferences are often very popular publishing venues and are therefore very competitive for example the 2007 siggraph conference attracted about 30000 visitors and cikm only accepted 15 of the long papers that were submitted in 2005 chi conference on human factors in computing systems cikm conference on information and knowledge management 13 dac design automation conference debs distributed event based systems fcrc federated computing research conference gecco genetic and evolutionary computation conference 14 sc supercomputing conference siggraph international conference on computer graphics and interactive techniques hypertext conference on hypertext and hypermedia 15 jcdl joint conference on digital libraries 16 oopsla conference on object oriented programming systems languages and applications www world wide web conference the acm is a co presenter and founding partner of the grace hopper celebration of women in computing ghc with the anita borg institute for women and technology 17 there are some conferences hosted by acm student branches this includes reflections projections which is hosted by uiuc acm citation needed awards edit the acm presents or co presents a number of awards for outstanding technical and professional achievements and contributions in computer science and information technology 18 a m turing award acm infosys foundation award in the computing sciences distinguished service award doctoral dissertation award eckert mauchly award gordon bell prize grace murray hopper award paris kanellakis theory and practice award karl v karlstrom outstanding educator award acm ieee cs ken kennedy award eugene l lawler award outstanding contribution to acm award allen newell award acm presidential award siam acm prize in computational science and engineering software system award acm programming systems and languages paper award acm w athena lecturer award leadership edit the president of the acm for 2012 2014 19 is vint cerf an american computer scientist who is recognized as one of the fathers of the internet he is the successor of alain chesnais 2010 2012 20 a french citizen living in toronto where he runs his company named visual transitions and wendy hall of the university of southampton acm is led by a council consisting of the president vice president treasurer past president sig governing board chair publications board chair three representatives of the sig governing board and seven members at large this institution is often referred to simply as council in communications of the acm infrastructure edit acm has five boards that make up various committees and subgroups to help headquarters staff maintain quality services and products these boards are as follows publications board sig governing board education board membership services board professions board acm w association for computing machinery committee on women edit acm w the acm s committee on women in computing is set up to support inform celebrate and work with women in computing dr anita borg was a great supporter of acm w acm w provides various resources for women in computing as well as high school girls interested in the field acm w also reaches out internationally to those women who are involved and interested in computing athena lectures edit the acm w holds annual athena lectures to honor outstanding women researchers who have made fundamental contributions to computer science starting from 2006 speakers are nominated by sig officers 21 2006 2007 professor deborah estrin of ucla 2007 2008 professor karen sp rck jones of cambridge university 2008 2009 professor shafi goldwasser of mit and the weitzmann institute of science 2009 2010 susan eggers of the university of washington 2010 2011 mary jane irwin of the pennsylvania state university 2011 2012 judith s olson of the university of california irvine 2012 2013 nancy lynch of mit publications edit in 1997 acm press published wizards and their wonders portraits in computing isbn 0897919602 written by christopher morgan with new photographs by louis fabian bachrach the book is a collection of historic and current portrait photographs of figures from the computer industry see also edit computer science portal computing portal acm classification scheme association of information technology professionals bernard galler former president category presidents of the association for computing machinery computer science computing edmund berkeley co founder franz alt former president grace murray hopper award awarded by the acm institution of analysts and programmers ken kennedy award awarded by acm and the ieee computer society timeline of computing 2400 bc 1949 turing award references edit indiana university media relations indiana edu retrieved 2012 10 02 160 acm 501 c 3 status as a group irs gov retrieved 2012 10 01 160 wakkary r stolterman e 2011 welcome our first interactions interactions 18 5 doi 10 1145 1897239 1897240 160 edit acm copyright policy acm org 160 joint task force of association for computing machinery acm association for information systems ais and ieee computer society ieee cs computing curricula 2005 the overview report 160 see e g ken kennedy award acm senior members an overview acm org 160 list of acm fellows fellows acm org retrieved 2012 06 07 160 acm special interest groups archived from the original on july 27 2010 lt dashbot gt retrieved august 7 2010 160 acm chapters retrieved august 7 2010 160 worldwide professional chapters association for computing machinery acm retrieved 2012 12 27 160 student chapters http campus acm org public chapters geo_listing index cfm ct student amp inus 0 conference on information and knowledge management cikm cikmconference org 160 gecco 2009 sigevo org 160 hypertext 2009 ht2009 org 160 joint conference on digital library jcdl home jcdl 160 grace hopper celebration of women in computing largest gathering of women in computing attracts researchers industry retrieved june 27 2011 160 acm awards retrieved april 26 2012 160 acm elects vint cerf as president acm org may 25 2012 160 acm elects new leaders committed to expanding international initiatives acm org june 9 2010 160 athena talks at acm w retrieved 10 january 2013 160 external links edit official website acm portal for publications acm digital library association for computing machinery records 1947 2009 charles babbage institute university of minnesota retrieved from http en wikipedia org w index php title association_for_computing_machinery amp oldid 561550274 categories association for computing machineryprofessional associations based in the united statescomputer science organizationscomputer related organizationsinternational nongovernmental organizationslearned societiesorganizations established in 1947computer science related professional associationshidden categories use mdy dates from june 2012all articles with unsourced statementsarticles with unsourced statements from november 2012articles with unsourced statements from october 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages az rbaycanca dansk deutsch espa ol fran ais hrvatski bahasa indonesia italiano nederlands piemont is polski portugus rom n sloven ina suomi svenska t rk e ti ng vi t edit links this page was last modified on 25 june 2013 at 17 54 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_rule_learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_rule_learning new file mode 100644 index 00000000..28719e0e --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_rule_learning @@ -0,0 +1 @@ +association rule learning wikipedia the free encyclopedia association rule learning from wikipedia the free encyclopedia jump to navigation search in data mining association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases it is intended to identify strong rules discovered in databases using different measures of interestingness 1 based on the concept of strong rules rakesh agrawal et al 2 introduced association rules for discovering regularities between products in large scale transaction data recorded by point of sale pos systems in supermarkets for example the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together he or she is likely to also buy hamburger meat such information can be used as the basis for decisions about marketing activities such as e g promotional pricing or product placements in addition to the above example from market basket analysis association rules are employed today in many application areas including web usage mining intrusion detection continuous production and bioinformatics as opposed to sequence mining association rule learning typically does not consider the order of items either within a transaction or across transactions contents 1 definition 2 useful concepts 3 process 4 history 5 alternative measures of interestingness 6 statistically sound associations 7 algorithms 7 1 apriori algorithm 7 2 eclat algorithm 7 3 fp growth algorithm 7 4 guha procedure assoc 7 5 opus search 8 lore 9 other types of association mining 10 see also 11 references 12 external links 12 1 bibliographies 12 2 implementations definition edit example database with 4 items and 5 transactions transaction id milk bread butter beer 1 1 1 0 0 2 0 0 1 0 3 0 0 0 1 4 1 1 1 0 5 0 1 0 0 following the original definition by agrawal et al 2 the problem of association rule mining is defined as let be a set of binary attributes called items let be a set of transactions called the database each transaction in has a unique transaction id and contains a subset of the items in a rule is defined as an implication of the form where and the sets of items for short itemsets and are called antecedent left hand side or lhs and consequent right hand side or rhs of the rule respectively to illustrate the concepts we use a small example from the supermarket domain the set of items is and a small database containing the items 1 codes presence and 0 absence of an item in a transaction is shown in the table to the right an example rule for the supermarket could be meaning that if butter and bread are bought customers also buy milk note this example is extremely small in practical applications a rule needs a support of several hundred transactions before it can be considered statistically significant and datasets often contain thousands or millions of transactions useful concepts edit to select interesting rules from the set of all possible rules constraints on various measures of significance and interest can be used the best known constraints are minimum thresholds on support and confidence the support of an itemset is defined as the proportion of transactions in the data set which contain the itemset in the example database the itemset has a support of since it occurs in 20 of all transactions 1 out of 5 transactions the confidence of a rule is defined for example the rule has a confidence of in the database which means that for 50 of the transactions containing milk and bread the rule is correct 50 of the times a customer buys milk and bread butter is bought as well be careful when reading the expression here supp x y means support for occurrences of transactions where x and y both appear not support for occurrences of transactions where either x or y appears the latter interpretation arising because set union is equivalent to logical disjunction the argument of is a set of preconditions and thus becomes more restrictive as it grows instead of more inclusive confidence can be interpreted as an estimate of the probability the probability of finding the rhs of the rule in transactions under the condition that these transactions also contain the lhs 3 the lift of a rule is defined as or the ratio of the observed support to that expected if x and y were independent the rule has a lift of the conviction of a rule is defined as the rule has a conviction of and can be interpreted as the ratio of the expected frequency that x occurs without y that is to say the frequency that the rule makes an incorrect prediction if x and y were independent divided by the observed frequency of incorrect predictions in this example the conviction value of 1 2 shows that the rule would be incorrect 20 more often 1 2 times as often if the association between x and y was purely random chance process edit frequent itemset lattice where the color of the box indicates how many transactions contain the combination of items note that lower levels of the lattice can contain at most the minimum number of their parents items e g ac can have only at most items this is called the downward closure property 2 association rules are usually required to satisfy a user specified minimum support and a user specified minimum confidence at the same time association rule generation is usually split up into two separate steps first minimum support is applied to find all frequent itemsets in a database second these frequent itemsets and the minimum confidence constraint are used to form rules while the second step is straightforward the first step needs more attention finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets item combinations the set of possible itemsets is the power set over and has size excluding the empty set which is not a valid itemset although the size of the powerset grows exponentially in the number of items in efficient search is possible using the downward closure property of support 2 4 also called anti monotonicity 5 which guarantees that for a frequent itemset all its subsets are also frequent and thus for an infrequent itemset all its supersets must also be infrequent exploiting this property efficient algorithms e g apriori 6 and eclat 7 can find all frequent itemsets history edit the concept of association rules was popularised particularly due to the 1993 article of agrawal et al 2 which has acquired more than 6000 citations according to google scholar as of march 2008 and is thus one of the most cited papers in the data mining field however it is possible that what is now called association rules is similar to what appears in the 1966 paper 8 on guha a general data mining method developed by petr h jek et al 9 alternative measures of interestingness edit next to confidence also other measures of interestingness for rules were proposed some popular measures are all confidence 10 collective strength 11 conviction 12 leverage 13 lift originally called interest 14 a definition of these measures can be found here several more measures are presented and compared by tan et al 15 looking for techniques that can model what the user has known and using this models as interestingness measures is currently an active research trend under the name of subjective interestingness statistically sound associations edit one limitation of the standard approach to discovering associations is that by searching massive numbers of possible associations to look for collections of items that appear to be associated there is a large risk of finding many spurious associations these are collections of items that co occur with unexpected frequency in the data but only do so by chance for example suppose we are considering a collection of 10 000 items and looking for rules containing two items in the left hand side and 1 item in the right hand side there are approximately 1 000 000 000 000 such rules if we apply a statistical test for independence with a significance level of 0 05 it means there is only a 5 chance of accepting a rule if there is no association if we assume there are no associations we should nonetheless expect to find 50 000 000 000 rules statistically sound association discovery 16 17 controls this risk in most cases reducing the risk of finding any spurious associations to a user specified significance level algorithms edit many algorithms for generating association rules were presented over time some well known algorithms are apriori eclat and fp growth but they only do half the job since they are algorithms for mining frequent itemsets another step needs to be done after to generate rules from frequent itemsets found in a database apriori algorithm edit main article apriori algorithm apriori 6 is the best known algorithm to mine association rules it uses a breadth first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support eclat algorithm edit eclat 7 is a depth first search algorithm using set intersection fp growth algorithm edit fp stands for frequent pattern in the first pass the algorithm counts occurrence of items attribute value pairs in the dataset and stores them to header table in the second pass it builds the fp tree structure by inserting instances items in each instance have to be sorted by descending order of their frequency in the dataset so that the tree can be processed quickly items in each instance that do not meet minimum coverage threshold are discarded if many instances share most frequent items fp tree provides high compression close to tree root recursive processing of this compressed version of main dataset grows large item sets directly instead of generating candidate items and testing them against the entire database growth starts from the bottom of the header table having longest branches by finding all instances matching given condition new tree is created with counts projected from the original tree corresponding to the set of instances that are conditional on the attribute with each node getting sum of its children counts recursive growth ends when no individual items conditional on the attribute meet minimum support threshold and processing continues on the remaining header items of the original fp tree once the recursive process has completed all large item sets with minimum coverage have been found and association rule creation begins 18 guha procedure assoc edit guha is a general method for exploratory data analysis that has theoretical foundations in observational calculi 19 the assoc procedure 20 is a guha method which mines for generalized association rules using fast bitstrings operations the association rules mined by this method are more general than those output by apriori for example items can be connected both with conjunction and disjunctions and the relation between antecedent and consequent of the rule is not restricted to setting minimum support and confidence as in apriori an arbitrary combination of supported interest measures can be used opus search edit opus is an efficient algorithm for rule discovery that in contrast to most alternatives does not require either monotone or anti monotone constraints such as minimum support 21 initially used to find rules for a fixed consequent 21 22 it has subsequently been extended to find rules with any item as a consequent 23 opus search is the core technology in the popular magnum opus association discovery system lore edit a famous story about association rule mining is the beer and diaper story a purported survey of behavior of supermarket shoppers discovered that customers presumably young men who buy diapers tend also to buy beer this anecdote became popular as an example of how unexpected association rules might be found from everyday data there are varying opinions as to how much of the story is true 24 daniel powers says 24 in 1992 thomas blischok manager of a retail consulting group at teradata and his staff prepared an analysis of 1 2 million market baskets from about 25 osco drug stores database queries were developed to identify affinities the analysis did discover that between 5 00 and 7 00 p m that consumers bought beer and diapers osco managers did not exploit the beer and diapers relationship by moving the products closer together on the shelves other types of association mining edit contrast set learning is a form of associative learning contrast set learners use rules that differ meaningfully in their distribution across subsets 25 weighted class learning is another form of associative learning in which weight may be assigned to classes to give focus to a particular issue of concern for the consumer of the data mining results high order pattern discovery techniques facilitate the capture of high order polythetic patterns or event associations that are intrinsic to complex real world data 26 k optimal pattern discovery provides an alternative to the standard approach to association rule learning that requires that each pattern appear frequently in the data generalized association rules hierarchical taxonomy concept hierarchy quantitative association rules categorical and quantitative data 27 interval data association rules e g partition the age into 5 year increment ranged maximal association rules sequential pattern mining discovers subsequences that are common to more than minsup sequences in a sequence database where minsup is set by the user a sequence is an ordered list of transactions 28 sequential rules discovering relationships between items while considering the time ordering it is generally applied on a sequence database for example a sequential rule found in database of sequences of customer transactions can be that customers who bought a computer and cd roms later bought a webcam with a given confidence and support see also edit sequence mining production system references edit piatetsky shapiro gregory 1991 discovery analysis and presentation of strong rules in piatetsky shapiro gregory and frawley william j eds knowledge discovery in databases aaai mit press cambridge ma a b c d e agrawal r imieli ski t swami a 1993 mining association rules between sets of items in large databases proceedings of the 1993 acm sigmod international conference on management of data sigmod 93 p 160 207 doi 10 1145 170035 170072 isbn 160 0897915925 160 edit hipp j g ntzer u nakhaeizadeh g 2000 algorithms for association rule mining a general survey and comparison acm sigkdd explorations newsletter 2 58 doi 10 1145 360402 360421 160 edit tan pang ning michael steinbach kumar vipin 2005 chapter 6 association analysis basic concepts and algorithms introduction to data mining addison wesley isbn 160 0 321 32136 7 160 pei jian han jiawei and lakshmanan laks v s mining frequent itemsets with convertible constraints in proceedings of the 17th international conference on data engineering april 2 6 2001 heidelberg germany 2001 pages 433 442 a b agrawal rakesh and srikant ramakrishnan fast algorithms for mining association rules in large databases in bocca jorge b jarke matthias and zaniolo carlo editors proceedings of the 20th international conference on very large data bases vldb santiago chile september 1994 pages 487 499 a b zaki m j 2000 scalable algorithms for association mining ieee transactions on knowledge and data engineering 12 3 372 390 doi 10 1109 69 846291 160 edit h jek petr havel ivan chytil metod j the guha method of automatic hypotheses determination computing 1 1966 293 308 h jek petr feglar tomas rauch jan and coufal david the guha method data preprocessing and mining database support for data mining applications springer 2004 isbn 978 3 540 22479 2 omiecinski edward r alternative interest measures for mining associations in databases ieee transactions on knowledge and data engineering 15 1 57 69 jan feb 2003 aggarwal charu c and yu philip s a new framework for itemset generation in pods 98 symposium on principles of database systems seattle wa usa 1998 pages 18 24 brin sergey motwani rajeev ullman jeffrey d and tsur shalom dynamic itemset counting and implication rules for market basket data in sigmod 1997 proceedings of the acm sigmod international conference on management of data sigmod 1997 tucson arizona usa may 1997 pp 255 264 piatetsky shapiro gregory discovery analysis and presentation of strong rules knowledge discovery in databases 1991 pp 229 248 brin sergey motwani rajeev ullman jeffrey d and tsur shalom dynamic itemset counting and implication rules for market basket data in sigmod 1997 proceedings of the acm sigmod international conference on management of data sigmod 1997 tucson arizona usa may 1997 pp 265 276 tan pang ning kumar vipin and srivastava jaideep selecting the right objective measure for association analysis information systems 29 4 293 313 2004 webb geoffrey i 2007 discovering significant patterns machine learning 68 1 netherlands springer pp 1 33 online access gionis aristides mannila heikki mielik inen taneli and tsaparas panayiotis assessing data mining results via swap randomization acm transactions on knowledge discovery from data tkdd volume 1 issue 3 december 2007 article no 14 witten frank hall data mining practical machine learning tools and techniques 3rd edition rauch jan logical calculi for knowledge discovery in databases in proceedings of the first european symposium on principles of data mining and knowledge discovery springer 1997 pp 47 57 h jek petr and havr nek tom 1978 mechanizing hypothesis formation mathematical foundations for a general theory springer verlag isbn 160 3 540 08738 9 160 a b webb geoffrey i 1995 opus an efficient admissible algorithm for unordered search journal of artificial intelligence research 3 menlo park ca aaai press pp 431 465 online access bayardo roberto j jr agrawal rakesh gunopulos dimitrios 2000 constraint based rule mining in large dense databases data mining and knowledge discovery 4 2 217 240 doi 10 1023 a 1009895914772 160 webb geoffrey i 2000 efficient search for association rules in ramakrishnan raghu and stolfo sal eds proceedings of the sixth acm sigkdd international conference on knowledge discovery and data mining kdd 2000 boston ma new york ny the association for computing machinery pp 99 107 online access a b http www dssresources com newsletters 66 php menzies tim and hu ying data mining for very busy people ieee computer october 2003 pp 18 25 wong andrew k c wang yang 1997 high order pattern discovery from discrete valued data ieee transactions on knowledge and data engineering tkde 877 893 160 salleb aouissi ansaf vrain christel and nortet cyril 2007 quantminer a genetic algorithm for mining quantitative association rules international joint conference on artificial intelligence ijcai 1035 1040 160 zaki mohammed j 2001 spade an efficient algorithm for mining frequent sequences machine learning journal 42 pp 31 60 external links edit bibliographies edit hahsler michael annotated bibliography on association rules statsoft electronic statistics textbook association rules implementations edit sipina a free academic data mining sotware which includes a model for association rule learning pervasive datarush data mining platform for big data includes association rule mining kxen a commercial data mining software silverlight widget for live demonstration of association rule mining using apriori algorithm rapidminer a free java data mining software suite community edition gnu orange a free data mining software suite module orngassoc ruby implementation ai4r arules a package for mining association rules and frequent itemsets with r c borgelt s implementation of apriori and eclat frequent itemset mining implementations repository fimi frequent pattern mining implementations from bart goethals weka a collection of machine learning algorithms for data mining tasks written in java knime an open source workflow oriented data preprocessing and analysis platform zaki mohammed j data mining software magnum opus a system for statistically sound association discovery lisp miner mines for generalized guha association rules uses bitstrings not apriori algorithm ferda dataminer an extensible visual data mining platform implements guha procedures assoc and features multirelational data mining statistica commercial statistics software with an association rules module spmf an open source data mining platform offering more than 48 algorithms for association rule mining itemset mining and sequential pattern mining includes a simple user interface and java source code is distributed under the gpl artool gpl java association rule mining application with gui offering implementations of multiple algorithms for discovery of frequent patterns and extraction of association rules includes apriori and fpgrowth easyminer a web based association rule mining system for interactive mining free demo based on lisp miner retrieved from http en wikipedia org w index php title association_rule_learning amp oldid 561271719 categories data managementdata mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages srpski edit links this page was last modified on 23 june 2013 at 22 30 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_rule_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_rule_mining new file mode 100644 index 00000000..00c2578b --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Association_rule_mining @@ -0,0 +1 @@ +association rule learning wikipedia the free encyclopedia association rule learning from wikipedia the free encyclopedia redirected from association rule mining jump to navigation search in data mining association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases it is intended to identify strong rules discovered in databases using different measures of interestingness 1 based on the concept of strong rules rakesh agrawal et al 2 introduced association rules for discovering regularities between products in large scale transaction data recorded by point of sale pos systems in supermarkets for example the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together he or she is likely to also buy hamburger meat such information can be used as the basis for decisions about marketing activities such as e g promotional pricing or product placements in addition to the above example from market basket analysis association rules are employed today in many application areas including web usage mining intrusion detection continuous production and bioinformatics as opposed to sequence mining association rule learning typically does not consider the order of items either within a transaction or across transactions contents 1 definition 2 useful concepts 3 process 4 history 5 alternative measures of interestingness 6 statistically sound associations 7 algorithms 7 1 apriori algorithm 7 2 eclat algorithm 7 3 fp growth algorithm 7 4 guha procedure assoc 7 5 opus search 8 lore 9 other types of association mining 10 see also 11 references 12 external links 12 1 bibliographies 12 2 implementations definition edit example database with 4 items and 5 transactions transaction id milk bread butter beer 1 1 1 0 0 2 0 0 1 0 3 0 0 0 1 4 1 1 1 0 5 0 1 0 0 following the original definition by agrawal et al 2 the problem of association rule mining is defined as let be a set of binary attributes called items let be a set of transactions called the database each transaction in has a unique transaction id and contains a subset of the items in a rule is defined as an implication of the form where and the sets of items for short itemsets and are called antecedent left hand side or lhs and consequent right hand side or rhs of the rule respectively to illustrate the concepts we use a small example from the supermarket domain the set of items is and a small database containing the items 1 codes presence and 0 absence of an item in a transaction is shown in the table to the right an example rule for the supermarket could be meaning that if butter and bread are bought customers also buy milk note this example is extremely small in practical applications a rule needs a support of several hundred transactions before it can be considered statistically significant and datasets often contain thousands or millions of transactions useful concepts edit to select interesting rules from the set of all possible rules constraints on various measures of significance and interest can be used the best known constraints are minimum thresholds on support and confidence the support of an itemset is defined as the proportion of transactions in the data set which contain the itemset in the example database the itemset has a support of since it occurs in 20 of all transactions 1 out of 5 transactions the confidence of a rule is defined for example the rule has a confidence of in the database which means that for 50 of the transactions containing milk and bread the rule is correct 50 of the times a customer buys milk and bread butter is bought as well be careful when reading the expression here supp x y means support for occurrences of transactions where x and y both appear not support for occurrences of transactions where either x or y appears the latter interpretation arising because set union is equivalent to logical disjunction the argument of is a set of preconditions and thus becomes more restrictive as it grows instead of more inclusive confidence can be interpreted as an estimate of the probability the probability of finding the rhs of the rule in transactions under the condition that these transactions also contain the lhs 3 the lift of a rule is defined as or the ratio of the observed support to that expected if x and y were independent the rule has a lift of the conviction of a rule is defined as the rule has a conviction of and can be interpreted as the ratio of the expected frequency that x occurs without y that is to say the frequency that the rule makes an incorrect prediction if x and y were independent divided by the observed frequency of incorrect predictions in this example the conviction value of 1 2 shows that the rule would be incorrect 20 more often 1 2 times as often if the association between x and y was purely random chance process edit frequent itemset lattice where the color of the box indicates how many transactions contain the combination of items note that lower levels of the lattice can contain at most the minimum number of their parents items e g ac can have only at most items this is called the downward closure property 2 association rules are usually required to satisfy a user specified minimum support and a user specified minimum confidence at the same time association rule generation is usually split up into two separate steps first minimum support is applied to find all frequent itemsets in a database second these frequent itemsets and the minimum confidence constraint are used to form rules while the second step is straightforward the first step needs more attention finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets item combinations the set of possible itemsets is the power set over and has size excluding the empty set which is not a valid itemset although the size of the powerset grows exponentially in the number of items in efficient search is possible using the downward closure property of support 2 4 also called anti monotonicity 5 which guarantees that for a frequent itemset all its subsets are also frequent and thus for an infrequent itemset all its supersets must also be infrequent exploiting this property efficient algorithms e g apriori 6 and eclat 7 can find all frequent itemsets history edit the concept of association rules was popularised particularly due to the 1993 article of agrawal et al 2 which has acquired more than 6000 citations according to google scholar as of march 2008 and is thus one of the most cited papers in the data mining field however it is possible that what is now called association rules is similar to what appears in the 1966 paper 8 on guha a general data mining method developed by petr h jek et al 9 alternative measures of interestingness edit next to confidence also other measures of interestingness for rules were proposed some popular measures are all confidence 10 collective strength 11 conviction 12 leverage 13 lift originally called interest 14 a definition of these measures can be found here several more measures are presented and compared by tan et al 15 looking for techniques that can model what the user has known and using this models as interestingness measures is currently an active research trend under the name of subjective interestingness statistically sound associations edit one limitation of the standard approach to discovering associations is that by searching massive numbers of possible associations to look for collections of items that appear to be associated there is a large risk of finding many spurious associations these are collections of items that co occur with unexpected frequency in the data but only do so by chance for example suppose we are considering a collection of 10 000 items and looking for rules containing two items in the left hand side and 1 item in the right hand side there are approximately 1 000 000 000 000 such rules if we apply a statistical test for independence with a significance level of 0 05 it means there is only a 5 chance of accepting a rule if there is no association if we assume there are no associations we should nonetheless expect to find 50 000 000 000 rules statistically sound association discovery 16 17 controls this risk in most cases reducing the risk of finding any spurious associations to a user specified significance level algorithms edit many algorithms for generating association rules were presented over time some well known algorithms are apriori eclat and fp growth but they only do half the job since they are algorithms for mining frequent itemsets another step needs to be done after to generate rules from frequent itemsets found in a database apriori algorithm edit main article apriori algorithm apriori 6 is the best known algorithm to mine association rules it uses a breadth first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support eclat algorithm edit eclat 7 is a depth first search algorithm using set intersection fp growth algorithm edit fp stands for frequent pattern in the first pass the algorithm counts occurrence of items attribute value pairs in the dataset and stores them to header table in the second pass it builds the fp tree structure by inserting instances items in each instance have to be sorted by descending order of their frequency in the dataset so that the tree can be processed quickly items in each instance that do not meet minimum coverage threshold are discarded if many instances share most frequent items fp tree provides high compression close to tree root recursive processing of this compressed version of main dataset grows large item sets directly instead of generating candidate items and testing them against the entire database growth starts from the bottom of the header table having longest branches by finding all instances matching given condition new tree is created with counts projected from the original tree corresponding to the set of instances that are conditional on the attribute with each node getting sum of its children counts recursive growth ends when no individual items conditional on the attribute meet minimum support threshold and processing continues on the remaining header items of the original fp tree once the recursive process has completed all large item sets with minimum coverage have been found and association rule creation begins 18 guha procedure assoc edit guha is a general method for exploratory data analysis that has theoretical foundations in observational calculi 19 the assoc procedure 20 is a guha method which mines for generalized association rules using fast bitstrings operations the association rules mined by this method are more general than those output by apriori for example items can be connected both with conjunction and disjunctions and the relation between antecedent and consequent of the rule is not restricted to setting minimum support and confidence as in apriori an arbitrary combination of supported interest measures can be used opus search edit opus is an efficient algorithm for rule discovery that in contrast to most alternatives does not require either monotone or anti monotone constraints such as minimum support 21 initially used to find rules for a fixed consequent 21 22 it has subsequently been extended to find rules with any item as a consequent 23 opus search is the core technology in the popular magnum opus association discovery system lore edit a famous story about association rule mining is the beer and diaper story a purported survey of behavior of supermarket shoppers discovered that customers presumably young men who buy diapers tend also to buy beer this anecdote became popular as an example of how unexpected association rules might be found from everyday data there are varying opinions as to how much of the story is true 24 daniel powers says 24 in 1992 thomas blischok manager of a retail consulting group at teradata and his staff prepared an analysis of 1 2 million market baskets from about 25 osco drug stores database queries were developed to identify affinities the analysis did discover that between 5 00 and 7 00 p m that consumers bought beer and diapers osco managers did not exploit the beer and diapers relationship by moving the products closer together on the shelves other types of association mining edit contrast set learning is a form of associative learning contrast set learners use rules that differ meaningfully in their distribution across subsets 25 weighted class learning is another form of associative learning in which weight may be assigned to classes to give focus to a particular issue of concern for the consumer of the data mining results high order pattern discovery techniques facilitate the capture of high order polythetic patterns or event associations that are intrinsic to complex real world data 26 k optimal pattern discovery provides an alternative to the standard approach to association rule learning that requires that each pattern appear frequently in the data generalized association rules hierarchical taxonomy concept hierarchy quantitative association rules categorical and quantitative data 27 interval data association rules e g partition the age into 5 year increment ranged maximal association rules sequential pattern mining discovers subsequences that are common to more than minsup sequences in a sequence database where minsup is set by the user a sequence is an ordered list of transactions 28 sequential rules discovering relationships between items while considering the time ordering it is generally applied on a sequence database for example a sequential rule found in database of sequences of customer transactions can be that customers who bought a computer and cd roms later bought a webcam with a given confidence and support see also edit sequence mining production system references edit piatetsky shapiro gregory 1991 discovery analysis and presentation of strong rules in piatetsky shapiro gregory and frawley william j eds knowledge discovery in databases aaai mit press cambridge ma a b c d e agrawal r imieli ski t swami a 1993 mining association rules between sets of items in large databases proceedings of the 1993 acm sigmod international conference on management of data sigmod 93 p 160 207 doi 10 1145 170035 170072 isbn 160 0897915925 160 edit hipp j g ntzer u nakhaeizadeh g 2000 algorithms for association rule mining a general survey and comparison acm sigkdd explorations newsletter 2 58 doi 10 1145 360402 360421 160 edit tan pang ning michael steinbach kumar vipin 2005 chapter 6 association analysis basic concepts and algorithms introduction to data mining addison wesley isbn 160 0 321 32136 7 160 pei jian han jiawei and lakshmanan laks v s mining frequent itemsets with convertible constraints in proceedings of the 17th international conference on data engineering april 2 6 2001 heidelberg germany 2001 pages 433 442 a b agrawal rakesh and srikant ramakrishnan fast algorithms for mining association rules in large databases in bocca jorge b jarke matthias and zaniolo carlo editors proceedings of the 20th international conference on very large data bases vldb santiago chile september 1994 pages 487 499 a b zaki m j 2000 scalable algorithms for association mining ieee transactions on knowledge and data engineering 12 3 372 390 doi 10 1109 69 846291 160 edit h jek petr havel ivan chytil metod j the guha method of automatic hypotheses determination computing 1 1966 293 308 h jek petr feglar tomas rauch jan and coufal david the guha method data preprocessing and mining database support for data mining applications springer 2004 isbn 978 3 540 22479 2 omiecinski edward r alternative interest measures for mining associations in databases ieee transactions on knowledge and data engineering 15 1 57 69 jan feb 2003 aggarwal charu c and yu philip s a new framework for itemset generation in pods 98 symposium on principles of database systems seattle wa usa 1998 pages 18 24 brin sergey motwani rajeev ullman jeffrey d and tsur shalom dynamic itemset counting and implication rules for market basket data in sigmod 1997 proceedings of the acm sigmod international conference on management of data sigmod 1997 tucson arizona usa may 1997 pp 255 264 piatetsky shapiro gregory discovery analysis and presentation of strong rules knowledge discovery in databases 1991 pp 229 248 brin sergey motwani rajeev ullman jeffrey d and tsur shalom dynamic itemset counting and implication rules for market basket data in sigmod 1997 proceedings of the acm sigmod international conference on management of data sigmod 1997 tucson arizona usa may 1997 pp 265 276 tan pang ning kumar vipin and srivastava jaideep selecting the right objective measure for association analysis information systems 29 4 293 313 2004 webb geoffrey i 2007 discovering significant patterns machine learning 68 1 netherlands springer pp 1 33 online access gionis aristides mannila heikki mielik inen taneli and tsaparas panayiotis assessing data mining results via swap randomization acm transactions on knowledge discovery from data tkdd volume 1 issue 3 december 2007 article no 14 witten frank hall data mining practical machine learning tools and techniques 3rd edition rauch jan logical calculi for knowledge discovery in databases in proceedings of the first european symposium on principles of data mining and knowledge discovery springer 1997 pp 47 57 h jek petr and havr nek tom 1978 mechanizing hypothesis formation mathematical foundations for a general theory springer verlag isbn 160 3 540 08738 9 160 a b webb geoffrey i 1995 opus an efficient admissible algorithm for unordered search journal of artificial intelligence research 3 menlo park ca aaai press pp 431 465 online access bayardo roberto j jr agrawal rakesh gunopulos dimitrios 2000 constraint based rule mining in large dense databases data mining and knowledge discovery 4 2 217 240 doi 10 1023 a 1009895914772 160 webb geoffrey i 2000 efficient search for association rules in ramakrishnan raghu and stolfo sal eds proceedings of the sixth acm sigkdd international conference on knowledge discovery and data mining kdd 2000 boston ma new york ny the association for computing machinery pp 99 107 online access a b http www dssresources com newsletters 66 php menzies tim and hu ying data mining for very busy people ieee computer october 2003 pp 18 25 wong andrew k c wang yang 1997 high order pattern discovery from discrete valued data ieee transactions on knowledge and data engineering tkde 877 893 160 salleb aouissi ansaf vrain christel and nortet cyril 2007 quantminer a genetic algorithm for mining quantitative association rules international joint conference on artificial intelligence ijcai 1035 1040 160 zaki mohammed j 2001 spade an efficient algorithm for mining frequent sequences machine learning journal 42 pp 31 60 external links edit bibliographies edit hahsler michael annotated bibliography on association rules statsoft electronic statistics textbook association rules implementations edit sipina a free academic data mining sotware which includes a model for association rule learning pervasive datarush data mining platform for big data includes association rule mining kxen a commercial data mining software silverlight widget for live demonstration of association rule mining using apriori algorithm rapidminer a free java data mining software suite community edition gnu orange a free data mining software suite module orngassoc ruby implementation ai4r arules a package for mining association rules and frequent itemsets with r c borgelt s implementation of apriori and eclat frequent itemset mining implementations repository fimi frequent pattern mining implementations from bart goethals weka a collection of machine learning algorithms for data mining tasks written in java knime an open source workflow oriented data preprocessing and analysis platform zaki mohammed j data mining software magnum opus a system for statistically sound association discovery lisp miner mines for generalized guha association rules uses bitstrings not apriori algorithm ferda dataminer an extensible visual data mining platform implements guha procedures assoc and features multirelational data mining statistica commercial statistics software with an association rules module spmf an open source data mining platform offering more than 48 algorithms for association rule mining itemset mining and sequential pattern mining includes a simple user interface and java source code is distributed under the gpl artool gpl java association rule mining application with gui offering implementations of multiple algorithms for discovery of frequent patterns and extraction of association rules includes apriori and fpgrowth easyminer a web based association rule mining system for interactive mining free demo based on lisp miner retrieved from http en wikipedia org w index php title association_rule_learning amp oldid 561271719 categories data managementdata mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages srpski edit links this page was last modified on 23 june 2013 at 22 30 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Automatic_distillation_of_structure b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Automatic_distillation_of_structure new file mode 100644 index 00000000..208b8fc6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Automatic_distillation_of_structure @@ -0,0 +1 @@ +automatic distillation of structure wikipedia the free encyclopedia automatic distillation of structure from wikipedia the free encyclopedia jump to navigation search the topic of this article may not meet wikipedia s general notability guideline please help to establish notability by adding reliable secondary sources about the topic if notability cannot be established the article is likely to be merged redirected or deleted find sources 160 automatic distillation of structure 160 160 news 160 books 160 scholar 160 jstor 160 free images october 2011 automatic distillation of structure adios is an algorithm that can analyse source material such as text and come up with meaningful information about the generative structures that gave rise to the source one application of the algorithm is grammar induction adios can read a source text and infer grammatical rules based on structures and patterns found in the text using these the system can then generate new well structured sentences adios was developed by zach solan david horn and eytan ruppin from tel aviv university israel and shimon edelman from cornell university new york usa references edit computers learn a new language new scientist 2005 08 06 160 computer program learns language rules and composes sentences all without outside help gizmag 2005 09 02 160 adios project homepage this computer science article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title automatic_distillation_of_structure amp oldid 471282484 categories data miningcomputer science stubshidden categories articles with topics of unclear notability from october 2011all articles with topics of unclear notabilitywikiproject computer science stubs navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 14 january 2012 at 07 36 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Automatic_summarization b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Automatic_summarization new file mode 100644 index 00000000..4b075003 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Automatic_summarization @@ -0,0 +1 @@ +automatic summarization wikipedia the free encyclopedia automatic summarization from wikipedia the free encyclopedia jump to navigation search automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document as the problem of information overload has grown and as the quantity of data has increased so has interest in automatic summarization technologies that can make a coherent summary take into account variables such as length writing style and syntax an example of the use of summarization technology is search engines such as google document summarization is another generally there are two approaches to automatic summarization extraction and abstraction extractive methods work by selecting a subset of existing words phrases or sentences in the original text to form the summary in contrast abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate such a summary might contain words not explicitly present in the original the state of the art abstractive methods are still quite weak so most research has focused on extractive methods contents 1 methods 1 1 extraction based summarization 1 2 abstraction based summarization 1 3 maximum entropy based summarization 1 4 aided summarization 2 applications 2 1 keyphrase extraction 2 1 1 task description and example 2 1 2 keyphrase extraction as supervised learning 2 1 2 1 design choices 2 1 2 1 1 what are the examples 2 1 2 1 2 what are the features 2 1 2 1 3 how many keyphrases to return 2 1 2 1 4 what learning algorithm 2 1 3 unsupervised keyphrase extraction textrank 2 1 3 1 design choices 2 1 3 1 1 what should vertices be 2 1 3 1 2 how should we create edges 2 1 3 1 3 how are the final keyphrases formed 2 1 3 2 why it works 2 2 document summarization 2 2 1 overview of supervised learning approaches 2 2 2 unsupervised approaches textrank and lexrank 2 2 2 1 design choices 2 2 2 1 1 what are the vertices 2 2 2 1 2 what are the edges 2 2 2 1 3 how are summaries formed 2 2 2 2 textrank and lexrank differences 2 2 3 why unsupervised summarization works 2 2 4 multi document summarization 2 2 4 1 incorporating diversity grasshopper algorithm 3 evaluation techniques 3 1 intrinsic and extrinsic evaluation 3 2 inter textual and intra textual 3 3 current difficulties in evaluating summaries automatically 3 4 evaluating summaries qualitatively 4 see also 5 references 5 1 further reading methods edit methods of automatic summarization include extraction based abstraction based maximum entropy based and aided summarization extraction based summarization edit two particular types of summarization often addressed in the literature are keyphrase extraction where the goal is to select individual words or phrases to tag a document and document summarization where the goal is to select whole sentences to create a short paragraph summary abstraction based summarization edit extraction techniques merely copy the information deemed most important by the system to the summary for example key clauses sentences or paragraphs while abstraction involves paraphrasing sections of the source document in general abstraction can condense a text more strongly than extraction but the programs that can do this are harder to develop as they require the use of natural language generation technology which itself is a growing field while some work has been done in abstractive summarization creating an abstract synopsis like that of a human the majority of summarization systems are extractive selecting a subset of sentences to place in a summary maximum entropy based summarization edit even though automating abstractive summarization is the goal of summarization research most practical systems are based on some form of extractive summarization extracted sentences can form a valid summary in itself or form a basis for further condensation operations furthermore evaluation of extracted summaries can be automated since it is essentially a classification task during the duc 2001 and 2002 evaluation workshops tno disambiguation needed developed a sentence extraction system for multi document summarization in the news domain the system was based on a hybrid system using a naive bayes classifier and statistical language models for modeling salience although the system exhibited good results we wanted to explore the effectiveness of a maximum entropy me classifier for the meeting summarization task as me is known to be robust against feature dependencies maximum entropy has also been applied successfully for summarization in the broadcast news domain aided summarization edit machine learning techniques from closely related fields such as information retrieval or text mining have been successfully adapted to help automatic summarization apart from fully automated summarizers fas there are systems that aid users with the task of summarization mahs machine aided human summarization for example by highlighting candidate passages to be included in the summary and there are systems that depend on post processing by a human hams human aided machine summarization applications edit there are different types of summaries depending what the summarization program focuses on to make the summary of the text for example generic summaries or query relevant summaries sometimes called query based summaries summarization systems are able to create both query relevant text summaries and generic machine generated summaries depending on what the user needs summarization of multimedia documents e g pictures or movies is also possible some systems will generate a summary based on a single source document while others can use multiple source documents for example a cluster of news stories on the same topic these systems are known as multi document summarization systems keyphrase extraction edit task description and example edit the task is the following you are given a piece of text such as a journal article and you must produce a list of keywords or keyphrases that capture the primary topics discussed in the text in the case of research articles many authors provide manually assigned keywords but most text lacks pre existing keyphrases for example news articles rarely have keyphrases attached but it would be useful to be able to automatically do so for a number of applications discussed below consider the example text from a recent news article the army corps of engineers rushing to meet president bush s promise to protect new orleans by the start of the 2006 hurricane season installed defective flood control pumps last year despite warnings from its own expert that the equipment would fail during a storm according to documents obtained by the associated press an extractive keyphrase extractor might select army corps of engineers president bush new orleans and defective flood control pumps as keyphrases these are pulled directly from the text in contrast an abstractive keyphrase system would somehow internalize the content and generate keyphrases that might be more descriptive and more like what a human would produce such as political negligence or inadequate protection from floods note that these terms do not appear in the text and require a deep understanding which makes it difficult for a computer to produce such keyphrases keyphrases have many applications such as to improve document browsing by providing a short summary also keyphrases can improve information retrieval if documents have keyphrases assigned a user could search by keyphrase to produce more reliable hits than a full text search also automatic keyphrase extraction can be useful in generating index entries for a large text corpus keyphrase extraction as supervised learning edit beginning with the turney paper many researchers have approached keyphrase extraction as a supervised machine learning problem given a document we construct an example for each unigram bigram and trigram found in the text though other text units are also possible as discussed below we then compute various features describing each example e g does the phrase begin with an upper case letter we assume there are known keyphrases available for a set of training documents using the known keyphrases we can assign positive or negative labels to the examples then we learn a classifier that can discriminate between positive and negative examples as a function of the features some classifiers make a binary classification for a test example while others assign a probability of being a keyphrase for instance in the above text we might learn a rule that says phrases with initial capital letters are likely to be keyphrases after training a learner we can select keyphrases for test documents in the following manner we apply the same example generation strategy to the test documents then run each example through the learner we can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model if probabilities are given a threshold is used to select the keyphrases keyphrase extractors are generally evaluated using precision and recall precision measures how many of the proposed keyphrases are actually correct recall measures how many of the true keyphrases your system proposed the two measures can be combined in an f score which is the harmonic mean of the two f 160 160 2pr p 160 160 r matches between the proposed keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization design choices edit designing a supervised keyphrase extraction system involves deciding on several choices some of these apply to unsupervised too what are the examples edit the first choice is exactly how to generate examples turney and others have used all possible unigrams bigrams and trigrams without intervening punctuation and after removing stopwords hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part of speech tags ideally the mechanism for generating examples produces all the known labeled keyphrases as candidates though this is often not the case for example if we use only unigrams bigrams and trigrams then we will never be able to extract a known keyphrase containing four words thus recall may suffer however generating too many examples can also lead to low precision what are the features edit we also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non keyphrases typically features involve various term frequencies how many times a phrase appears in the current text or in a larger corpus the length of the example relative position of the first occurrence various boolean syntactic features e g contains all caps etc the turney paper used about 12 such features hulth uses a reduced set of features which were found most successful in the kea keyphrase extraction algorithm work derived from turney s seminal paper how many keyphrases to return edit in the end the system will need to return a list of keyphrases for a test document so we need to have a way to limit the number ensemble methods i e using votes from several classifiers have been used to produce numeric scores that can be thresholded to provide a user provided number of keyphrases this is the technique used by turney with c4 5 decision trees hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number what learning algorithm edit once examples and features are created we need a way to learn to predict keyphrases virtually any supervised learning algorithm could be used such as decision trees naive bayes and rule induction in the case of turney s genex algorithm a genetic algorithm is used to learn parameters for a domain specific keyphrase extraction algorithm the extractor follows a series of heuristics to identify keyphrases the genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases unsupervised keyphrase extraction textrank edit while supervised methods have some nice properties like being able to produce interpretable rules for what features characterize a keyphrase they also require a large amount of training data many documents with known keyphrases are needed furthermore training on a specific domain tends to customize the extraction process to that domain so the resulting classifier is not necessarily portable as some of turney s results demonstrate unsupervised keyphrase extraction removes the need for training data it approaches the problem from a different angle instead of trying to learn explicit features that characterize keyphrases the textrank algorithm 1 exploits the structure of the text itself to determine keyphrases that appear central to the text in the same way that pagerank selects important web pages recall this is based on the notion of prestige or recommendation from social networks in this way textrank does not rely on any previous training data at all but rather can be run on any arbitrary piece of text and it can produce output simply based on the text s intrinsic properties thus the algorithm is easily portable to new domains and languages textrank is a general purpose graph based ranking algorithm for nlp essentially it runs pagerank on a graph specially designed for a particular nlp task for keyphrase extraction it builds a graph using some set of text units as vertices edges are based on some measure of semantic or lexical similarity between the text unit vertices unlike pagerank the edges are typically undirected and can be weighted to reflect a degree of similarity once the graph is constructed it is used to form a stochastic matrix combined with a damping factor as in the random surfer model and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 i e the stationary distribution of the random walk on the graph design choices edit what should vertices be edit the vertices should correspond to what we want to rank potentially we could do something similar to the supervised methods and create a vertex for each unigram bigram trigram etc however to keep the graph small the authors decide to rank individual unigrams in a first step and then include a second step that merges highly ranked adjacent unigrams to form multi word phrases this has a nice side effect of allowing us to produce keyphrases of arbitrary length for example if we rank unigrams and find that advanced natural language and processing all get high ranks then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together note that the unigrams placed in the graph can be filtered by part of speech the authors found that adjectives and nouns were the best to include thus some linguistic knowledge comes into play in this step how should we create edges edit edges are created based on word co occurrence in this application of textrank two vertices are connected by an edge if the unigrams appear within a window of size n in the original text n is typically around 2 10 thus natural and language might be linked in a text about nlp natural and processing would also be linked because they would both appear in the same string of n words these edges build on the notion of text cohesion and the idea that words that appear near each other are likely related in a meaningful way and recommend each other to the reader how are the final keyphrases formed edit since this method simply ranks the individual vertices we need a way to threshold or produce a limited number of keyphrases the technique chosen is to set a count t to be a user specified fraction of the total number of vertices in the graph then the top t vertices unigrams are selected based on their stationary probabilities a post processing step is then applied to merge adjacent instances of these t unigrams as a result potentially more or less than t final keyphrases will be produced but the number should be roughly proportional to the length of the original text why it works edit it is not initially clear why applying pagerank to a co occurrence graph would produce useful keyphrases one way to think about it is the following a word that appears multiple times throughout a text may have many different co occurring neighbors for example in a text about machine learning the unigram learning might co occur with machine supervised un supervised and semi supervised in four different sentences thus the learning vertex would be a central hub that connects to these other modifying words running pagerank textrank on the graph is likely to rank learning highly similarly if the text contains the phrase supervised classification then there would be an edge between supervised and classification if classification appears several other places and thus has many neighbors it is importance would contribute to the importance of supervised if it ends up with a high rank it will be selected as one of the top t unigrams along with learning and probably classification in the final post processing step we would then end up with keyphrases supervised learning and supervised classification in short the co occurrence graph will contain densely connected regions for terms that appear often and in different contexts a random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters this is similar to densely connected web pages getting ranked highly by pagerank document summarization edit like keyphrase extraction document summarization hopes to identify the essence of a text the only real difference is that now we are dealing with larger text units whole sentences instead of words and phrases before getting into the details of some summarization methods we will mention how summarization systems are typically evaluated the most common way is using the so called rouge recall oriented understudy for gisting evaluation measure this is a recall based measure that determines how well a system generated summary covers the content present in one or more human generated model summaries known as references it is recall based to encourage systems to include all the important topics in the text recall can be computed with respect to unigram bigram trigram or 4 gram matching though rouge 1 unigram matching has been shown to correlate best with human assessments of system generated summaries i e the summaries with highest rouge 1 values correlate with the summaries humans deemed the best rouge 1 is computed as division of count of unigrams in reference that appear in system and count of unigrams in reference summary if there are multiple references the rouge 1 scores are averaged because rouge is based only on content overlap it can determine if the same general concepts are discussed between an automatic summary and a reference summary but it cannot determine if the result is coherent or the sentences flow together in a sensible manner high order n gram rouge measures try to judge fluency to some degree note that rouge is similar to the bleu measure for machine translation but bleu is precision based because translation systems favor accuracy a promising line in document summarization is adaptive document text summarization 2 the idea of adaptive summarization involves preliminary recognition of document text genre and subsequent application of summarization algorithms optimized for this genre first summarizes that perform adaptive summarization have been created 3 overview of supervised learning approaches edit supervised text summarization is very much like supervised keyphrase extraction basically if you have a collection of documents and human generated summaries for them you can learn features of sentences that make them good candidates for inclusion in the summary features might include the position in the document i e the first few sentences are probably important the number of words in the sentence etc the main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as in summary or not in summary this is not typically how people create summaries so simply using journal abstracts or existing summaries is usually not sufficient the sentences in these summaries do not necessarily match up with sentences in the original text so it would be difficult to assign labels to examples for training note however that these natural summaries can still be used for evaluation purposes since rouge 1 only cares about unigrams unsupervised approaches textrank and lexrank edit the unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data some unsupervised summarization approaches are based on finding a centroid sentence which is the mean word vector of all the sentences in the document then the sentences can be ranked with regard to their similarity to this centroid sentence a more principled way to estimate sentence importance is using random walks and eigenvector centrality lexrank 4 is an algorithm essentially identical to textrank and both use this approach for document summarization the two methods were developed by different groups at the same time and lexrank simply focused on summarization but could just as easily be used for keyphrase extraction or any other nlp ranking task design choices edit what are the vertices edit in both lexrank and textrank a graph is constructed by creating a vertex for each sentence in the document what are the edges edit the edges between sentences are based on some form of semantic similarity or content overlap while lexrank uses cosine similarity of tf idf vectors textrank uses a very similar measure based on the number of words two sentences have in common normalized by the sentences lengths the lexrank paper explored using unweighted edges after applying a threshold to the cosine values but also experimented with using edges with weights equal to the similarity score textrank uses continuous similarity scores as weights how are summaries formed edit in both algorithms the sentences are ranked by applying pagerank to the resulting graph a summary is formed by combining the top ranking sentences using a threshold or length cutoff to limit the size of the summary textrank and lexrank differences edit it is worth noting that textrank was applied to summarization exactly as described here while lexrank was used as part of a larger summarization system mead that combines the lexrank score stationary probability with other features like sentence position and length using a linear combination with either user specified or automatically tuned weights in this case some training documents might be needed though the textrank results show the additional features are not absolutely necessary another important distinction is that textrank was used for single document summarization while lexrank has been applied to multi document summarization the task remains the same in both cases only the number of sentences to choose from has grown however when summarizing multiple documents there is a greater risk of selecting duplicate or highly redundant sentences to place in the same summary imagine you have a cluster of news articles on a particular event and you want to produce one summary each article is likely to have many similar sentences and you would only want to include distinct ideas in the summary to address this issue lexrank applies a heuristic post processing step that builds up a summary by adding sentences in rank order but discards any sentences that are too similar to ones already placed in the summary the method used is called cross sentence information subsumption csis why unsupervised summarization works edit these methods work based on the idea that sentences recommend other similar sentences to the reader thus if one sentence is very similar to many others it will likely be a sentence of great importance the importance of this sentence also stems from the importance of the sentences recommending it thus to get ranked highly and placed in a summary a sentence must be similar to many sentences that are in turn also similar to many other sentences this makes intuitive sense and allows the algorithms to be applied to any arbitrary new text the methods are domain independent and easily portable one could imagine the features indicating important sentences in the news domain might vary considerably from the biomedical domain however the unsupervised recommendation based approach applies to any domain multi document summarization edit main article multi document summarization multi document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic resulting summary report allows individual users such as professional information consumers to quickly familiarize themselves with information contained in a large cluster of documents in such a way multi document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload multi document summarization creates information reports that are both concise and comprehensive with different opinions being put together amp outlined every topic is described from multiple perspectives within a single document while the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents comprehensive multi document summary should itself contain the required information hence limiting the need for accessing original files to cases when refinement is required automatic summaries present information extracted from multiple sources algorithmically without any editorial touch or subjective human intervention thus making it completely unbiased incorporating diversity grasshopper algorithm edit multi document extractive summarization faces a problem of potential redundancy ideally we would like to extract sentences that are both central i e contain the main ideas and diverse i e they differ from one another lexrank deals with diversity as a heuristic final stage using csis and other systems have used similar methods such as maximal marginal relevance mmr in trying to eliminate redundancy in information retrieval results there is a general purpose graph based ranking algorithm like page lex textrank that handles both centrality and diversity in a unified mathematical framework based on absorbing markov chain random walks an absorbing random walk is like a standard random walk except some states are now absorbing states that act as black holes that cause the walk to end abruptly at that state the algorithm is called grasshopper in addition to explicitly promoting diversity during the ranking process grasshopper incorporates a prior ranking based on sentence position in the case of summarization evaluation techniques edit the most common way to evaluate the informativeness of automatic summaries is to compare them with human made model summaries evaluation techniques fall into intrinsic and extrinsic 5 inter texual and intra texual 6 intrinsic and extrinsic evaluation edit an intrinsic evaluation tests the summarization system in of itself while an extrinsic evaluation tests the summarization based on how it affects the completion of some other task intrinsic evaluations have assessed mainly the coherence and informativeness of summaries extrinsic evaluations on the other hand have tested the impact of summarization on tasks like relevance assessment reading comprehension etc inter textual and intra textual edit intra texual methods assess the output of a specific summarization system and the inter texual ones focus on contrastive analysis of outputs of several summarization systems human judgement often has wide variance on what is considered a good summary which means that making the evaluation process automatic is particularly difficult manual evaluation can be used but this is both time and labor intensive as it requires humans to read not only the summaries but also the source documents other issues are those concerning coherence and coverage disambiguation needed one of the metrics used in nist s annual document understanding conferences in which research groups submit their systems for both summarization and translation tasks is the rouge metric recall oriented understudy for gisting evaluation 3 it essentially calculates n gram overlaps between automatically generated summaries and previously written human summaries a high level of overlap should indicate a high level of shared concepts between the two summaries note that overlap metrics like this are unable to provide any feedback on a summary s coherence anaphor resolution remains another problem yet to be fully solved current difficulties in evaluating summaries automatically edit evaluating summaries either manually or automatically is a hard task the main difficulty in evaluation comes from the impossibility of building a fair gold standard against which the results of the systems can be compared furthermore it is also very hard to determine what a correct summary is because there is always the possibility of a system to generate a good summary that is quite different from any human summary used as an approximation to the correct output content selection is not a deterministic problem people are subjective and different authors would choose different sentences and individuals may not be consistent a particular person may chose different sentences at different times two distinct sentences expressed in different words can express the same meaning this phenomenon is known as paraphrasing we can find an approach to automatically evaluating summaries using paraphrases paraeval most summarization systems perform an extractive approach selecting and copying important sentences from the source documents although humans can also cut and paste relevant information of a text most of the times they rephrase sentences when necessary or they join different related information into one sentence evaluating summaries qualitatively edit the main drawback of the evaluation systems existing so far is that we need at least one reference summary and for some methods more than one to be able to compare automatic summaries with models this is a hard and expensive task much effort has to be done in order to have corpus of texts and their corresponding summaries furthermore for some methods not only do we need to have human made summaries available for comparison but also manual annotation has to be performed in some of them e g scu in the pyramid method in any case what the evaluation methods need as an input is a set of summaries to serve as gold standards and a set of automatic summaries moreover they all perform a quantitative evaluation with regard to different similarity metrics to overcome these problems we think that the quantitative evaluation might not be the only way to evaluate summaries and a qualitative automatic evaluation would be also important see also edit sentence extraction text mining multi document summarization open text summarizer an open sourced text summarizing library a tool for text summarization written in matlab references edit rada mihalcea and paul tarau 2004 textrank bringing order into texts department of computer science university of north texas 1 yatsko v et al automatic genre recognition and adaptive text summarization in automatic documentation and mathematical linguistics 2010 volume 44 number 3 pp 111 120 unis universal summarizer g ne erkan and dragomir r radev lexrank graph based lexical centrality as salience in text summarization 2 mani i summarization evaluation an overview yatsko v a vishnyakov t n a method for evaluating modern systems of automatic text summarization in automatic documentation and mathematical linguistics 2007 v 41 no 3 p 93 103 further reading edit hercules dalianis 2003 porting and evaluation of automatic summarization 160 roxana angheluta 2002 the use of topic segmentation for automatic summarization 160 anne buist 2004 automatic summarization of meeting data a feasibility study 160 annie louis 2009 performance confidence estimation for automatic summarization 160 elena lloret and manuel palomar 2009 challenging issues of automatic summarization relevance detection and quality based evaluation 160 andrew goldberg 2007 automatic summarization 160 endres niggemeyer brigitte 1998 summarizing information isbn 160 3 540 63735 4 160 marcu daniel 2000 the theory and practice of discourse parsing and summarization isbn 160 0 262 13372 5 160 mani inderjeet 2001 automatic summarization isbn 160 1 58811 060 5 160 huff jason 2010 autosummarize 160 conceptual artwork using automatic summarization software in microsoft word 2008 lehmam abderrafih 2010 essential summarizer innovative automatic text summarization software in twenty languages acm digital library 160 published in proceeding riao 10 adaptivity personalization and fusion of heterogeneous information cid paris france xiaojin zhu andrew goldberg jurgen van gael and david andrzejewski 2007 improving diversity in ranking using absorbing random walks 160 the grasshopper algorithm retrieved from http en wikipedia org w index php title automatic_summarization amp oldid 560358971 categories computational linguisticsnatural language processingtasks of natural language processingdata mininghidden categories articles with links needing disambiguation from may 2013 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol esperanto euskara fran ais norsk nynorsk rom n suomi edit links this page was last modified on 17 june 2013 at 22 01 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Bayes_theorem b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Bayes_theorem new file mode 100644 index 00000000..a196080d --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Bayes_theorem @@ -0,0 +1 @@ +bad title wikipedia the free encyclopedia bad title jump to navigation search the requested page title is invalid it may be empty contain unsupported characters or include a non local or incorrectly linked interwiki prefix you may be able to locate the desired page by searching for its name with interwiki prefix if any in the search box possible causes are an attempt to follow a link to a diff for a page that has since been deleted an attempt to load a url such as http en wikipedia org wiki the character is not permitted in page titles an attempt to load a url pointing to a non local interwiki page usually those not run by the wikimedia foundation for example the url http en wikipedia org wiki meatball wikipedia will give this error because the meatball interwiki prefix is not marked as local in the interwiki table certain interwiki prefixes are marked as local in the table for example the url http en wikipedia org wiki meta main_page can be used to load meta main_page all interlanguage prefixes are marked as local and thus urls such as http en wikipedia org wiki fr accueil will work as expected however non local interwiki pages can still be accessed by interwiki linking or by entering them in the search box for example meatball wikipedia can be used on a page like this meatball wikipedia if you tried to access a non local interwiki page you may be able to access that page by clicking the article tab on this page return to main page retrieved from http en wikipedia org wiki special badtitle navigation menu personal tools create accountlog in namespaces special page variants views actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox upload file special pages privacy policy about wikipedia disclaimers contact wikipedia \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Biomedical_text_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Biomedical_text_mining new file mode 100644 index 00000000..aebe456e --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Biomedical_text_mining @@ -0,0 +1 @@ +biomedical text mining wikipedia the free encyclopedia biomedical text mining from wikipedia the free encyclopedia jump to navigation search biomedical text mining also known as bionlp refers to text mining applied to texts and literature of the biomedical and molecular biology domain it is a rather recent research field on the edge of natural language processing bioinformatics medical informatics and computational linguistics there is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as pubmed contents 1 main applications 2 examples 3 conferences at which bionlp research is presented 4 see also 5 external links 6 references main applications edit the main developments in this area have been related to the identification of biological entities named entity recognition such as protein and gene names in free text the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature automatic extraction of protein interactions and associations of proteins to functional concepts e g gene ontology terms even the extraction of kinetic parameters from text or the subcellular location of proteins have been addressed by information extraction and text mining technology examples edit pie pie protein interaction information extraction is a configurable web service to extract ppi relevant articles from medline kleio an advanced information retrieval system providing knowledge enriched searching for biomedicine facta a medline search engine for finding associations between biomedical concepts the facta visualizer helps intuitive understanding of facta search results through graphical visualization of the results 1 u compare u compare is an integrated text mining natural language processing system based on the uima framework with an emphasis on components for biomedical text mining 2 termine a term management system that identifies key terms in biomedical and other text types plan2l extraction of gene regulation relations protein protein interactions mutations ranked associations and cellular and developmental process associations for genes and proteins of the plant arabidopsis from abstracts and full text articles medie an intelligent search engine to retrieve biomedical correlations from medline based on indexing by natural language processing and text mining techniques 3 acromine an acronym dictionary which can be used to find distinct expanded forms of acronyms from medline 4 acromine disambiguator disambiguates abbreviations in biomedical text with their correct full forms 5 genia tagger analyses biomedical text and outputs base forms part of speech tags chunk tags and named entity tags nemine recognises gene protein names in text yeast metaboliner recognizes yeast metabolite names in text smart dictionary lookup machine learning based gene protein name lookup tpx a concept assisted search and navigation tool for biomedical literature analyses runs on pubmed pmc and can be configured on request to run on local literature repositories too 6 chilibot a tool for finding relationships between genes or gene products ebimed ebimed is a web application that combines information retrieval and extraction from medline 7 fable a gene centric text mining search engine for medline goannotator an online tool that uses semantic similarity for verification of electronic protein annotations using go terms automatically extracted from literature gopubmed retrieves pubmed abstracts for your search query then detects ontology terms from the gene ontology and medical subject headings in the abstracts and allows the user to browse the search results by exploring the ontologies and displaying only papers mentioning selected terms their synonyms or descendants anne o tate retrieves sets of pubmed records using a standard pubmed interface and analyzes them arranging content of pubmed record fields mesh author journal words from title and abtsracts and others in order of frequency information hyperlinked over proteins ihop 8 a network of concurring genes and proteins extends through the scientific literature touching on phenotypes pathologies and gene function ihop provides this network as a natural way of accessing millions of pubmed abstracts by using genes and proteins as hyperlinks between sentences and abstracts the information in pubmed can be converted into one navigable resource bringing all advantages of the internet to scientific literature research litinspector gene and signal transduction pathway data mining in pubmed abstracts nextbio life sciences search engine with a text mining functionality that utilizes pubmed abstracts ex literature search and clinical trials example to return concepts relevant to the query based on a number of heuristics including ontology relationships journal impact publication date and authorship the neuroscience information framework nif a neuroscience research hub with a search engine specifically tailored for neuroscience direct access to over 180 databases and curated resources built as part of the nih blueprint for neuroscience research pubanatomy an interactive visual search engine that provides new ways to explore relationships among medline literature text mining results anatomical structures gene expression and other background information pubgene co occurrence networks display of gene and protein symbols as well as mesh go pubchem and interaction terms such as binds or induces as these appear in medline records that is pubmed titles and abstracts whatizit whatizit is great at identifying molecular biology terms and linking them to publicly available databases 9 xtractor discovering newer scientific relations across pubmed abstracts a tool to obtain manually annotated expert curated relationships for proteins diseases drugs and biological processes as they get published in pubmed medical abstract medical abstract is an aggregator for medical abstract journal from pubmed abstracts mugex mugex is a tool for finding disease specific mutation gene pairs medcase medcase is an experimental tool of faculties of veterinary medicine and computer science in cluj napoca designed as a homeostatic serving sistem with natural language support for medical applications becas becas is a web application api and widget for biomedical concept identification able to annotate free text and pubmed abstracts note a workbench for biomedical text mining including information retrieval name entity recognition and relation extraction plugins conferences at which bionlp research is presented edit bionlp is presented at a variety of meetings pacific symposium on biocomputing in plenary session intelligent systems for molecular biology in plenary session and also in the biolink and bio ontologies workshops association for computational linguistics and north american association for computational linguistics annual meetings and associated workshops in plenary session and as part of the bionlp workshop see below bionlp 2010 american medical informatics association annual meeting in plenary session see also edit biocreative trec genomics medical literature retrieval external links edit bio nlp resources systems and application database collection the bionlp mailing list archives corpora for biomedical text mining the biocreative evaluations of biomedical text mining technologies directory of people involved in bionlp national centre for text mining nactem references edit tsuruoka y tsujii j and ananiadou s 2008 facta a text search engine for finding associated biomedical concepts bioinformatics 24 21 2559 2560 doi 10 1093 bioinformatics btn469 pmc 160 2572701 pmid 160 18772154 160 kano y baumgartner jr wa mccrohon l ananiadou s cohen kb hunter l and tsujii j 2009 u compare share and compare text mining tools with uima bioinformatics 25 15 1997 1998 doi 10 1093 bioinformatics btp289 pmc 160 2712335 pmid 160 19414535 160 miyao y ohta t masuda k tsuruoka y yoshida k ninomiya t and tsujii j 2006 semantic retrieval for the accurate identification of relational concepts in massive textbases proceedings of coling acl 2006 pp 160 1017 1024 160 okazaki n and ananiadou s 2006 building an abbreviation dictionary using a term recognition approach bioinformatics 22 24 3089 3095 doi 10 1093 bioinformatics btl534 pmid 160 17050571 160 okazaki n ananiadou s and tsujii j 2010 building a high quality sense inventory for improved abbreviation disambiguation bioinformatics 26 9 1246 1253 doi 10 1093 bioinformatics btq129 pmc 160 2859134 pmid 160 20360059 160 thomas joseph vangala g saipradeep ganesh sekar venkat raghavan rajgopal srinivasan aditya rao sujatha kotte amp naveen sivadasan 2012 tpx biomedical literature search made easy bioinformation 8 12 578 580 doi 10 6026 97320630008578 pmid 160 22829734 160 rebholz schuhmann d kirsch h arregui m gaudan s riethoven m and stoehr p 2007 ebimed text crunching to gather facts for proteins from medline bioinformatics 23 2 e237 e244 doi 10 1093 bioinformatics btl302 pmid 160 17237098 160 hoffmann r valencia a september 2005 implementing the ihop concept for navigation of biomedical literature bioinformatics 21 suppl 2 ii252 8 doi 10 1093 bioinformatics bti1142 pmid 160 16204114 160 rebholz schuhmann d arregui m gaudan s kirsch h jimeno a november 2008 text processing through web services calling whatizit bioinformatics 24 2 296 298 doi 10 1093 bioinformatics btm557 pmid 160 18006544 160 krallinger m valencia a 2005 text mining and information retrieval services for molecular biology genome biol 6 7 224 doi 10 1186 gb 2005 6 7 224 pmc 160 1175978 pmid 160 15998455 160 hoffmann r krallinger m andres e tamames j blaschke c valencia a may 2005 text mining for metabolic pathways signaling cascades and protein networks sci stke 2005 283 pe21 doi 10 1126 stke 2832005pe21 pmid 160 15886388 160 krallinger m erhardt ra valencia a march 2005 text mining approaches in molecular biology and biomedicine drug discov today 10 6 439 45 doi 10 1016 s1359 6446 05 03376 3 pmid 160 15808823 160 biomedical literature mining publications blimp a comprehensive and regularly updated index of publications on bio medical text mining retrieved from http en wikipedia org w index php title biomedical_text_mining amp oldid 558789640 categories data miningbioinformatics navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 7 june 2013 at 17 56 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Business_intelligence b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Business_intelligence new file mode 100644 index 00000000..d66c395a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Business_intelligence @@ -0,0 +1 @@ +business intelligence wikipedia the free encyclopedia business intelligence from wikipedia the free encyclopedia jump to navigation search business intelligence bi is a set of theories methodologies processes architectures and technologies that transform raw data into meaningful and useful information for business purposes bi can handle large amounts of information to help identify and develop new opportunities making use of new opportunities and implementing an effective strategy can provide a competitive market advantage and long term stability 1 bi technologies provide historical current and predictive views of business operations common functions of business intelligence technologies are reporting online analytical processing analytics data mining process mining complex event processing business performance management benchmarking text mining predictive analytics and prescriptive analytics though the term business intelligence is sometimes a synonym for competitive intelligence because they both support decision making bi uses technologies processes and applications to analyze mostly internal structured data and business processes while competitive intelligence gathers analyzes and disseminates information with a topical focus on company competitors if understood broadly business intelligence can include the subset of competitive intelligence 2 contents 1 history 2 business intelligence and data warehousing 3 business intelligence and business analytics 4 applications in an enterprise 5 prioritization of business intelligence projects 6 success factors of implementation 6 1 business sponsorship 6 2 business needs 6 3 amount and quality of available data 7 user aspect 8 bi portals 9 marketplace 9 1 industry specific 10 semi structured or unstructured data 10 1 unstructured data vs semi structured data 10 2 problems with semi structured or unstructured data 10 3 the use of metadata 11 future 12 see also 13 references 14 bibliography 15 external links history edit in a 1958 article ibm researcher hans peter luhn used the term business intelligence he defined intelligence as the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal 3 business intelligence as it is understood today is said to have evolved from the decision support systems that began in the 1960s and developed throughout the mid 1980s dss originated in the computer aided models created to assist with decision making and planning from dss data warehouses executive information systems olap and business intelligence came into focus beginning in the late 80s in 1989 howard dresner later a gartner group analyst proposed business intelligence as an umbrella term to describe concepts and methods to improve business decision making by using fact based support systems 4 it was not until the late 1990s that this usage was widespread 5 business intelligence and data warehousing edit often bi applications use data gathered from a data warehouse or a data mart a data warehouse is a copy of transactional data so that it facillitates in decision support however not all data warehouses are used for business intelligence nor do all business intelligence applications require a data warehouse to distinguish between the concepts of business intelligence and data warehouses forrester research often defines business intelligence in one of two ways using a broad definition business intelligence is a set of methodologies processes architectures and technologies that transform raw data into meaningful and useful information used to enable more effective strategic tactical and operational insights and decision making 6 when using this definition business intelligence also includes technologies such as data integration data quality data warehousing master data management text and content analytics and many others that the market sometimes lumps into the information management segment therefore forrester refers to data preparation and data usage as two separate but closely linked segments of the business intelligence architectural stack forrester defines the latter narrower business intelligence market as referring to just the top layers of the bi architectural stack such as reporting analytics and dashboards 7 business intelligence and business analytics edit thomas davenport argues that business intelligence should be divided into querying reporting olap an alerts tool and business analytics in this definition business analytics is the subset of bi based on statistics prediction and optimization 8 applications in an enterprise edit business intelligence can be applied to the following business purposes in order to drive business value citation needed measurement program that creates a hierarchy of performance metrics see also metrics reference model and benchmarking that informs business leaders about progress towards business goals business process management analytics program that builds quantitative processes for a business to arrive at optimal decisions and to perform business knowledge discovery frequently involves data mining process mining statistical analysis predictive analytics predictive modeling business process modeling complex event processing and prescriptive analytics reporting enterprise reporting program that builds infrastructure for strategic reporting to serve the strategic management of a business not operational reporting frequently involves data visualization executive information system and olap collaboration collaboration platform program that gets different areas both inside and outside the business to work together through data sharing and electronic data interchange knowledge management program to make the company data driven through strategies and practices to identify create represent distribute and enable adoption of insights and experiences that are true business knowledge knowledge management leads to learning management and regulatory compliance in addition to above business intelligence also can provide a pro active approach such as alarm function to alert immediately to end user there are many types of alerts for example if some business value exceeds the threshold value the color of that amount in the report will turn red and the business analyst is alerted sometimes an alert mail will be sent to the user as well this end to end process requires data governance which should be handled by the expert citation needed prioritization of business intelligence projects edit it is often difficult to provide a positive business case for business intelligence initiatives and often the projects must be prioritized through strategic initiatives here are some hints to increase the benefits for a bi project as described by kimball 9 you must determine the tangible benefits such as eliminated cost of producing legacy reports enforce access to data for the entire organization 10 in this way even a small benefit such as a few minutes saved makes a difference when multiplied by the number of employees in the entire organization as described by ross weil amp roberson for enterprise architecture 11 consider letting the bi project be driven by other business initiatives with excellent business cases to support this approach the organization must have enterprise architects who can identify suitable business projects use a structured and quantitative methodology to create defensible prioritization in line with the actual needs of the organization such as a weighted decision matrix 12 success factors of implementation edit before implementing a bi solution it is worth taking different factors into consideration before proceeding according to kimball et al these are the three critical areas that you need to assess within your organization before getting ready to do a bi project 13 the level of commitment and sponsorship of the project from senior management the level of business need for creating a bi implementation the amount and quality of business data available business sponsorship edit the commitment and sponsorship of senior management is according to kimball et al the most important criteria for assessment 14 this is because having strong management backing helps overcome shortcomings elsewhere in the project however as kimball et al state even the most elegantly designed dw bi system cannot overcome a lack of business management sponsorship 15 it is important that personnel who participate in the project have a vision and an idea of the benefits and drawbacks of implementing a bi system the best business sponsor should have organizational clout and should be well connected within the organization it is ideal that the business sponsor is demanding but also able to be realistic and supportive if the implementation runs into delays or drawbacks the management sponsor also needs to be able to assume accountability and to take responsibility for failures and setbacks on the project support from multiple members of the management ensures the project does not fail if one person leaves the steering group however having many managers work together on the project can also mean that there are several different interests that attempt to pull the project in different directions such as if different departments want to put more emphasis on their usage this issue can be countered by an early and specific analysis of the business areas that benefit the most from the implementation all stakeholders in project should participate in this analysis in order for them to feel ownership of the project and to find common ground another management problem that should be encountered before start of implementation is if the business sponsor is overly aggressive if the management individual gets carried away by the possibilities of using bi and starts wanting the dw or bi implementation to include several different sets of data that were not included in the original planning phase however since extra implementations of extra data may add many months to the original plan it s wise to make sure the person from management is aware of his actions business needs edit because of the close relationship with senior management another critical thing that must be assessed before the project begins is whether or not there is a business need and whether there is a clear business benefit by doing the implementation 16 the needs and benefits of the implementation are sometimes driven by competition and the need to gain an advantage in the market another reason for a business driven approach to implementation of bi is the acquisition of other organizations that enlarge the original organization it can sometimes be beneficial to implement dw or bi in order to create more oversight companies that implement bi are often large multinational organizations with diverse subsidiaries 17 a well designed bi solution provides a consolidated view of key business data not available anywhere else in the organization giving management visibility and control over measures that otherwise would not exist amount and quality of available data edit without good data it does not matter how good the management sponsorship or business driven motivation is without proper data or with too little quality data any bi implementation fails before implementation it is a good idea to do data profiling this analysis identifies the content consistency and structure 16 of the data this should be done as early as possible in the process and if the analysis shows that data is lacking put the project on the shelf temporarily while the it department figures out how to properly collect data when planning for business data and business intelligence requirements it is always advisable to consider specific scenarios that apply to a particular organization and then select the business intelligence features best suited for the scenario often scenarios revolve around distinct business processes each built on one or more data sources these sources are used by features that present that data as information to knowledge workers who subsequently act on that information the business needs of the organization for each business process adopted correspond to the essential steps of business intelligence these essential steps of business intelligence includes but not limited to go through business data sources in order to collect needed data convert business data to information and present appropriately query and analyze data act on those data collected user aspect edit some considerations must be made in order to successfully integrate the usage of business intelligence systems in a company ultimately the bi system must be accepted and utilized by the users in order for it to add value to the organization 18 19 if the usability of the system is poor the users may become frustrated and spend a considerable amount of time figuring out how to use the system or may not be able to really use the system if the system does not add value to the users mission they simply don t use it 19 to increase user acceptance of a bi system it can be advisable to consult business users at an early stage of the dw bi lifecycle for example at the requirements gathering phase 18 this can provide an insight into the business process and what the users need from the bi system there are several methods for gathering this information such as questionnaires and interview sessions when gathering the requirements from the business users the local it department should also be consulted in order to determine to which degree it is possible to fulfill the business s needs based on the available data 18 taking on a user centered approach throughout the design and development stage may further increase the chance of rapid user adoption of the bi system 19 besides focusing on the user experience offered by the bi applications it may also possibly motivate the users to utilize the system by adding an element of competition kimball 18 suggests implementing a function on the business intelligence portal website where reports on system usage can be found by doing so managers can see how well their departments are doing and compare themselves to others and this may spur them to encourage their staff to utilize the bi system even more in a 2007 article h j watson gives an example of how the competitive element can act as an incentive 20 watson describes how a large call centre implemented performance dashboards for all call agents with monthly incentive bonuses tied to performance metrics also agents could compare their performance to other team members the implementation of this type of performance measurement and competition significantly improved agent performance bi chances of success can be improved by involving senior management to help make bi a part of the organizational culture and by providing the users with necessary tools training and support 20 training encourages more people to use the bi application 18 providing user support is necessary to maintain the bi system and resolve user problems 19 user support can be incorporated in many ways for example by creating a website the website should contain great content and tools for finding the necessary information furthermore helpdesk support can be used the help desk can be manned by power users or the dw bi project team 18 bi portals edit a business intelligence portal bi portal is the primary access interface for data warehouse dw and business intelligence bi applications the bi portal is the users first impression of the dw bi system it is typically a browser application from which the user has access to all the individual services of the dw bi system reports and other analytical functionality the bi portal must be implemented in such a way that it is easy for the users of the dw bi application to call on the functionality of the application 21 the bi portal s main functionality is to provide a navigation system of the dw bi application this means that the portal has to be implemented in a way that the user has access to all the functions of the dw bi application the most common way to design the portal is to custom fit it to the business processes of the organization for which the dw bi application is designed in that way the portal can best fit the needs and requirements of its users 22 the bi portal needs to be easy to use and understand and if possible have a look and feel similar to other applications or web content of the organization the dw bi application is designed for consistency the following is a list of desirable features for web portals in general and bi portals in particular usable user should easily find what they need in the bi tool content rich the portal is not just a report printing tool it should contain more functionality such as advice help support information and documentation clean the portal should be designed so it is easily understandable and not over complex as to confuse the users current the portal should be updated regularly interactive the portal should be implemented in a way that makes it easy for the user to use its functionality and encourage them to use the portal scalability and customization give the user the means to fit the portal to each user value oriented it is important that the user has the feeling that the dw bi application is a valuable resource that is worth working on marketplace edit there are a number of business intelligence vendors often categorized into the remaining independent pure play vendors and consolidated megavendors that have entered the market through a recent trend when of acquisitions in the bi industry 23 some companies adopting bi software decide to pick and choose from different product offerings best of breed rather than purchase one comprehensive integrated solution full service 24 industry specific edit specific considerations for business intelligence systems have to be taken in some sectors such as governmental banking regulations the information collected by banking institutions and analyzed with bi software must be protected from some groups or individuals while being fully available to other groups or individuals therefore bi solutions must be sensitive to those needs and be flexible enough to adapt to new regulations and changes to existing law semi structured or unstructured data edit businesses create a huge amount of valuable information in the form of e mails memos notes from call centers news user groups chats reports web pages presentations image files video files and marketing material and news according to merrill lynch more than 85 of all business information exists in these forms these information types are called either semi structured or unstructured data however organizations often only use these documents once 25 the management of semi structured data is recognized as a major unsolved problem in the information technology industry 26 according to projections from gartner 2003 white collar workers spend anywhere from 30 to 40 percent of their time searching finding and assessing unstructured data bi uses both structured and unstructured data but the former is easy to search and the latter contains a large quantity of the information needed for analysis and decision making 26 27 because of the difficulty of properly searching finding and assessing unstructured or semi structured data organizations may not draw upon these vast reservoirs of information which could influence a particular decision task or project this can ultimately lead to poorly informed decision making 25 therefore when designing a business intelligence dw solution the specific problems associated with semi structured and unstructured data must be accommodated for as well as those for the structured data 27 unstructured data vs semi structured data edit unstructured and semi structured data have different meanings depending on their context in the context of relational database systems unstructured data cannot be stored in predictably ordered columns and rows one type of unstructured data is typically stored in a blob binary large object a catch all data type available in most relational database management systems unstructured data may also refer to irregularly or randomly repeated column patterns that vary from row to row within each file or document many of these data types however like e mails word processing text files ppts image files and video files conform to a standard that offers the possibility of metadata metadata can include information such as author and time of creation and this can be stored in a relational database therefore it may be more accurate to talk about this as semi structured documents or data 26 but no specific consensus seems to have been reached unstructured data can also simply be the knowledge that business users have about future business trends business forecasting naturally aligns with the bi system because business users think of their business in aggregate terms capturing the business knowledge that may only exist in the minds of business users provides some of the most important data points for a complete bi solution problems with semi structured or unstructured data edit there are several challenges to developing bi with semi structured data according to inmon amp nesavich 28 some of those are physically accessing unstructured textual data unstructured data is stored in a huge variety of formats terminology among researchers and analysts there is a need to develop a standardized terminology volume of data as stated earlier up to 85 of all data exists as semi structured data couple that with the need for word to word and semantic analysis searchability of unstructured textual data a simple search on some data e g apple results in links where there is a reference to that precise search term inmon amp nesavich 2008 28 gives an example a search is made on the term felony in a simple search the term felony is used and everywhere there is a reference to felony a hit to an unstructured document is made but a simple search is crude it does not find references to crime arson murder embezzlement vehicular homicide and such even though these crimes are types of felonies the use of metadata edit to solve problems with searchability and assessment of data it is necessary to know something about the content this can be done by adding context through the use of metadata 25 many systems already capture some metadata e g filename author size etc but more useful would be metadata about the actual content e g summaries topics people or companies mentioned two technologies designed for generating metadata about content are automatic categorization and information extraction future edit a 2009 gartner paper predicted 29 these developments in the business intelligence market because of lack of information processes and tools through 2012 more than 35 percent of the top 5 000 global companies regularly fail to make insightful decisions about significant changes in their business and markets by 2012 business units will control at least 40 percent of the total budget for business intelligence by 2012 one third of analytic applications applied to business processes will be delivered through coarse grained application mashups a 2009 information management special report predicted the top bi trends green computing social networking data visualization mobile bi predictive analytics composite applications cloud computing and multitouch 30 other business intelligence trends include the following third party soa bi products increasingly address etl issues of volume and throughput cloud computing and software as a service saas are ubiquitous companies embrace in memory processing 64 bit processing and pre packaged analytic bi applications operational applications have callable bi components with improvements in response time scaling and concurrency near or real time bi analytics is a baseline expectation open source bi software replaces vendor offerings other lines of research include the combined study of business intelligence and uncertain data 31 32 in this context the data used is not assumed to be precise accurate and complete instead data is considered uncertain and therefore this uncertainty is propagated to the results produced by bi according to a study by the aberdeen group there has been increasing interest in software as a service saas business intelligence over the past years with twice as many organizations using this deployment approach as one year ago 15 in 2009 compared to 7 in 2008 citation needed an article by infoworld s chris kanaracus points out similar growth data from research firm idc which predicts the saas bi market will grow 22 percent each year through 2013 thanks to increased product sophistication strained it budgets and other factors 33 see also edit accounting intelligence analytic applications artificial intelligence marketing business intelligence 2 0 business intelligence 3 0 business process discovery business process management business activity monitoring business service management customer dynamics data presentation architecture data visualization decision engineering enterprise planning systems document intelligence integrated business planning location intelligence meteorological intelligence mobile business intelligence operational intelligence business information systems business intelligence tools process mining runtime intelligence sales intelligence spend management test and learn references edit rud olivia 2009 business intelligence success factors tools for aligning your business in the global economy hoboken n j wiley amp sons isbn 160 978 0 470 39240 9 160 kobielus james 30 april 2010 what s not bi oh don t get me started oops too late here goes business intelligence is a non domain specific catchall for all the types of analytic data that can be delivered to users in reports dashboards and the like when you specify the subject domain for this intelligence then you can refer to competitive intelligence market intelligence social intelligence financial intelligence hr intelligence supply chain intelligence and the like 160 h p luhn 1958 a business intelligence system ibm journal 2 4 314 doi 10 1147 rd 24 0314 160 d j power 10 march 2007 a brief history of decision support systems version 4 0 dssresources com retrieved 10 july 2008 160 power d j a brief history of decision support systems retrieved 1 november 2010 160 evelson boris 21 november 2008 topic overview business intelligence 160 evelson boris 29 april 2010 want to know what forrester s lead data analysts are thinking about bi and the data domain 160 henschen doug 4 january 2010 analytics at work q amp a with tom davenport interview http www informationweek com news software bi 222200096 kimball et al 2008 29 are you ready for the new business intelligence dell com retrieved 2012 06 19 160 jeanne w ross peter weil david c robertson 2006 enterprise architecture as strategy p 117 isbn 1 59139 839 8 krapohl donald a structured methodology for group decision making augmentedintel retrieved 22 april 2013 160 kimball et al 2008 p 298 kimball et al 2008 16 kimball et al 2008 18 a b kimball et al 2008 17 how companies are implementing business intelligence competency centers computer world retrieved april 2006 160 a b c d e f kimball a b c d swain scheps business intelligence for dummies 2008 isbn 978 0 470 12723 0 a b watson hugh j wixom barbara h 2007 the current state of business intelligence computer 40 9 96 doi 10 1109 mc 2007 331 160 the data warehouse lifecycle toolkit 2nd ed ralph kimball 2008 microsoft data warehouse toolkit wiley publishing 2006 pendse nigel 7 march 2008 consolidations in the bi industry the olap report 160 imhoff claudia 4 april 2006 three trends in business intelligence technology 160 a b c rao r 2003 from unstructured data to actionable intelligence it professional 5 6 29 doi 10 1109 mitp 2003 1254966 160 a b c blumberg r amp s atre 2003 the problem with unstructured data dm review 42 46 160 a b negash s 2004 business intelligence communications of the association of information systems 13 177 195 160 a b inmon b amp a nesavich unstructured textual data in the organization from managing unstructured data in the organization prentice hall 2008 pp 1 13 gartner reveals five business intelligence predictions for 2009 and beyond gartner com 15 january 2009 campbell don 23 june 2009 10 red hot bi trends information management 160 rodriguez carlos daniel florian casati fabio cappiello cinzia 2010 toward uncertain business intelligence the case of key indicators ieee internet computing 14 4 32 doi 10 1109 mic 2010 59 160 rodriguez c daniel f casati f amp cappiello c 2009 computing uncertain key indicators from uncertain data pp 160 106 120 160 conference iciq 09 year 2009 saas bi growth will soar in 2010 cloud computing infoworld 2010 02 01 retrieved on 17 january 2012 bibliography edit ralph kimball et al the data warehouse lifecycle toolkit 2nd ed wiley isbn 0 470 47957 4 peter rausch alaa sheta aladdin ayesh 160 business intelligence and performance management theory systems and industrial applications springer verlag u k 2013 isbn 978 1 4471 4865 4 external links edit chaudhuri surajit dayal umeshwar narasayya vivek august 2011 an overview of business intelligence technology communications of the acm 54 8 88 98 doi 10 1145 1978542 1978562 retrieved 26 october 2011 160 v t e data warehouse 160 creating the data warehouse concepts database dimension dimensional modeling fact olap star schema aggregate variants anchor modeling column oriented dbms data vault modeling holap molap rolap operational data store elements data dictionary metadata data mart sixth normal form surrogate key fact fact table early arriving fact measure dimension dimension table degenerate slowly changing filling extract transform load etl extract transform load 160 using the data warehouse concepts business intelligence dashboard data mining decision support system dss olap cube languages data mining extensions dmx multidimensional expressions mdx xml for analysis xmla tools business intelligence tools reporting software spreadsheet 160 related people bill inmon ralph kimball products comparison of olap servers data warehousing products and their producers retrieved from http en wikipedia org w index php title business_intelligence amp oldid 560227840 categories business intelligencefinancial data analysisdata managementintelligence information gathering hidden categories all articles with unsourced statementsarticles with unsourced statements from october 2010articles with unsourced statements from january 2012vague or ambiguous time from september 2012articles with unsourced statements from november 2010use dmy dates from march 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal esky dansk deutsch espa ol fran ais hrvatski bahasa indonesia italiano latga u latvie u lietuvi nederlands norsk bokm l polski portugus sloven ina suomi svenska edit links this page was last modified on 18 june 2013 at 09 40 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Buzzword b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Buzzword new file mode 100644 index 00000000..833d2af9 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Buzzword @@ -0,0 +1 @@ +buzzword wikipedia the free encyclopedia buzzword from wikipedia the free encyclopedia jump to navigation search a buzzword is a word or phrase used to impress or an expression which is fashionable buzzwords often originate in jargon buzzwords are often neologisms 1 the term was first used in 1946 as student slang 2 contents 1 examples 2 see also 3 footnotes 4 further reading 5 external links examples edit the following terms are or were examples of buzzwords see also list of buzzwords long tail 3 next generation 4 paradigm 5 paradigm shift 6 see also edit buzzword bingo buzzword compliant golden hammer marketing buzz marketing speak memetics power word psychobabble virtue word weasel word footnotes edit grammar about com definition of buzzword online etymology dictionary douglas harper historian the register the long tail s maths begin to crumble evolt buzzword bingo the buzzword bingo book the complete definitive guide to the underground workplace game of doublespeak author benjamin yoskovitz publisher villard isbn 978 0 375 75348 0 cnet com s top 10 buzzwords further reading edit negus k pickering m 2004 creativity communication and cultural value sage publications ltd collins david 2000 management fads and buzzwords 160 critical practical perspectives london 160 new york 160 routledge godin b 2006 the knowledge based economy conceptual framework or buzzword the journal of technology transfer 31 1 17 external links edit look up buzzword 160 or buzz phrase in wiktionary the free dictionary the buzzword generator generates buzzwords and sample sentences containing such generated buzzwords languagemonitor watchdog on contemporary english usage n gage at e3 showcases immersive games and next generation mobile gaming an example of buzzwords in action the web economy bullshit generator on living wage affordable housing etc view buzzwords add buzzwords comment on buzzwords the online dictionary of language terminology guide to corporate buzzwords part 1 a look at buzzwords in corporate america v t e propaganda techniques ad hominem bandwagon effect big lie blood libel buzzword card stacking censorship code word dog whistle politics doublespeak euphemism framing glittering generality historical revisionism ideograph indoctrination lawfare lesser of two evils principle limited hangout loaded language loosely associated statements newspeak obscurantism plain folks public relations slogan spin weasel word this linguistics article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title buzzword amp oldid 553486214 categories buzzwordspropaganda techniques using wordsrhetorical techniqueslinguistics stubs navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages dansk deutsch fran ais nederlands occitan polski simple english svenska edit links this page was last modified on 4 may 2013 at 13 31 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/CIKM_Conference b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/CIKM_Conference new file mode 100644 index 00000000..80e785ab --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/CIKM_Conference @@ -0,0 +1 @@ +conference on information and knowledge management wikipedia the free encyclopedia conference on information and knowledge management from wikipedia the free encyclopedia redirected from cikm conference jump to navigation search the acm conference on information and knowledge management cikm pronounced sik m is an annual computer science research conference dedicated to information and knowledge management since the first event in 1992 the conference has evolved into one of the major forums for research on database management information retrieval and knowledge management 1 2 the conference is noted for its interdisciplinarity as it brings together communities that otherwise often publish at separate venues recent editions have attracted well beyond 500 participants 3 in addition to the main research program the conference also features a number of workshops tutorials and industry presentations 4 for many years the conference was held in the usa since 2005 venues in other countries have been selected as well locations include 5 1992 baltimore maryland usa 1993 washington d c usa 1994 gaithersburg maryland usa 1995 baltimore maryland usa 1996 rockville maryland usa 1997 las vegas nevada usa 1998 bethesda maryland usa 1999 kansas city missouri usa 2000 washington d c usa 2001 atlanta georgia usa 2002 mclean virginia usa 2003 new orleans louisiana usa 2004 washington d c usa 2005 bremen germany 2006 arlington virginia usa 2007 lisbon portugal 6 2008 napa valley california usa 7 2009 hong kong china 8 2010 toronto ontario canada 9 2011 glasgow scotland uk 10 see also edit sigir conference references edit official home page arnetminer ranking list www ir arnetminer retrieved 2011 06 11 160 cikm 2011 sponsorship page cikm 2011 home page dblp http www fc ul pt cikm2007 1 2 3 4 external links edit official home page retrieved from http en wikipedia org w index php title conference_on_information_and_knowledge_management amp oldid 559530156 categories computer science conferences navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 12 june 2013 at 08 10 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Category_Data_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Category_Data_mining new file mode 100644 index 00000000..171ff5bf --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Category_Data_mining @@ -0,0 +1 @@ +category data mining wikipedia the free encyclopedia category data mining from wikipedia the free encyclopedia jump to navigation search the main article for this category is data mining wikimedia commons has media related to data mining see also categories machine learning 160 and data analysis data mining facilities are included in some of the category data analysis software and category statistical software products subcategories this category has the following 5 subcategories out of 5 total a applied data mining 18 p c cluster analysis 2 c 14 p d data miners 12 p data mining and machine learning software 1 c 45 p dimension reduction 20 p pages in category data mining the following 55 pages are in this category out of 55 total this list may not reflect recent changes learn more 160 data mininga accuracy paradox affinity analysis alpha algorithm anomaly detection anomaly detection at multiple scales apriori algorithm association rule learning automatic distillation of structure automatic summarizationb biomedical text miningc cluster analysis co occurrence networks concept drift concept mining conference on knowledge discovery and data mining contrast set learningd data classification business intelligence data dredging d cont data mining and knowledge discovery data stream mining decision tree learning document classificatione ecml pkdd elastic map evolutionary data miningf feature vector formal concept analysis fsa red algorithmg gene expression programming gsp algorithmk k optimal pattern discoveryl lift data mining list of machine learning algorithms local outlier factorm mining software repositories molecule mining multifactor dimensionality reduction n nearest neighbor search nothing to hide argumento optimal matchingp proactive discovery of insider threats using graph analysis and learning profiling practicesr receiver operating characteristic ren rou rouge metric s sequence mining sigkdd software mining spss modeler structure miningt text miningu uncertain dataw ward s method web mining retrieved from http en wikipedia org w index php title category data_mining amp oldid 547416974 categories data analysiscomputational statisticsinformation technology managementalgorithmsinformation sciencehidden categories commons category with local link same as on wikidata navigation menu personal tools create accountlog in namespaces category talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information print export create a book download as pdf printable version languages deutsch espa ol euskara fran ais italiano portugus sloven ina srpski basa sunda t rk e ti ng vi t edit links this page was last modified on 28 march 2013 at 09 55 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Cluster_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Cluster_analysis new file mode 100644 index 00000000..ee27cbce --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Cluster_analysis @@ -0,0 +1 @@ +cluster analysis wikipedia the free encyclopedia cluster analysis from wikipedia the free encyclopedia jump to navigation search the result of a cluster analysis shown as the coloring of the squares into three clusters this section needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed may 2012 cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called cluster are more similar in some sense or another to each other than to those in other groups clusters it is a main task of exploratory data mining and a common technique for statistical data analysis used in many fields including machine learning pattern recognition image analysis information retrieval and bioinformatics cluster analysis itself is not one specific algorithm but the general task to be solved it can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them popular notions of clusters include groups with small distances among the cluster members dense areas of the data space intervals or particular statistical distributions clustering can therefore be formulated as a multi objective optimization problem the appropriate clustering algorithm and parameter settings including values such as the distance function to use a density threshold or the number of expected clusters depend on the individual data set and intended use of the results cluster analysis as such is not an automatic task but an iterative process of knowledge discovery or interactive multi objective optimization that involves trial and failure it will often be necessary to modify data preprocessing and model parameters until the result achieves the desired properties besides the term clustering there are a number of terms with similar meanings including automatic classification numerical taxonomy botryology from greek grape and typological analysis the subtle differences are often in the usage of the results while in data mining the resulting groups are the matter of interest in automatic classification primarily their discriminative power is of interest this often leads to misunderstandings between researchers coming from the fields of data mining and machine learning since they use the same terms and often the same algorithms but have different goals contents 1 clusters and clusterings 2 clustering algorithms 2 1 connectivity based clustering hierarchical clustering 2 2 centroid based clustering 2 3 distribution based clustering 2 4 density based clustering 2 5 newer developments 3 evaluation of clustering results 3 1 internal evaluation 3 2 external evaluation 4 clustering axioms 4 1 formal preliminaries 4 2 axioms 5 applications 6 see also 6 1 related topics 6 2 related methods 7 references clusters and clusterings edit the notion of a cluster cannot be precisely defined 1 which is one of the reasons why there are so many clustering algorithms there of course is a common denominator a group of data objects however different researchers employ different cluster models and for each of these cluster models again different algorithms can be given the notion of a cluster as found by different algorithms varies significantly in its properties understanding these cluster models is key to understanding the differences between the various algorithms typical cluster models include connectivity models for example hierarchical clustering builds models based on distance connectivity centroid models for example the k means algorithm represents each cluster by a single mean vector distribution models clusters are modeled using statistical distributions such as multivariate normal distributions used by the expectation maximization algorithm density models for example dbscan and optics defines clusters as connected dense regions in the data space subspace models in biclustering also known as co clustering or two mode clustering clusters are modeled with both cluster members and relevant attributes group models some algorithms unfortunately do not provide a refined model for their results and just provide the grouping information graph based models a clique i e a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster relaxations of the complete connectivity requirement a fraction of the edges can be missing are known as quasi cliques a clustering is essentially a set of such clusters usually containing all objects in the data set additionally it may specify the relationship of the clusters to each other for example a hierarchy of clusters embedded in each other clusterings can be roughly distinguished in hard clustering each object belongs to a cluster or not soft clustering also fuzzy clustering each object belongs to each cluster to a certain degree e g 160 a likelihood of belonging to the cluster there are also finer distinctions possible for example strict partitioning clustering here each object belongs to exactly one cluster strict partitioning clustering with outliers objects can also belong to no cluster and are considered outliers overlapping clustering also alternative clustering multi view clustering while usually a hard clustering objects may belong to more than one cluster hierarchical clustering objects that belong to a child cluster also belong to the parent cluster subspace clustering while an overlapping clustering within a uniquely defined subspace clusters are not expected to overlap clustering algorithms edit clustering algorithms can be categorized based on their cluster model as listed above the following overview will only list the most prominent examples of clustering algorithms as there are possibly over 100 published clustering algorithms not all provide models for their clusters and can thus not easily be categorized an overview of algorithms explained in wikipedia can be found in the list of statistics algorithms there is no objectively correct clustering algorithm but as it was noted clustering is in the eye of the beholder 1 the most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally unless there is a mathematical reason to prefer one cluster model over another it should be noted that an algorithm that is designed for one kind of models has no chance on a data set that contains a radically different kind of models 1 for example k means cannot find non convex clusters 1 connectivity based clustering hierarchical clustering edit main article hierarchical clustering connectivity based clustering also known as hierarchical clustering is based on the core idea of objects being more related to nearby objects than to objects farther away as such these algorithms connect objects to form clusters based on their distance a cluster can be described largely by the maximum distance needed to connect parts of the cluster at different distances different clusters will form which can be represented using a dendrogram which explains where the common name hierarchical clustering comes from these algorithms do not provide a single partitioning of the data set but instead provide an extensive hierarchy of clusters that merge with each other at certain distances in a dendrogram the y axis marks the distance at which the clusters merge while the objects are placed along the x axis such that the clusters don t mix connectivity based clustering is a whole family of methods that differ by the way distances are computed apart from the usual choice of distance functions the user also needs to decide on the linkage criterion since a cluster consists of multiple objects there are multiple candidates to compute the distance to to use popular choices are known as single linkage clustering the minimum of object distances complete linkage clustering the maximum of object distances or upgma unweighted pair group method with arithmetic mean also known as average linkage clustering furthermore hierarchical clustering can be agglomerative starting with single elements and aggregating them into clusters or divisive starting with the complete data set and dividing it into partitions while these methods are fairly easy to understand the results are not always easy to use as they will not produce a unique partitioning of the data set but a hierarchy the user still needs to choose appropriate clusters from the methods are not very robust towards outliers which will either show up as additional clusters or even cause other clusters to merge known as chaining phenomenon in particular with single linkage clustering in the general case the complexity is which makes them too slow for large data sets for some special cases optimal efficient methods of complexity are known slink 2 for single linkage and clink 3 for complete linkage clustering in the data mining community these methods are recognized as a theoretical foundation of cluster analysis but often considered obsolete they did however provide inspiration for many later methods such as density based clustering linkage clustering examples single linkage on gaussian data at 35 clusters the biggest cluster starts fragmenting into smaller parts while before it was still connected to the second largest due to the single link effect single linkage on density based clusters 20 clusters extracted most of which contain single elements since linkage clustering does not have a notion of noise centroid based clustering edit main article k means clustering in centroid based clustering clusters are represented by a central vector which may not necessarily be a member of the data set when the number of clusters is fixed to k k means clustering gives a formal definition as an optimization problem find the cluster centers and assign the objects to the nearest cluster center such that the squared distances from the cluster are minimized the optimization problem itself is known to be np hard and thus the common approach is to search only for approximate solutions a particularly well known approximative method is lloyd s algorithm 4 often actually referred to as k means algorithm it does however only find a local optimum and is commonly run multiple times with different random initializations variations of k means often include such optimizations as choosing the best of multiple runs but also restricting the centroids to members of the data set k medoids choosing medians k medians clustering choosing the initial centers less randomly k means or allowing a fuzzy cluster assignment fuzzy c means most k means type algorithms require the number of clusters to be specified in advance which is considered to be one of the biggest drawbacks of these algorithms furthermore the algorithms prefer clusters of approximately similar size as they will always assign an object to the nearest centroid this often leads to incorrectly cut borders in between of clusters which is not surprising as the algorithm optimized cluster centers not cluster borders k means has a number of interesting theoretical properties on one hand it partitions the data space into a structure known as voronoi diagram on the other hand it is conceptually close to nearest neighbor classification and as such popular in machine learning third it can be seen as a variation of model based classification and lloyd s algorithm as a variation of the expectation maximization algorithm for this model discussed below k means clustering examples k means separates data into voronoi cells which assumes equal sized clusters not adequate here k means cannot represent density based clusters distribution based clustering edit the clustering model most closely related to statistics is based on distribution models clusters can then easily be defined as objects belonging most likely to the same distribution a nice property of this approach is that this closely resembles the way artificial data sets are generated by sampling random objects from a distribution while the theoretical foundation of these methods is excellent they suffer from one key problem known as overfitting unless constraints are put on the model complexity a more complex model will usually always be able to explain the data better which makes choosing the appropriate model complexity inherently difficult one prominent method is known as gaussian mixture models using the expectation maximization algorithm here the data set is usually modeled with a fixed to avoid overfitting number of gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to fit better to the data set this will converge to a local optimum so multiple runs may produce different results in order to obtain a hard clustering objects are often then assigned to the gaussian distribution they most likely belong to for soft clusterings this is not necessary distribution based clustering is a semantically strong method as it not only provides you with clusters but also produces complex models for the clusters that can also capture correlation and dependence of attributes however using these algorithms puts an extra burden on the user to choose appropriate data models to optimize and for many real data sets there may be no mathematical model available the algorithm is able to optimize e g assuming gaussian distributions is a rather strong assumption on the data em clustering examples on gaussian distributed data em works well since it uses gaussians for modelling clusters density based clusters cannot be modeled using gaussian distributions density based clustering edit in density based clustering 5 clusters are defined as areas of higher density than the remainder of the data set objects in these sparse areas that are required to separate clusters are usually considered to be noise and border points the most popular 6 density based clustering method is dbscan 7 in contrast to many newer methods it features a well defined cluster model called density reachability similar to linkage based clustering it is based on connecting points within certain distance thresholds however it only connects points that satisfy a density criterion in the original variant defined as a minimum number of other objects within this radius a cluster consists of all density connected objects which can form a cluster of an arbitrary shape in contrast to many other methods plus all objects that are within these objects range another interesting property of dbscan is that its complexity is fairly low it requires a linear number of range queries on the database and that it will discover essentially the same results it is deterministic for core and noise points but not for border points in each run therefore there is no need to run it multiple times optics 8 is a generalization of dbscan that removes the need to choose an appropriate value for the range parameter and produces a hierarchical result related to that of linkage clustering deli clu 9 density link clustering combines ideas from single linkage clustering and optics eliminating the parameter entirely and offering performance improvements over optics by using an r tree index the key drawback of dbscan and optics is that they expect some kind of density drop to detect cluster borders moreover they can not detect intrinsic cluster structures which are prevalent in the majority of real life data a variation of dbscan endbscan 10 efficiently detects such kinds of structures on data sets with for example overlapping gaussian distributions a common use case in artificial data the cluster borders produced by these algorithms will often look arbitrary because the cluster density decreases continuously on a data set consisting of mixtures of gaussians these algorithms are nearly always outperformed by methods such as em clustering that are able to precisely model this kind of data density based clustering examples density based clustering with dbscan dbscan assumes clusters of similar density and may have problems separating nearby clusters optics is a dbscan variant that handles different densities much better newer developments edit in recent years considerable effort has been put into improving algorithm performance of the existing algorithms 11 among them are clarans ng and han 1994 12 and birch zhang et al 1996 13 with the recent need to process larger and larger data sets also known as big data the willingness to trade semantic meaning of the generated clusters for performance has been increasing this led to the development of pre clustering methods such as canopy clustering which can process huge data sets efficiently but the resulting clusters are merely a rough pre partitioning of the data set to then analyze the partitions with existing slower methods such as k means clustering various other approaches to clustering have been tried such as seed based clustering 14 for high dimensional data many of the existing methods fail due to the curse of dimensionality which renders particular distance functions problematic in high dimensional spaces this led to new clustering algorithms for high dimensional data that focus on subspace clustering where only some attributes are used and cluster models include the relevant attributes for the cluster and correlation clustering that also looks for arbitrary rotated correlated subspace clusters that can be modeled by giving a correlation of their attributes examples for such clustering algorithms are clique 15 and subclu 16 ideas from density based clustering methods in particular the dbscan optics family of algorithms have been adopted to subspace clustering hisc 17 hierarchical subspace clustering and dish 18 and correlation clustering hico 19 hierarchical corelation clustering 4c 20 using correlation connectivity and eric 21 exploring hierarchical density based correlation clusters several different clustering systems based on mutual information have been proposed one is marina meil s variation of information metric 22 another provides hierarchical clustering 23 using genetic algorithms a wide range of different fit functions can be optimized including mutual information 24 also message passing algorithms a recent development in computer science and statistical physics has led to the creation of new types of clustering algorithms 25 evaluation of clustering results edit evaluation of clustering results sometimes is referred to as cluster validation there have been several suggestions for a measure of similarity between two clusterings such a measure can be used to compare how well different data clustering algorithms perform on a set of data these measures are usually tied to the type of criterion being considered in assessing the quality of a clustering method internal evaluation edit when a clustering result is evaluated based on the data that was clustered itself this is called internal evaluation these methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters one drawback of using internal criteria in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications 26 additionally this evaluation is biased towards algorithms that use the same cluster model for example k means clustering naturally optimizes object distances and a distance based internal criterion will likely overrate the resulting clustering therefore the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another but this shall not imply that one algorithm produces more valid results than another 1 validity as measured by such an index depends on the claim that this kind of structure exists in the data set an algorithm designed for some kind of models has no chance if the data set contains a radically different set of models or if the evaluation measures a radically different criterion 1 for example k means clustering can only find convex clusters and many evaluation indexes assume convex clusters on a data set with non convex clusters neither the use of k means nor of an evaluation criterion that assumes convexity is sound the following methods can be used to assess the quality of clustering algorithms based on internal criterion linear algebra measure laura a mather 2000 jasist 51 602 613 one of the most common models in information retrieval ir the vector space model represents a document set as a term document matrix where each row corresponds to a term and each column corresponds to a document because of the use of matrices in ir it is possible to apply linear algebra to this ir model this paper describes an application of linear algebra to text clustering namely a metric for measuring cluster quality the metric is based on the theory that cluster quality is proportional to the number of terms that are disjoint across the clusters the metric compares the singular values of the term document matrix to the singular values of the matrices for each of the clusters to determine the amount of overlap of the terms across clusters because the metric can be difficult to interpret a standardization of the metric is defined which specifies the number of standard deviations a clustering of a document set is from an average random clustering of that document set empirical evidence shows that the standardized cluster metric correlates with clustered retrieval performance when comparing clustering algorithms or multiple parameters for the same clustering algorithm davies bouldin index the davies bouldin index can be calculated by the following formula where n is the number of clusters is the centroid of cluster is the average distance of all elements in cluster to centroid and is the distance between centroids and since algorithms that produce clusters with low intra cluster distances high intra cluster similarity and high inter cluster distances low inter cluster similarity will have a low davies bouldin index the clustering algorithm that produces a collection of clusters with the smallest davies bouldin index is considered the best algorithm based on this criterion dunn index j c dunn 1974 the dunn index aims to identify dense and well separated clusters it is defined as the ratio between the minimal inter cluster distance to maximal intra cluster distance for each cluster partition the dunn index can be calculated by the following formula 27 where represents the distance between clusters and and measures the intra cluster distance of cluster the inter cluster distance between two clusters may be any number of distance measures such as the distance between the centroids of the clusters similarly the intra cluster distance may be measured in a variety ways such as the maximal distance between any pair of elements in cluster since internal criterion seek clusters with high intra cluster similarity and low inter cluster similarity algorithms that produce clusters with high dunn index are more desirable external evaluation edit in external evaluation clustering results are evaluated based on data that was not used for clustering such as known class labels and external benchmarks such benchmarks consist of a set of pre classified items and these sets are often created by human experts thus the benchmark sets can be thought of as a gold standard for evaluation these types of evaluation methods measure how close the clustering is to the predetermined benchmark classes however it has recently been discussed whether this is adequate for real data or only on synthetic data sets with a factual ground truth since classes can contain internal structure the attributes present may not allow separation of clusters or the classes may contain anomalies 28 additionally from a knowledge discovery point of view the reproduction of known knowledge may not necessarily be the intended result 28 some of the measures of quality of a cluster algorithm using external criterion include rand measure william m rand 1971 29 the rand index computes how similar the clusters returned by the clustering algorithm are to the benchmark classifications one can also view the rand index as a measure of the percentage of correct decisions made by the algorithm it can be computed using the following formula where is the number of true positives is the number of true negatives is the number of false positives and is the number of false negatives one issue with the rand index is that false positives and false negatives are equally weighted this may be an undesirable characteristic for some clustering applications the f measure addresses this concern f measure the f measure can be used to balance the contribution of false negatives by weighting recall through a parameter let precision and recall be defined as follows where is the precision rate and is the recall rate we can calculate the f measure by using the following formula 26 notice that when in other words recall has no impact on the f measure when and increasing allocates an increasing amount of weight to recall in the final f measure pair counting f measure is the f measure applied to the set of object pairs where objects are paired with each other when they are part of the same cluster this measure is able to compare clusterings with different numbers of clusters jaccard index the jaccard index is used to quantify the similarity between two datasets the jaccard index takes on a value between 0 and 1 an index of 1 means that the two dataset are identical and an index of 0 indicates that the datasets have no common elements the jaccard index is defined by the following formula this is simply the number of unique elements common to both sets divided by the total number of unique elements in both sets fowlkes mallows index e b fowlkes amp c l mallows 1983 30 the fowlkes mallows index computes the similarity between the clusters returned by the clustering algorithm and the benchmark classifications the higher the value of the fowlkes mallows index the more similar the clusters and the benchmark classifications are it can be computed using the following formula where is the number of true positives is the number of false positives and is the number of false negatives the index is the geometric mean of the precision and recall and while the f measure is their harmonic mean 31 moreover precision and recall are also known as wallace s indices and 32 confusion matrix a confusion matrix can be used to quickly visualize the results of a classification or clustering algorithm it shows how different a cluster is from the gold standard cluster the mutual information is an information theoretic measure of how much information is shared between a clustering and a ground truth classification that can detect a non linear similarity between two clusterings adjusted mutual information is the corrected for chance variant of this that has a reduced bias for varying cluster numbers clustering axioms edit given that there is a myriad of clustering algorithms and objectives it is helpful to reason about clustering independently of any particular algorithm objective function or generative data model this can be achieved by defining a clustering function as one that satisfies a set of properties this is often termed as an axiomatic system functions that satisfy the basic axioms are called clustering functions 33 formal preliminaries edit a partitioning function acts on a set of points along with an integer and pairwise distances among the points in the points in are not assumed to belong to any specific set the pairwise distances are the only data the partitioning function has about them since we wish to deal with point sets that do not necessarily belong to a specific set we identify the points with the set we can then define a distance function to be any function such that for distinct we have if and only if and in other words must be symmetric and two points have distance zero if only if they are the same point a partitioning function is a function that takes a distance function on and an integer and returns a partitioning of a partitioning of is a collection of non empty disjoint subsets of whose union is the sets in will be called its clusters two clustering functions are equivalent if and only if they output the same partitioning on all values of and i e functionally equivalent axioms edit now in an effort to distinguish clustering functions from partitioning functions we lay down some properties that one may like a clustering function to satisfy here is the first one if is a distance function then we define to be the same function with all distances multiplied by scale invariance for any distance function number of clusters and scalar we have this property simply requires the function to be immune to stretching or shrinking the data points linearly it effectively disallows clustering functions to be sensitive to changes in units of measurement which is desirable we would like clustering functions to not have any predefined hard coded distance values in their decision process the next property ensures that the clustering function is rich in types of partitioning it could output for a fixed and let range be the set of all possible outputs while varying richess for any number of clusters range is equal to the set of all partitions of in other words if we are given a set of points such that all we know about the points are pairwise distances then for any partitioning there should exist a such that by varying distances amongst points we should be able to obtain all possible partitionings the next property is more subtle we call a partitioning function consistent if it satisfies the following when we shrink distances between points in the same cluster and expand distances between between points in different clusters we get the same result formally we say that is a transformation of if a for all belonging to the same cluster of we have and b for all belonging to different clusters of we have in other words is a transformation of such that points inside the same cluster are brought closer together and points not inside the same cluster are moved further away from one another consistency fix let be a distance function and be a transformation of then in other words suppose that we run the partitioning function on to get back a particular partitioning now with respect to if we shrink in cluster distances or expand between cluster distances and run again we should still get back the same result namely the partitioning function is forced to return a fixed number of clusters if this were not the case then the above three properties could never be satisfied by any function 34 in many popular clustering algorithms such as means single linkage and spectral clustering the number of clusters to be returned is determined beforehand by the human user or other methods and passed into the clustering function as a parameter applications edit biology computational biology and bioinformatics plant and animal ecology cluster analysis is used to describe and to make spatial and temporal comparisons of communities assemblages of organisms in heterogeneous environments it is also used in plant systematics to generate artificial phylogenies or clusters of organisms individuals at the species genus or higher level that share a number of attributes transcriptomics clustering is used to build groups of genes with related expression patterns also known as coexpressed genes often such groups contain functionally related proteins such as enzymes for a specific pathway or genes that are co regulated high throughput experiments using expressed sequence tags ests or dna microarrays can be a powerful tool for genome annotation a general aspect of genomics sequence analysis clustering is used to group homologous sequences into gene families this is a very important concept in bioinformatics and evolutionary biology in general see evolution by gene duplication high throughput genotyping platforms clustering algorithms are used to automatically assign genotypes human genetic clustering the similarity of genetic data is used in clustering to infer population structures medicine medical imaging on pet scans cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image in this application actual position does not matter but the voxel intensity is considered as a vector with a dimension for each image that was taken over time this technique allows for example accurate measurement of the rate a radioactive tracer is delivered to the area of interest without a separate sampling of arterial blood an intrusive technique that is most common today imrt segmentation clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in mlc based radiation therapy business and marketing market research cluster analysis is widely used in market research when working with multivariate data from surveys and test panels market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers potential customers and for use in market segmentation product positioning new product development and selecting test markets grouping of shopping items clustering can be used to group all the shopping items available on the web into a set of unique products for example all the items on ebay can be grouped into unique products ebay doesn t have the concept of a sku world wide web social network analysis in the study of social networks clustering may be used to recognize communities within large groups of people search result grouping in the process of intelligent grouping of the files and websites clustering may be used to create a more relevant set of search results compared to normal search engines like google there are currently a number of web based clustering tools such as clusty slippy map optimization flickr s map of photos and other map sites use clustering to reduce the number of markers on a map this makes it both faster and reduces the amount of visual clutter computer science software evolution clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed it is a form of restructuring and hence is a way of directly preventative maintenance image segmentation clustering can be used to divide a digital image into distinct regions for border detection or object recognition evolutionary algorithms clustering may be used to identify different niches within the population of an evolutionary algorithm so that reproductive opportunity can be distributed more evenly amongst the evolving species or subspecies recommender systems recommender systems are designed to recommend new items based on a user s tastes they sometimes use clustering algorithms to predict a user s preferences based on the preferences of other users in the user s cluster markov chain monte carlo methods clustering is often utilized to locate and characterize extrema in the target distribution social science crime analysis cluster analysis can be used to identify areas where there are greater incidences of particular types of crime by identifying these distinct areas or hot spots where a similar crime has happened over a period of time it is possible to manage law enforcement resources more effectively educational data mining cluster analysis is for example used to identify groups of schools or students with similar properties typologies from poll data projects such as those underaken by the pew research center use cluster analysis to discern typologies of opinions habits and demographics that may be useful in politics and marketing others field robotics clustering algorithms are used for robotic situational awareness to track objects and detect outliers in sensor data 35 mathematical chemistry to find structural similarity etc for example 3000 chemical compounds were clustered in the space of 90 topological indices 36 climatology to find weather regimes or preferred sea level pressure atmospheric patterns 37 petroleum geology cluster analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties physical geography the clustering of chemical properties in different sample locations see also edit wikimedia commons has media related to cluster analysis this section may require cleanup to meet wikipedia s quality standards october 2011 related topics edit main article data mining clustering high dimensional data curse of dimensionality data stream clustering dimension reduction silhouette parallel coordinates related methods edit see also category data clustering algorithms artificial neural network ann cluster weighted modeling consensus clustering constrained clustering idea networking instance based learning latent class analysis multidimensional scaling nearest neighbor search conceptual clustering neighbourhood components analysis paired difference test principal component analysis structured data analysis statistics sequence clustering references edit a b c d e f estivill castro v 2002 why so many clustering algorithms acm sigkdd explorations newsletter 4 65 doi 10 1145 568574 568575 160 edit r sibson 1973 slink an optimally efficient algorithm for the single link cluster method the computer journal british computer society 16 1 30 34 doi 10 1093 comjnl 16 1 30 160 d defays 1977 an efficient algorithm for a complete link method the computer journal british computer society 20 4 364 366 doi 10 1093 comjnl 20 4 364 160 lloyd s 1982 least squares quantization in pcm ieee transactions on information theory 28 2 129 137 doi 10 1109 tit 1982 1056489 160 edit hans peter kriegel peer kr ger j rg sander arthur zimek 2011 density based clustering wires data mining and knowledge discovery 1 3 231 240 doi 10 1002 widm 30 160 microsoft academic search most cited data mining articles dbscan is on rank 24 when accessed on 4 18 2010 martin ester hans peter kriegel j rg sander xiaowei xu 1996 a density based algorithm for discovering clusters in large spatial databases with noise in evangelos simoudis jiawei han usama m fayyad proceedings of the second international conference on knowledge discovery and data mining kdd 96 aaai press pp 160 226 231 isbn 160 1 57735 004 9 160 mihael ankerst markus m breunig hans peter kriegel j rg sander 1999 optics ordering points to identify the clustering structure acm sigmod international conference on management of data acm press pp 160 49 60 160 achtert e b hm c kr ger p 2006 deli clu boosting robustness completeness usability and efficiency of hierarchical clustering by a closest pair ranking lncs advances in knowledge discovery and data mining lecture notes in computer science 3918 119 128 doi 10 1007 11731139_16 isbn 160 978 3 540 33206 0 160 edit s roy d k bhattacharyya 2005 an approach to find embedded clusters using density based techniques lncs vol 3816 springer verlag pp 160 523 535 160 z huang extensions to the k means algorithm for clustering large data sets with categorical values data mining and knowledge discovery 2 283 304 1998 r ng and j han efficient and effective clustering method for spatial data mining in proceedings of the 20th vldb conference pages 144 155 santiago chile 1994 tian zhang raghu ramakrishnan miron livny an efficient data clustering method for very large databases in proc int l conf on management of data acm sigmod pp 103 114 can f ozkarahan e a 1990 concepts and effectiveness of the cover coefficient based clustering methodology for text databases acm transactions on database systems 15 4 483 doi 10 1145 99935 99938 160 edit agrawal r gehrke j gunopulos d raghavan p 2005 automatic subspace clustering of high dimensional data data mining and knowledge discovery 11 5 doi 10 1007 s10618 005 1396 1 160 edit karin kailing hans peter kriegel and peer kr ger density connected subspace clustering for high dimensional data in proc siam int conf on data mining sdm 04 pp 246 257 2004 achtert e b hm c kriegel h p kr ger p m ller gorman i zimek a 2006 finding hierarchies of subspace clusters lncs knowledge discovery in databases pkdd 2006 lecture notes in computer science 4213 446 453 doi 10 1007 11871637_42 isbn 160 978 3 540 45374 1 160 edit achtert e b hm c kriegel h p kr ger p m ller gorman i zimek a 2007 detection and visualization of subspace cluster hierarchies lncs advances in databases concepts systems and applications lecture notes in computer science 4443 152 163 doi 10 1007 978 3 540 71703 4_15 isbn 160 978 3 540 71702 7 160 edit achtert e b hm c kr ger p zimek a 2006 mining hierarchies of correlation clusters proc 18th international conference on scientific and statistical database management ssdbm 119 128 doi 10 1109 ssdbm 2006 35 isbn 160 0 7695 2590 3 160 edit b hm c kailing k kr ger p zimek a 2004 computing clusters of correlation connected objects proceedings of the 2004 acm sigmod international conference on management of data sigmod 04 p 160 455 doi 10 1145 1007568 1007620 isbn 160 1581138598 160 edit achtert e bohm c kriegel h p kr ger p zimek a 2007 on exploring complex relationships of correlation clusters 19th international conference on scientific and statistical database management ssdbm 2007 p 160 7 doi 10 1109 ssdbm 2007 21 isbn 160 0 7695 2868 6 160 edit meil marina 2003 comparing clusterings by the variation of information learning theory and kernel machines lecture notes in computer science 2777 173 187 doi 10 1007 978 3 540 45167 9_14 isbn 160 978 3 540 40720 1 160 alexander kraskov harald st gbauer ralph g andrzejak and peter grassberger hierarchical clustering based on mutual information 2003 arxiv q bio 0311039 auffarth b 2010 clustering by a genetic algorithm with biased mutation operator wcci cec ieee july 18 23 2010 http citeseerx ist psu edu viewdoc summary doi 10 1 1 170 869 b j frey and d dueck 2007 clustering by passing messages between data points science 315 5814 972 976 doi 10 1126 science 1136800 pmid 160 17218491 160 papercore summary frey2007 a b christopher d manning prabhakar raghavan amp hinrich schutze introduction to information retrieval cambridge university press isbn 160 978 0 521 86571 5 160 dunn j 1974 well separated clusters and optimal fuzzy partitions journal of cybernetics 4 95 104 doi 10 1080 01969727408546059 160 a b ines f rber stephan g nnemann hans peter kriegel peer kr ger emmanuel m ller erich schubert thomas seidl arthur zimek 2010 on using class labels in evaluation of clusterings in xiaoli z fern ian davidson jennifer dy multiclust discovering summarizing and using multiple clusterings acm sigkdd 160 w m rand 1971 objective criteria for the evaluation of clustering methods journal of the american statistical association american statistical association 66 336 846 850 doi 10 2307 2284239 jstor 160 2284239 160 e b fowlkes amp c l mallows 1983 a method for comparing two hierarchical clusterings journal of the american statistical association 78 553 569 l hubert et p arabie comparing partitions j of classification 2 1 1985 d l wallace comment journal of the american statistical association 78 160 569 579 1983 r b zadeh s ben david a uniqueness theorem for clustering in proceedings of the conference of uncertainty in artificial intelligence 2009 j kleinberg an impossibility theorem for clustering proceedings of the neural information processing systems conference 2002 bewley a et al real time volume estimation of a dragline payload ieee international conference on robotics and automation 2011 1571 1576 basak s c magnuson v r niemi c j regal r r determining structural similarity of chemicals using graph theoretic indices discr appl math 19 1988 17 44 huth r et al classifications of atmospheric circulation patterns recent advances and applications ann n y acad sci 1146 2008 105 152 retrieved from http en wikipedia org w index php title cluster_analysis amp oldid 561326233 categories data miningdata analysiscluster analysisgeostatisticsmachine learningmultivariate statisticshidden categories articles with inconsistent citation formatsarticles needing additional references from may 2012all articles needing additional referencescommons category without a link on wikidataarticles needing cleanup from october 2011all articles needing cleanupcleanup tagged articles without a reason field from october 2011wikipedia pages needing cleanup from october 2011 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal esky deutsch eesti espa ol euskara fran ais hrvatski italiano latvie u magyar nederlands polski portugus sloven ina srpskohrvatski svenska ti ng vi t edit links this page was last modified on 24 june 2013 at 07 56 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Co_occurrence_networks b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Co_occurrence_networks new file mode 100644 index 00000000..0fabf379 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Co_occurrence_networks @@ -0,0 +1 @@ +co occurrence networks wikipedia the free encyclopedia co occurrence networks from wikipedia the free encyclopedia jump to navigation search a co occurrence network created with kh coder co occurrence networks are generally used to provide a graphic visualization of potential relationships between people organizations concepts or other entities represented within written material the generation and visualization of co occurrence networks has become practical with the advent of electronically stored text amenable to text mining by way of definition co occurrence networks are the collective interconnection of terms based on their paired presence within a specified unit of text networks are generated by connecting pairs of terms using a set of criteria defining co occurrence for example terms a and b may be said to co occur if they both appear in a particular article another article may contain terms b and c linking a to b and b to c creates a co occurrence network of these three terms rules to define co occurrence within a text corpus can be set according to desired criteria for example a more stringent criteria for co occurrence may require a pair of terms to appear in the same sentence contents 1 methods and development 2 applications and use 3 see also 4 references methods and development edit co occurrence networks can be created for any given list of terms any dictionary in relation to any collection of texts any text corpus co occurring pairs of terms can be called neighbors and these often group into neighborhoods based on their interconnections individual terms may have several neighbors neighborhoods may connect to one another through at least one individual term or may remain unconnected individual terms are within the context of text mining symbolically represented as text strings in the real world the entity identified by a term normally has several symbolic representations it is therefore useful to consider terms as being represented by one primary symbol and up to several synonymous alternative symbols occurrence of an individual term is established by searching for each known symbolic representations of the term the process can be augmented through nlp natural language processing algorithms that interrogate segments of text for possible alternatives such as word order spacing and hyphenation nlp can also be used to identify sentence structure and categorize text strings according to grammar for example categorizing a string of text as a noun based on a preceding string of text known to be an article graphic representation of co occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the domain represented by the dictionary of terms applied to the text corpus meaningful visualization normally requires simplifications of the network for example networks may be drawn such that the number of neighbors connecting to each term is limited the criteria for limiting neighbors might be based on the absolute number of co occurrences or more subtle criteria such as probability of co occurrence or the presence of an intervening descriptive term quantitative aspects of the underlying structure of a co occurrence network might also be informative such as the overall number of connections between entities clustering of entities representing sub domains detecting synonyms 1 etc applications and use edit some working applications of the co occurrence approach are available to the public through the internet pubgene is an example of an application that addresses the interests of biomedical community by presenting networks based on the co occurrence of genetics related terms as these appear in medline records 2 3 the website namebase is an example of how human relationships can be inferred by examining networks constructed from the co occurrence of personal names in newspapers and other texts as in ozgur et al 4 networks of information are also used to facilitate efforts to organize and focus publicly available information for law enforcement and intelligence purposes so called open source intelligence or osint related techniques include co citation networks as well as the analysis of hyperlink and content structure on the internet such as in the analysis of web sites connected to terrorism 5 see also edit takada h saito k yamada t kimura m analysis of growing co occurrence networks sig kbs journal code x0831a 2006 vol 73rd no page 117 122 language japanese liu chua t s building semantic perceptron net for topic spotting proceedings of the 39th annual meeting on association for computational linguistics 2001 378 385 references edit cohen am hersh wr dubay c spackman k using co occurrence network structure to extract synonymous gene and protein names from medline abstracts bmc bioinformatics 2005 6 103 jenssen tk laegreid a komorowski j hovig e a literature network of human genes for high throughput analysis of gene expression nature genetics 2001 may 28 1 21 8 pmid 11326270 grivell l mining the bibliome searching for a needle in a haystack new computing tools are needed to effectively scan the growing amount of scientific literature for useful information embo reports 2001 mar 3 3 200 3 doi 10 1093 embo reports kvf059 pmid 11882534 ozgur a cetin b bingol h co occurrence network of reuters news 15 dec 2007 http arxiv org abs 0712 2491 zhou y reid e qin j chen h lai g us domestic extremist groups on the web link and content analysis http doi ieeecomputersociety org 10 1109 mis 2005 96 retrieved from http en wikipedia org w index php title co occurrence_networks amp oldid 559177716 categories biological databasescomputational linguisticsdata miningdomain specific search enginesintelligence gathering disciplinesmedical researchopen source intelligence navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 10 june 2013 at 05 16 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Computational_complexity_theory b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Computational_complexity_theory new file mode 100644 index 00000000..7ebf39a3 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Computational_complexity_theory @@ -0,0 +1 @@ +computational complexity theory wikipedia the free encyclopedia computational complexity theory from wikipedia the free encyclopedia jump to navigation search computational complexity theory is a branch of the theory of computation in theoretical computer science and mathematics that focuses on classifying computational problems according to their inherent difficulty and relating those classes to each other a computational problem is understood to be a task that is in principle amenable to being solved by a computer which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps a problem is regarded as inherently difficult if its solution requires significant resources whatever the algorithm used the theory formalizes this intuition by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them such as time and storage other complexity measures are also used such as the amount of communication used in communication complexity the number of gates in a circuit used in circuit complexity and the number of processors used in parallel computing one of the roles of computational complexity theory is to determine the practical limits on what computers can and cannot do closely related fields in theoretical computer science are analysis of algorithms and computability theory a key distinction between analysis of algorithms and computational complexity theory is that the former is devoted to analyzing the amount of resources needed by a particular algorithm to solve a problem whereas the latter asks a more general question about all possible algorithms that could be used to solve the same problem more precisely it tries to classify problems that can or cannot be solved with appropriately restricted resources in turn imposing restrictions on the available resources is what distinguishes computational complexity from computability theory the latter theory asks what kind of problems can in principle be solved algorithmically contents 1 computational problems 1 1 problem instances 1 2 representing problem instances 1 3 decision problems as formal languages 1 4 function problems 1 5 measuring the size of an instance 2 machine models and complexity measures 2 1 turing machine 2 2 other machine models 2 3 complexity measures 2 4 best worst and average case complexity 2 5 upper and lower bounds on the complexity of problems 3 complexity classes 3 1 defining complexity classes 3 2 important complexity classes 3 3 hierarchy theorems 3 4 reduction 4 important open problems 4 1 p versus np problem 4 2 problems in np not known to be in p or np complete 4 3 separations between other complexity classes 5 intractability 6 continuous complexity theory 7 history 8 see also 9 notes 10 references 10 1 textbooks 10 2 surveys 11 external links computational problems edit a traveling salesperson tour through germany s 15 largest cities problem instances edit a computational problem can be viewed as an infinite collection of instances together with a solution for every instance the input string for a computational problem is referred to as a problem instance and should not be confused with the problem itself in computational complexity theory a problem refers to the abstract question to be solved in contrast an instance of this problem is a rather concrete utterance which can serve as the input for a decision problem for example consider the problem of primality testing the instance is a number e g 15 and the solution is yes if the number is prime and no otherwise in this case no stated another way the instance is a particular input to the problem and the solution is the output corresponding to the given input to further highlight the difference between a problem and an instance consider the following instance of the decision version of the traveling salesman problem is there a route of at most 2000 kilometres passing through all of germany s 15 largest cities the quantitative answer to this particular problem instance is of little use for solving other instances of the problem such as asking for a round trip through all sites in milan whose total length is at most 10 160 km for this reason complexity theory addresses computational problems and not particular problem instances representing problem instances edit when considering computational problems a problem instance is a string over an alphabet usually the alphabet is taken to be the binary alphabet i e the set 0 1 and thus the strings are bitstrings as in a real world computer mathematical objects other than bitstrings must be suitably encoded for example integers can be represented in binary notation and graphs can be encoded directly via their adjacency matrices or by encoding their adjacency lists in binary even though some proofs of complexity theoretic theorems regularly assume some concrete choice of input encoding one tries to keep the discussion abstract enough to be independent of the choice of encoding this can be achieved by ensuring that different representations can be transformed into each other efficiently decision problems as formal languages edit a decision problem has only two possible outputs yes or no or alternately 1 or 0 on any input decision problems are one of the central objects of study in computational complexity theory a decision problem is a special type of computational problem whose answer is either yes or no or alternately either 1 or 0 a decision problem can be viewed as a formal language where the members of the language are instances whose output is yes and the non members are those instances whose output is no the objective is to decide with the aid of an algorithm whether a given input string is a member of the formal language under consideration if the algorithm deciding this problem returns the answer yes the algorithm is said to accept the input string otherwise it is said to reject the input an example of a decision problem is the following the input is an arbitrary graph the problem consists in deciding whether the given graph is connected or not the formal language associated with this decision problem is then the set of all connected graphs of course to obtain a precise definition of this language one has to decide how graphs are encoded as binary strings function problems edit a function problem is a computational problem where a single output of a total function is expected for every input but the output is more complex than that of a decision problem that is it isn t just yes or no notable examples include the traveling salesman problem and the integer factorization problem it is tempting to think that the notion of function problems is much richer than the notion of decision problems however this is not really the case since function problems can be recast as decision problems for example the multiplication of two integers can be expressed as the set of triples a 160 b 160 c such that the relation a 160 160 b 160 160 c holds deciding whether a given triple is member of this set corresponds to solving the problem of multiplying two numbers similarly finding the minimum value of a mathematical function f x is equivalent to a search on k for the problem of determining whether a feasible point exists for f x k measuring the size of an instance edit to measure the difficulty of solving a computational problem one may wish to see how much time the best algorithm requires to solve the problem however the running time may in general depend on the instance in particular larger instances will require more time to solve thus the time required to solve a problem or the space required or any measure of complexity is calculated as function of the size of the instance this is usually taken to be the size of the input in bits complexity theory is interested in how algorithms scale with an increase in the input size for instance in the problem of finding whether a graph is connected how much more time does it take to solve a problem for a graph with 2n vertices compared to the time taken for a graph with n vertices if the input size is n the time taken can be expressed as a function of n since the time taken on different inputs of the same size can be different the worst case time complexity t n is defined to be the maximum time taken over all inputs of size n if t n is a polynomial in n then the algorithm is said to be a polynomial time algorithm cobham s thesis says that a problem can be solved with a feasible amount of resources if it admits a polynomial time algorithm machine models and complexity measures edit turing machine edit an artistic representation of a turing machine main article turing machine a turing machine is a mathematical model of a general computing machine it is a theoretical device that manipulates symbols contained on a strip of tape turing machines are not intended as a practical computing technology but rather as a thought experiment representing a computing machine anything from an advanced supercomputer to a mathematician with a pencil and paper it is believed that if a problem can be solved by an algorithm there exists a turing machine that solves the problem indeed this is the statement of the church turing thesis furthermore it is known that everything that can be computed on other models of computation known to us today such as a ram machine conway s game of life cellular automata or any programming language can be computed on a turing machine since turing machines are easy to analyze mathematically and are believed to be as powerful as any other model of computation the turing machine is the most commonly used model in complexity theory many types of turing machines are used to define complexity classes such as deterministic turing machines probabilistic turing machines non deterministic turing machines quantum turing machines symmetric turing machines and alternating turing machines they are all equally powerful in principle but when resources such as time or space are bounded some of these may be more powerful than others a deterministic turing machine is the most basic turing machine which uses a fixed set of rules to determine its future actions a probabilistic turing machine is a deterministic turing machine with an extra supply of random bits the ability to make probabilistic decisions often helps algorithms solve problems more efficiently algorithms that use random bits are called randomized algorithms a non deterministic turing machine is a deterministic turing machine with an added feature of non determinism which allows a turing machine to have multiple possible future actions from a given state one way to view non determinism is that the turing machine branches into many possible computational paths at each step and if it solves the problem in any of these branches it is said to have solved the problem clearly this model is not meant to be a physically realizable model it is just a theoretically interesting abstract machine that gives rise to particularly interesting complexity classes for examples see nondeterministic algorithm other machine models edit many machine models different from the standard multi tape turing machines have been proposed in the literature for example random access machines perhaps surprisingly each of these models can be converted to another without providing any extra computational power the time and memory consumption of these alternate models may vary 1 what all these models have in common is that the machines operate deterministically however some computational problems are easier to analyze in terms of more unusual resources for example a nondeterministic turing machine is a computational model that is allowed to branch out to check many different possibilities at once the nondeterministic turing machine has very little to do with how we physically want to compute algorithms but its branching exactly captures many of the mathematical models we want to analyze so that nondeterministic time is a very important resource in analyzing computational problems complexity measures edit for a precise definition of what it means to solve a problem using a given amount of time and space a computational model such as the deterministic turing machine is used the time required by a deterministic turing machine m on input x is the total number of state transitions or steps the machine makes before it halts and outputs the answer yes or no a turing machine m is said to operate within time f n if the time required by m on each input of length n is at most f n a decision problem a can be solved in time f n if there exists a turing machine operating in time f n that solves the problem since complexity theory is interested in classifying problems based on their difficulty one defines sets of problems based on some criteria for instance the set of problems solvable within time f n on a deterministic turing machine is then denoted by dtime f n analogous definitions can be made for space requirements although time and space are the most well known complexity resources any complexity measure can be viewed as a computational resource complexity measures are very generally defined by the blum complexity axioms other complexity measures used in complexity theory include communication complexity circuit complexity and decision tree complexity the complexity of an algorithm is often expressed using big o notation best worst and average case complexity edit visualization of the quicksort algorithm that has average case performance the best worst and average case complexity refer to three different ways of measuring the time complexity or any other complexity measure of different inputs of the same size since some inputs of size n may be faster to solve than others we define the following complexities best case complexity this is the complexity of solving the problem for the best input of size n worst case complexity this is the complexity of solving the problem for the worst input of size n average case complexity this is the complexity of solving the problem on an average this complexity is only defined with respect to a probability distribution over the inputs for instance if all inputs of the same size are assumed to be equally likely to appear the average case complexity can be defined with respect to the uniform distribution over all inputs of size n for example consider the deterministic sorting algorithm quicksort this solves the problem of sorting a list of integers that is given as the input the worst case is when the input is sorted or sorted in reverse order and the algorithm takes time o n2 for this case if we assume that all possible permutations of the input list are equally likely the average time taken for sorting is o n log n the best case occurs when each pivoting divides the list in half also needing o n log n time upper and lower bounds on the complexity of problems edit to classify the computation time or similar resources such as space consumption one is interested in proving upper and lower bounds on the minimum amount of time required by the most efficient algorithm solving a given problem the complexity of an algorithm is usually taken to be its worst case complexity unless specified otherwise analyzing a particular algorithm falls under the field of analysis of algorithms to show an upper bound t n on the time complexity of a problem one needs to show only that there is a particular algorithm with running time at most t n however proving lower bounds is much more difficult since lower bounds make a statement about all possible algorithms that solve a given problem the phrase all possible algorithms includes not just the algorithms known today but any algorithm that might be discovered in the future to show a lower bound of t n for a problem requires showing that no algorithm can have time complexity lower than t n upper and lower bounds are usually stated using the big o notation which hides constant factors and smaller terms this makes the bounds independent of the specific details of the computational model used for instance if t n 160 160 7n2 160 160 15n 160 160 40 in big o notation one would write t n 160 160 o n2 complexity classes edit defining complexity classes edit a complexity class is a set of problems of related complexity simpler complexity classes are defined by the following factors the type of computational problem the most commonly used problems are decision problems however complexity classes can be defined based on function problems counting problems optimization problems promise problems etc the model of computation the most common model of computation is the deterministic turing machine but many complexity classes are based on nondeterministic turing machines boolean circuits quantum turing machines monotone circuits etc the resource or resources that are being bounded and the bounds these two properties are usually stated together such as polynomial time logarithmic space constant depth etc of course some complexity classes have complex definitions that do not fit into this framework thus a typical complexity class has a definition like the following the set of decision problems solvable by a deterministic turing machine within time f n this complexity class is known as dtime f n but bounding the computation time above by some concrete function f n often yields complexity classes that depend on the chosen machine model for instance the language xx x is any binary string can be solved in linear time on a multi tape turing machine but necessarily requires quadratic time in the model of single tape turing machines if we allow polynomial variations in running time cobham edmonds thesis states that the time complexities in any two reasonable and general models of computation are polynomially related goldreich 2008 chapter 1 2 this forms the basis for the complexity class p which is the set of decision problems solvable by a deterministic turing machine within polynomial time the corresponding set of function problems is fp important complexity classes edit a representation of the relation among complexity classes many important complexity classes can be defined by bounding the time or space used by the algorithm some important complexity classes of decision problems defined in this manner are the following complexity class model of computation resource constraint dtime f n deterministic turing machine time f n p deterministic turing machine time poly n exptime deterministic turing machine time 2poly n ntime f n non deterministic turing machine time f n np non deterministic turing machine time poly n nexptime non deterministic turing machine time 2poly n dspace f n deterministic turing machine space f n l deterministic turing machine space o log n pspace deterministic turing machine space poly n expspace deterministic turing machine space 2poly n nspace f n non deterministic turing machine space f n nl non deterministic turing machine space o log n npspace non deterministic turing machine space poly n nexpspace non deterministic turing machine space 2poly n it turns out that pspace npspace and expspace nexpspace by savitch s theorem other important complexity classes include bpp zpp and rp which are defined using probabilistic turing machines ac and nc which are defined using boolean circuits and bqp and qma which are defined using quantum turing machines p is an important complexity class of counting problems not decision problems classes like ip and am are defined using interactive proof systems all is the class of all decision problems hierarchy theorems edit main articles time hierarchy theorem and space hierarchy theorem for the complexity classes defined in this way it is desirable to prove that relaxing the requirements on say computation time indeed defines a bigger set of problems in particular although dtime n is contained in dtime n2 it would be interesting to know if the inclusion is strict for time and space requirements the answer to such questions is given by the time and space hierarchy theorems respectively they are called hierarchy theorems because they induce a proper hierarchy on the classes defined by constraining the respective resources thus there are pairs of complexity classes such that one is properly included in the other having deduced such proper set inclusions we can proceed to make quantitative statements about how much more additional time or space is needed in order to increase the number of problems that can be solved more precisely the time hierarchy theorem states that the space hierarchy theorem states that the time and space hierarchy theorems form the basis for most separation results of complexity classes for instance the time hierarchy theorem tells us that p is strictly contained in exptime and the space hierarchy theorem tells us that l is strictly contained in pspace reduction edit main article reduction complexity many complexity classes are defined using the concept of a reduction a reduction is a transformation of one problem into another problem it captures the informal notion of a problem being at least as difficult as another problem for instance if a problem x can be solved using an algorithm for y x is no more difficult than y and we say that x reduces to y there are many different types of reductions based on the method of reduction such as cook reductions karp reductions and levin reductions and the bound on the complexity of reductions such as polynomial time reductions or log space reductions the most commonly used reduction is a polynomial time reduction this means that the reduction process takes polynomial time for example the problem of squaring an integer can be reduced to the problem of multiplying two integers this means an algorithm for multiplying two integers can be used to square an integer indeed this can be done by giving the same input to both inputs of the multiplication algorithm thus we see that squaring is not more difficult than multiplication since squaring can be reduced to multiplication this motivates the concept of a problem being hard for a complexity class a problem x is hard for a class of problems c if every problem in c can be reduced to x thus no problem in c is harder than x since an algorithm for x allows us to solve any problem in c of course the notion of hard problems depends on the type of reduction being used for complexity classes larger than p polynomial time reductions are commonly used in particular the set of problems that are hard for np is the set of np hard problems if a problem x is in c and hard for c then x is said to be complete for c this means that x is the hardest problem in c since many problems could be equally hard one might say that x is one of the hardest problems in c thus the class of np complete problems contains the most difficult problems in np in the sense that they are the ones most likely not to be in p because the problem p 160 160 np is not solved being able to reduce a known np complete problem 2 to another problem 1 would indicate that there is no known polynomial time solution for 1 this is because a polynomial time solution to 1 would yield a polynomial time solution to 2 similarly because all np problems can be reduced to the set finding an np complete problem that can be solved in polynomial time would mean that p 160 160 np 2 important open problems edit diagram of complexity classes provided that p 160 160 np the existence of problems in np outside both p and np complete in this case was established by ladner 3 p versus np problem edit main article p versus np problem the complexity class p is often seen as a mathematical abstraction modeling those computational tasks that admit an efficient algorithm this hypothesis is called the cobham edmonds thesis the complexity class np on the other hand contains many problems that people would like to solve efficiently but for which no efficient algorithm is known such as the boolean satisfiability problem the hamiltonian path problem and the vertex cover problem since deterministic turing machines are special nondeterministic turing machines it is easily observed that each problem in p is also member of the class np the question of whether p equals np is one of the most important open questions in theoretical computer science because of the wide implications of a solution 2 if the answer is yes many important problems can be shown to have more efficient solutions these include various types of integer programming problems in operations research many problems in logistics protein structure prediction in biology 4 and the ability to find formal proofs of pure mathematics theorems 5 the p versus np problem is one of the millennium prize problems proposed by the clay mathematics institute there is a us 1 000 000 prize for resolving the problem 6 problems in np not known to be in p or np complete edit it was shown by ladner that if p np then there exist problems in np that are neither in p nor np complete 3 such problems are called np intermediate problems the graph isomorphism problem the discrete logarithm problem and the integer factorization problem are examples of problems believed to be np intermediate they are some of the very few np problems not known to be in p or to be np complete the graph isomorphism problem is the computational problem of determining whether two finite graphs are isomorphic an important unsolved problem in complexity theory is whether the graph isomorphism problem is in p np complete or np intermediate the answer is not known but it is believed that the problem is at least not np complete 7 if graph isomorphism is np complete the polynomial time hierarchy collapses to its second level 8 since it is widely believed that the polynomial hierarchy does not collapse to any finite level it is believed that graph isomorphism is not np complete the best algorithm for this problem due to laszlo babai and eugene luks has run time 2o n log n for graphs with n vertices the integer factorization problem is the computational problem of determining the prime factorization of a given integer phrased as a decision problem it is the problem of deciding whether the input has a factor less than k no efficient integer factorization algorithm is known and this fact forms the basis of several modern cryptographic systems such as the rsa algorithm the integer factorization problem is in np and in co np and even in up and co up 9 if the problem is np complete the polynomial time hierarchy will collapse to its first level i e np will equal co np the best known algorithm for integer factorization is the general number field sieve which takes time o e 64 9 1 3 n log 2 1 3 log n log 2 2 3 to factor an n bit integer however the best known quantum algorithm for this problem shor s algorithm does run in polynomial time unfortunately this fact doesn t say much about where the problem lies with respect to non quantum complexity classes separations between other complexity classes edit many known complexity classes are suspected to be unequal but this has not been proved for instance p np pp pspace but it is possible that p pspace if p is not equal to np then p is not equal to pspace either since there are many known complexity classes between p and pspace such as rp bpp pp bqp ma ph etc it is possible that all these complexity classes collapse to one class proving that any of these classes are unequal would be a major breakthrough in complexity theory along the same lines co np is the class containing the complement problems i e problems with the yes no answers reversed of np problems it is believed 10 that np is not equal to co np however it has not yet been proven it has been shown that if these two complexity classes are not equal then p is not equal to np similarly it is not known if l the set of all problems that can be solved in logarithmic space is strictly contained in p or equal to p again there are many complexity classes between the two such as nl and nc and it is not known if they are distinct or equal classes it is suspected that p and bpp are equal however it is currently open if bpp nexp intractability edit see also combinatorial explosion problems that can be solved in theory e g given infinite time but which in practice take too long for their solutions to be useful are known as intractable problems 11 in complexity theory problems that lack polynomial time solutions are considered to be intractable for more than the smallest inputs in fact the cobham edmonds thesis states that only those problems that can be solved in polynomial time can be feasibly computed on some computational device problems that are known to be intractable in this sense include those that are exptime hard if np is not the same as p then the np complete problems are also intractable in this sense to see why exponential time algorithms might be unusable in practice consider a program that makes 2n operations before halting for small n say 100 and assuming for the sake of example that the computer does 1012 operations each second the program would run for about 4 160 160 1010 years which is the same order of magnitude as the age of the universe even with a much faster computer the program would only be useful for very small instances and in that sense the intractability of a problem is somewhat independent of technological progress nevertheless a polynomial time algorithm is not always practical if its running time is say n15 it is unreasonable to consider it efficient and it is still useless except on small instances what intractability means in practice is open to debate saying that a problem is not in p does not imply that all large cases of the problem are hard or even that most of them are for example the decision problem in presburger arithmetic has been shown not to be in p yet algorithms have been written that solve the problem in reasonable times in most cases similarly algorithms can solve the np complete knapsack problem over a wide range of sizes in less than quadratic time and sat solvers routinely handle large instances of the np complete boolean satisfiability problem continuous complexity theory edit continuous complexity theory can refer to complexity theory of problems that involve continuous functions that are approximated by discretizations as studied in numerical analysis one approach to complexity theory of numerical analysis 12 is information based complexity continuous complexity theory can also refer to complexity theory of the use of analog computation which uses continuous dynamical systems and differential equations 13 control theory can be considered a form of computation and differential equations are used in the modelling of continuous time and hybrid discrete continuous time systems 14 history edit the analysis of algorithms has been studied long before the invention of computers gabriel lam gave a running time analysis of the euclidean algorithm in 1844 before the actual research explicitly devoted to the complexity of algorithmic problems started off numerous foundations were laid out by various researchers most influential among these was the definition of turing machines by alan turing in 1936 which turned out to be a very robust and flexible notion of computer fortnow amp homer 2003 date the beginning of systematic studies in computational complexity to the seminal paper on the computational complexity of algorithms by juris hartmanis and richard stearns 1965 which laid out the definitions of time and space complexity and proved the hierarchy theorems also in 1965 edmonds defined a good algorithm as one with running time bounded by a polynomial of the input size 15 according to fortnow amp homer 2003 earlier papers studying problems solvable by turing machines with specific bounded resources include john myhill s definition of linear bounded automata myhill 1960 raymond smullyan s study of rudimentary sets 1961 as well as hisao yamada s paper 16 on real time computations 1962 somewhat earlier boris trakhtenbrot 1956 a pioneer in the field from the ussr studied another specific complexity measure 17 as he remembers however my initial interest in automata theory was increasingly set aside in favor of computational complexity an exciting fusion of combinatorial methods inherited from switching theory with the conceptual arsenal of the theory of algorithms these ideas had occurred to me earlier in 1955 when i coined the term signalizing function which is nowadays commonly known as complexity measure boris trakhtenbrot 160 from logic to theoretical computer science an update in pillars of computer science lncs 4800 springer 2008 in 1967 manuel blum developed an axiomatic complexity theory based on his axioms and proved an important result the so called speed up theorem the field really began to flourish in 1971 when the us researcher stephen cook and working independently leonid levin in the ussr proved that there exist practically relevant problems that are np complete in 1972 richard karp took this idea a leap forward with his landmark paper reducibility among combinatorial problems in which he showed that 21 diverse combinatorial and graph theoretical problems each infamous for its computational intractability are np complete 18 relationship between computability theory complexity theory and formal language theory see also edit list of computability and complexity topics list of important publications in theoretical computer science unsolved problems in computer science category computational problems list of complexity classes structural complexity theory descriptive complexity theory quantum complexity theory context of computational complexity parameterized complexity game complexity proof complexity transcomputational problem notes edit references edit see arora amp barak 2009 chapter 1 the computational model and why it doesn t matter a b see sipser 2006 chapter 7 time complexity a b ladner richard e 1975 on the structure of polynomial time reducibility pdf journal of the acm jacm 22 1 151 171 doi 10 1145 321864 321877 160 berger bonnie a leighton t 1998 protein folding in the hydrophobic hydrophilic hp model is np complete journal of computational biology 5 1 27 40 doi 10 1089 cmb 1998 5 27 pmid 160 9541869 160 cook stephen april 2000 the p versus np problem clay mathematics institute retrieved 2006 10 18 160 jaffe arthur m 2006 the millennium grand challenge in mathematics notices of the ams 53 6 retrieved 2006 10 18 160 arvind vikraman kurur piyush p 2006 graph isomorphism is in spp information and computation 204 5 835 852 doi 10 1016 j ic 2006 02 002 160 uwe sch ning graph isomorphism is in the low hierarchy proceedings of the 4th annual symposium on theoretical aspects of computer science 1987 114 124 also journal of computer and system sciences vol 37 1988 312 323 lance fortnow computational complexity blog complexity class of the week factoring september 13 2002 http weblog fortnow com 2002 09 complexity class of week factoring html boaz barak s course on computational complexity lecture 2 hopcroft j e motwani r and ullman j d 2007 introduction to automata theory languages and computation addison wesley boston san francisco new york page 368 smale steve 1997 complexity theory and numerical analysis acta numerica cambridge univ press citeseerx 10 1 1 33 4678 160 a survey on continuous time computations olivier bournez manuel campagnolo new computational paradigms changing conceptions of what is computable cooper s b and l o we b and sorbi a eds new york springer verlag pages 383 423 2008 tomlin claire j mitchell ian bayen alexandre m oishi meeko july 2003 computational techniques for the verification of hybrid systems proceedings of the ieee 91 7 citeseerx 10 1 1 70 4296 160 richard m karp combinatorics complexity and randomness 1985 turing award lecture yamada h 1962 real time computation and recursive functions not real time computable ieee transactions on electronic computers ec 11 6 753 760 doi 10 1109 tec 1962 5219459 160 edit trakhtenbrot b a signalizing functions and tabular operators uchionnye zapiski penzenskogo pedinstituta transactions of the penza pedagogoical institute 4 75 87 1956 in russian richard m karp 1972 reducibility among combinatorial problems in r e miller and j w thatcher editors complexity of computer computations new york plenum pp 160 85 103 160 textbooks edit arora sanjeev barak boaz 2009 computational complexity a modern approach cambridge isbn 160 978 0 521 42426 4 zbl 160 1193 68112 160 downey rod fellows michael 1999 parameterized complexity berlin new york springer verlag 160 du ding zhu ko ker i 2000 theory of computational complexity john wiley amp sons isbn 160 978 0 471 34506 0 160 goldreich oded 2008 computational complexity a conceptual perspective cambridge university press 160 van leeuwen jan ed 1990 handbook of theoretical computer science vol a algorithms and complexity mit press isbn 160 978 0 444 88071 0 160 papadimitriou christos 1994 computational complexity 1st ed addison wesley isbn 160 0 201 53082 1 160 sipser michael 2006 introduction to the theory of computation 2nd ed usa thomson course technology isbn 160 0 534 95097 3 160 garey michael r johnson david s 1979 computers and intractability a guide to the theory of np completeness w 160 h 160 freeman isbn 160 0 7167 1045 5 160 surveys edit khalil hatem ulery dana 1976 a review of current studies on complexity of algorithms for partial differential equations acm 76 proceedings of the 1976 annual conference p 160 197 doi 10 1145 800191 805573 160 cook stephen 1983 an overview of computational complexity commun acm acm 26 6 400 408 doi 10 1145 358141 358144 issn 160 0001 0782 160 fortnow lance homer steven 2003 a short history of computational complexity bulletin of the eatcs 80 95 133 160 mertens stephan 2002 computational complexity for physicists computing in science and engg piscataway nj usa ieee educational activities department 4 3 31 47 arxiv cond mat 0012185 doi 10 1109 5992 998639 issn 160 1521 9615 160 external links edit the complexity zoo v t e important complexity classes more considered feasible dlogtime ac0 acc0 tc0 l sl rl nl nc sc cc p p complete zpp rp bpp bqp suspected infeasible up np np complete np hard co np co np complete am ph p pp p p complete ip pspace pspace complete considered infeasible exptime nexptime expspace elementary pr r re all class hierarchies polynomial hierarchy exponential hierarchy grzegorczyk hierarchy arithmetic hierarchy boolean hierarchy families of classes dtime ntime dspace nspace probabilistically checkable proof interactive proof system retrieved from http en wikipedia org w index php title computational_complexity_theory amp oldid 561560583 categories computational complexity theory navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal esky deutsch eesti espa ol fran ais hrvatski italiano lietuvi bahasa melayu nederlands norsk bokm l polski portugus rom n simple english sloven ina srpski srpskohrvatski suomi svenska ti ng vi t edit links this page was last modified on 25 june 2013 at 19 07 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Computer_science b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Computer_science new file mode 100644 index 00000000..e08d422a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Computer_science @@ -0,0 +1 @@ +computer science wikipedia the free encyclopedia computer science from wikipedia the free encyclopedia jump to navigation search computer science or computing science abbreviated cs or compsci is the scientific and practical approach to computation and its applications a computer scientist specializes in the theory of computation and the design of computational systems 1 its subfields can be divided into a variety of theoretical and practical disciplines some fields such as computational complexity theory which explores the fundamental properties of computational problems are highly abstract whilst fields such as computer graphics emphasize real world visual applications still other fields focus on the challenges in implementing computation for example programming language theory considers various approaches to the description of computation whilst the study of computer programming itself investigates various aspects of the use of programming language and complex systems human computer interaction considers the challenges in making computers and computations useful usable and universally accessible to humans computer science deals with the theoretical foundations of information and computation together with practical techniques for the implementation and application of these foundations contents 1 history 1 1 major achievements 2 philosophy 2 1 name of the field 3 areas of computer science 3 1 theoretical computer science 3 1 1 theory of computation 3 1 2 information and coding theory 3 1 3 algorithms and data structures 3 1 4 programming language theory 3 1 5 formal methods 3 2 applied computer science 3 2 1 artificial intelligence 3 2 2 computer architecture and engineering 3 2 3 computer graphics and visualization 3 2 4 computer security and cryptography 3 2 5 computational science 3 2 6 computer networks 3 2 7 concurrent parallel and distributed systems 3 2 8 databases and information retrieval 3 2 9 health informatics 3 2 10 information science 3 2 11 software engineering 4 academia 4 1 conferences 4 2 journals 5 education 6 see also 7 notes 8 references 9 further reading 10 external links history edit main article history of computer science charles babbage is credited with inventing the first mechanical computer ada lovelace is credited with writing the first algorithm intended for processing on a computer the earliest foundations of what would become computer science predate the invention of the modern digital computer machines for calculating fixed numerical tasks such as the abacus have existed since antiquity but they only supported the human mind aiding in computations as complex as multiplication and division blaise pascal designed and constructed the first working mechanical calculator pascal s calculator in 1642 two hundred years later thomas de colmar launched the mechanical calculator industry 2 when he released his simplified arithmometer which was the first calculating machine strong enough and reliable enough to be used daily in an office environment charles babbage started the design of the first automatic mechanical calculator his difference engine in 1822 which eventually gave him the idea of the first programmable mechanical calculator his analytical engine 3 he started developing this machine in 1834 and in less than two years he had sketched out many of the salient features of the modern computer a crucial step was the adoption of a punched card system derived from the jacquard loom 4 making it infinitely programmable 5 in 1843 during the translation of a french article on the analytical engine ada lovelace wrote in one of the many notes she included an algorithm to compute the bernoulli numbers which is considered to be the first computer program 6 around 1885 herman hollerith invented the tabulator which used punched cards to process statistical information eventually his company became part of ibm in 1937 one hundred years after babbage s impossible dream howard aiken convinced ibm which was making all kinds of punched card equipment and was also in the calculator business 7 to develop his giant programmable calculator the ascc harvard mark i based on babbage s analytical engine which itself used cards and a central computing unit when the machine was finished some hailed it as babbage s dream come true 8 during the 1940s as new and more powerful computing machines were developed the term computer came to refer to the machines rather than their human predecessors 9 as it became clear that computers could be used for more than just mathematical calculations the field of computer science broadened to study computation in general computer science began to be established as a distinct academic discipline in the 1950s and early 1960s 10 11 the world s first computer science degree program the cambridge diploma in computer science began at the university of cambridge computer laboratory in 1953 the first computer science degree program in the united states was formed at purdue university in 1962 12 since practical computers became available many applications of computing have become distinct areas of study in their own right although many initially believed it was impossible that computers themselves could actually be a scientific field of study in the late fifties it gradually became accepted among the greater academic population 13 it is the now well known ibm brand that formed part of the computer science revolution during this time ibm short for international business machines released the ibm 704 14 and later the ibm 709 15 computers which were widely used during the exploration period of such devices still working with the ibm computer was frustrating if you had misplaced as much as one letter in one instruction the program would crash and you would have to start the whole process over again 13 during the late 1950s the computer science discipline was very much in its developmental stages and such issues were commonplace time has seen significant improvements in the usability and effectiveness of computing technology modern society has seen a significant shift in the users of computer technology from usage only by experts and professionals to a near ubiquitous user base initially computers were quite costly and some degree of human aid was needed for efficient use in part from professional computer operators as computer adoption became more widespread and affordable less human assistance was needed for common usage major achievements edit the german military used the enigma machine shown here during world war ii for communication they thought to be secret the large scale decryption of enigma traffic at bletchley park was an important factor that contributed to allied victory in wwii 16 despite its short history as a formal academic discipline computer science has made a number of fundamental contributions to science and society in fact along with electronics it is a founding science of the current epoch of human history called the information age and a driver of the information revolution seen as the third major leap in human technological progress after the industrial revolution 1750 1850 ce and the agricultural revolution 8000 5000 bce these contributions include the start of the digital revolution which includes the current information age and the internet 17 a formal definition of computation and computability and proof that there are computationally unsolvable and intractable problems 18 the concept of a programming language a tool for the precise expression of methodological information at various levels of abstraction 19 in cryptography breaking the enigma code was an important factor contributing to the allied victory in world war ii 16 scientific computing enabled practical evaluation of processes and situations of great complexity as well as experimentation entirely by software it also enabled advanced study of the mind and mapping of the human genome became possible with the human genome project 17 distributed computing projects such as folding home explore protein folding algorithmic trading has increased the efficiency and liquidity of financial markets by using artificial intelligence machine learning and other statistical and numerical techniques on a large scale 20 high frequency algorithmic trading can also exacerbate volatility 21 computer graphics and computer generated imagery have become almost ubiquitous in modern entertainment particularly in television cinema advertising animation and video games even films that feature no explicit cgi are usually filmed now on digital cameras or edited or postprocessed using a digital video editor citation needed simulation of various processes including computational fluid dynamics physical electrical and electronic systems and circuits as well as societies and social situations notably war games along with their habitats among many others modern computers enable optimization of such designs as complete aircraft notable in electrical and electronic circuit design are spice as well as software for physical realization of new or modified designs the latter includes essential design software for integrated circuits citation needed artificial intelligence is becoming increasingly important as its getting smarter and more complex there are many applications of the ai some of which can be seen at homes like the robotic vacuum cleaners and in video games or on the modern battlefield like drones anti missile systems and squad support robots philosophy edit main article philosophy of computer science a number of computer scientists have argued for the distinction of three separate paradigms in computer science peter wegner argued that those paradigms are science technology and mathematics 22 peter denning s working group argued that they are theory abstraction modeling and design 23 amnon h eden described them as the rationalist paradigm which treats computer science as a branch of mathematics which is prevalent in theoretical computer science and mainly employs deductive reasoning the technocratic paradigm which might be found in engineering approaches most prominently in software engineering and the scientific paradigm which approaches computer related artifacts from the empirical perspective of natural sciences identifiable in some branches of artificial intelligence 24 name of the field edit the term computer science appears in a 1959 article in communications of the acm 25 in which louis fein argues for the creation of a graduate school in computer sciences analogous to the creation of harvard business school in 1921 justifying the name by arguing that like management science it is applied and interdisciplinary in nature yet at the same time has all the characteristics of an academic discipline 26 his efforts and those of others such as numerical analyst george forsythe were rewarded universities went on to create such programs starting with purdue in 1962 27 despite its name a significant amount of computer science does not involve the study of computers themselves because of this several alternative names have been proposed 28 certain departments of major universities prefer the term computing science to emphasize precisely that difference danish scientist peter naur suggested the term datalogy 29 to reflect the fact that the scientific discipline revolves around data and data treatment while not necessarily involving computers the first scientific institution to use the term was the department of datalogy at the university of copenhagen founded in 1969 with peter naur being the first professor in datalogy the term is used mainly in the scandinavian countries also in the early days of computing a number of terms for the practitioners of the field of computing were suggested in the communications of the acm turingineer turologist flow charts man applied meta mathematician and applied epistemologist 30 three months later in the same journal comptologist was suggested followed next year by hypologist 31 the term computics has also been suggested 32 in europe terms derived from contracted translations of the expression automatic information e g informazione automatica in italian or information and mathematics are often used e g informatique french informatik german informatica italy inform tica spain portugal or informatika slavic languages are also used and have also been adopted in the uk as in the school of informatics of the university of edinburgh 33 a folkloric quotation often attributed to but almost certainly not first formulated by edsger dijkstra states that computer science is no more about computers than astronomy is about telescopes note 1 the design and deployment of computers and computer systems is generally considered the province of disciplines other than computer science for example the study of computer hardware is usually considered part of computer engineering while the study of commercial computer systems and their deployment is often called information technology or information systems however there has been much cross fertilization of ideas between the various computer related disciplines computer science research also often intersects other disciplines such as philosophy cognitive science linguistics mathematics physics statistics and logic computer science is considered by some to have a much closer relationship with mathematics than many scientific disciplines with some observers saying that computing is a mathematical science 10 early computer science was strongly influenced by the work of mathematicians such as kurt g del and alan turing and there continues to be a useful interchange of ideas between the two fields in areas such as mathematical logic category theory domain theory and algebra the relationship between computer science and software engineering is a contentious issue which is further muddied by disputes over what the term software engineering means and how computer science is defined 34 david parnas taking a cue from the relationship between other engineering and science disciplines has claimed that the principal focus of computer science is studying the properties of computation in general while the principal focus of software engineering is the design of specific computations to achieve practical goals making the two separate but complementary disciplines 35 the academic political and funding aspects of computer science tend to depend on whether a department formed with a mathematical emphasis or with an engineering emphasis computer science departments with a mathematics emphasis and with a numerical orientation consider alignment with computational science both types of departments tend to make efforts to bridge the field educationally if not across all research areas of computer science edit as a discipline computer science spans a range of topics from theoretical studies of algorithms and the limits of computation to the practical issues of implementing computing systems in hardware and software 36 37 csab formerly called computing sciences accreditation board which is made up of representatives of the association for computing machinery acm and the ieee computer society ieee cs 38 identifies four areas that it considers crucial to the discipline of computer science theory of computation algorithms and data structures programming methodology and languages and computer elements and architecture in addition to these four areas csab also identifies fields such as software engineering artificial intelligence computer networking and communication database systems parallel computation distributed computation computer human interaction computer graphics operating systems and numerical and symbolic computation as being important areas of computer science 36 theoretical computer science edit main article theoretical computer science the broader field of theoretical computer science encompasses both the classical theory of computation and a wide range of other topics that focus on the more abstract logical and mathematical aspects of computing theory of computation edit main article theory of computation according to peter j denning the fundamental question underlying computer science is what can be efficiently automated 10 the study of the theory of computation is focused on answering fundamental questions about what can be computed and what amount of resources are required to perform those computations in an effort to answer the first question computability theory examines which computational problems are solvable on various theoretical models of computation the second question is addressed by computational complexity theory which studies the time and space costs associated with different approaches to solving a multitude of computational problems the famous p np problem one of the millennium prize problems 39 is an open problem in the theory of computation p np 160 gnitirw terces automata theory computability theory computational complexity theory cryptography quantum computing theory information and coding theory edit main articles information theory and coding theory information theory is related to the quantification of information this was developed by claude e shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and communicating data coding theory is the study of the properties of codes systems for converting information from one form to another and their fitness for a specific application codes are used for data compression cryptography error detection and correction and more recently also for network coding codes are studied for the purpose of designing efficient and reliable data transmission methods algorithms and data structures edit analysis of algorithms algorithms data structures computational geometry programming language theory edit main article programming language theory programming language theory plt is a branch of computer science that deals with the design implementation analysis characterization and classification of programming languages and their individual features it falls within the discipline of computer science both depending on and affecting mathematics software engineering and linguistics it is an active research area with numerous dedicated academic journals type theory compiler design programming languages formal methods edit main article formal methods formal methods are a particular kind of mathematically based technique for the specification development and verification of software and hardware systems the use of formal methods for software and hardware design is motivated by the expectation that as in other engineering disciplines performing appropriate mathematical analysis can contribute to the reliability and robustness of a design they form an important theoretical underpinning for software engineering especially where safety or security is involved formal methods are a useful adjunct to software testing since they help avoid errors and can also give a framework for testing for industrial use tool support is required however the high cost of using formal methods means that they are usually only used in the development of high integrity and life critical systems where safety or security is of utmost importance formal methods are best described as the application of a fairly broad variety of theoretical computer science fundamentals in particular logic calculi formal languages automata theory and program semantics but also type systems and algebraic data types to problems in software and hardware specification and verification applied computer science edit artificial intelligence edit main article artificial intelligence this branch of computer science aims to or is required to synthesise goal orientated processes such as problem solving decision making environmental adaptation learning and communication which are found in humans and animals from its origins in cybernetics and in the dartmouth conference 1956 artificial intelligence ai research has been necessarily cross disciplinary drawing on areas of expertise such as applied mathematics symbolic logic semiotics electrical engineering philosophy of mind neurophysiology and social intelligence ai is associated in the popular mind with robotic development but the main field of practical application has been as an embedded component in areas of software development which require computational understanding and modeling such as finance and economics data mining and the physical sciences the starting point in the late 1940s was alan turing s question can computers think and the question remains effectively unanswered although the turing test is still used to assess computer output on the scale of human intelligence but the automation of evaluative and predictive tasks has been increasingly successful as a substitute for human monitoring and intervention in domains of computer application involving complex real world data machine learning computer vision image processing pattern recognition cognitive science data mining evolutionary computation information retrieval knowledge representation natural language processing robotics medical image computing computer architecture and engineering edit main articles computer architecture and computer engineering computer architecture or digital computer organization is the conceptual design and fundamental operational structure of a computer system it focuses largely on the way by which the central processing unit performs internally and accesses addresses in memory the field often involves disciplines of computer engineering and electrical engineering selecting and interconnecting hardware components to create computers that meet functional performance and cost goals digital logic microarchitecture multiprocessing operating systems computer networks databases information security ubiquitous computing systems architecture compiler design programming languages computer graphics and visualization edit main article computer graphics computer science computer graphics is the study of digital visual contents and involves synthese and manipulations of image data the study is connected to many other fields in computer science including computer vision image processing and computational geometry and is heavily applied in the fields of special effects and video games computer security and cryptography edit main articles computer security and cryptography computer security is a branch of computer technology whose objective includes protection of information from unauthorized access disruption or modification while maintaining the accessibility and usability of the system for its intended users cryptography is the practice and study of hiding encryption and therefore deciphering decryption information modern cryptography is largely related to computer science for many encryption and decryption algorithms are based on their computational complexity computational science edit computational science or scientific computing is the field of study concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyze and solve scientific problems in practical use it is typically the application of computer simulation and other forms of computation to problems in various scientific disciplines numerical analysis computational physics computational chemistry bioinformatics computer networks edit main article computer network this branch of computer science aims to manage networks between computers worldwide concurrent parallel and distributed systems edit main articles concurrency computer science and distributed computing concurrency is a property of systems in which several computations are executing simultaneously and potentially interacting with each other a number of mathematical models have been developed for general concurrent computation including petri nets process calculi and the parallel random access machine model a distributed system extends the idea of concurrency onto multiple computers connected through a network computers within the same distributed system have their own private memory and information is often exchanged amongst themselves to achieve a common goal databases and information retrieval edit main articles database and database management systems a database is intended to organize store and retrieve large amounts of data easily digital databases are managed using database management systems to store create maintain and search data through database models and query languages health informatics edit main article health informatics health informatics in computer science deals with computational techniques for solving problems in health care information science edit main article information science information retrieval knowledge representation natural language processing human computer interaction software engineering edit main article software engineering software engineering is the study of designing implementing and modifying software in order to ensure it is of high quality affordable maintainable and fast to build it is a systematic approach to software design involving the application of engineering practices to software software engineering deals with the organizing and analyzing of software it doesn t just deal with the creation or manufacture of new software but its internal maintenance and arrangement both computer applications software engineers and computer systems software engineers are projected to be among the fastest growing occupations from 2008 and 2018 academia edit conferences edit conferences are strategic events of the academic research in computer science during those conferences researchers from the public and private sectors present their recent work and meet proceedings of these conferences are an important part of the computer science literature further information list of computer science conferences journals edit further information category computer science journals this section requires expansion june 2011 education edit some universities teach computer science as a theoretical study of computation and algorithmic reasoning these programs often feature the theory of computation analysis of algorithms formal methods concurrency theory databases computer graphics and systems analysis among others they typically also teach computer programming but treat it as a vessel for the support of other fields of computer science rather than a central focus of high level study the acm ieee cs joint curriculum task force computing curriculum 2005 and 2008 update 40 gives a guideline for university curriculum other colleges and universities as well as secondary schools and vocational programs that teach computer science emphasize the practice of advanced programming rather than the theory of algorithms and computation in their computer science curricula such curricula tend to focus on those skills that are important to workers entering the software industry the process aspects of computer programming are often referred to as software engineering while computer science professions increasingly drive the u s economy computer science education is absent in most american k 12 curricula a report entitled running on empty the failure to teach k 12 computer science in the digital age was released in october 2010 by association for computing machinery acm and computer science teachers association csta and revealed that only 14 states have adopted significant education standards for high school computer science the report also found that only nine states count high school computer science courses as a core academic subject in their graduation requirements in tandem with running on empty a new non partisan advocacy coalition computing in the core cinc was founded to influence federal and state policy such as the computer science education act which calls for grants to states to develop plans for improving computer science education and supporting computer science teachers within the united states a gender gap in computer science education has been observed as well research conducted by the wgbh educational foundation and the association for computing machinery acm revealed that more than twice as many high school boys considered computer science to be a very good or good college major than high school girls 41 in addition the high school advanced placement ap exam for computer science has displayed a disparity in gender compared to other ap subjects it has the lowest number of female participants with a composition of about 15 percent women 42 this gender gap in computer science is further witnessed at the college level where 31 percent of undergraduate computer science degrees are earned by women and only 8 percent of computer science faculty consists of women 43 according to an article published by the epistemic games group in august 2012 the number of women graduates in the computer science field has declined to 13 percent see also edit main article outline of computer science computer science portal book computer science academic genealogy of computer scientists informatics academic field list of academic computer science departments list of computer science conferences list of computer scientists list of publications in computer science list of pioneers in computer science list of software engineering topics list of unsolved problems in computer science women in computing notes edit see the entry computer science on wikiquote for the history of this quotation references edit wordnet search 3 1 wordnetweb princeton edu retrieved 2012 05 14 160 in 1851 science museum introduction to babbage archived from the original on 2006 09 08 retrieved 2006 09 24 160 anthony hyman charles babbage pioneer of the computer 1982 the introduction of punched cards into the new engine was important not only as a more convenient form of control than the drums or because programs could now be of unlimited extent and could be stored and repeated without the danger of introducing errors in setting the machine by hand it was important also because it served to crystallize babbage s feeling that he had invented something really new something much more than a sophisticated calculating machine bruce collier 1970 a selection and adaptation from ada s notes found in ada the enchantress of numbers by betty alexandra toole ed d strawberry press mill valley ca retrieved 2006 05 04 160 in this sense aiken needed ibm whose technology included the use of punched cards the accumulation of numerical data and the transfer of numerical data from one register to another bernard cohen p 44 2000 brian randell p 187 1975 the association for computing machinery acm was founded in 1947 a b c denning p j 2000 computer science the discipline pdf encyclopedia of computer science archived from the original on 2006 05 25 160 some edsac statistics cl cam ac uk retrieved 2011 11 19 160 computer science pioneer samuel d conte dies at 85 july 1 2002 a b levy steven 1984 hackers heroes of the computer revolution doubleday isbn 160 0 385 19195 2 160 http www computerhistory org revolution computer graphics music and art 15 222 633 http archive computerhistory org resources text ibm ibm 709 1957 102646304 pdf a b david kahn the codebreakers 1967 isbn 0 684 83130 9 a b http www cis cornell edu dean presentations slides bgu pdf constable r l march 2000 computer science achievements and challenges circa 2000 pdf 160 abelson h g j sussman with j sussman 1996 structure and interpretation of computer programs 2nd ed mit press isbn 160 0 262 01153 0 the computer revolution is a revolution in the way we think and in the way we express what we think the essence of this change is the emergence of what might best be called procedural epistemology the study of the structure of knowledge from an imperative point of view as opposed to the more declarative point of view taken by classical mathematical subjects 160 black box traders are on the march the telegraph august 26 2006 the impact of high frequency trading on an electronic market papers ssrn com doi 10 2139 ssrn 1686004 retrieved 2012 05 14 160 wegner p october 13 15 1976 research paradigms in computer science proceedings of the 2nd international conference on software engineering san francisco california united states ieee computer society press los alamitos ca 160 denning p j comer d e gries d mulder m c tucker a turner a j young p r jan 1989 computing as a discipline communications of the acm 32 9 23 doi 10 1145 63238 63239 160 volume 64 edit eden a h 2007 three paradigms of computer science minds and machines 17 2 135 167 doi 10 1007 s11023 007 9060 8 160 edit louis fine 1959 the role of the university in computers data processing and related fields communications of the acm 2 9 7 14 doi 10 1145 368424 368427 160 id p 11 donald knuth 1972 george forsythe and the development of computer science comms acm matti tedre 2006 the development of computer science a sociocultural perspective p 260 peter naur 1966 the science of datalogy communications of the acm 9 7 485 doi 10 1145 365719 366510 160 communications of the acm 1 4 p 6 communications of the acm 2 1 p 4 ieee computer 28 12 p 136 p mounier kuhn l informatique en france de la seconde guerre mondiale au plan calcul l mergence d une science paris pups 2010 ch 3 amp 4 m tedre 2011 computing as a science a survey of competing viewpoints minds and machines 21 3 361 387 parnas d l 1998 annals of software engineering 6 19 37 doi 10 1023 a 1018949113292 160 edit p 19 rather than treat software engineering as a subfield of computer science i treat it as an element of the set civil engineering mechanical engineering chemical engineering electrical engineering a b computing sciences accreditation board 28 may 1997 computer science as a profession archived from the original on 2008 06 17 retrieved 2010 05 23 160 committee on the fundamentals of computer science challenges and opportunities national research council 2004 computer science reflections on the field reflections from the field national academies press isbn 160 978 0 309 09301 9 160 csab inc csab org 2011 08 03 retrieved 2011 11 19 160 clay mathematics institute p np acm curricula recommendations retrieved 2012 11 18 160 http www acm org membership nic pdf gilbert alorie newsmaker computer science s gender gap cnet news 160 dovzan nicole examining the gender gap in technology university of michigan 160 computer software engineer u s bureau of labor statistics u s bureau of labor statistics n d web 05 feb 2013 further reading edit overview tucker allen b 2004 computer science handbook 2nd ed chapman and hall crc isbn 160 1 58488 360 x 160 within more than 70 chapters every one new or significantly revised one can find any kind of information and references about computer science one can imagine all in all there is absolute nothing about computer science that can not be found in the 2 5 kilogram encyclopaedia with its 110 survey articles christoph meinel zentralblatt math van leeuwen jan 1994 handbook of theoretical computer science the mit press isbn 160 0 262 72020 5 160 this set is the most unique and possibly the most useful to the theoretical computer science community in support both of teaching and research the books can be used by anyone wanting simply to gain an understanding of one of these areas or by someone desiring to be in research in a topic or by instructors wishing to find timely information on a subject they are teaching outside their major areas of expertise rocky ross sigact news ralston anthony reilly edwin d hemmendinger david 2000 encyclopedia of computer science 4th ed grove s dictionaries isbn 160 1 56159 248 x 160 since 1976 this has been the definitive reference work on computer computing and computer science alphabetically arranged and classified into broad subject areas the entries cover hardware computer systems information and data software the mathematics of computing theory of computation methodologies applications and computing milieu the editors have done a commendable job of blending historical perspective and practical reference information the encyclopedia remains essential for most public and academic library reference collections joe accardin northeastern illinois univ chicago edwin d reilly 2003 milestones in computer science and information technology greenwood publishing group isbn 160 978 1 57356 521 9 160 selected papers knuth donald e 1996 selected papers on computer science csli publications cambridge university press 160 collier bruce the little engine that could ve the calculating machines of charles babbage garland publishing inc isbn 160 0 8240 0043 9 160 cohen bernard 2000 howard aiken portrait of a computer pioneer the mit press isbn 160 978 0 2625317 9 5 160 randell brian 1973 the origins of digital computers selected papers springer verlag isbn 160 3 540 06169 x 160 covering a period from 1966 to 1993 its interest lies not only in the content of each of these papers still timely today but also in their being put together so that ideas expressed at different times complement each other nicely n bernard zentralblatt math articles peter j denning is computer science science communications of the acm april 2005 peter j denning great principles in computing curricula technical symposium on computer science education 2004 research evaluation for computer science informatics europe report shorter journal version bertrand meyer christine choppy jan van leeuwen and jorgen staunstrup research evaluation for computer science in communications of the acm vol 52 no 4 pp 160 31 34 april 2009 curriculum and classification association for computing machinery 1998 acm computing classification system 1998 joint task force of association for computing machinery acm association for information systems ais and ieee computer society ieee cs computing curricula 2005 the overview report september 30 2005 norman gibbs allen tucker a model curriculum for a liberal arts degree in computer science communications of the acm volume 29 issue 3 march 1986 external links edit find more about computer science at wikipedia s sister projects definitions and translations from wiktionary media from commons learning resources from wikiversity news stories from wikinews quotations from wikiquote source texts from wikisource textbooks from wikibooks library resources about computer science resources in your library resources in other libraries computer science at the open directory project scholarly societies in computer science best papers awards in computer science since 1996 photographs of computer scientists by bertrand meyer eecs berkeley edu bibliography and academic search engines citeseerx article search engine digital library and repository for scientific and academic papers with a focus on computer and information science dblp computer science bibliography article computer science bibliography website hosted at universit t trier in germany the collection of computer science bibliographies article professional organizations association for computing machinery ieee computer society informatics europe misc computer science stack exchange a community run question and answer site for computer science v t e major fields of computer science mathematical foundations mathematical logic set theory number theory graph theory type theory category theory numerical analysis information theory combinatorics boolean algebra theory of computation automata theory computability theory computational complexity theory quantum computing theory algorithms data structures analysis of algorithms algorithm design computational geometry programming languages compilers parsers interpreters procedural programming object oriented programming functional programming logic programming programming paradigms concurrent parallel distributed systems multiprocessing grid computing concurrency control software engineering requirements analysis software design computer programming formal methods software testing software development process system architecture computer architecture computer organization operating systems telecommunication networking computer audio routing network topology cryptography databases database management systems relational databases sql transactions database indexes data mining artificial intelligence automated reasoning computational linguistics computer vision evolutionary computation expert systems machine learning natural language processing robotics computer graphics visualization computer animation image processing human computer interaction computer accessibility user interfaces wearable computing ubiquitous computing virtual reality scientific computing artificial life bioinformatics cognitive science computational chemistry computational neuroscience computational physics numerical algorithms symbolic mathematics note computer science can also be divided into different topics or fields according to the acm computing classification system v t e technology outline of technology outline of applied science fields agriculture agricultural engineering aquaculture fisheries science food chemistry food engineering food microbiology food technology gurt ict in agriculture nutrition biomedical bioinformatics biological engineering biomechatronics biomedical engineering biotechnology cheminformatics genetic engineering healthcare science medical research medical technology nanomedicine neuroscience neurotechnology pharmacology reproductive technology tissue engineering buildings and construction acoustical engineering architectural engineering building services engineering civil engineering construction engineering domestic technology facade engineering fire protection engineering safety engineering sanitary engineering structural engineering educational educational software digital technologies in education ict in education impact multimedia learning virtual campus virtual education energy nuclear engineering nuclear technology petroleum engineering soft energy technology environmental clean technology clean coal technology ecological design ecological engineering ecotechnology environmental engineering environmental engineering science green building green nanotechnology landscape engineering renewable energy sustainable design sustainable engineering industrial automation business informatics engineering management enterprise engineering financial engineering industrial biotechnology industrial engineering metallurgy mining engineering productivity improving technologies research and development it and communications artificial intelligence broadcast engineering computer engineering computer science information technology music technology ontology engineering rf engineering software engineering telecommunications engineering visual technology military army engineering maintenance electronic warfare military communications military engineering stealth technology transport aerospace engineering automotive engineering naval architecture space technology traffic engineering transport engineering other applied sciences cryogenics electro optics electronics engineering geology engineering physics hydraulics materials science microtechnology nanotechnology other engineering fields audio biochemical ceramic chemical polymer control electrical electronic entertainment geotechnical hydraulic mechanical mechatronics optical protein quantum robotics animatronics systems components infrastructure invention timeline knowledge machine skill craft tool gadget history prehistoric technology neolithic revolution ancient technology medieval technology renaissance technology industrial revolution second jet age digital revolution information age theories and concepts appropriate technology critique of technology diffusion of innovations disruptive innovation dual use technology ephemeralization ethics of technology high tech hype cycle inevitability thesis low technology mature technology philosophy of technology strategy of technology technicism techno progressivism technocapitalism technocentrism technocracy technocriticism technoetic technological change technological convergence technological determinism technological escalation technological evolution technological fix technological innovation system technological momentum technological nationalism technological rationality technological revival technological singularity technological somnambulism technological utopianism technology lifecycle technology acceptance model technology adoption lifecycle technomancy technorealism transhumanism other emerging technologies list fictional technology high technology business districts kardashev scale list of technologies science and technology by country technology alignment technology assessment technology brokering technology companies technology demonstration technology education technical universities and colleges technology evangelist technology governance technology integration technology journalism technology management technology shock technology strategy technology and society technology transfer book category commons portal wikiquotes retrieved from http en wikipedia org w index php title computer_science amp oldid 561000777 categories computer sciencehidden categories all articles with unsourced statementsarticles with unsourced statements from october 2010articles to be expanded from june 2011all articles to be expanded navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages afrikaans alemannisch aragon s asturianu az rbaycanca b n l m g boarisch bosanski brezhoneg esky dansk deutsch eesti emili n e rumagn l espa ol estreme u euskara f royskt fran ais frysk furlan gaeilge gaelg hrvatski ido bahasa indonesia interlingua interlingue inuktitut slenska italiano kalaallisut kasz bsczi krey l ayisyen kurd ladino latga u latina latvie u l tzebuergesch lietuvi limburgs lojban magyar malagasy bahasa melayu mirand s nederlands nedersaksies nordfriisk norsk bokm l norsk nynorsk novial occitan o zbekcha picard piemont is polski portugus rom n shqip sicilianu simple english sloven ina sloven ina srpski srpskohrvatski suomi svenska tagalog taqbaylit tatar a t rk e t rkmen e uyghurche v neto ti ng vi t vro walon winaray wolof yor b zeuws emait ka edit links this page was last modified on 22 june 2013 at 01 11 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Concept_drift b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Concept_drift new file mode 100644 index 00000000..6bceb5f6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Concept_drift @@ -0,0 +1 @@ +concept drift wikipedia the free encyclopedia concept drift from wikipedia the free encyclopedia jump to navigation search in predictive analytics and machine learning the concept drift means that the statistical properties of the target variable which the model is trying to predict change over time in unforeseen ways this causes problems because the predictions become less accurate as time passes the term concept refers to the quantity to be predicted more generally it can also refer to other phenomena of interest besides the target concept such as an input but in the context of concept drift the term commonly refers to the target variable contents 1 examples 2 possible remedies 3 software 4 datasets 4 1 real 4 2 other 4 3 synthetic 4 4 data generation frameworks 5 projects 6 meetings 7 mailing list 8 bibliographic references 8 1 reviews 9 see also examples edit in a fraud detection application the target concept may be a binary attribute fraudulent with values yes or no that indicates whether a given transaction is fraudulent or in a weather prediction application there may be several target concepts such as temperature pressure and humidity the behavior of the customers in an online shop may change over time for example if weekly merchandise sales are to be predicted and a predictive model has been developed that works satisfactorily the model may use inputs such as the amount of money spent on advertising promotions being run and other metrics that may affect sales the model is likely to become less and less accurate over time this is concept drift in the merchandise sales application one reason for concept drift may be seasonality which means that shopping behavior changes seasonally perhaps there will be higher sales in the winter holiday season than during the summer for example possible remedies edit to prevent deterioration in prediction accuracy because of concept drift both active and passive solutions can be adopted active solutions rely on triggering mechanisms e g change detection tests basseville and nikiforov 1993 alippi and roveri 2007 to explicitly detect concept drift as a change in the statistics of the data generating process in stationary conditions any fresh information made available can be integrated to improve the model differently when concept drift is detected the current model is no more up to date and must be substituted with a new one to maintain the prediction accuracy gama et al 2004 alippi et al 2011 on the contrary in passive solutions the model is continuously updated e g by retraining the model on the most recently observed samples widmer and kubat 1996 or enforcing an ensemble of classifiers elwell and polikar 2011 contextual information when available can be used to better explain the causes of the concept drift for instance in the sales prediction application concept drift might be compensated by adding information about the season to the model by providing information about the time of the year the rate of deterioration of your model is likely to decrease concept drift is unlikely to be eliminated altogether this is because actual shopping behavior does not follow any static finite model new factors may arise at any time that influence shopping behavior the influence of the known factors or their interactions may change concept drift cannot be avoided for complex phenomenon that are not governed by fixed laws of nature all processes that arise from human activity such as socioeconomic processes and biological processes are likely to experience concept drift therefore periodic retraining also known as refreshing of any model is necessary software edit rapidminer formerly yale yet another learning environment free open source software for knowledge discovery data mining and machine learning also featuring data stream mining learning time varying concepts and tracking drifting concept if used in combination with its data stream mining plugin formerly concept drift plugin eddm eddm early drift detection method free open source implementation of drift detection methods in weka machine learning moa massive online analysis free open source software specific for mining data streams with concept drift it contains a prequential evaluation method the eddm concept drift methods a reader of arff real datasets and artificial stream generators as sea concepts stagger rotating hyperplane random tree and random radius based functions moa supports bi directional interaction with weka machine learning datasets edit real edit airline approximately 116 million flight arrival and departure records cleaned and sorted compiled by e ikonomovska reference data expo 2009 competition 1 access chess com online games and luxembourg social survey datasets compiled by i zliobaite access ecue spam 2 datasets each consisting of more than 10 000 emails collected over a period of approximately 2 years by an individual access from s j delany webpage elec2 electricity demand 2 classes 45312 instances reference m harries splice 2 comparative evaluation electricity pricing technical report the university of south wales 1999 access from j gama webpage comment on applicability pakdd 09 competition data represents the credit evaluation task it is collected over a five year period unfortunately the true labels are released only for the first part of the data access sensor stream and power supply stream datasets are available from x zhu s stream data mining repository access text mining a collection of text mining datasets with concept drift maintained by i katakis access gas sensor array drift dataset a collection of 13910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations 2 other edit kdd 99 competition data contains simulated intrusions in a military network environment it is often used as a benchmark to evaluate handling concept drift access synthetic edit sine line plane circle and boolean data sets l l minku a p white x yao the impact of diversity on on line ensemble learning in the presence of concept drift ieee transactions on knowledge and data engineering vol 22 no 5 pp 160 730 742 2010 access from l minku webpage sea concepts n w street y kim a streaming ensemble algorithm sea for large scale classification kdd 01 proceedings of the seventh acm sigkdd international conference on knowledge discovery and data mining 2001 access from j gama webpage stagger j c schlimmer r h granger incremental learning from noisy data mach learn vol 1 no 3 1986 data generation frameworks edit l l minku a p white x yao the impact of diversity on on line ensemble learning in the presence of concept drift ieee transactions on knowledge and data engineering vol 22 no 5 pp 160 730 742 2010 download from l minku webpage lindstrom p sj delany amp b macnamee 2008 autopilot simulating changing concepts in real data in proceedings of the 19th irish conference on artificial intelligence amp cognitive science d bridge k brown b o sullivan amp h sorensen eds p272 263 pdf narasimhamurthy a l i kuncheva a framework for generating data to simulate changing environments proc iasted artificial intelligence and applications innsbruck austria 2007 384 389 pdf code projects edit infer computational intelligence platform for evolving and robust predictive systems 2010 2014 bournemouth university uk evonik industries germany research and engineering centre poland hacdais handling concept drift in adaptive information systems 2008 2012 eindhoven university of technology the netherlands kdus knowledge discovery from ubiquitous streams inesc porto and laboratory of artificial intelligence and decision support portugal adept adaptive dynamic ensemble prediction techniques university of manchester uk university of bristol uk aladdin autonomous learning agents for decentralised data and information networks 2005 2010 meetings edit 2011 lee 2011 special session on learning in evolving environments and its application on real world problems at icmla 11 hacdais 2011 the 2nd international workshop on handling concept drift in adaptive information systems icais 2011 track on incremental learning ijcnn 2011 special session on concept drift and learning dynamic environments cidue 2011 symposium on computational intelligence in dynamic and uncertain environments 2010 hacdais 2010 international workshop on handling concept drift in adaptive information systems importance challenges and solutions icmla10 special session on dynamic learning in non stationary environments sac 2010 data streams track at acm symposium on applied computing sensorkdd 2010 international workshop on knowledge discovery from sensor data streamkdd 2010 novel data stream pattern mining techniques concept drift and learning in nonstationary environments at ieee world congress on computational intelligence mlmds 2010 special session on machine learning methods for data streams at the 10th international conference on intelligent design and applications isda 10 mailing list edit announcements discussions job postings related to the topic of concept drift in data mining machine learning posts are moderated to subscribe go to the group home page http groups google com group conceptdrift bibliographic references edit many papers have been published describing algorithms for concept drift detection only reviews surveys and overviews are here reviews edit zliobaite i learning under concept drift an overview technical report 2009 faculty of mathematics and informatics vilnius university vilnius lithuania pdf jiang j a literature survey on domain adaptation of statistical classifiers 2008 pdf kuncheva l i classifier ensembles for detecting concept change in streaming data overview and perspectives proc 2nd workshop suema 2008 ecai 2008 patras greece 2008 5 10 pdf gaber m m zaslavsky a and krishnaswamy s mining data streams a review in acm sigmod record vol 34 no 1 june 2005 issn 0163 5808 kuncheva l i classifier ensembles for changing environments proceedings 5th international workshop on multiple classifier systems mcs2004 cagliari italy in f roli j kittler and t windeatt eds lecture notes in computer science vol 3077 2004 1 15 pdf tsymbal a the problem of concept drift definitions and related work technical report 2004 department of computer science trinity college dublin ireland pdf see also edit data stream mining data mining machine learning retrieved from http en wikipedia org w index php title concept_drift amp oldid 559287343 categories data miningmachine learning navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 10 june 2013 at 20 42 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Concept_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Concept_mining new file mode 100644 index 00000000..5ffd0398 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Concept_mining @@ -0,0 +1 @@ +concept mining wikipedia the free encyclopedia concept mining from wikipedia the free encyclopedia jump to navigation search concept mining is an activity that results in the extraction of concepts from artifacts solutions to the task typically involve aspects of artificial intelligence and statistics such as data mining and text mining 1 because artifacts are typically a loosely structured sequence of words and other symbols rather than concepts the problem is nontrivial but it can provide powerful insights into the meaning provenance and similarity of documents contents 1 methods 2 applications 2 1 detecting and indexing similar documents in large corpora 2 2 clustering documents by topic 3 references 4 see also methods edit traditionally the conversion of words to concepts has been performed using a thesaurus 2 and for computational techniques the tendency is to do the same the thesauri used are either specially created for the task or a pre existing language model usually related to princeton s wordnet the mappings of words to concepts 3 are often ambiguous typically each word in a given language will relate to several possible concepts humans use context to disambiguate the various meanings of a given piece of text where available machine translation systems cannot easily infer context for the purposes of concept mining however these ambiguities tend to be less important than they are with machine translation for in large documents the ambiguities tend to even out much as is the case with text mining there are many techniques for disambiguation that may be used examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora recently techniques that base on semantic similarity between the possible concepts and the context have appeared and gained interest in the scientific community applications edit detecting and indexing similar documents in large corpora edit one of the spin offs of calculating document statistics in the concept domain rather than the word domain is that concepts form natural tree structures based on hypernymy and meronymy these structures can be used to produce simple tree membership statistics that can be used to locate any document in a euclidean concept space if the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created this technique is currently in commercial use locating similar legal documents in a 2 5 million document corpus clustering documents by topic edit standard numeric clustering techniques may be used in concept space as described above to locate and index documents by the inferred topic these are numerically far more efficient than their text mining cousins and tend to behave more intuitively in that they map better to the similarity measures a human would generate references edit yuen hsien tseng chun yen chang shu nu chang rundgren and carl johan rundgren mining concept maps from news stories for measuring civic scientific literacy in media computers and education vol 55 no 1 august 2010 pp 165 177 yuen hsien tseng automatic thesaurus generation for chinese documents journal of the american society for information science and technology vol 53 no 13 nov 2002 pp 1130 1138 yuen hsien tseng generic title labeling for clustered documents expert systems with applications vol 37 no 3 15 march 2010 pp 2247 2254 see also edit formal concept analysis information extraction compound term processing retrieved from http en wikipedia org w index php title concept_mining amp oldid 554088888 categories natural language processingartificial intelligence applicationsdata mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 8 may 2013 at 07 03 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Conference_on_Information_and_Knowledge_Management b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Conference_on_Information_and_Knowledge_Management new file mode 100644 index 00000000..52446222 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Conference_on_Information_and_Knowledge_Management @@ -0,0 +1 @@ +conference on information and knowledge management wikipedia the free encyclopedia conference on information and knowledge management from wikipedia the free encyclopedia jump to navigation search the acm conference on information and knowledge management cikm pronounced sik m is an annual computer science research conference dedicated to information and knowledge management since the first event in 1992 the conference has evolved into one of the major forums for research on database management information retrieval and knowledge management 1 2 the conference is noted for its interdisciplinarity as it brings together communities that otherwise often publish at separate venues recent editions have attracted well beyond 500 participants 3 in addition to the main research program the conference also features a number of workshops tutorials and industry presentations 4 for many years the conference was held in the usa since 2005 venues in other countries have been selected as well locations include 5 1992 baltimore maryland usa 1993 washington d c usa 1994 gaithersburg maryland usa 1995 baltimore maryland usa 1996 rockville maryland usa 1997 las vegas nevada usa 1998 bethesda maryland usa 1999 kansas city missouri usa 2000 washington d c usa 2001 atlanta georgia usa 2002 mclean virginia usa 2003 new orleans louisiana usa 2004 washington d c usa 2005 bremen germany 2006 arlington virginia usa 2007 lisbon portugal 6 2008 napa valley california usa 7 2009 hong kong china 8 2010 toronto ontario canada 9 2011 glasgow scotland uk 10 see also edit sigir conference references edit official home page arnetminer ranking list www ir arnetminer retrieved 2011 06 11 160 cikm 2011 sponsorship page cikm 2011 home page dblp http www fc ul pt cikm2007 1 2 3 4 external links edit official home page retrieved from http en wikipedia org w index php title conference_on_information_and_knowledge_management amp oldid 559530156 categories computer science conferences navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 12 june 2013 at 08 10 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Conference_on_Knowledge_Discovery_and_Data_Mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Conference_on_Knowledge_Discovery_and_Data_Mining new file mode 100644 index 00000000..b59d3eab --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Conference_on_Knowledge_Discovery_and_Data_Mining @@ -0,0 +1 @@ +sigkdd wikipedia the free encyclopedia sigkdd from wikipedia the free encyclopedia redirected from conference on knowledge discovery and data mining jump to navigation search sigkdd is the association for computing machinery s special interest group on knowledge discovery and data mining it became an official acm sig in 1998 the official web page of sigkdd can be found on www kdd org the current chairman of sigkdd since 2009 is usama m fayyad ph d contents 1 conferences 2 kdd cup 3 awards 4 sigkdd explorations 5 current executive committee 6 information directors 7 references 8 external links conferences edit sigkdd has hosted an annual conference acm sigkdd conference on knowledge discovery and data mining kdd since 1995 kdd conferences grew from kdd knowledge discovery and data mining workshops at aaai conferences which were started by gregory piatetsky shapiro in 1989 1991 and 1993 and usama fayyad in 1994 1 conference papers of each proceedings of the sigkdd international conference on knowledge discovery and data mining are published through acm 2 kdd 2012 took place in beijing china 3 and kdd 2013 will take place in chicago united states aug 11 14 2013 kdd cup edit sigkdd sponsors the kdd cup competition every year in conjunction with the annual conference it is aimed at members of the industry and academia particularly students interested in kdd awards edit the group also annually recognizes members of the kdd community with its innovation award and service award additionally kdd presents a best paper award 4 to recognize the highest quality paper at each conference sigkdd explorations edit sigkdd has also published a biannual academic journal titled sigkdd explorations since june 1999 editors in chief bart goethals since 2010 osmar r zaiane 2008 2010 ramakrishnan srikant 2006 2007 sunita sarawagi 2003 2006 usama fayyad 1999 2002 current executive committee edit chair usama fayyad 2009 treasurer osmar r zaiane 2009 directors johannes gehrke robert grossman david d jensen 5 raghu ramakrishnan sunita sarawagi 6 ramakrishnan srikant 7 former chairpersons gregory piatetsky shapiro 8 2005 2008 won kim 1998 2004 information directors edit ankur teredesai 2011 gabor melli 9 2004 2011 ramakrishnan srikant 1998 2003 references edit http www sigkdd org conferences php http dl acm org event cfm id re329 http kdd2012 sigkdd org kdd conference best paper awards retrieved 2012 04 07 160 http kdl cs umass edu people jensen http www it iitb ac in sunita http www rsrikant com http www kdnuggets com gps html http www gabormeli com rkb external links edit acm sigkdd homepage acm sigkdd explorations homepage kdd 2013 conference homepage kdd 2012 conference homepage this computing article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title sigkdd amp oldid 558448906 categories association for computing machinery special interest groupsdata miningcomputing stubs navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 5 june 2013 at 14 14 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Contrast_set_learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Contrast_set_learning new file mode 100644 index 00000000..25b30bb7 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Contrast_set_learning @@ -0,0 +1 @@ +contrast set learning wikipedia the free encyclopedia contrast set learning from wikipedia the free encyclopedia jump to navigation search contrast set learning is a form of association rule learning that seeks to identify meaningful differences between separate groups by reverse engineering the key predictors that identify for each particular group for example given a set of attributes for a pool of students labeled by degree type a contrast set learner would identify the contrasting features between students seeking bachelor s degrees and those working toward phd degrees contents 1 overview 1 1 example supermarket purchases 2 treatment learning 2 1 example boston housing data 3 algorithms 3 1 stucco 3 2 tar3 4 references overview edit a common practice in data mining is to classify to look at the attributes of an object or situation and make a guess at what category the observed item belongs to as new evidence is examined typically by feeding a training set to a learning algorithm these guesses are re ned and improved contrast set learning works in the opposite direction while classi ers read a collection of data and collect information that is used to place new data into a series of discrete categories contrast set learning takes the category that an item belongs to and attempts to reverse engineer the statistical evidence that identifies an item as a member of a class that is contrast set learners seek rules associating attribute values with changes to the class distribution 1 they seek to identify the key predictors that contrast one classification from another for example an aerospace engineer might record data on test launches of a new rocket measurements would be taken at regular intervals throughout the launch noting factors such as the trajectory of the rocket operating temperatures external pressures and so on if the rocket launch fails after a number of successful tests the engineer could use contrast set learning to distinguish between the successful and failed tests a contrast set learner will produce a set of association rules that when applied will indicate the key predictors of each failed tests versus the successful ones the temperature was too high the wind pressure was too high etc contrast set learning is a form of association rule learning association rule learners typically offer rules linking attributes commonly occurring together in a training set for instance people who are enrolled in four year programs and take a full course load tend to also live near campus instead of nding rules that describe the current situation contrast set learners seek rules that differ meaningfully in their distribution across groups and thus can be used as predictors for those groups 2 for example a contrast set learner could ask what are the key identifiers of a person with a bachelor s degree or a person with a phd and how do people with phd s and bachelor s degrees differ standard classifier algorithms such as c4 5 have no concept of class importance that is they do not know if a class is good or bad such learners cannot bias or filter their predictions towards certain desired classes as the goal of contrast set learning is to discover meaningful differences between groups it is useful to be able to target the learned rules towards certain classifications several contrast set learners such as minwal 3 or the family of tar algorithms 4 5 6 assign weights to each class in order to focus the learned theories toward outcomes that are of interest to a particular audience thus contrast set learning can be though of as a form of weighted class learning 7 example supermarket purchases edit the differences between standard classification association rule learning and contrast set learning can be illustrated with a simple supermarket metaphor in the following small dataset each row is a supermarket transaction and each 1 indicates that the item was purchased a 0 indicates that the item was not purchased hamburger potatoes foie gras onions champagne purpose of purchases 1 1 0 1 0 cookout 1 1 0 1 0 cookout 0 0 1 0 1 anniversary 1 1 0 1 0 cookout 1 1 0 0 1 frat party given this data association rule learning may discover that customers that buy onions and potatoes together are likely to also purchase hamburger meat classification may discover that customers that bought onions potatoes and hamburger meats were purchasing items for a cookout contrast set learning may discover that the major difference between customers shopping for a cookout and those shopping for an anniversary dinner are that customers acquiring items for a cookout purchase onions potatoes and hamburger meat and do not purchase foie gras or champagne treatment learning edit treatment learning is a form of weighted contrast set learning that takes a single desirable group and contrasts it against the remaining undesirable groups the level of desirability is represented by weighted classes 4 the resulting treatment suggests a set of rules that when applied will lead to the desired outcome treatment learning differs from standard contrast set learning through the following constraints rather than seeking the differences between all groups treatment learning specifies a particular group to focus on applies a weight to this desired grouping and lumps the remaining groups into one undesired category treatment learning has a stated focus on minimal theories in practice treatment are limited to a maximum of four contraints i e rather than stating all of the reasons that a rocket differs from a skateboard a treatment learner will state one to four major differences that predict for rockets at a high level of statistical significance this focus on simplicity is an important goal for treatment learners treatment learning seeks the smallest change that has the greatest impact on the class distribution 7 conceptually treatment learners explore all possible subsets of the range of values for all attributes such a search is often infeasible in practice so treatment learning often focuses instead on quickly pruning and ignoring attribute ranges that when applied lead to a class distribution where the desired class is in the minority 6 example boston housing data edit the following example demonstrates the output of the treatment learner tar3 on a dataset of housing data from the city of boston a nontrivial public dataset with over 500 examples in this dataset a number of factors are collected for each house and each house is classified according to its quality low medium low medium high and high the desired class is set to high and all other classes are lumped together as undesirable the output of the treatment learner is as follows baseline class distribution low 29 medlow 29 medhigh 21 high 21 suggested treatment ptratio 12 6 16 rm 6 7 9 78 new class distribution low 0 medlow 0 medhigh 3 high 97 with no applied treatments rules the desired class represents only 21 of the class distribution however if we filter the data set for houses with 6 7 to 9 78 rooms and a neighborhood parent teacher ratio of 12 6 to 16 then 97 of the remaining examples fall into the desired class high quality houses algorithms edit there are a number of algorithms that perform contrast set learning the following subsections describe two examples stucco edit the stucco contrast set learner 1 2 treats the task of learning from contrast sets as a tree search problem where the root node of the tree is an empty contrast set children are added by specializing the set with additional items picked through a canonical ordering of attributes to avoid visiting the same nodes twice children are formed by appending terms that follow all existing terms in a given ordering the formed tree is searched in a breadth first manner given the nodes at each level the dataset is scanned and the support is counted for each group each node is then examined to determine if it is significant and large if it should be pruned and if new children should be generated after all significant contrast sets are located a post processor selects a subset to show to the user the low order simpler results are shown first followed by the higher order results which are surprising and significantly different 2 the support calculation comes from testing a null hypothesis that the contrast set support is equal across all groups i e that contrast set support is independent of group membership the support count for each group is a frequency value that can be analyzed in a contingency table where each row represents the truth value of the contrast set and each column variable indicates the group membership frequency if there is a difference in proportions between the contrast set frequencies and those of the null hypothesis the algorithm must then determine if the differences in proportions represent a relation between variables or if it can be attributed to random causes this can be determined through a chi square test comparing the observed frequency count to the expected count nodes are pruned from the tree when all specializations of the node can never lead to a significant and large contrast set the decision to prune is based on the minimum deviation size the maximum difference between the support of any two groups bust be greater than a user specified threshold expected cell frequencies the expected cell frequencies of a contingency table can only decrease as the contrast set is specialized when these frequencies are too small the validity of the chi square test is violated bounds an upper bound is kept on the distribution of a statistic calculated when the null hypothesis is true nodes are pruned when it is no longer possible to meet this cutoff tar3 edit the tar3 5 8 weighted contrast set learner is based on two fundamental concepts the lift and support of a rule set the lift of a set of rules is the change that some decision makes to a set of examples after imposing that decision i e how the class distribution shifts in response to the imposition of a rule tar3 seeks the smallest set of rules which induces the biggest changes in the sum of the weights attached to each class multiplied by the frequency at which each class occurs the lift is calculated by dividing the score of the set in which the set of rules is imposed by the score of the baseline set i e no rules are applied note that by reversing the lift scoring function the tar3 learner can also select for the remaining classes and reject the target class it is problematic to rely on the lift of a rule set alone incorrect or misleading data noise if correlated with failing examples may result in an overfitted rule set such an overfitted model may have a large lift score but it does not accurately re ect the prevailing conditions within the dataset to avoid overfitting tar3 utilizes a support threshold and rejects all rules that fall on the wrong side of this threshold given a target class the support threshold is a user supplied value usually 0 2 which is compared to the ratio of the frequency of the target class when the rule set has been applied to the frequency of that class in the overall dataset tar3 rejects all sets of rules with support lower than this threshold by requiring both a high lift and a high support value tar3 not only returns ideal rule sets but also favors smaller sets of rules the fewer rules adopted the more evidence that will exist supporting those rules the tar3 algorithm only builds sets of rules from attribute value ranges with a high heuristic value the algorithm determines which ranges to use by rst determining the lift score of each attribute s value ranges these individual scores are then sorted and converted into a cumulative probability distribution tar3 randomly selects values from this distribution meaning that low scoring ranges are unlikely to be selected to build a candidate rule set several ranges are selected and combined these candidate rule sets are then scored and sorted if no improvement is seen after a user defined number of rounds the algorithm terminates and returns the top scoring rule sets references edit a b stephen bay and michael pazzani 2001 detecting group differences mining contrast sets data mining and knowledge discovery 5 3 213 246 160 a b c stephen bay and michael pazzani 1999 detecting change in categorical data mining contrast sets kdd 99 proceedings of the fifth acm sigkdd international conference on knowledge discovery and data mining 160 c h cai a w c fu c h cheng and w w kwong 1998 mining association rules with weighted items proceedings of international database engineering and applications symposium ideas 98 160 a b y hu 2003 treatment learning implementation and application 160 unknown parameter book ignored help a b k gundy burlet j schumann t barrett and t menzies 2007 parametric analysis of antares re entry guidance algorithms using advanced test generation and data analysis in 9th international symposium on arti cal intelligence robotics and automation in space 160 a b gregory gay tim menzies misty davies and karen gundy burlet 2010 automatically finding the control variables for complex system behavior automated software engineering 17 4 160 a b t menzies and y hu 2003 data mining for very busy people ieee computer 36 11 22 29 160 j schumann k gundy burlet c pasareanu t menzies and a barrett 2009 software v amp v support by parametric analysis of large software simulation systems proceedings of the 2009 ieee aerospace conference 160 retrieved from http en wikipedia org w index php title contrast_set_learning amp oldid 527388955 categories data managementdata mininghidden categories pages with citations using unsupported parameters navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 10 december 2012 at 18 31 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Cross_Industry_Standard_Process_for_Data_Mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Cross_Industry_Standard_Process_for_Data_Mining new file mode 100644 index 00000000..6bac7a99 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Cross_Industry_Standard_Process_for_Data_Mining @@ -0,0 +1 @@ +cross industry standard process for data mining wikipedia the free encyclopedia cross industry standard process for data mining from wikipedia the free encyclopedia jump to navigation search crisp dm stands for cross industry standard process for data mining 1 it is a data mining process model that describes commonly used approaches that expert data miners use to tackle problems polls conducted in 2002 2004 and 2007 show that it is the leading methodology used by data miners 2 3 4 the only other data mining standard named in these polls was semma however 3 4 times as many people reported using crisp dm a review and critique of data mining process models in 2009 called the crisp dm the de facto standard for developing data mining and knowledge discovery projects 5 other reviews of crisp dm and data mining process models include kurgan and musilek s 2006 review 6 and azevedo and santos 2008 comparison of crisp dm and semma 7 contents 1 major phases 2 history 3 references 4 external links major phases edit crisp dm breaks the process of data mining into six major phases 8 the sequence of the phases is not strict and moving back and forth between different phases is always required the arrows in the process diagram indicate the most important and frequent dependencies between phases the outer circle in the diagram symbolizes the cyclic nature of data mining itself a data mining process continues after a solution has been deployed the lessons learned during the process can trigger new often more focused business questions and subsequent data mining processes will benefit from the experiences of previous ones process diagram showing the relationship between the different phases of crisp dm business understanding this initial phase focuses on understanding the project objectives and requirements from a business perspective and then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives data understanding the data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data to identify data quality problems to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information data preparation the data preparation phase covers all activities to construct the final dataset data that will be fed into the modeling tool s from the initial raw data data preparation tasks are likely to be performed multiple times and not in any prescribed order tasks include table record and attribute selection as well as transformation and cleaning of data for modeling tools modeling in this phase various modeling techniques are selected and applied and their parameters are calibrated to optimal values typically there are several techniques for the same data mining problem type some techniques have specific requirements on the form of data therefore stepping back to the data preparation phase is often needed evaluation at this stage in the project you have built a model or models that appear to have high quality from a data analysis perspective before proceeding to final deployment of the model it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives a key objective is to determine if there is some important business issue that has not been sufficiently considered at the end of this phase a decision on the use of the data mining results should be reached deployment creation of the model is generally not the end of the project even if the purpose of the model is to increase knowledge of the data the knowledge gained will need to be organized and presented in a way that the customer can use it depending on the requirements the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process in many cases it will be the customer not the data analyst who will carry out the deployment steps however even if the analyst will not carry out the deployment effort it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models history edit crisp dm was conceived in 1996 in 1997 it got underway as a european union project under the esprit funding initiative the project was led by five companies spss teradata daimler ag ncr corporation and ohra an insurance company this core consortium brought different experiences to the project isl later acquired and merged into spss inc the computer giant ncr corporation produced the teradata data warehouse and its own data mining software daimler benz had a significant data mining team ohra was just starting to explore the potential use of data mining the first version of the methodology was presented at the 4th crisp dm sig workshop in brussels in march 1999 9 and published as a step by step data mining guide later that year 10 between 2006 and 2008 a crisp dm 2 0 sig was formed and there were discussions about updating the crisp dm process model 5 11 the current status of these efforts is not known however the original crisp dm org website cited in the reviews 6 7 and the crisp dm 2 0 sig website 5 11 are both no longer active while many non ibm data mining practitioners use crisp dm 2 3 4 5 ibm is the primary corporation that currently embraces the crisp dm process model it makes some of the old crisp dm documents available for download 10 and it has incorporated it into its spss modeler product references edit shearer c the crisp dm model the new blueprint for data mining j data warehousing 2000 5 13 22 a b gregory piatetsky shapiro 2002 kdnuggets methodology poll a b gregory piatetsky shapiro 2004 kdnuggets methodology poll a b gregory piatetsky shapiro 2007 kdnuggets methodology poll a b c d scar marb n gonzalo mariscal and javier segovia 2009 a data mining amp knowledge discovery process model in data mining and knowledge discovery in real life applications book edited by julio ponce and adem karahoca isbn 978 3 902613 53 0 pp 438 453 february 2009 i tech vienna austria a b lukasz kurgan and petr musilek 2006 a survey of knowledge discovery and data mining process models the knowledge engineering review volume 21 issue 1 march 2006 pp 1 24 cambridge university press new york ny usa doi 10 1017 s0269888906000737 a b azevedo a and santos m f 2008 kdd semma and crisp dm a parallel overview in proceedings of the iadis european conference on data mining 2008 pp 182 185 harper gavin stephen d pickett august 2006 methods for mining hts data drug discovery today 11 15 16 694 699 doi 10 1016 j drudis 2006 06 006 pmid 160 16846796 160 pete chapman 1999 the crisp dm user guide a b pete chapman julian clinton randy kerber thomas khabaza thomas reinartz colin shearer and r diger wirth 2000 crisp dm 1 0 step by step data mining guide a b colin shearer 2006 first crisp dm 2 0 workshop held external links edit le site des dataminers article publi par pascal bizzari mai 2009 retrieved from http en wikipedia org w index php title cross_industry_standard_process_for_data_mining amp oldid 559699292 categories applied data mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais polski portugus edit links this page was last modified on 13 june 2013 at 10 00 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data new file mode 100644 index 00000000..25b51d04 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data @@ -0,0 +1 @@ +data wikipedia the free encyclopedia data from wikipedia the free encyclopedia jump to navigation search for data in computer science see data computing for other uses see data disambiguation data det day t d t da t or d t dah t are values of qualitative or quantitative variables belonging to a set of items data in computing or data processing are represented in a structure often tabular represented by rows and columns a tree a set of nodes with parent children relationship or a graph structure a set of interconnected nodes data are typically the results of measurements and can be visualised using graphs or images data as an abstract concept can be viewed as the lowest level of abstraction from which information and then knowledge are derived raw data i e unprocessed data refers to a collection of numbers characters and is a relative term data processing commonly occurs by stages and the processed data from one stage may be considered the raw data of the next field data refers to raw data collected in an uncontrolled in situ environment experimental data refers to data generated within the context of a scientific investigation by observation and recording the word data is the plural of datum neuter past participle of the latin dare to give hence something given in discussions of problems in geometry mathematics engineering and so on the terms givens and data are used interchangeably such usage is the origin of data as a concept in computer science or data processing data are numbers words images etc accepted as they stand though data is also increasingly used in humanities particularly in the growing digital humanities it has been suggested that the highly interpretive nature of humanities might be at odds with the ethos of data as given peter checkland introduced the term capta from the latin capere to take to distinguish between an immense number of possible data and a sub set of them to which attention is oriented 1 johanna drucker has argued that since the humanities affirm knowledge production as situated partial and constitutive using data may introduce assumptions that are counterproductive for example that phenomena are discreet or are observer independent 2 the term capta which emphasizes the act of observation as constitutive is offered as an alternative to data for visual representations in the humanities contents 1 usage in english 2 meaning of data information and knowledge 3 see also 4 references 5 external links usage in english edit in english the word datum is still used in the general sense of an item given in cartography geography nuclear magnetic resonance and technical drawing it is often used to refer to a single specific reference datum from which distances to all other data are measured any measurement or result is a datum but data point is more usual 3 albeit tautological or more generously pleonastic in one sense datum is a count noun with the plural datums see usage in datum article that can be used with cardinal numbers e g 80 datums data originally a latin plural is not used like a normal count noun with cardinal numbers but it can be used as a plural with plural determiners such as these and many in addition to its use as a singular abstract mass noun with a verb in the singular form 4 even when a very small quantity of data is referenced one number for example the phrase piece of data is often used as opposed to datum the debate over appropriate usage is ongoing 5 6 7 the ieee computer society allows usage of data as either a mass noun or plural based on author preference 8 some professional organizations and style guides 9 dead link require that authors treat data as a plural noun for example the air force flight test center specifically states that the word data is always plural never singular 10 data is most often used as a singular mass noun in educated everyday usage 11 12 some major newspapers such as the new york times use it either in the singular or plural in the new york times the phrases the survey data are still being analyzed and the first year for which data is available have appeared within one day 13 the wall street journal explicitly allows this in its style guide 14 the associated press style guide classifies data as a collective noun that takes the singular when treated as a unit but the plural when referring to individual items the data is sound but the data have been carefully collected 15 in scientific writing data is often treated as a plural as in these data do not support the conclusions but it is also used as a singular mass entity like information for instance in computing and related disciplines 16 british usage now widely accepts treating data as singular in standard english 17 including everyday newspaper usage 18 at least in non scientific use 19 uk scientific publishing still prefers treating it as a plural 20 some uk university style guides recommend using data for both singular and plural use 21 and some recommend treating it only as a singular in connection with computers 22 meaning of data information and knowledge edit the terms data information and knowledge are frequently used for overlapping concepts the main difference is in the level of abstraction being considered data is the lowest level of abstraction information is the next level and finally knowledge is the highest level among all three 23 data on its own carries no meaning for data to become information it must be interpreted and take on a meaning for example the height of mt everest is generally considered as data a book on mt everest geological characteristics may be considered as information and a report containing practical information on the best way to reach mt everest s peak may be considered as knowledge information as a concept bears a diversity of meanings from everyday usage to technical settings generally speaking the concept of information is closely related to notions of constraint communication control data form instruction knowledge meaning mental stimulus pattern perception and representation beynon davies uses the concept of a sign to distinguish between data and information data are symbols while information occurs when the symbols are used to refer to something 24 25 it is people and computers who collect data and impose patterns on it these patterns are seen as information which can be used to enhance knowledge these patterns can be interpreted as truth and are authorized as aesthetic and ethical criteria events that leave behind perceivable physical or virtual remains can be traced back through data marks are no longer considered data once the link between the mark and observation is broken 26 mechanical computing devices are classified according to the means by which they represent data an analog computer represents a datum as a voltage distance position or other physical quantity a digital computer represents a datum as a sequence of symbols drawn from a fixed alphabet the most common digital computers use a binary alphabet that is an alphabet of two characters typically denoted 0 and 1 more familiar representations such as numbers or letters are then constructed from the binary alphabet some special forms of data are distinguished a computer program is a collection of data which can be interpreted as instructions most computer languages make a distinction between programs and the other data on which programs operate but in some languages notably lisp and similar languages programs are essentially indistinguishable from other data it is also useful to distinguish metadata that is a description of other data a similar yet earlier term for metadata is ancillary data the prototypical example of metadata is the library catalog which is a description of the contents of books see also edit biological data data acquisition data analysis data cable data domain data element data farming data governance data integrity data maintenance data management data mining data modeling computer data processing data remanence data set data warehouse database datasheet environmental data rescue fieldwork metadata scientific data archiving statistics datastructure references edit this article is based on material taken from the free on line dictionary of computing prior to 1 november 2008 and incorporated under the relicensing terms of the gfdl version 1 3 or later p checkland and s holwell 1998 information systems and information systems making sense of the field chichester west sussex john wiley amp sons pp 160 86 89 isbn 160 0 471 95820 4 160 johanna drucker 2011 humanities approaches to graphical display 160 matt dye 2001 writing reports university of bristol 160 data datum merriam webster s dictionary of english usage springfield massachusetts merriam webster 2002 pp 160 317 318 isbn 160 978 0 87779 132 4 160 data is a singular noun 160 grammarist data 160 dictionary com data 160 ieee computer society style guide def ieee computer society 160 who style guide geneva world health organization 2004 p 160 43 160 the author s guide to writing air force flight test center technical reports air force flight test center 160 new oxford dictionary of english 1999 in educated everyday usage as represented by the guardian newspaper it is nowadays most often used as a singular http www lexically net timjohns kibbitzer revis006 htm when serving the lord ministers are often found to neglect themselves new york times 2009 160 investment tax cuts help mostly the rich new york times 2009 160 is data is or is data ain t a plural wall street journal 2012 160 the associated press june 2002 collective nouns in norm goldstein the associated press stylebook and briefing on media law cambridge massachusetts perseus p 160 52 isbn 160 0 7382 0740 3 160 r w burchfield ed 1996 data fowler s modern english usage 3rd ed oxford clarendon press pp 160 197 198 isbn 160 0 19 869126 2 160 new oxford dictionary of english 1999 160 tim johns 1997 data singular or plural in educated everyday usage as represented by the guardian newspaper it is nowadays most often used as a singular 160 data compact oxford dictionary 160 data singular or plural blair wisconsin international university 160 singular or plural university of nottingham style book university of nottingham 160 dead link computers and computer systems openlearn 160 dead link akash mitra 2011 classifying data for successful modeling 160 p beynon davies 2002 information systems an introduction to informatics in organisations basingstoke uk palgrave macmillan isbn 160 0 333 96390 3 160 p beynon davies 2009 business information systems basingstoke uk palgrave isbn 160 978 0 230 20368 6 160 sharon daniel the database an aesthetics of dignity 160 external links edit look up data in wiktionary the free dictionary data is a singular noun a detailed assessment v t e statistics 160 descriptive statistics continuous data location mean arithmetic geometric harmonic median mode dispersion range standard deviation coefficient of variation percentile interquartile range shape variance skewness kurtosis moments l moments count data index of dispersion summary tables grouped data frequency distribution contingency table dependence pearson product moment correlation rank correlation spearman s rho kendall s tau partial correlation scatter plot statistical graphics bar chart biplot box plot control chart correlogram forest plot histogram q q plot run chart scatter plot stemplot radar chart 160 data collection designing studies effect size standard error statistical power sample size determination survey methodology sampling stratified sampling opinion poll questionnaire controlled experiment design of experiments randomized experiment random assignment replication blocking factorial experiment optimal design uncontrolled studies natural experiment quasi experiment observational study 160 statistical inference statistical theory sampling distribution order statistics sufficiency completeness exponential family permutation test randomization test empirical distribution bootstrap u statistic efficiency asymptotics robustness frequentist inference unbiased estimator mean unbiased minimum variance median unbiased biased estimators maximum likelihood method of moments minimum distance density estimation confidence interval testing hypotheses power parametric tests likelihood ratio wald score specific tests z normal student s t test f chi squared signed rank 1 sample 2 sample 1 way anova shapiro wilk kolmogorov smirnov bayesian inference bayesian probability prior posterior credible interval bayes factor bayesian estimator maximum posterior estimator 160 correlation and regression analysis correlation pearson product moment correlation partial correlation confounding variable coefficient of determination regression analysis errors and residuals regression model validation mixed effects models simultaneous equations models linear regression simple linear regression ordinary least squares general linear model bayesian regression non standard predictors nonlinear regression nonparametric semiparametric isotonic robust generalized linear model exponential families logistic bernoulli binomial poisson partition of variance analysis of variance anova analysis of covariance multivariate anova degrees of freedom 160 categorical multivariate time series or survival analysis categorical data cohen s kappa contingency table graphical model log linear model mcnemar s test multivariate statistics multivariate regression principal components factor analysis cluster analysis classification copulas time series analysis general decomposition trend stationarity seasonal adjustment time domain acf pacf xcf arma model arima model vector autoregression frequency domain spectral density estimation survival analysis survival function kaplan meier logrank test failure rate proportional hazards models accelerated failure time model 160 applications biostatistics bioinformatics clinical trials amp studies epidemiology medical statistics engineering statistics chemometrics methods engineering probabilistic design process amp quality control reliability system identification social statistics actuarial science census crime statistics demography econometrics national accounts official statistics population psychometrics spatial statistics cartography environmental statistics geographic information system geostatistics kriging category portal outline index retrieved from http en wikipedia org w index php title data amp oldid 558373470 categories computer datadatadata managementhidden categories all articles with dead external linksarticles with dead external links from december 2010wikipedia indefinitely move protected pages navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages afrikaans asturianu az rbaycanca bosanski catal esky dansk deutsch espa ol esperanto euskara fran ais galego hrvatski bahasa indonesia interlingua slenska italiano basa jawa kiswahili kurd latina latvie u lietuvi magyar bahasa melayu mirand s nederlands polski portugus rom n sloven ina srpski srpskohrvatski basa sunda suomi svenska tagalog t rk e ti ng vi t edit links this page was last modified on 5 june 2013 at 01 05 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_Mining_and_Knowledge_Discovery b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_Mining_and_Knowledge_Discovery new file mode 100644 index 00000000..623d37e4 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_Mining_and_Knowledge_Discovery @@ -0,0 +1 @@ +data mining and knowledge discovery wikipedia the free encyclopedia data mining and knowledge discovery from wikipedia the free encyclopedia jump to navigation search data mining and knowledge discovery 160 abbreviated title iso 160 4 data min knowl discov discipline computer science language english publication details publisher springer science business media publication history 1997 present frequency triannually impact factor 2011 1 545 indexing issn 1384 5810 160 print 1573 756x 160 web lccn sn98038132 coden dmkdfd oclc number 38037443 links journal homepage online access data mining and knowledge discovery is a triannual peer reviewed scientific journal focusing on data mining it is published by springer science business media as of 2012 update the editor in chief is geoffrey i webb external links edit official website this article about a scientific journal is a stub you can help wikipedia by expanding it v t e see tips for writing articles about academic journals further suggestions might be found on the article s talk page retrieved from http en wikipedia org w index php title data_mining_and_knowledge_discovery amp oldid 528129995 categories computer science journalsdata miningspringer academic journalspublications established in 1997english language journalstriannual journalsscientific journal stubshidden categories articles containing potentially dated statements from 2012all articles containing potentially dated statements navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 15 december 2012 at 08 15 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_analysis new file mode 100644 index 00000000..ba079c16 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_analysis @@ -0,0 +1 @@ +data analysis wikipedia the free encyclopedia data analysis from wikipedia the free encyclopedia jump to navigation search this article needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed december 2008 analysis of data is a process of inspecting cleaning transforming and modeling data with the goal of highlighting useful information suggesting conclusions and supporting decision making data analysis has multiple facts and approaches encompassing diverse techniques under a variety of names in different business science and social science domains data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes business intelligence covers data analysis that relies heavily on aggregation focusing on business information in statistical applications some people divide data analysis into descriptive statistics exploratory data analysis eda and confirmatory data analysis cda eda focuses on discovering new features in the data and cda on confirming or falsifying existing hypotheses predictive analytics focuses on application of statistical or structural models for predictive forecasting or classification while text analytics applies statistical linguistic and structural techniques to extract and classify information from textual sources a species of unstructured data all are varieties of data analysis data integration is a precursor to data analysis and data analysis is closely linked to data visualization and data dissemination the term data analysis is sometimes used as a synonym for data modeling contents 1 the process of data analysis 2 data cleaning 3 initial data analysis 3 1 quality of data 3 2 quality of measurements 3 3 initial transformations 3 4 did the implementation of the study fulfill the intentions of the research design 3 5 characteristics of data sample 3 6 final stage of the initial data analysis 3 7 analyses 4 main data analysis 4 1 exploratory and confirmatory approaches 4 2 stability of results 4 3 statistical methods 5 free software for data analysis 6 commercial software for data analysis 7 education 8 nuclear and particle physics 9 see also 10 references 11 further reading the process of data analysis edit data analysis is a process within which several phases can be distinguished 1 data cleaning edit data cleaning is an important procedure during which the data are inspected and erroneous data are if necessary preferable and possible corrected data cleaning can be done during the stage of data entry if this is done it is important that no subjective decisions are made the guiding principle provided by ad r ref is during subsequent manipulations of the data information should always be cumulatively retrievable in other words it should always be possible to undo any data set alterations therefore it is important not to throw information away at any stage in the data cleaning phase all information should be saved i e when altering variables both the original values and the new values should be kept either in a duplicate data set or under a different variable name and all alterations to the data set should be carefully and clearly documented for instance in a syntax or a log 2 initial data analysis edit the most important distinction between the initial data analysis phase and the main analysis phase is that during initial data analysis one refrains from any analysis that are aimed at answering the original research question the initial data analysis phase is guided by the following four questions 3 quality of data edit the quality of the data should be checked as early as possible data quality can be assessed in several ways using different types of analyses frequency counts descriptive statistics mean standard deviation median normality skewness kurtosis frequency histograms n variables are compared with coding schemes of variables external to the data set and possibly corrected if coding schemes are not comparable test for common method variance the choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase 4 quality of measurements edit the quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study one should check whether structure of measurement instruments corresponds to structure reported in the literature there are two ways to assess measurement quality confirmatory factor analysis analysis of homogeneity internal consistency which gives an indication of the reliability of a measurement instrument during this analysis one inspects the variances of the items and the scales the cronbach s of the scales and the change in the cronbach s alpha when an item would be deleted from a scale 5 initial transformations edit after assessing the quality of the data and of the measurements one might decide to impute missing data or to perform initial transformations of one or more variables although this can also be done during the main analysis phase 6 possible transformations of variables are 7 square root transformation if the distribution differs moderately from normal log transformation if the distribution differs substantially from normal inverse transformation if the distribution differs severely from normal make categorical ordinal dichotomous if the distribution differs severely from normal and no transformations help did the implementation of the study fulfill the intentions of the research design edit one should check the success of the randomization procedure for instance by checking whether background and substantive variables are equally distributed within and across groups if the study did not need and or use a randomization procedure one should check the success of the non random sampling for instance by checking whether all subgroups of the population of interest are represented in sample other possible data distortions that should be checked are dropout this should be identified during the initial data analysis phase item nonresponse whether this is random or not should be assessed during the initial data analysis phase treatment quality using manipulation checks 8 characteristics of data sample edit in any report or article the structure of the sample must be accurately described it is especially important to exactly determine the structure of the sample and specifically the size of the subgroups when subgroup analyses will be performed during the main analysis phase the characteristics of the data sample can be assessed by looking at basic statistics of important variables scatter plots correlations and associations cross tabulations 9 final stage of the initial data analysis edit during the final stage the findings of the initial data analysis are documented and necessary preferable and possible corrective actions are taken also the original plan for the main data analyses can and should be specified in more detail and or rewritten in order to do this several decisions about the main data analyses can and should be made in the case of non normals should one transform variables make variables categorical ordinal dichotomous adapt the analysis method in the case of missing data should one neglect or impute the missing data which imputation technique should be used in the case of outliers should one use robust analysis techniques in case items do not fit the scale should one adapt the measurement instrument by omitting items or rather ensure comparability with other uses of the measurement instrument s in the case of too small subgroups should one drop the hypothesis about inter group differences or use small sample techniques like exact tests or bootstrapping in case the randomization procedure seems to be defective can and should one calculate propensity scores and include them as covariates in the main analyses 10 analyses edit several analyses can be used during the initial data analysis phase 11 univariate statistics single variable bivariate associations correlations graphical techniques scatter plots it is important to take the measurement levels of the variables into account for the analyses as special statistical techniques are available for each level 12 nominal and ordinal variables frequency counts numbers and percentages associations circumambulations crosstabulations hierarchical loglinear analysis restricted to a maximum of 8 variables loglinear analysis to identify relevant important variables and possible confounders exact tests or bootstrapping in case subgroups are small computation of new variables continuous variables distribution statistics m sd variance skewness kurtosis stem and leaf displays box plots main data analysis edit in the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report 13 exploratory and confirmatory approaches edit in the main analysis phase either an exploratory or confirmatory approach can be adopted usually the approach is decided before data is collected in an exploratory analysis no clear hypothesis is stated before analysing the data and the data is searched for models that describe the data well in a confirmatory analysis clear hypotheses about the data are tested exploratory data analysis should be interpreted carefully when testing multiple models at once there is a high chance on finding at least one of them to be significant but this can be due to a type 1 error it is important to always adjust the significance level when testing multiple models with for example a bonferroni correction also one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset an exploratory analysis is used to find ideas for a theory but not to test that theory as well when a model is found exploratory in a dataset then following up that analysis with a comfirmatory analysis in the same dataset could simply mean that the results of the comfirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place the comfirmatory analysis therefore will not be more informative than the original exploratory analysis 14 stability of results edit it is important to obtain some indication about how generalizable the results are 15 while this is hard to check one can look at the stability of the results are the results reliable and reproducible there are two main ways of doing this cross validation by splitting the data in multiple parts we can check if analyzes like a fitted model based on one part of the data generalize to another part of the data as well sensitivity analysis a procedure to study the behavior of a system or model when global parameters are systematically varied one way to do this is with bootstrapping statistical methods edit many statistical methods have been used for statistical analyses a very brief list of four of the more popular methods is general linear model a widely used model on which various methods are based e g t test anova ancova manova usable for assessing the effect of several predictors on one or more continuous dependent variables generalized linear model an extension of the general linear model for discrete dependent variables structural equation modelling usable for assessing latent structures from measured manifest variables item response theory models for mostly assessing one latent variable from several binary measured variables e g an exam free software for data analysis edit root c data analysis framework developed at cern paw fortran c data analysis framework developed at cern scavis java multi platform data analysis framework developed at anl knime the konstanz information miner a user friendly and comprehensive data analytics framework data applied an online data mining and data visualization solution r a programming language and software environment for statistical computing and graphics devinfo a database system endorsed by the united nations development group for monitoring and analyzing human development zeptoscope basic 16 interactive java based plotter developed at nanomix lavastorm analytics engine public edition free desktop edition for organizations genie discovery of causal relationships from data learning and inference with bayesian networks industrial quality software developed at the decision systems laboratory university of pittsburgh antz c realtime 3d data visualization hierarchal object trees that combine multiple topologies with millions of nodes commercial software for data analysis edit holsys one tool for the analysis of complex systems sensors network industrial plant based on a reinterpretation of the if then clause in the sense of the theory of holons 1 infobright s high performance analytic database is designed for analyzing large volumes of machine generated data education edit in education most educators have access to a data system for the purpose of analyzing student data 17 these data systems present data to educators in an over the counter data format embedding labels supplemental documentation and a help system and making key package display and content decisions to improve the accuracy of educators data analyses 18 nuclear and particle physics edit in nuclear and particle physics the data usually originate from the experimental apparatus via a data acquisition system they are then processed in a step usually called data reduction to apply calibrations and to extract physically significant information data reduction is most often especially in large particle physics experiments an automatic batch mode operation carried out by software written ad hoc the resulting data n tuples are then scrutinized by the physicists using specialized software tools like root or paw comparing the results of the experiment with theory the theoretical models are often difficult to compare directly with the results of the experiments so they are used instead as input for monte carlo simulation software like geant4 in order to predict the response of the detector to a given theoretical event thus producing simulated events which are then compared to experimental data see also edit statistics portal wikiversity has learning materials about data analysis analytics business intelligence censoring statistics computational physics data acquisition data governance data mining data presentation architecture digital signal processing dimension reduction early case assessment exploratory data analysis fourier analysis machine learning multilinear pca multilinear subspace learning nearest neighbor search predictive analytics principal component analysis qualitative research scientific computing structured data analysis statistics test method text analytics unstructured data wavelet references edit ad r 2008 p 334 335 ad r 2008 p 336 337 ad r 2008 p 337 ad r 2008 p 338 341 ad r 2008 p 341 3342 ad r 2008 p 344 tabachnick amp fidell 2007 p 87 88 ad r 2008 p 344 345 ad r 2008 p 345 ad r 2008 p 345 346 ad r 2008 p 346 347 ad r 2008 p 349 353 ad r 2008 p 363 ad r 2008 p 361 362 ad r 2008 p 368 371 zeptoscope synopsia net aarons d 2009 report finds states on course to build pupil data systems education week 29 13 6 rankin j 2013 march 28 how data systems amp reports can either fight or propagate the data analysis error epidemic and how educator leaders can help presentation conducted from technology information center for administrative leadership tical school leadership summit ad r h j 2008 chapter 14 phases and initial steps in data analysis in h j ad r amp g j mellenbergh eds with contributions by d j hand advising on research methods a consultant s companion pp 160 333 356 huizen the netherlands johannes van kessel publishing ad r h j 2008 chapter 15 the main analysis phase in h j ad r amp g j mellenbergh eds with contributions by d j hand advising on research methods a consultant s companion pp 160 333 356 huizen the netherlands johannes van kessel publishing tabachnick b g amp fidell l s 2007 chapter 4 cleaning up your act screening data prior to analysis in b g tabachnick amp l s fidell eds using multivariate statistics fifth edition pp 160 60 116 boston pearson education inc allyn and bacon further reading edit ad r h j amp mellenbergh g j with contributions by d j hand 2008 advising on research methods a consultant s companion huizen the netherlands johannes van kessel publishing astm international 2002 manual on presentation of data and control chart analysis mnl 7a isbn 0 8031 2093 1 juran joseph m godfrey a blanton 1999 juran s quality handbook 5th ed new york mcgraw hill isbn 0 07 034003 x lewis beck michael s 1995 data analysis an introduction sage publications inc isbn 0 8039 5772 6 nist sematek 2008 handbook of statistical methods pyzdek t 2003 quality engineering handbook isbn 0 8247 4614 7 richard veryard 1984 pragmatic data analysis oxford 160 blackwell scientific publications isbn 0 632 01311 7 tabachnick b g amp fidell l s 2007 using multivariate statistics fifth edition boston pearson education inc allyn and bacon isbn 978 0 205 45938 4 vance september 08 2011 data analytics crunching the future bloomberg businessweek retrieved 26 september 2011 160 hair joseph 2008 marketing research 4th ed mcgraw hill data analysis testing for association isbn 0 07 340470 5 retrieved from http en wikipedia org w index php title data_analysis amp oldid 560705007 categories data analysisscientific methodparticle physicshidden categories articles needing additional references from december 2008all articles needing additional references navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch eesti espa ol esperanto fran ais italiano polski portugus suomi edit links this page was last modified on 20 june 2013 at 04 30 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_classification_business_intelligence_ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_classification_business_intelligence_ new file mode 100644 index 00000000..3995fdaf --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_classification_business_intelligence_ @@ -0,0 +1 @@ +data classification business intelligence wikipedia the free encyclopedia data classification business intelligence from wikipedia the free encyclopedia jump to navigation search this article may require cleanup to meet wikipedia s quality standards no cleanup reason has been specified please help improve this article if you can may 2010 in business intelligence data classification has close ties to data clustering but where data clustering is descriptive data classification is predictive 1 2 in essence data classification consists of using variables with known values to predict the unknown or future values of other variables it can be used in e g direct marketing insurance fraud detection or medical diagnosis 2 the first step in doing a data classification is to cluster the data set used for category training to create the wanted number of categories an algorithm called the classifier is then used on the categories creating a descriptive model for each these models can then be used to categorize new items in the created classification system 1 according to golfarelli and rizzi these are the measures of effectiveness of the classifier 1 predictive accuracy how well does it predict the categories for new observations speed what is the computational cost of using the classifier robustness how well do the models created perform if data quality is low scalability does the classifier function efficiently with large amounts of data interpretability are the results understandable to users typical examples of input for data classification could be variables such as demographics lifestyle information or economical behaviour challenges for data classification edit there are several challenges in working with data classification one in particular is that it is necessary for all using categories on e g customers or clients to do the modeling in an iterative process this is to make sure that change in the characteristics of customer groups does not go unnoticed making the existing categories outdated and obsolete without anyone noticing this could be of special importance to insurance or banking companies where fraud detection is extremely relevant new fraud patterns may come unnoticed if the methods to surveil these changes and alert when categories are changing disappearing or new ones emerge are not developed and implemented references edit a b c golfarelli m amp rizzi s 2009 data warehouse design 160 modern principles and methodologies mcgraw hill osburn isbn 0 07 161039 1 a b kimball r et al 2008 the data warehouse lifecycle toolkit 2 ed wiley isbn 0 471 25547 5 retrieved from http en wikipedia org w index php title data_classification_ business_intelligence amp oldid 508667083 categories data miningcluster analysisstatistical classificationbusiness intelligencehidden categories articles needing cleanup from may 2010all articles needing cleanupcleanup tagged articles without a reason field from may 2010wikipedia pages needing cleanup from may 2010 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 22 august 2012 at 19 49 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_collection b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_collection new file mode 100644 index 00000000..a58e3e78 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_collection @@ -0,0 +1 @@ +data collection wikipedia the free encyclopedia data collection from wikipedia the free encyclopedia jump to navigation search data collection usually takes place early on in an improvement project and is often formalised through a data collection plan 1 which often contains the following activity pre collection activity 160 agree on goals target data definitions methods collection 160 data collections present findings 160 usually involves some form of sorting 2 analysis and or presentation prior to any data collection pre collection activity is one of the most crucial steps in the process it is often discovered too late that the value of their interview information is discounted as a consequence of poor sampling of both questions and informants and poor elicitation techniques 3 after pre collection activity is fully completed data collection in the field whether by interviewing or other methods can be carried out in a structured systematic and scientific way a formal data collection process is necessary as it ensures that data gathered are both defined and accurate and that subsequent decisions based on arguments embodied in the findings are valid 4 the process provides both a baseline from which to measure and in certain cases a target on what to improve other main types of collection include census sample survey and administrative by product and each with their respective advantages and disadvantages a census refers to data collection about everyone or everything in a group or statistical population and has advantages such as accuracy and detail and disadvantages such as cost and time a sampling is a data collection method that includes only part of the total population and has advantages such as cost and time and disadvantages such as accuracy and detail administrative by product data are collected as a byproduct of an organization s day to day operations and has advantages such as accuracy time simplicity and disadvantages such as no flexibility and lack of control 5 see also edit wikimedia commons has media related to data collection scientific data archiving data management experiment observational study sampling statistics statistical survey survey data collection references edit leanyourcompany com establishing a data collection plan sorting data collection and analysis by anthony peter macmillan coxon isbn 0 8039 7237 7 weller s romney a 1988 systematic data collection qualitative research methods series 10 thousand oaks california sage publications isbn 0 8039 3074 7 data collection and analysis by dr roger sapsford victor jupp isbn 0 7619 5046 x weimer j ed 1995 research techniques in human engineering englewood cliffs nj prentice hall isbn 0 13 097072 7 v t e statistics 160 descriptive statistics continuous data location mean arithmetic geometric harmonic median mode dispersion range standard deviation coefficient of variation percentile interquartile range shape variance skewness kurtosis moments l moments count data index of dispersion summary tables grouped data frequency distribution contingency table dependence pearson product moment correlation rank correlation spearman s rho kendall s tau partial correlation scatter plot statistical graphics bar chart biplot box plot control chart correlogram forest plot histogram q q plot run chart scatter plot stemplot radar chart 160 data collection designing studies effect size standard error statistical power sample size determination survey methodology sampling stratified sampling opinion poll questionnaire controlled experiment design of experiments randomized experiment random assignment replication blocking factorial experiment optimal design uncontrolled studies natural experiment quasi experiment observational study 160 statistical inference statistical theory sampling distribution order statistics sufficiency completeness exponential family permutation test randomization test empirical distribution bootstrap u statistic efficiency asymptotics robustness frequentist inference unbiased estimator mean unbiased minimum variance median unbiased biased estimators maximum likelihood method of moments minimum distance density estimation confidence interval testing hypotheses power parametric tests likelihood ratio wald score specific tests z normal student s t test f chi squared signed rank 1 sample 2 sample 1 way anova shapiro wilk kolmogorov smirnov bayesian inference bayesian probability prior posterior credible interval bayes factor bayesian estimator maximum posterior estimator 160 correlation and regression analysis correlation pearson product moment correlation partial correlation confounding variable coefficient of determination regression analysis errors and residuals regression model validation mixed effects models simultaneous equations models linear regression simple linear regression ordinary least squares general linear model bayesian regression non standard predictors nonlinear regression nonparametric semiparametric isotonic robust generalized linear model exponential families logistic bernoulli binomial poisson partition of variance analysis of variance anova analysis of covariance multivariate anova degrees of freedom 160 categorical multivariate time series or survival analysis categorical data cohen s kappa contingency table graphical model log linear model mcnemar s test multivariate statistics multivariate regression principal components factor analysis cluster analysis classification copulas time series analysis general decomposition trend stationarity seasonal adjustment time domain acf pacf xcf arma model arima model vector autoregression frequency domain spectral density estimation survival analysis survival function kaplan meier logrank test failure rate proportional hazards models accelerated failure time model 160 applications biostatistics bioinformatics clinical trials amp studies epidemiology medical statistics engineering statistics chemometrics methods engineering probabilistic design process amp quality control reliability system identification social statistics actuarial science census crime statistics demography econometrics national accounts official statistics population psychometrics spatial statistics cartography environmental statistics geographic information system geostatistics kriging category portal outline index retrieved from http en wikipedia org w index php title data_collection amp oldid 561657253 categories data collectionsurvey methodologydesign of experimentshidden categories commons category without a link on wikidata navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 26 june 2013 at 11 37 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_dredging b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_dredging new file mode 100644 index 00000000..627e67aa --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_dredging @@ -0,0 +1 @@ +data dredging wikipedia the free encyclopedia data dredging from wikipedia the free encyclopedia jump to navigation search data dredging data fishing data snooping equation fitting is the inappropriate sometimes deliberately so use of data mining to uncover misleading relationships in data data snooping bias is a form of statistical bias that arises from this misuse of statistics any relationships found might appear valid within the test set but they would have no statistical significance in the wider population data dredging and data snooping bias can occur when researchers either do not form a hypothesis in advance or narrow the data used to reduce the probability of the sample refuting a specific hypothesis although data snooping bias can occur in any field that uses data mining it is of particular concern in finance and medical research which both heavily use data mining the process of data mining involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching for combinations of variables that might show a correlation conventional tests of statistical significance are based on the probability that an observation arose by chance and necessarily accept some risk of mistaken test results called the significance when large numbers of tests are performed some produce false results hence 5 of randomly chosen hypotheses turn out to be significant at the 5 level 1 turn out to be significant at the 1 significance level and so on by chance alone this and a comic example http imgs xkcd com comics significant png exemplify the multiple comparisons hazard in data dredging there is no overall effect of jelly beans on acne also subgroups are sometimes explored without alerting the reader to the number of questions at issue which can lead to misinformed conclusions 1 when enough hypotheses are tested it is virtually certain that some falsely appear statistically significant since every data set with any degree of randomness contains some spurious correlations researchers using data mining techniques can be easily misled by these apparently significant results even though they are mere artifacts of random variation circumventing the traditional scientific approach by conducting an experiment without a hypothesis can lead to premature conclusions data mining can be used negatively to seek more information from a data set than it actually contains failure to adjust existing statistical models when applying them to new datasets can also result in the occurrences of new patterns between different attributes that would otherwise have not shown up overfitting oversearching overestimation and attribute selection errors are all actions that can lead to data dredging contents 1 types of problem 1 1 drawing conclusions from data 1 2 hypothesis suggested by non representative data 1 3 bias 1 4 multiple modelling 1 5 examples in meteorology and epidemiology 2 remedies 3 see also 4 references 5 further reading 6 external links types of problem edit drawing conclusions from data edit the conventional frequentist statistical hypothesis testing procedure is to formulate a research hypothesis such as people in higher social classes live longer then collect relevant data followed by carrying out a statistical significance test to see whether the results could be due to the effects of chance the last step is called testing against the null hypothesis a key point in proper statistical analysis is to test a hypothesis with evidence data that was not used in constructing the hypothesis this is critical because every data set contains some patterns due entirely to chance if the hypothesis is not tested on a different data set from the same population it is impossible to determine if the patterns found are chance patterns see testing hypotheses suggested by the data here is a simple example throwing a coin five times with a result of 2 heads and 3 tails might lead one to hypothesize that the coin favors tails by 3 5 to 2 5 if this hypothesis is then tested on the existing data set it is confirmed but the confirmation is meaningless the proper procedure would have been to form in advance a hypothesis of what the tails probability is and then throw the coin various times to see if the hypothesis is rejected or not if three tails and two heads are observed another hypothesis that the tails probability is 3 5 could be formed but it could only be tested by a new set of coin tosses it is important to realize that the statistical significance under the incorrect procedure is completely spurious significance tests do not protect against data dredging hypothesis suggested by non representative data edit main article testing hypotheses suggested by the data in a list of 367 people at least two have the same day and month of birth suppose mary and john both celebrate birthdays on august 7 data snooping would by design try to find additional similarities between mary and john such as are they the youngest and the oldest persons in the list have they met in person once twice three times do their fathers have the same first name or mothers have the same maiden name by going through hundreds or thousands of potential similarities between john and mary each having a low probability of being true we can almost certainly find some similarity between them perhaps john and mary are the only two persons in the list who switched minors three times in college a fact we found out by exhaustively comparing their lives histories our hypothesis biased by data snooping can then become people born on august 7 have a much higher chance of switching minors more than twice in college the data itself very strongly supports that correlation since no one with a different birthday had switched minors three times in college however when we turn to the larger sample of the general population and attempt to reproduce the results we find that there is no statistical correlation between august 7 birthdays and changing college minors more than once the fact exists only for a very small specific sample not for the public as a whole see also reproducible research bias edit main article bias bias is a systematic error in the analysis for example doctors directed hiv patients at high cardiovascular risk to a particular hiv treatment abacavir and lower risk patients to other drugs preventing a simple assessment of abacavir compared to other treatments an analysis that did not correct for this bias unfairly penalised the abacavir since its patients were more high risk so more of them had heart attacks 1 this problem can be very severe for example in the observational study 1 2 missing factors unmeasured confounders and loss to follow up can also lead to bias 1 by selecting papers with a significant p value negative studies are selected against which is the publication bias multiple modelling edit another aspect of the conditioning of statistical tests by knowledge of the data can be seen while using the frequent in the data analysis linear regression a crucial step in the process is to decide which covariates to include in a relationship explaining one or more other variables there are both statistical see stepwise regression and substantive considerations that lead the authors to favor some of their models over others and there is a liberal use of statistical tests however to discard one or more variables from an explanatory relation on the basis of the data means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened in the nature of the case the retained variables have had to pass some kind of preliminary test possibly an imprecise intuitive one that the discarded variables failed in 1966 selvin and stuart compared variables retained in the model to the fish that don t fall through the net in the sense that their effects are bound to be bigger than those that do fall through the net not only does this alter the performance of all subsequent tests on the retained explanatory model it may introduce bias and alter mean square error in estimation 3 4 examples in meteorology and epidemiology edit in meteorology dataset a is often weather data up to the present which ensures that even subconsciously subset b of the data could not influence the formulation of the hypothesis of course such a discipline necessitates waiting for new data to come in to show the formulated theory s predictive power versus the null hypothesis this process ensures that no one can accuse the researcher of hand tailoring the predictive model to the data on hand since the upcoming weather is not yet available as another example suppose that observers note that a particular town appears to have a cancer cluster but lack a firm hypothesis of why this is so however they have access to a large amount of demographic data about the town and surrounding area containing measurements for the area of hundreds or thousands of different variables mostly uncorrelated even if all these variables are independent of the cancer incidence rate it is highly likely that at least one variable correlates significantly with the cancer rate across the area while this may suggest a hypothesis further testing using the same variables but with data from a different location is needed to confirm note that a p value of 0 01 suggests that 1 of the time a result at least that extreme would be obtained by chance if hundreds or thousands of hypotheses with mutually relatively uncorrelated independent variables are tested then one is more likely than not to get at least one null hypothesis with a p value less than 0 01 remedies edit looking for patterns in data is legitimate applying a statistical test of significance hypothesis testing to the same data the pattern was learned from is wrong one way to construct hypotheses while avoiding data dredging is to conduct randomized out of sample tests the researcher collects a data set then randomly partitions it into two subsets a and b only one subset say subset a is examined for creating hypotheses once a hypothesis is formulated it must be tested on subset b which was not used to construct the hypothesis only where b also supports such a hypothesis is it reasonable to believe the hypothesis might be valid another remedy for data dredging is to record the number of all significance tests conducted during the experiment and simply multiply the final significance level by this number the bonferroni correction however this is a very conservative metric methods particularly useful in analysis of variance and in constructing simultaneous confidence bands for regressions involving basis functions are the scheff s method and if the researcher has in mind only pairwise comparisons the tukey method the use of a false discovery rate is a more sophisticated approach that has become a popular method for control of multiple hypothesis tests when neither approach is practical one can make a clear distinction between data analyses that are confirmatory and analyses that are exploratory statistical inference is appropriate only for the former 4 ultimately the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data thus if someone says that a certain event has probability of 20 2 19 times out of 20 this means that if the probability of the event is estimated by the same method used to obtain the 20 estimate the result is between 18 and 22 with probability 0 95 no claim of statistical significance can be made by only looking without due regard to the method used to assess the data see also edit base rate fallacy bonferroni inequalities pareidolia predictive analytics misuse of statistics lincoln kennedy coincidences urban legend references edit a b c d young s s karr a 2011 deming data and observational studies significance 8 3 160 smith g d shah e 2002 data dredging bias or confounding bmj 325 pmc 160 1124898 160 selvin h c stuart a 1966 data dredging procedures in survey analysis the american statistician 20 3 20 23 160 a b berk r brown l zhao l 2009 statistical inference after model selection j quant criminol doi 10 1007 s10940 009 9077 7 160 further reading edit ioannidis john p a august 30 2005 why most published research findings are false plos medicine san francisco public library of science 2 8 e124 doi 10 1371 journal pmed 0020124 issn 160 1549 1277 pmc 160 1182327 pmid 160 16060722 retrieved 2009 11 29 160 external links edit a bibliography on data snooping bias retrieved from http en wikipedia org w index php title data_dredging amp oldid 559859388 categories biascognitive biasesdata miningdesign of experimentshypothesis testingmisuse of statistics navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch edit links this page was last modified on 14 june 2013 at 11 14 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_management b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_management new file mode 100644 index 00000000..e0400a6c --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_management @@ -0,0 +1 @@ +data management wikipedia the free encyclopedia data management from wikipedia the free encyclopedia jump to navigation search data management comprises all the disciplines related to managing data as a valuable resource contents 1 overview 2 corporate data quality management 3 topics in data management 4 body of knowledge 5 usage 6 see also 7 notes 8 external links overview edit the official definition provided by dama international the professional organization for those in the data management profession is data resource management is the development and execution of architectures policies practices and procedures that properly manage the full data lifecycle needs of an enterprise dama international this definition is fairly broad and encompasses a number of professions which may not have direct technical contact with lower level aspects of data management such as relational database management alternatively the definition provided in the dama data management body of knowledge dama dmbok is data management is the development execution and supervision of plans policies programs and practices that control protect deliver and enhance the value of data and information assets 1 the concept of data management arose in the 1980s as technology moved from sequential processing first cards then tape to random access processing since it was now technically possible to store a single fact in a single place and access that using random access disk those suggesting that data management was more important than process management used arguments such as a customer s home address is stored in 75 or some other large number places in our computer systems during this period random access processing was not competitively fast so those suggesting process management was more important than data management used batch processing time as their primary argument as applications moved more and more into real time interactive applications it became obvious to most practitioners that both management processes were important if the data was not well defined the data would be mis used in applications if the process wasn t well defined it was impossible to meet user needs corporate data quality management edit corporate data quality management cdqm is according to the european foundation for quality management and the competence center corporate data quality cc cdq university of st gallen the whole set of activities intended to improve corporate data quality both reactive and preventive main premise of cdqm is the business relevance of high quality corporate data cdqm comprises with following activity areas 2 strategy for corporate data quality as cdqm is affected by various business drivers and requires involvement of multiple divisions in an organization it must be considered a company wide endeavor corporate data quality controlling effective cdqm requires compliance with standards policies and procedures compliance is monitored according to previously defined metrics and performance indicators and reported to stakeholders corporate data quality organization cdqm requires clear roles and responsibilities for the use of corporate data the cdqm organization defines tasks and privileges for decision making for cdqm corporate data quality processes and methods in order to handle corporate data properly and in a standardized way across the entire organization and to ensure corporate data quality standard procedures and guidelines must be embedded in company s daily processes data architecture for corporate data quality the data architecture consists of the data object model which comprises the unambiguous definition and the conceptual model of corporate data and the data storage and distribution architecture applications for corporate data quality software applications support the activities of corporate data quality management their use must be planned monitored managed and continuously improved topics in data management edit topics in data management grouped by the dama dmbok framework 3 include data governance data asset data governance data steward data architecture analysis and design data analysis data architecture data modeling database management data maintenance database administration database management system data security management data access data erasure data privacy data security data quality management data cleansing data integrity data enrichment data quality data quality assurance reference and master data management data integration master data management reference data data warehousing and business intelligence management business intelligence data mart data mining data movement extract transform and load data warehousing document record and content management document management system records management meta data management meta data management metadata metadata discovery metadata publishing metadata registry contact data management business continuity planning marketing operations customer data integration identity management identity theft data theft erp software crm software address geography postal code email address telephone number body of knowledge edit the dama guide to the data management body of knowledge dama dmbok guide under the guidance of a new dama dmbok editorial board this publication is available from april 5 2009 usage edit in modern management usage one can easily discern a trend away from the term data in composite expressions to the term information or even knowledge when talking in non technical context thus there exists not only data management but also information management and knowledge management this is a misleading trend as it obscures that traditional data is managed or somehow processed on second looks the distinction between data and derived values can be seen in the information ladder while data can exist as such information and knowledge are always in the eye or rather the brain of the beholder and can only be measured in relative units see also edit information architecture enterprise architecture information design information system controlled vocabulary notes edit http www dama org files public di_dama_dmbok_guide_presentation_2007 pdf dama dmbok guide data management body of knowledge introduction amp project status efqm 160 iwi hsg efqm framework for corporate data quality management brussels 160 efqm press 2011 forthcoming http www dama org i4a pages index cfm pageid 3364 dama dmbok functional framework external links edit data management at the open directory project retrieved from http en wikipedia org w index php title data_management amp oldid 540402953 categories data management navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages dansk deutsch espa ol fran ais bahasa indonesia bahasa melayu mirand s norsk bokm l portugus edit links this page was last modified on 25 february 2013 at 23 55 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_mining new file mode 100644 index 00000000..d4a5e0a3 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_mining @@ -0,0 +1 @@ +data mining wikipedia the free encyclopedia data mining from wikipedia the free encyclopedia jump to navigation search not to be confused with analytics information extraction or data analysis data mining the analysis step of the knowledge discovery in databases process or kdd 1 an interdisciplinary subfield of computer science 2 3 4 is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence machine learning statistics and database systems 2 the overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use 2 aside from the raw analysis step it involves database and data management aspects data pre processing model and inference considerations interestingness metrics complexity considerations post processing of discovered structures visualization and online updating 2 the term is a buzzword 5 and is frequently misused to mean any form of large scale data or information processing collection extraction warehousing analysis and statistics but is also generalized to any kind of computer decision support system including artificial intelligence machine learning and business intelligence in the proper use of the word the key term is discovery citation needed commonly defined as detecting something new even the popular book data mining practical machine learning tools and techniques with java 6 which covers mostly machine learning material was originally to be named just practical machine learning and the term data mining was only added for marketing reasons 7 often the more general terms large scale data analysis or analytics or when referring to actual methods artificial intelligence and machine learning are more appropriate the actual data mining task is the automatic or semi automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records cluster analysis unusual records anomaly detection and dependencies association rule mining this usually involves using database techniques such as spatial indices these patterns can then be seen as a kind of summary of the input data and may be used in further analysis or for example in machine learning and predictive analytics for example the data mining step might identify multiple groups in the data which can then be used to obtain more accurate prediction results by a decision support system neither the data collection data preparation nor result interpretation and reporting are part of the data mining step but do belong to the overall kdd process as additional steps the related terms data dredging data fishing and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are or may be too small for reliable statistical inferences to be made about the validity of any patterns discovered these methods can however be used in creating new hypotheses to test against the larger data populations data mining uses information from past data to analyze the outcome of a particular problem or situation that may arise data mining works to analyze data stored in data warehouses that are used to store that data that is being analyzed that particular data may come from all parts of business from the production to the management managers also use data mining to decide upon marketing strategies for their product they can use data to compare and contrast among competitors data mining interprets its data into real time analysis that can be used to increase sales promote new product or delete product that is not value added to the company contents 1 etymology 2 background 2 1 research and evolution 3 process 3 1 pre processing 3 2 data mining 3 3 results validation 4 standards 5 notable uses 5 1 games 5 2 business 5 3 science and engineering 5 4 human rights 5 5 medical data mining 5 6 spatial data mining 5 7 sensor data mining 5 8 visual data mining 5 9 music data mining 5 10 surveillance 5 11 pattern mining 5 12 subject based data mining 5 13 knowledge grid 6 reliability validity 7 privacy concerns and ethics 8 software 8 1 free open source data mining software and applications 8 2 commercial data mining software and applications 8 3 marketplace surveys 9 see also 10 references 11 further reading 12 external links etymology edit in the 1960s statisticians used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a priori hypothesis the term data mining appeared around 1990 in the database community at the beginning of the century there was a phrase database mining trademarked by hnc a san diego based company now merged into fico to pitch their data mining workstation 8 researchers consequently turned to data mining other terms used include data archaeology information harvesting information discovery knowledge extraction etc gregory piatetsky shapiro coined the term knowledge discovery in databases for the first workshop on the same topic 1989 and this term became more popular in ai and machine learning community however the term data mining became more popular in the business and press communities 9 currently data mining and knowledge discovery are used interchangeably background edit the manual extraction of patterns from data has occurred for centuries early methods of identifying patterns in data include bayes theorem 1700s and regression analysis 1800s the proliferation ubiquity and increasing power of computer technology has dramatically increased data collection storage and manipulation ability as data sets have grown in size and complexity direct hands on data analysis has increasingly been augmented with indirect automated data processing aided by other discoveries in computer science such as neural networks cluster analysis genetic algorithms 1950s decision trees 1960s and support vector machines 1990s data mining is the process of applying these methods with the intention of uncovering hidden patterns 10 in large data sets it bridges the gap from applied statistics and artificial intelligence which usually provide the mathematical background to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently allowing such methods to be applied to ever larger data sets research and evolution edit the premier professional body in the field is the association for computing machinery s acm special interest group sig on knowledge discovery and data mining sigkdd since 1989 this acm sig has hosted an annual international conference and published its proceedings 11 and since 1999 it has published a biannual academic journal titled sigkdd explorations 12 computer science conferences on data mining include cikm conference acm conference on information and knowledge management dmin conference international conference on data mining dmkd conference research issues on data mining and knowledge discovery ecdm conference european conference on data mining ecml pkdd conference european conference on machine learning and principles and practice of knowledge discovery in databases edm conference international conference on educational data mining icdm conference ieee international conference on data mining kdd conference acm sigkdd conference on knowledge discovery and data mining mldm conference machine learning and data mining in pattern recognition pakdd conference the annual pacific asia conference on knowledge discovery and data mining paw conference predictive analytics world sdm conference siam international conference on data mining siam sstd symposium symposium on spatial and temporal databases wsdm conference acm conference on web search and data mining data mining topics are also present on many data management database conferences such as the icde conference sigmod conference and international conference on very large data bases process edit the knowledge discovery in databases kdd process is commonly defined with the stages 1 selection 2 pre processing 3 transformation 4 data mining 5 interpretation evaluation 1 it exists however in many variations on this theme such as the cross industry standard process for data mining crisp dm which defines six phases 1 business understanding 2 data understanding 3 data preparation 4 modeling 5 evaluation 6 deployment or a simplified process such as 1 pre processing 2 data mining and 3 results validation polls conducted in 2002 2004 and 2007 show that the crisp dm methodology is the leading methodology used by data miners 13 14 15 the only other data mining standard named in these polls was semma however 3 4 times as many people reported using crisp dm several teams of researchers have published reviews of data mining process models 16 17 and azevedo and santos conducted a comparison of crisp dm and semma in 2008 18 pre processing edit before data mining algorithms can be used a target data set must be assembled as data mining can only uncover patterns actually present in the data the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit a common source for data is a data mart or data warehouse pre processing is essential to analyze the multivariate data sets before data mining the target set is then cleaned data cleaning removes the observations containing noise and those with missing data data mining edit data mining involves six common classes of tasks 1 anomaly detection outlier change deviation detection the identification of unusual data records that might be interesting or data errors that require further investigation association rule learning dependency modeling searches for relationships between variables for example a supermarket might gather data on customer purchasing habits using association rule learning the supermarket can determine which products are frequently bought together and use this information for marketing purposes this is sometimes referred to as market basket analysis clustering is the task of discovering groups and structures in the data that are in some way or another similar without using known structures in the data classification is the task of generalizing known structure to apply to new data for example an e mail program might attempt to classify an e mail as legitimate or as spam regression attempts to find a function which models the data with the least error summarization providing a more compact representation of the data set including visualization and report generation sequential pattern mining sequential pattern mining finds sets of data items that occur together frequently in some sequences sequential pattern mining which extracts frequent subsequences from a sequence database has attracted a great deal of interest during the recent data mining research because it is the basis of many applications such as web user analysis stock trend prediction dna sequence analysis finding language or linguistic patterns from natural language texts and using the history of symptoms to predict certain kind of disease results validation edit this section is missing information about non classification tasks in data mining it only covers machine learning this concern has been noted on the talk page where whether or not to include such information may be discussed september 2011 the final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set not all patterns found by the data mining algorithms are necessarily valid it is common for the data mining algorithms to find patterns in the training set which are not present in the general data set this is called overfitting to overcome this the evaluation uses a test set of data on which the data mining algorithm was not trained the learned patterns are applied to this test set and the resulting output is compared to the desired output for example a data mining algorithm trying to distinguish spam from legitimate emails would be trained on a training set of sample e mails once trained the learned patterns would be applied to the test set of e mails on which it had not been trained the accuracy of the patterns can then be measured from how many e mails they correctly classify a number of statistical methods may be used to evaluate the algorithm such as roc curves if the learned patterns do not meet the desired standards then it is necessary to re evaluate and change the pre processing and data mining steps if the learned patterns do meet the desired standards then the final step is to interpret the learned patterns and turn them into knowledge standards edit there have been some efforts to define standards for the data mining process for example the 1999 european cross industry standard process for data mining crisp dm 1 0 and the 2004 java data mining standard jdm 1 0 development on successors to these processes crisp dm 2 0 and jdm 2 0 was active in 2006 but has stalled since jdm 2 0 was withdrawn without reaching a final draft for exchanging the extracted models in particular for use in predictive analytics 160 the key standard is the predictive model markup language pmml which is an xml based language developed by the data mining group dmg and supported as exchange format by many data mining applications as the name suggests it only covers prediction models a particular data mining task of high importance to business applications however extensions to cover for example subspace clustering have been proposed independently of the dmg 19 notable uses edit see also category applied data mining games edit since the early 1960s with the availability of oracles for certain combinatorial games also called tablebases e g for 3x3 chess with any beginning configuration small board dots and boxes small board hex and certain endgames in chess dots and boxes and hex a new area for data mining has been opened this is the extraction of human usable strategies from these oracles current pattern recognition approaches do not seem to fully acquire the high level of abstraction required to be applied successfully instead extensive experimentation with the tablebases combined with an intensive study of tablebase answers to well designed problems and with knowledge of prior art i e pre tablebase knowledge is used to yield insightful patterns berlekamp in dots and boxes etc and john nunn in chess endgames are notable examples of researchers doing this work though they were not and are not involved in tablebase generation business edit data mining is the analysis of historical business activities stored as static data in data warehouse databases to reveal hidden patterns and trends data mining software uses advanced pattern recognition algorithms to sift through large amounts of data to assist in discovering previously unknown strategic business information examples of what businesses use data mining for include performing market analysis to identify new product bundles finding the root cause of manufacturing problems to prevent customer attrition and acquire new customers cross sell to existing customers and profile customers with more accuracy 20 in today s world raw data is being collected by companies at an exploding rate for example walmart processes over 20 million point of sale transactions every day this information is stored in a centralized database but would be useless without some type of data mining software to analyse it if walmart analyzed their point of sale data with data mining techniques they would be able to determine sales trends develop marketing campaigns and more accurately predict customer loyalty 21 every time we use our credit card a store loyalty card or fill out a warranty card data is being collected about our purchasing behavior many people find the amount of information stored about us from companies such as google facebook and amazon disturbing and are concerned about privacy although there is the potential for our personal data to be used in harmful or unwanted ways it is also being used to make our lives better for example ford and audi hope to one day collect information about customer driving patterns so they can recommend safer routes and warn drivers about dangerous road conditions 22 data mining in customer relationship management applications can contribute significantly to the bottom line citation needed rather than randomly contacting a prospect or customer through a call center or sending mail a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer more sophisticated methods may be used to optimize resources across campaigns so that one may predict to which channel and to which offer an individual is most likely to respond across all potential offers additionally sophisticated applications could be used to automate mailing once the results from data mining potential prospect customer and channel offer are determined this sophisticated application can either automatically send an e mail or a regular mail finally in cases where many people will take an action without an offer uplift modeling can be used to determine which people have the greatest increase in response if given an offer uplift modeling thereby enables marketers to focus mailings and offers on persuadable people and not to send offers to people who will buy the product without an offer data clustering can also be used to automatically discover the segments or groups within a customer data set businesses employing data mining may see a return on investment but also they recognize that the number of predictive models can quickly become very large rather than using one model to predict how many customers will churn a business could build a separate model for each region and customer type then instead of sending an offer to all people that are likely to churn it may only want to send offers to loyal customers finally the business may want to determine which customers are going to be profitable over a certain window in time and only send the offers to those that are likely to be profitable in order to maintain this quantity of models they need to manage model versions and move on to automated data mining data mining can also be helpful to human resources hr departments in identifying the characteristics of their most successful employees information obtained such as universities attended by highly successful employees can help hr focus recruiting efforts accordingly additionally strategic enterprise management applications help a company translate corporate level goals such as profit and margin share targets into operational decisions such as production plans and workforce levels 23 another example of data mining often called the market basket analysis relates to its use in retail sales if a clothing store records the purchases of customers a data mining system could identify those customers who favor silk shirts over cotton ones although some explanations of relationships may be difficult taking advantage of it is easier the example deals with association rules within transaction based data not all data are transaction based and logical or inexact rules may also be present within a database market basket analysis has also been used to identify the purchase patterns of the alpha consumer alpha consumers are people that play a key role in connecting with the concept behind a product then adopting that product and finally validating it for the rest of society analyzing the data collected on this type of user has allowed companies to predict future buying trends and forecast supply demands citation needed data mining is a highly effective tool in the catalog marketing industry citation needed catalogers have a rich database of history of their customer transactions for millions of customers dating back a number of years data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns data mining for business applications is a component that needs to be integrated into a complex modeling and decision making process reactive business intelligence rbi advocates a holistic approach that integrates data mining modeling and interactive visualization into an end to end discovery and continuous innovation process powered by human and automated learning 24 in the area of decision making the rbi approach has been used to mine knowledge that is progressively acquired from the decision maker and then self tune the decision method accordingly 25 an example of data mining related to an integrated circuit ic production line is described in the paper mining ic test data to optimize vlsi testing 26 in this paper the application of data mining and decision analysis to the problem of die level functional testing is described experiments mentioned demonstrate the ability to apply a system of mining historical die test data to create a probabilistic model of patterns of die failure these patterns are then utilized to decide in real time which die to test next and when to stop testing this system has been shown based on experiments with historical test data to have the potential to improve profits on mature ic products science and engineering edit in recent years data mining has been used widely in the areas of science and engineering such as bioinformatics genetics medicine education and electrical power engineering in the study of human genetics sequence mining helps address the important goal of understanding the mapping relationship between the inter individual variations in human dna sequence and the variability in disease susceptibility in simple terms it aims to find out how the changes in an individual s dna sequence affects the risks of developing common diseases such as cancer which is of great importance to improving methods of diagnosing preventing and treating these diseases the data mining method that is used to perform this task is known as multifactor dimensionality reduction 27 in the area of electrical power engineering data mining methods have been widely used for condition monitoring of high voltage electrical equipment the purpose of condition monitoring is to obtain valuable information on for example the status of the insulation or other important safety related parameters data clustering techniques such as the self organizing map som have been applied to vibration monitoring and analysis of transformer on load tap changers oltcs using vibration monitoring it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms obviously different tap positions will generate different signals however there was considerable variability amongst normal condition signals for exactly the same tap position som has been applied to detect abnormal conditions and to hypothesize about the nature of the abnormalities 28 data mining methods have also been applied to dissolved gas analysis dga in power transformers dga as a diagnostics for power transformers has been available for many years methods such as som has been applied to analyze generated data and to determine trends which are not obvious to the standard dga ratio methods such as duval triangle 28 another example of data mining in science and engineering is found in educational research where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning 29 and to understand factors influencing university student retention 30 a similar example of social application of data mining is its use in expertise finding systems whereby descriptors of human expertise are extracted normalized and classified so as to facilitate the finding of experts particularly in scientific and technical fields in this way data mining can facilitate institutional memory other examples of application of data mining methods are biomedical data facilitated by domain ontologies 31 mining clinical trial data 32 and traffic analysis using som 33 in adverse drug reaction surveillance the uppsala monitoring centre has since 1998 used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the who global database of 4 6 160 million suspected adverse drug reaction incidents 34 recently similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses 35 data mining has been applied software artifacts within the realm of software engineering mining software repositories human rights edit data mining of government records particularly records of the justice system i e courts prisons enables the discovery of systemic human rights violations in connection to generation and publication of invalid or fraudulent legal records by various government agencies 36 37 medical data mining edit in 2011 the case of sorrell v ims health inc decided by the supreme court of the united states ruled that pharmacies may share information with outside companies this practice was authorized under the 1st amendment of the constitution protecting the freedom of speech 38 spatial data mining edit spatial data mining is the application of data mining methods to spatial data the end objective of spatial data mining is to find patterns in data with respect to geography so far data mining and geographic information systems gis have existed as two separate technologies each with its own methods traditions and approaches to visualization and data analysis particularly most contemporary gis have only very basic spatial analysis functionality the immense explosion in geographically referenced data occasioned by developments in it digital mapping remote sensing and the global diffusion of gis emphasizes the importance of developing data driven inductive approaches to geographical analysis and modeling data mining offers great potential benefits for gis based applied decision making recently the task of integrating these two technologies has become of critical importance especially as various public and private sector organizations possessing huge databases with thematic and geographically referenced data begin to realize the huge potential of the information contained therein among those organizations are offices requiring analysis or dissemination of geo referenced statistical data public health services searching for explanations of disease clustering environmental agencies assessing the impact of changing land use patterns on climate change geo marketing companies doing customer segmentation based on spatial location challenges in spatial mining geospatial data repositories tend to be very large moreover existing gis datasets are often splintered into feature and attribute components that are conventionally archived in hybrid data management systems algorithmic requirements differ substantially for relational attribute data management and for topological feature data management 39 related to this is the range and diversity of geographic data formats which present unique challenges the digital geographic data revolution is creating new types of data formats beyond the traditional vector and raster formats geographic data repositories increasingly include ill structured data such as imagery and geo referenced multi media 40 there are several critical research challenges in geographic knowledge discovery and data mining miller and han 41 offer the following list of emerging research topics in the field developing and supporting geographic data warehouses gdw s spatial properties are often reduced to simple aspatial attributes in mainstream data warehouses creating an integrated gdw requires solving issues of spatial and temporal data interoperability including differences in semantics referencing systems geometry accuracy and position better spatio temporal representations in geographic knowledge discovery current geographic knowledge discovery gkd methods generally use very simple representations of geographic objects and spatial relationships geographic data mining methods should recognize more complex geographic objects i e lines and polygons and relationships i e non euclidean distances direction connectivity and interaction through attributed geographic space such as terrain furthermore the time dimension needs to be more fully integrated into these geographic representations and relationships geographic knowledge discovery using diverse data types gkd methods should be developed that can handle diverse data types beyond the traditional raster and vector models including imagery and geo referenced multimedia as well as dynamic data types video streams animation sensor data mining edit wireless sensor networks can be used for facilitating the collection of data for spatial data mining for a variety of applications such as air pollution monitoring 42 a characteristic of such networks is that nearby sensor nodes monitoring an environmental feature typically register similar values this kind of data redundancy due to the spatial correlation between sensor observations inspires the techniques for in network data aggregation and mining by measuring the spatial correlation between data sampled by different sensors a wide class of specialized algorithms can be developed to develop more efficient spatial data mining algorithms 43 visual data mining edit in the process of turning from analogical into digital large data sets have been generated collected and stored discovering statistical patterns trends and information which is hidden in data in order to build predictive patterns studies suggest visual data mining is faster and much more intuitive than is traditional data mining 44 45 46 see also computer vision music data mining edit data mining techniques and in particular co occurrence analysis has been used to discover relevant similarities among music corpora radio lists cd databases for the purpose of classifying music into genres in a more objective manner 47 surveillance edit data mining has been used to stop terrorist programs under the u s government including the total information awareness tia program secure flight formerly known as computer assisted passenger prescreening system capps ii analysis dissemination visualization insight semantic enhancement advise 48 and the multi state anti terrorism information exchange matrix 49 these programs have been discontinued due to controversy over whether they violate the 4th amendment to the united states constitution although many programs that were formed under them continue to be funded by different organizations or under different names 50 in the context of combating terrorism two particularly plausible methods of data mining are pattern mining and subject based data mining pattern mining edit pattern mining is a data mining method that involves finding existing patterns in data in this context patterns often means association rules the original motivation for searching association rules came from the desire to analyze supermarket transaction data that is to examine customer behavior in terms of the purchased products for example an association rule beer potato chips 80 states that four out of five customers that bought beer also bought potato chips in the context of pattern mining as a tool to identify terrorist activity the national research council provides the following definition pattern based data mining looks for patterns including anomalous data patterns that might be associated with terrorist activity 160 these patterns might be regarded as small signals in a large ocean of noise 51 52 53 pattern mining includes new areas such a music information retrieval mir where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search methods subject based data mining edit subject based data mining is a data mining method involving the search for associations between individuals in data in the context of combating terrorism the national research council provides the following definition subject based data mining uses an initiating individual or other datum that is considered based on other information to be of high interest and the goal is to determine what other persons or financial transactions or movements etc are related to that initiating datum 52 knowledge grid edit knowledge discovery on the grid generally refers to conducting knowledge discovery in an open environment using grid computing concepts allowing users to integrate data from various online data sources as well make use of remote resources for executing their data mining tasks the earliest example was the discovery net 54 55 developed at imperial college london which won the most innovative data intensive application award at the acm sc02 supercomputing 2002 conference and exhibition based on a demonstration of a fully interactive distributed knowledge discovery application for a bioinformatics application other examples include work conducted by researchers at the university of calabria who developed a knowledge grid architecture for distributed knowledge discovery based on grid computing 56 57 reliability validity edit data mining can be misused and can also unintentionally produce results which appear significant but which do not actually predict future behavior and cannot be reproduced on a new sample of data see data dredging privacy concerns and ethics edit some people believe that data mining itself is ethically neutral 58 while the term data mining has no ethical implications it is often associated with the mining of information in relation to peoples behavior ethical and otherwise to be precise data mining is a statistical method that is applied to a set of information i e a data set associating these data sets with people is an extreme narrowing of the types of data that are available examples could range from a set of crash test data for passenger vehicles to the performance of a group of stocks these types of data sets make up a great proportion of the information available to be acted on by data mining methods and rarely have ethical concerns associated with them however the ways in which data mining can be used can in some cases and contexts raise questions regarding privacy legality and ethics 59 in particular data mining government or commercial data sets for national security or law enforcement purposes such as in the total information awareness program or in advise has raised privacy concerns 60 61 data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations a common way for this to occur is through data aggregation data aggregation involves combining data together possibly from various sources in a way that facilitates analysis but that also might make identification of private individual level data deducible or otherwise apparent 62 this is not data mining per se but a result of the preparation of data before and for the purposes of the analysis the threat to an individual s privacy comes into play when the data once compiled cause the data miner or anyone who has access to the newly compiled data set to be able to identify specific individuals especially when the data were originally anonymous it is recommended that an individual is made aware of the following before data are collected 62 the purpose of the data collection and any known data mining projects how the data will be used who will be able to mine the data and use the data and their derivatives the status of security surrounding access to the data how collected data can be updated in america privacy concerns have been addressed to some extent by the us congress via the passage of regulatory controls such as the health insurance portability and accountability act hipaa the hipaa requires individuals to give their informed consent regarding information they provide and its intended present and future uses according to an article in biotech business week i n practice hipaa may not offer any greater protection than the longstanding regulations in the research arena says the aahc more importantly the rule s goal of protection through informed consent is undermined by the complexity of consent forms that are required of patients and participants which approach a level of incomprehensibility to average individuals 63 this underscores the necessity for data anonymity in data aggregation and mining practices data may also be modified so as to become anonymous so that individuals may not readily be identified 62 however even de identified anonymized data sets can potentially contain enough information to allow identification of individuals as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by aol 64 software edit see also category data mining and machine learning software free open source data mining software and applications edit carrot2 text and search results clustering framework chemicalize org a chemical structure miner and web search engine elki a university research project with advanced cluster analysis and outlier detection methods written in the java language gate a natural language processing and language engineering tool scavis java cross platform data analysis framework developed at argonne national laboratory knime the konstanz information miner a user friendly and comprehensive data analytics framework ml flex a software package that enables users to integrate with third party machine learning packages written in any programming language execute classification analyses in parallel across multiple computing nodes and produce html reports of classification results nltk natural language toolkit a suite of libraries and programs for symbolic and statistical natural language processing nlp for the python language senticnet api a semantic and affective resource for opinion mining and sentiment analysis orange a component based data mining and machine learning software suite written in the python language r a programming language and software environment for statistical computing data mining and graphics it is part of the gnu project rapidminer an environment for machine learning and data mining experiments uima the uima unstructured information management architecture is a component framework for analyzing unstructured content such as text audio and video originally developed by ibm weka a suite of machine learning software applications written in the java programming language commercial data mining software and applications edit angoss knowledgestudio data mining tool provided by angoss birt analytics visual data mining and predictive analytics tool provided by actuate corporation clarabridge enterprise class text analytics solution e ni e mining e monitor data mining tool based on temporal patterns ibm db2 intelligent miner in database data mining platform provided by ibm with modeling scoring and visualization services based on the sql mm pmml framework ibm spss modeler data mining software provided by ibm kxen modeler data mining tool provided by kxen lionsolver an integrated software application for data mining business intelligence and modeling that implements the learning and intelligent optimization lion approach microsoft analysis services data mining software provided by microsoft oracle data mining data mining software by oracle predixion insight data mining software by predixion software sas enterprise miner data mining software provided by the sas institute statistica data miner data mining software provided by statsoft holsys one tool for the analysis of complex systems sensors network industrial plant based on a reinterpretation of the if then clause in the sense of the theory of holons marketplace surveys edit several researchers and organizations have conducted reviews of data mining tools and surveys of data miners these identify some of the strengths and weaknesses of the software packages they also provide an overview of the behaviors preferences and views of data miners some of these reports include 2011 wiley interdisciplinary reviews data mining and knowledge discovery 65 annual rexer analytics data miner surveys 2007 2011 66 forrester research 2010 predictive analytics and data mining solutions report 67 gartner 2008 magic quadrant report 68 robert a nisbet s 2006 three part series of articles data mining tools which one is best for crm 69 haughton et al s 2003 review of data mining software packages in the american statistician 70 see also edit methods anomaly outlier change detection association rule learning classification cluster analysis decision tree factor analysis multilinear subspace learning neural networks regression analysis sequence mining structured data analysis text mining application domains analytics bioinformatics business intelligence data analysis data warehouse decision support system drug discovery exploratory data analysis predictive analytics web mining application examples see also category applied data mining customer analytics data mining in agriculture data mining in meteorology educational data mining national security agency police enforced anpr in the uk quantitative structure activity relationship surveillance mass surveillance e g stellar wind related topics data mining is about analyzing data for information about extracting information out of data see data integration data transformation information extraction information integration named entity recognition profiling practices web scraping references edit a b c fayyad usama piatetsky shapiro gregory smyth padhraic 1996 from data mining to knowledge discovery in databases retrieved 17 december 2008 160 a b c d data mining curriculum acm sigkdd 2006 04 30 retrieved 2011 10 28 160 clifton christopher 2010 encyclop dia britannica definition of data mining retrieved 2010 12 09 160 hastie trevor tibshirani robert friedman jerome 2009 the elements of statistical learning data mining inference and prediction retrieved 2012 08 07 160 see e g okairp 2005 fall conference arizona state university about com datamining witten ian h frank eibe hall mark a 30 january 2011 data mining practical machine learning tools and techniques 3 ed elsevier isbn 160 978 0 12 374856 0 160 bouckaert remco r frank eibe hall mark a holmes geoffrey pfahringer bernhard reutemann peter witten ian h 2010 weka experiences with a java open source project journal of machine learning research 11 2533 2541 the original title practical machine learning was changed 160 the term data mining was added primarily for marketing reasons 160 mena jess 2011 machine learning forensics for law enforcement security and intelligence boca raton fl crc press taylor amp francis group isbn 160 978 1 4398 6069 4 160 piatetsky shapiro gregory parker gary 2011 lesson data mining and knowledge discovery an introduction introduction to data mining kd nuggets retrieved 30 august 2012 160 kantardzic mehmed 2003 data mining concepts models methods and algorithms john wiley amp sons isbn 160 0 471 22852 4 oclc 160 50055336 160 proceedings international conferences on knowledge discovery and data mining acm new york sigkdd explorations acm new york gregory piatetsky shapiro 2002 kdnuggets methodology poll gregory piatetsky shapiro 2004 kdnuggets methodology poll gregory piatetsky shapiro 2007 kdnuggets methodology poll scar marb n gonzalo mariscal and javier segovia 2009 a data mining amp knowledge discovery process model in data mining and knowledge discovery in real life applications book edited by julio ponce and adem karahoca isbn 978 3 902613 53 0 pp 160 438 453 february 2009 i tech vienna austria lukasz kurgan and petr musilek 2006 a survey of knowledge discovery and data mining process models the knowledge engineering review volume 21 issue 1 march 2006 pp 160 1 24 cambridge university press new york ny usa doi 10 1017 s0269888906000737 azevedo a and santos m f kdd semma and crisp dm a parallel overview in proceedings of the iadis european conference on data mining 2008 pp 160 182 185 g nnemann stephan kremer hardy seidl thomas 2011 an extension of the pmml standard to subspace clustering models proceedings of the 2011 workshop on predictive markup language modeling pmml 11 p 160 48 doi 10 1145 2023598 2023605 isbn 160 9781450308373 160 edit o brien j a amp marakas g m 2011 management information systems new york ny mcgraw hill irwin alexander d n d data mining retrieved from the university of texas at austin college of liberal arts http www laits utexas edu anorman bus for course mat alex goss s 2013 april 10 data mining and our personal privacy retrieved from the telegraph http www macon com 2013 04 10 2429775 data mining and our personal privacy html monk ellen wagner bret 2006 concepts in enterprise resource planning second edition boston ma thomson course technology isbn 160 0 619 21663 8 oclc 160 224465825 160 battiti roberto and brunato mauro reactive business intelligence from data to models to insight reactive search srl italy february 2011 isbn 978 88 905795 0 9 battiti roberto passerini andrea 2010 brain computer evolutionary multi objective optimization bc emo a genetic algorithm adapting to the decision maker ieee transactions on evolutionary computation 14 15 671 687 doi 10 1109 tevc 2010 2058118 160 fountain tony dietterich thomas and sudyka bill 2000 mining ic test data to optimize vlsi testing in proceedings of the sixth acm sigkdd international conference on knowledge discovery amp data mining acm press pp 18 25 zhu xingquan davidson ian 2007 knowledge discovery and data mining challenges and realities new york ny hershey p 160 18 isbn 160 978 1 59904 252 7 160 a b mcgrail anthony j gulski edward allan david birtwhistle david blackburn trevor r groot edwin r s data mining techniques to assess the condition of high voltage electrical plant cigr wg 15 11 of study committee 15 160 baker ryan s j d is gaming the system state or trait educational data mining through the multi contextual application of a validated behavioral model workshop on data mining for user modeling 2007 160 superby aguirre juan francisco vandamme jean philippe meskens nadine determination of factors influencing the achievement of the first year university students using data mining methods workshop on educational data mining 2006 160 zhu xingquan davidson ian 2007 knowledge discovery and data mining challenges and realities new york ny hershey pp 160 163 189 isbn 160 978 1 59904 252 7 160 zhu xingquan davidson ian 2007 knowledge discovery and data mining challenges and realities new york ny hershey pp 160 31 48 isbn 160 978 1 59904 252 7 160 chen yudong zhang yi hu jianming li xiang 2006 traffic data analysis using kernel pca and self organizing map ieee intelligent vehicles symposium 160 bate andrew lindquist marie edwards i ralph olsson sten orre roland lansner anders and de freitas rogelio melhado a bayesian neural network method for adverse drug reaction signal generation european journal of clinical pharmacology 1998 jun 54 4 315 21 nor n g niklas bate andrew hopstadius johan star kristina and edwards i ralph 2008 temporal pattern discovery for trends and transient effects its application to patient records proceedings of the fourteenth international conference on knowledge discovery and data mining sigkdd 2008 las vegas nv pp 963 971 zernik joseph data mining as a civic duty online public prisoners registration systems international journal on social media monitoring measurement mining 1 84 96 2010 zernik joseph data mining of online judicial records of the networked us federal courts international journal on social media monitoring measurement mining 1 69 83 2010 david g savage 2011 06 24 pharmaceutical industry supreme court sides with pharmaceutical industry in two decisions los angeles times retrieved 2012 11 07 160 healey richard g 1991 database management systems in maguire david j goodchild michael f and rhind david w eds geographic information systems principles and applications london gb longman camara antonio s and raper jonathan eds 1999 spatial multimedia and virtual reality london gb taylor and francis miller harvey j and han jiawei eds 2001 geographic data mining and knowledge discovery london gb taylor amp francis ma y richards m ghanem m guo y hassard j 2008 air pollution monitoring and mining based on sensor grid in london sensors 8 6 3601 doi 10 3390 s8063601 160 edit ma y guo y tian x ghanem m 2011 distributed clustering based aggregation algorithm for spatial correlated sensor networks ieee sensors journal 11 3 641 doi 10 1109 jsen 2010 2056916 160 edit zhao kaidi and liu bing tirpark thomas m and weimin xiao a visual data mining framework for convenient identification of useful knowledge keim daniel a information visualization and visual data mining burch michael diehl stephan wei gerber peter visual data mining in software archives pachet fran ois westermann gert and laigre damien musical data mining for electronic music distribution proceedings of the 1st wedelmusic conference firenze italy 2001 pp 101 106 government accountability office data mining early attention to privacy in developing a key dhs program could reduce risks gao 07 293 february 2007 washington dc secure flight program report msnbc total terrorism information awareness tia is it truly dead electronic frontier foundation official website 2003 retrieved 2009 03 15 160 agrawal rakesh mannila heikki srikant ramakrishnan toivonen hannu and verkamo a inkeri fast discovery of association rules in advances in knowledge discovery and data mining mit press 1996 pp 307 328 a b national research council protecting individual privacy in the struggle against terrorists a framework for program assessment washington dc national academies press 2008 haag stephen cummings maeve phillips amy 2006 management information systems for the information age toronto mcgraw hill ryerson p 160 28 isbn 160 0 07 095569 7 oclc 160 63194770 160 ghanem moustafa guo yike rowe anthony wendel patrick 2002 grid based knowledge discovery services for high throughput informatics proceedings 11th ieee international symposium on high performance distributed computing p 160 416 doi 10 1109 hpdc 2002 1029946 isbn 160 0 7695 1686 6 160 edit ghanem moustafa curcin vasa wendel patrick guo yike 2009 building and using analytical workflows in discovery net data mining techniques in grid computing environments p 160 119 doi 10 1002 9780470699904 ch8 isbn 160 9780470699904 160 edit cannataro mario talia domenico january 2003 the knowledge grid an architecture for distributed knowledge discovery communications of the acm 46 1 89 93 doi 10 1145 602421 602425 retrieved 17 october 2011 160 talia domenico trunfio paolo july 2010 how distributed data mining tasks can thrive as knowledge services communications of the acm 53 7 132 137 doi 10 1145 1785414 1785451 retrieved 17 october 2011 160 seltzer william the promise and pitfalls of data mining ethical issues 160 pitts chip 15 march 2007 the end of illegal domestic spying don t count on it washington spectator 160 taipale kim a 15 december 2003 data mining and domestic security connecting the dots to make sense of data columbia science and technology law review 5 2 oclc 160 45263753 ssrn 160 546782 160 resig john and teredesai ankur 2004 a framework for mining instant messaging services proceedings of the 2004 siam dm conference 160 a b c think before you dig privacy implications of data mining amp aggregation nascio research brief september 2004 biotech business week editors june 30 2008 biomedicine hipaa privacy rule impedes biomedical research biotech business week retrieved 17 november 2009 from lexisnexis academic aol search data identified individuals securityfocus august 2006 mikut ralf reischl markus september october 2011 data mining tools wiley interdisciplinary reviews data mining and knowledge discovery 1 5 431 445 doi 10 1002 widm 24 retrieved october 21 2011 160 karl rexer heather allen amp paul gearan 2011 understanding data miners analytics magazine may june 2011 informs institute for operations research and the management sciences kobielus james the forrester wave predictive analytics and data mining solutions q1 2010 forrester research 1 july 2008 herschel gareth magic quadrant for customer data mining applications gartner inc 1 july 2008 nisbet robert a 2006 data mining tools which one is best for crm part 1 information management special reports january 2006 haughton dominique deichmann joel eshghi abdolreza sayek selin teebagy nicholas and topi heikki 2003 a review of software packages for data mining the american statistician vol 57 no 4 pp 290 309 further reading edit cabena peter hadjnian pablo stadler rolf verhees jaap and zanasi alessandro 1997 discovering data mining from concept to implementation prentice hall isbn 0 13 743980 6 feldman ronen and sanger james the text mining handbook cambridge university press isbn 978 0 521 83657 9 guo yike and grossman robert editors 1999 high performance data mining scaling algorithms applications and systems kluwer academic publishers hastie trevor tibshirani robert and friedman jerome 2001 the elements of statistical learning data mining inference and prediction springer isbn 0 387 95284 5 liu bing 2007 web data mining exploring hyperlinks contents and usage data springer isbn 3 540 37881 2 murphy chris 16 may 2011 is data mining free speech informationweek umb 12 160 nisbet robert elder john miner gary 2009 handbook of statistical analysis amp data mining applications academic press elsevier isbn 978 0 12 374765 5 poncelet pascal masseglia florent and teisseire maguelonne editors october 2007 data mining patterns new methods and applications information science reference isbn 978 1 59904 162 9 tan pang ning steinbach michael and kumar vipin 2005 introduction to data mining isbn 0 321 32136 7 theodoridis sergios and koutroumbas konstantinos 2009 pattern recognition 4th edition academic press isbn 978 1 59749 272 0 weiss sholom m and indurkhya nitin 1998 predictive data mining morgan kaufmann witten ian h frank eibe hall mark a 30 january 2011 data mining practical machine learning tools and techniques 3 ed elsevier isbn 160 978 0 12 374856 0 160 see also free weka software ye nong 2003 the handbook of data mining mahwah nj lawrence erlbaum external links edit wikimedia commons has media related to data mining data mining software at the open directory project v t e data warehouse 160 creating the data warehouse concepts database dimension dimensional modeling fact olap star schema aggregate variants anchor modeling column oriented dbms data vault modeling holap molap rolap operational data store elements data dictionary metadata data mart sixth normal form surrogate key fact fact table early arriving fact measure dimension dimension table degenerate slowly changing filling extract transform load etl extract transform load 160 using the data warehouse concepts business intelligence dashboard data mining decision support system dss olap cube languages data mining extensions dmx multidimensional expressions mdx xml for analysis xmla tools business intelligence tools reporting software spreadsheet 160 related people bill inmon ralph kimball products comparison of olap servers data warehousing products and their producers v t e major fields of computer science mathematical foundations mathematical logic set theory number theory graph theory type theory category theory numerical analysis information theory combinatorics boolean algebra theory of computation automata theory computability theory computational complexity theory quantum computing theory algorithms data structures analysis of algorithms algorithm design computational geometry programming languages compilers parsers interpreters procedural programming object oriented programming functional programming logic programming programming paradigms concurrent parallel distributed systems multiprocessing grid computing concurrency control software engineering requirements analysis software design computer programming formal methods software testing software development process system architecture computer architecture computer organization operating systems telecommunication networking computer audio routing network topology cryptography databases database management systems relational databases sql transactions database indexes data mining artificial intelligence automated reasoning computational linguistics computer vision evolutionary computation expert systems machine learning natural language processing robotics computer graphics visualization computer animation image processing human computer interaction computer accessibility user interfaces wearable computing ubiquitous computing virtual reality scientific computing artificial life bioinformatics cognitive science computational chemistry computational neuroscience computational physics numerical algorithms symbolic mathematics note computer science can also be divided into different topics or fields according to the acm computing classification system retrieved from http en wikipedia org w index php title data_mining amp oldid 561492045 categories data miningdata analysisformal scienceshidden categories all articles with unsourced statementsarticles with unsourced statements from march 2013articles to be expanded from september 2011articles with unsourced statements from july 2008articles with unsourced statements from july 2010commons category with local link same as on wikidata navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages az rbaycanca catal esky dansk deutsch eesti espa ol euskara fran ais bahasa indonesia italiano latvie u lietuvi magyar nederlands norsk bokm l polski portugus rom n simple english sloven ina sloven ina srpski basa sunda suomi svenska t rk e ti ng vi t edit links this page was last modified on 25 june 2013 at 09 53 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_pre_processing b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_pre_processing new file mode 100644 index 00000000..de42be25 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_pre_processing @@ -0,0 +1 @@ +data pre processing wikipedia the free encyclopedia data pre processing from wikipedia the free encyclopedia jump to navigation search data pre processing is an important step in the data mining process the phrase garbage in garbage out is particularly applicable to data mining and machine learning projects data gathering methods are often loosely controlled resulting in out of range values e g income 100 impossible data combinations e g sex male pregnant yes missing values etc analyzing data that has not been carefully screened for such problems can produce misleading results thus the representation and quality of data is first and foremost before running an analysis 1 if there is much irrelevant and redundant information present or noisy and unreliable data then knowledge discovery during the training phase is more difficult data preparation and filtering steps can take considerable amount of processing time data pre processing includes cleaning normalization transformation feature extraction and selection etc the product of data pre processing is the final training set kotsiantis et al 2006 present a well known algorithm for each step of data pre processing 2 references edit pyle d 1999 data preparation for data mining morgan kaufmann publishers los altos california s kotsiantis d kanellopoulos p pintelas data preprocessing for supervised leaning international journal of computer science 2006 vol 1 n 2 pp 111 117 retrieved from http en wikipedia org w index php title data_pre processing amp oldid 559593512 categories machine learning navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 12 june 2013 at 16 52 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_set b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_set new file mode 100644 index 00000000..e8600db5 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_set @@ -0,0 +1 @@ +data set wikipedia the free encyclopedia data set from wikipedia the free encyclopedia jump to navigation search for ibm mainframe term for a file see data set ibm mainframe for the telecommunications interface device see modem a dataset or data set is a collection of data most commonly a dataset corresponds to the contents of a single database table or a single statistical data matrix where each column of the table represents a particular variable and each row corresponds to a given member of the dataset in question the dataset lists values for each of the variables such as height and weight of an object for each member of the dataset each value is known as a datum the dataset may comprise data for one or more members corresponding to the number of rows the term dataset may also be used more loosely to refer to the data in a collection of closely related tables corresponding to a particular experiment or event contents 1 history 2 properties 3 classic datasets 4 see also 5 notes 6 external links history edit historically the term originated in the mainframe field where it had a well defined meaning very close to contemporary computer file citation needed properties edit several characteristics define a dataset s structure and properties these include the number and types of the attributes or variables and various statistical measures applicable to them such as standard deviation and kurtosis 1 in the simplest case there is only one variable and the dataset consists of a single column of values often represented as a list in spite of the name such a univariate dataset is not a set in the usual mathematical sense since a given value may occur multiple times usually the order does not matter and then the collection of values may be considered a multiset rather than an ordered list original research the values may be numbers such as real numbers or integers for example representing a person s height in centimeters but may also be nominal data i e not consisting of numerical values for example representing a person s ethnicity more generally values may be of any of the kinds described as a level of measurement for each variable the values are normally all of the same kind however there may also be missing values which must be indicated in some way in statistics datasets usually come from actual observations obtained by sampling a statistical population and each row corresponds to the observations on one element of that population datasets may further be generated by algorithms for the purpose of testing certain kinds of software some modern statistical analysis software such as spss still present their data in the classical dataset fashion classic datasets edit several classic datasets have been used extensively in the statistical literature iris flower data set multivariate dataset introduced by ronald fisher 1936 2 categorical data analysis datasets used in the book an introduction to categorical data analysis by agresti are provided on line by statlib robust statistics datasets used in robust regression and outlier detection rousseeuw and leroy 1986 provided on line at the university of cologne time series data used in chatfield s book the analysis of time series are provided on line by statlib extreme values data used in the book an introduction to the statistical modeling of extreme values are provided on line by stuart coles the book s author dead link bayesian data analysis data used in the book are provided on line by andrew gelman one of the book s authors the bupa liver data used in several papers in the machine learning data mining literature anscombe s quartet small dataset illustrating the importance of graphing the data to avoid statistical fallacies see also edit interoperability 3 notes edit jan m ytkow jan rauch 1999 principles of data mining and knowledge discovery isbn 160 978 3 540 66490 1 160 fisher r a 1936 the use of multiple measurements in taxonomic problems annals of eugenics 7 179 188 doi 10 1111 j 1469 1809 1936 tb02137 x 160 king external links edit the datahub a community managed home for open datasets research pipeline a wiki website with links to datasets on many different topics statlib datasets archive statlib jasa data archive data gov uk government public data gcmd the global change master directory contains more than 20 000 descriptions of earth science datasets and services covering all aspects of earth and environmental sciences retrieved from http en wikipedia org w index php title data_set amp oldid 560278414 categories computer datastatistical data setshidden categories all articles with unsourced statementsarticles with unsourced statements from august 2009all articles that may contain original researcharticles that may contain original research from august 2009 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol euskara italiano nederlands polski portugus edit links this page was last modified on 17 june 2013 at 10 41 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_stream_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_stream_mining new file mode 100644 index 00000000..0381b9c7 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_stream_mining @@ -0,0 +1 @@ +data stream mining wikipedia the free encyclopedia data stream mining from wikipedia the free encyclopedia jump to navigation search this article may require cleanup to meet wikipedia s quality standards the specific problem is lt more a list of links refs and people than a wp article gt please help improve this article if you can november 2012 data stream mining is the process of extracting knowledge structures from continuous rapid data records a data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities examples of data streams include computer network traffic phone conversations atm transactions web searches and sensor data data stream mining can be considered a subfield of data mining machine learning and knowledge discovery in many data stream mining applications the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion in many applications the distribution underlying the instances or the rules underlying their labeling may change over time i e the goal of the prediction the class to be predicted or the target value to be predicted may change over time this problem is referred to as concept drift contents 1 software for data stream mining 2 events 3 researchers working on data stream mining 4 master references 5 bibliographic references 6 books 7 see also 8 external references software for data stream mining edit rapidminer free open source software for knowledge discovery data mining and machine learning also featuring data stream mining learning time varying concepts and tracking drifting concept if used in combination with its data stream mining plugin formerly concept drift plugin moa massive online analysis free open source software specific for mining data streams with concept drift it contains a prequential evaluation method the eddm concept drift methods a reader of arff real datasets and artificial stream generators as sea concepts stagger rotating hyperplane random tree and random radius based functions moa supports bi directional interaction with weka machine learning events edit international workshop on ubiquitous data mining held in conjunction with the international joint conference on artificial intelligence ijcai in beijing china august 3 5 2013 international workshop on knowledge discovery from ubiquitous data streams held in conjunction with the 18th european conference on machine learning ecml and the 11th european conference on principles and practice of knowledge discovery in databases pkdd in warsaw poland in september 2007 acm symposium on applied computing data streams track held in conjunction with the 2007 acm symposium on applied computing sac 2007 in seoul korea in march 2007 ieee international workshop on mining evolving and streaming data iwmesd 2006 to be held in conjunction with the 2006 ieee international conference on data mining icdm 2006 in hong kong in december 2006 fourth international workshop on knowledge discovery from data streams iwkdds to be held in conjunction with the 17th european conference on machine learning ecml and the 10th european conference on principles and practice of knowledge discovery in databases pkdd ecml pkdd 2006 in berlin germany in september 2006 researchers working on data stream mining edit carlo zaniolo university of california los angeles ucla california united states jo o gama university of porto portugal leandro minku the university of birmingham uk mohamed medhat gaber university of portsmouth uk olfa nasraoui university of louisville usa hua fu li national chiao tung university taiwan eyke h llermeier university of marburg germany marco grawunder university of oldenburg germany latifur khan university of texas at dallas usa pedro pereira rodrigues university of porto portugal michael hahsler southern methodist university texas usa master references edit mohamed medhat gaber arkady zaslavsky and shonali krishnaswamy mining data streams a review acm sigmod record vol 34 no 2 june 2005 pp 18 26 brian babcock shivnath babu mayur datar rajeev motwani and jennifer widom models and issues in data stream systems in proc 21st acm sigact sigmod sigart symposium on principles of database systems pods 2002 madison wisconsin usa june 2002 mining data streams bibliography bibliographic references edit minku and yao ddd a new ensemble approach for dealing with concept drift ieee transactions on knowledge and data engineering 24 4 p 619 633 2012 hahsler michael and dunham margaret h temporal structure learning for clustering massive data streams in real time in siam conference on data mining sdm11 pages 664 675 siam april 2011 minku white and yao the impact of diversity on on line ensemble learning in the presence of concept drift ieee transactions on knowledge and data engineering 22 5 p 730 742 2010 mohammad m masud jing gao latifur khan jiawei han bhavani m thuraisingham integrating novel class detection with classification for concept drifting data streams ecml pkdd 2 2009 79 94 extended version will appear in tkde journal scholz martin and klinkenberg ralf boosting classifiers for drifting concepts in intelligent data analysis ida special issue on knowledge discovery from data streams vol 11 no 1 pages 3 28 march 2007 nasraoui o cerwinske j rojas c and gonzalez f collaborative filtering in dynamic usage environments in proc of cikm 2006 conference on information and knowledge management arlington va nov 2006 nasraoui o rojas c and cardona c a framework for mining evolving trends in web data streams using dynamic learning and retrospective validation journal of computer networks special issue on web dynamics 50 10 1425 1652 july 2006 scholz martin and klinkenberg ralf an ensemble classifier for drifting concepts in gama j and aguilar ruiz j s editors proceedings of the second international workshop on knowledge discovery in data streams pages 53 64 porto portugal 2005 klinkenberg ralf learning drifting concepts example selection vs example weighting in intelligent data analysis ida special issue on incremental learning systems capable of dealing with concept drift vol 8 no 3 pages 281 300 2004 klinkenberg ralf using labeled and unlabeled data to learn drifting concepts in kubat miroslav and morik katharina editors workshop notes of the ijcai 01 workshop on em learning from temporal and spatial data pages 16 24 ijcai menlo park ca usa aaai press 2001 maloof m and michalski r selecting examples for partial memory learning machine learning 41 11 2000 pp 160 27 52 koychev i gradual forgetting for adaptation to concept drift in proceedings of ecai 2000 workshop current issues in spatio temporal reasoning berlin germany 2000 pp 160 101 106 klinkenberg ralf and joachims thorsten detecting concept drift with support vector machines in langley pat editor proceedings of the seventeenth international conference on machine learning icml pages 487 494 san francisco ca usa morgan kaufmann 2000 koychev i and schwab i adaptation to drifting user s interests proc of ecml 2000 workshop machine learning in new information age barcelona spain 2000 pp 160 39 45 schwab i pohl w and koychev i learning to recommend from positive evidence proceedings of intelligent user interfaces 2000 acm press 241 247 klinkenberg ralf and renz ingrid adaptive information filtering learning in the presence of concept drifts in sahami mehran and craven mark and joachims thorsten and mccallum andrew editors workshop notes of the icml aaai 98 workshop em learning for text categorization pages 33 40 menlo park ca usa aaai press 1998 grabtree i soltysiak s identifying and tracking changing interests international journal of digital libraries springer verlag vol 2 38 53 widmer g tracking context changes through meta learning machine learning 27 1997 pp 160 256 286 maloof m a and michalski r s learning evolving concepts using partial memory approach working notes of the 1995 aaai fall symposium on active learning boston ma pp 160 70 73 1995 mitchell t caruana r freitag d mcdermott j and zabowski d experience with a learning personal assistant communications of the acm 37 7 1994 pp 160 81 91 widmer g and kubat m learning in the presence of concept drift and hidden contexts machine learning 23 1996 pp 160 69 101 schlimmer j and granger r incremental learning from noisy data machine learning 1 3 1986 317 357 books edit jo o gama and mohamed medhat gaber eds learning from data streams processing techniques in sensor networks springer 2007 auroop r ganguly jo o gama olufemi a omitaomu mohamed m gaber and ranga r vatsavai eds knowledge discovery from sensor data crc press 2008 jo o gama knowledge discovery from data streams chapman and hall crc 2010 see also edit concept drift data mining sequence mining streaming algorithm stream processing wireless sensor network external references edit ibm spade stream processing application declarative engine ibm infosphere streams streamit programming language and compilation infrastructure by mit csail retrieved from http en wikipedia org w index php title data_stream_mining amp oldid 548633870 categories data miningbusiness intelligencehidden categories articles needing cleanup from november 2012all articles needing cleanupcleanup tagged articles with a reason field from november 2012wikipedia pages needing cleanup from november 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 4 april 2013 at 09 38 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_visualization b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_visualization new file mode 100644 index 00000000..7494649a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_visualization @@ -0,0 +1 @@ +data visualization wikipedia the free encyclopedia data visualization from wikipedia the free encyclopedia jump to navigation search it has been suggested that this article be merged with information visualization discuss proposed since february 2013 a data visualization of wikipedia as part of the world wide web demonstrating hyperlinks data visualization is the study of the visual representation of data meaning information that has been abstracted in some schematic form including attributes or variables for the units of information 1 contents 1 overview 2 data visualization scope 3 related fields 3 1 data acquisition 3 2 data analysis 3 3 data governance 3 4 data management 3 5 data mining 3 6 data transforms 4 data visualization software 5 data presentation architecture 5 1 objectives 5 2 scope 5 3 related fields 6 see also 7 references 8 further reading 9 external links overview edit a data visualization from social media according to friedman 2008 the main goal of data visualization is to communicate information clearly and effectively through graphical means it doesn t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful to convey ideas effectively both aesthetic form and functionality need to go hand in hand providing insights into a rather sparse and complex data set by communicating its key aspects in a more intuitive way yet designers often fail to achieve a balance between form and function creating gorgeous data visualizations which fail to serve their main purpose to communicate information 2 indeed fernanda viegas and martin m wattenberg have suggested that an ideal visualization should not only communicate clearly but stimulate viewer engagement and attention 3 data visualization is closely related to information graphics information visualization scientific visualization and statistical graphics in the new millennium data visualization has become an active area of research teaching and development according to post et al 2002 it has united scientific and information visualization 4 brian willison has demonstrated that data visualization has also been linked to enhancing agile software development and customer engagement 5 kpi library has developed the periodic table of visualization methods an interactive chart displaying various data visualization methods it includes six types of data visualization methods data information concept strategy metaphor and compound 6 data visualization scope edit there are different approaches on the scope of data visualization one common focus is on information presentation such as friedman 2008 presented it in this way friendly 2008 presumes two main parts of data visualization statistical graphics and thematic cartography 1 in this line the data visualization modern approaches 2007 article gives an overview of seven subjects of data visualization 7 mindmaps displaying news displaying data displaying connections displaying websites articles amp resources tools and services all these subjects are closely related to graphic design and information representation on the other hand from a computer science perspective frits h post 2002 categorized the field into a number of sub fields 4 visualization algorithms and techniques volume visualization information visualization multiresolution methods modelling techniques and interaction techniques and architectures for different types of visualizations and their connection to infographics see infographics related fields edit data acquisition edit data acquisition is the sampling of the real world to generate data that can be manipulated by a computer sometimes abbreviated daq or das data acquisition typically involves acquisition of signals and waveforms and processing the signals to obtain desired information the components of data acquisition systems include appropriate sensors that convert any measurement parameter to an electrical signal which is acquired by data acquisition hardware data analysis edit data analysis is the process of studying and summarizing data with the intent to extract useful information and develop conclusions data analysis is closely related to data mining but data mining tends to focus on larger data sets with less emphasis on making inference and often uses data that was originally collected for a different purpose in statistical applications some people divide data analysis into descriptive statistics exploratory data analysis and inferential statistics or confirmatory data analysis where the eda focuses on discovering new features in the data and cda on confirming or falsifying existing hypotheses types of data analysis are exploratory data analysis eda an approach to analyzing data for the purpose of formulating hypotheses worth testing complementing the tools of conventional statistics for testing hypotheses it was so named by john tukey qualitative data analysis qda or qualitative research is the analysis of non numerical data for example words photographs observations etc data governance edit data governance encompasses the people processes and technology required to create a consistent enterprise view of an organisation s data in order to increase consistency amp confidence in decision making decrease the risk of regulatory fines improve data security maximize the income generation potential of data designate accountability for information quality data management edit data management comprises all the academic disciplines related to managing data as a valuable resource the official definition provided by dama is that data resource management is the development and execution of architectures policies practices and procedures that properly manage the full data lifecycle needs of an enterprise this definition is fairly broad and encompasses a number of professions that may not have direct technical contact with lower level aspects of data management such as relational database management data mining edit data mining is the process of sorting through large amounts of data and picking out relevant information it is usually used by business intelligence organizations and financial analysts but is increasingly being used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods it has been described as the nontrivial extraction of implicit previously unknown and potentially useful information from data 8 and the science of extracting useful information from large data sets or databases 9 in relation to enterprise resource planning according to monk 2006 data mining is the statistical and logical analysis of large sets of transaction data looking for patterns that can aid decision making 10 data transforms edit data transforms is the process of automation and transformation of both real time and offline data from one format to another there are standards and protocols that provide the specifications and rules and it usually occurs in the process pipeline of aggregation or consolidation or interoperability the primary use cases are in integration systems organizations and compliance personnels data visualization software edit software type targeted users license antz realtime 3d data visualization analysts scientists programmers vr public domain amira gui code data visualisation scientists proprietary avizo gui code data visualisation engineers and scientists proprietary cave5d virtual reality data visualization scientists open source curios it interactive 3d data visualization business managers proprietary data desk gui data visualisation statistician proprietary davix operating system with data tools security consultant various dundas data visualization inc gui data visualisation business managers proprietary elki data mining visualizations scientists and teachers open source eye sys gui code data visualisation engineers and scientists proprietary ferret data visualization and analysis gridded datasets visualisation oceanographers and meteorologists open source fusioncharts component programmers proprietary geoscape geographic data visualisation business users proprietary treemap gui data visualisation business managers proprietary trendalyzer data visualisation teachers proprietary tulip gui data visualization researchers and engineers open source gephi gui data visualisation statistician open source ggobi gui data visualisation statistician open source grapheur gui data visualisation business users project managers coaches proprietary ggplot2 data visualization package for r programmers open source mondrian gui data visualisation statistician open source ibm opendx gui code data visualisation engineers and scientists open source idl programming language code data visualisation programmer many idl programming language programming language programmer open source inetsoft gui data visualization business users developers academics proprietary infogr am online infographic tool journalists bloggers education business users proprietary instantatlas gis data visualisation analysts researchers statisticians and gis professionals proprietary mevislab gui code data visualisation engineers and scientists proprietary mindview mind map graphic visualisation business users and project managers proprietary kumu web based relationship visualization social impact business government amp policy proprietary panopticon software enterprise application sdk rapid development kit rdk capital markets telecommunications energy government proprietary panorama software gui data visualisation business users proprietary panxpan gui data visualisation business users proprietary paraview gui code data visualisation engineers and scientists bsd processing programming language programming language programmers gpl profileplot gui data visualisation engineers and scientists proprietary protovis library toolkit programmers bsd qunb gui data visualisation non expert business users proprietary sas institute gui data visualisation business users analysts proprietary sciencegl components solutions oem scientists engineers analysts proprietary smile software gui code data visualisation engineers and scientists proprietary spotfire gui data visualisation business users proprietary statsoft company of gui code data visualisation software engineers and scientists proprietary tableau software gui data visualisation business users proprietary powerpanels gui data visualisation business users proprietary the hive group honeycomb gui data visualisation energy financial services manufacturers government military proprietary the hive group hiveondemand gui data visualisation business users academic users proprietary tinkerplots gui data visualisation students proprietary tom sawyer software data visualization and social network analysis applications capital markets telecommunications energy government business users engineers and scientists proprietary trade space visualizer gui code data visualisation engineers and scientists proprietary visifire library programmers was open source now proprietary vis5d gui data visualization scientists open source visad java jython library programmers open source visit gui code data visualisation engineers and scientists bsd vtk c library programmers open source weave web based data visualization many open source 11 yoix programming language programmers open source visual ly company creative tools data curation and visualization proprietary holsys one show the algorithms inside a data gui data visualisation engineers and scientists proprietary the topic of this article may not meet wikipedia s notability guideline for neologisms please help to establish notability by adding reliable secondary sources about the topic if notability cannot be established the article is likely to be merged redirected or deleted find sources 160 data visualization 160 160 news 160 books 160 scholar 160 jstor 160 free images august 2010 this article needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed march 2010 data presentation architecture edit data presentation architecture dpa is a skill set that seeks to identify locate manipulate format and present data in such a way as to optimally communicate meaning and proffer knowledge historically the term data presentation architecture is attributed to kelly lautt 12 data presentation architecture dpa is a rarely applied skill set critical for the success and value of business intelligence data presentation architecture weds the science of numbers data and statistics in discovering valuable information from data and making it usable relevant and actionable with the arts of data visualization communications organizational psychology and change management in order to provide business intelligence solutions with the data scope delivery timing format and visualizations that will most effectively support and drive operational tactical and strategic behaviour toward understood business or organizational goals dpa is neither an it nor a business skill set but exists as a separate field of expertise often confused with data visualization data presentation architecture is a much broader skill set that includes determining what data on what schedule and in what exact format is to be presented not just the best way to present data that has already been chosen which is data visualization data visualization skills are one element of dpa objectives edit dpa has two main objectives to use data to provide knowledge in the most effective manner possible provide relevant timely and complete data to each audience member in a clear and understandable manner that conveys important meaning is actionable and can affect understanding behavior and decisions to use data to provide knowledge in the most efficient manner possible minimize noise complexity and unnecessary data or detail given each audience s needs and roles scope edit with the above objectives in mind the actual work of data presentation architecture consists of defining important meaning relevant knowledge that is needed by each audience member in each context finding the right data subject area historical reach breadth level of detail etc determining the required periodicity of data updates the currency of the data determining the right timing for data presentation when and how often the user needs to see the data utilizing appropriate analysis grouping visualization and other presentation formats creating effective delivery mechanisms for each audience member depending on their role tasks locations and access to technology related fields edit dpa work has some commonalities with several other fields including business analysis in determining business goals collecting requirements mapping processes solution architecture in determining the optimal detailed solution including the scope of data to include given the business goals business process improvement in that its goal is to improve and streamline actions and decisions in furtherance of business goals statistical analysis or data analysis in that it creates information and knowledge out of data data visualization in that it uses well established theories of visualization to add or highlight meaning or importance in data presentation information architecture but information architecture s focus is on unstructured data and therefore excludes both analysis in the statistical data sense and direct transformation of the actual content data for dpa into new entities and combinations graphic or user design as the term dpa is used it falls just short of design in that it does not consider such detail as colour palates styling branding and other aesthetic concerns unless these design elements are specifically required or beneficial for communication of meaning impact severity or other information of business value for example choosing to provide a specific colour in graphical elements that represent data of specific meaning or concern is part of the dpa skill set choosing locations for various data presentation elements on a presentation page such as in a company portal in a report or on a web page in order to convey hierarchy priority importance or a rational progression for the user is part of the dpa skill set see also edit information architecture information design infographic scientific visualization software visualization business analysis business intelligence data analysis statistical analysis data warehouse data profiling analytics corporate performance management information architecture information graphics information visualization statistical graphics interaction techniques interaction design visual analytics balanced scorecard references edit a b michael friendly 2008 milestones in the history of thematic cartography statistical graphics and data visualization vitaly friedman 2008 data visualization and infographics in graphics monday inspiration january 14th 2008 fernanda viegas and martin wattenberg how to make data look sexy cnn com april 19 2011 http articles cnn com 2011 04 19 opinion sexy data_1_visualization 21st century engagement _s pm opinion a b frits h post gregory m nielson and georges pierre bonneau 2002 data visualization the state of the art research paper tu delft 2002 brian willison visualization driven rapid prototyping parsons institute for information mapping 2008 lengler ralph lengler ralph periodic table of visualization methods www visual literacy org retrieved 15 march 2013 160 data visualization modern approaches in graphics august 2nd 2007 w frawley and g piatetsky shapiro and c matheus fall 1992 knowledge discovery in databases an overview ai magazine pp 213 228 issn 160 0738 4602 160 d hand h mannila p smyth 2001 principles of data mining mit press cambridge ma isbn 160 0 262 08290 x 160 ellen monk bret wagner 2006 concepts in enterprise resource planning second edition thomson course technology boston ma isbn 160 0 619 21663 8 160 http oicweave org the first formal recorded public usages of the term data presentation architecture were at the three formal microsoft office 2007 launch events in dec jan and feb of 2007 08 in edmonton calgary and vancouver canada in a presentation by kelly lautt describing a business intelligence system designed to improve service quality in a pulp and paper company the term was further used and recorded in public usage on december 16 2009 in a microsoft canada presentation on the value of merging business intelligence with corporate collaboration processes further reading edit chandrajit bajaj bala krishnamurthy 1999 data visualization techniques william s cleveland 1993 visualizing data hobart press william s cleveland 1994 the elements of graphing data hobart press alexander n gorban bal zs k gl donald wunsch and andrei zinovyev 2008 principal manifolds for data visualization and dimension reduction lncse 58 springer john p lee and georges g grinstein eds 1994 database issues for data visualization ieee visualization 93 workshop san diego peter r keller and mary keller 1993 visual cues practical data visualization frits h post gregory m nielson and georges pierre bonneau 2002 data visualization the state of the art stewart liff and pamela a posey seeing is believing how the new art of visual management can boost performance throughout your organization amacom new york 2007 isbn 978 0 8144 0035 7 stephen few 2009 fundamental differences in analytical tools exploratory custom or customizable external links edit wikimedia commons has media related to data visualization yau nathan 2011 visualize this the flowing data guide to design visualization and statistics wiley p 160 384 isbn 160 978 0470944882 160 milestones in the history of thematic cartography statistical graphics and data visualization an illustrated chronology of innovations by michael friendly and daniel j denis peer reviewed definition of data visualization with commentaries we love infographics the data visualization academy 1 http bigdataviz dk visualcomplexity v t e visualization of technical information fields biological data visualization chemical imaging crime mapping data visualization educational visualization flow visualization geovisualization information visualization mathematical visualization medical imaging molecular graphics product visualization scientific visualization software visualization technical drawing user interface design visual culture volume visualization image types chart diagram engineering drawing graph of a function ideogram map photograph pictogram plot schematic statistical graphics table technical drawings technical illustration user interface people jacques bertin jim blinn stuart card thomas a defanti michael friendly george furnas nigel holmes alan maceachren jock d mackinlay michael maltz bruce h mccormick charles joseph minard otto neurath florence nightingale clifford a pickover william playfair adolphe quetelet george g robertson arthur h robinson lawrence j rosenblum ben shneiderman edward tufte fernanda viegas manuel lima related topics cartography chartjunk computer graphics computer graphics computer science graph drawing graphic design graphic organizer imaging science information graphics information science mental visualisation misleading graph neuroimaging patent drawing scientific modelling spatial analysis visual analytics visual perception retrieved from http en wikipedia org w index php title data_visualization amp oldid 559893563 categories visualization graphic statistical charts and diagramsinformation technology governancedatahidden categories articles to be merged from february 2013all articles to be mergedarticles with topics of unclear notability from august 2010all articles with topics of unclear notabilityarticles needing additional references from march 2010all articles needing additional referencescommons category with local link same as on wikidata navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais nederlands portugus sloven ina edit links this page was last modified on 14 june 2013 at 16 05 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_warehouse b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_warehouse new file mode 100644 index 00000000..b1816527 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Data_warehouse @@ -0,0 +1 @@ +data warehouse wikipedia the free encyclopedia data warehouse from wikipedia the free encyclopedia jump to navigation search this article needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed february 2008 data warehouse overview in computing a data warehouse or enterprise data warehouse dw dwh or edw is a database used for reporting and data analysis it is a central repository of data which is created by integrating data from one or more disparate sources data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons the data stored in the warehouse are uploaded from the operational systems such as marketing sales etc shown in the figure to the right the data may pass through an operational data store for additional operations before they are used in the dw for reporting the typical etl based data warehouse uses staging data integration and access layers to house its key functions the staging layer or staging database stores raw data extracted from each of the disparate source data systems the integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store ods database the integrated data are then moved to yet another database often called the data warehouse database where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts the combination of facts and dimensions is sometimes called a star schema the access layer helps users retrieve data 1 a data warehouse constructed from an integrated data source systems does not require etl staging databases or operational data store databases the integrated data source systems may be considered to be a part of a distributed operational data store layer data federation methods or data virtualization methods may be used to access the distributed integrated source data systems to consolidate and aggregate data directly into the data warehouse database tables unlike the etl based data warehouse the integrated source data systems and the data warehouse are all integrated since there is no transformation of dimensional or reference data this integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the transactional data of the integrated source data systems data warehouses can be subdivided into data marts data marts store subsets of data from a warehouse this definition of the data warehouse focuses on data storage the main source of the data is cleaned transformed cataloged and made available for use by managers and other business professionals for data mining online analytical processing market research and decision support marakas amp o brien 2009 however the means to retrieve and analyze data to extract transform and load data and to manage the data dictionary are also considered essential components of a data warehousing system many references to data warehousing use this broader context thus an expanded definition for data warehousing includes business intelligence tools tools to extract transform and load data into the repository and tools to manage and retrieve metadata contents 1 benefits of a data warehouse 2 generic data warehouse environment 3 history 4 information storage 4 1 facts 4 2 dimensional vs normalized approach for storage of data 5 top down versus bottom up design methodologies 5 1 bottom up design 5 2 top down design 5 3 hybrid design 6 data warehouses versus operational systems 7 evolution in organization use 8 sample applications 9 see also 10 references 11 further reading 12 external links benefits of a data warehouse edit a data warehouse maintains a copy of information from the source transaction systems this architectural complexity provides the opportunity to congregate data from multiple sources into a single database so a single query engine can be used to present data mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large long running analysis queries in transaction processing databases maintain data history even if the source transaction systems do not integrate data from multiple source systems enabling a central view across the enterprise this benefit is always valuable but particularly so when the organization has grown by merger improve data quality by providing consistent codes and descriptions flagging or even fixing bad data present the organization s information consistently provide a single common data model for all data of interest regardless of the data s source restructure the data so that it makes sense to the business users restructure the data so that it delivers excellent query performance even for complex analytic queries without impacting the operational systems add value to operational business applications notably customer relationship management crm systems generic data warehouse environment edit the environment for data warehouses and marts includes the following source systems that provide data to the warehouse or mart data integration technology and processes that are needed to prepare the data for use different architectures for storing data in an organization s data warehouse or data marts different tools and applications for the variety of users metadata data quality and governance processes must be in place to ensure that the warehouse or mart meets its purposes in regards to source systems listed above rainer states a common source for the data in data warehouses is the company s operational databases which can be relational databases 130 regarding data integration rainer states it is necessary to extract data from source systems transform them and load them into a data mart or warehouse 131 rainer discusses storing data in an organization s data warehouse or data marts there are a variety of possible architectures to store decision support data 131 metadata are data about data it personnel need information about data sources database table and column names refresh schedules and data usage measures 133 today the most successful companies are those that can respond quickly and flexibly to market changes and opportunities a key to this response is the effective and efficient use of data and information by analysts and managers rainer 127 a data warehouse is a repository of historical data that are organized by subject to support decision makers in the organization 128 once data are stored in a data mart or warehouse they can be accessed rainer r kelly 2012 05 01 introduction to information systems enabling and transforming business 4th edition page 129 wiley kindle edition v history edit the concept of data warehousing dates back to the late 1980s 2 when ibm researchers barry devlin and paul murphy developed the business data warehouse in essence the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments the concept attempted to address the various problems associated with this flow mainly the high costs associated with it in the absence of a data warehousing architecture an enormous amount of redundancy was required to support multiple decision support environments in larger corporations it was typical for multiple decision support environments to operate independently though each environment served different users they often required much of the same stored data the process of gathering cleaning and integrating data from various sources usually from long term existing operational systems usually referred to as legacy systems was typically in part replicated for each environment moreover the operational systems were frequently reexamined as new decision support requirements emerged often new requirements necessitated gathering cleaning and integrating new data from data marts that were tailored for ready access by users key developments in early years of data warehousing were 1960s 160 general mills and dartmouth college in a joint research project develop the terms dimensions and facts 3 1970s 160 acnielsen and iri provide dimensional data marts for retail sales 3 1970s 160 bill inmon begins to define and discuss the term data warehouse 1975 160 sperry univac introduce mapper maintain prepare and produce executive reports is a database management and reporting system that includes the world s first 4gl it was the first platform specifically designed for building information centers a forerunner of contemporary enterprise data warehousing platforms 1983 160 teradata introduces a database management system specifically designed for decision support 1983 160 sperry corporation martyn richard jones defines the sperry information center approach which while not being a true dw in the inmon sense did contain many of the characteristics of dw structures and process as defined previously by inmon and later by devlin first used at the tsb england amp wales 1984 160 metaphor computer systems founded by david liddle and don massaro releases data interpretation system dis dis was a hardware software package and gui for business users to create a database management and analytic system 1988 160 barry devlin and paul murphy publish the article an architecture for a business and information system in ibm systems journal where they introduce the term business data warehouse 1990 160 red brick systems founded by ralph kimball introduces red brick warehouse a database management system specifically for data warehousing 1991 160 prism solutions founded by bill inmon introduces prism warehouse manager software for developing a data warehouse 1992 160 bill inmon publishes the book building the data warehouse 4 1995 160 the data warehousing institute a for profit organization that promotes data warehousing is founded 1996 160 ralph kimball publishes the book the data warehouse toolkit 5 2000 160 daniel linstedt releases the data vault enabling real time auditable data warehouses warehouse information storage edit facts edit a fact is a value or measurement which represents a fact about the managed entity or system facts as reported by the reporting entity are said to be at raw level e g if a bts received 1 000 requests for traffic channel allocation it allocates for 820 and rejects the remaining then it would report 3 facts or measurements to a management system tch_req_total 1000 tch_req_success 820 tch_req_fail 180 facts at raw level are further aggregated to higher levels in various dimensions to extract more service or business relevant information out of it these are called aggregates or summaries or aggregated facts e g if there are 3 btss in a city then facts above can be aggregated from bts to city level in network dimension e g dimensional vs normalized approach for storage of data edit there are two leading approaches to storing data in a data warehouse 160 the dimensional approach and the normalized approach the dimensional approach whose supporters are referred to as kimballites believe in ralph kimball s approach in which it is stated that the data warehouse should be modeled using a dimensional model star schema the normalized approach also called the 3nf model whose supporters are referred to as inmonites believe in bill inmon s approach in which it is stated that the data warehouse should be modeled using an e r model normalized model in a dimensional approach transaction data are partitioned into facts which are generally numeric transaction data and dimensions which are the reference information that gives context to the facts for example a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products and into dimensions such as order date customer name product number order ship to and bill to locations and salesperson responsible for receiving the order a key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use also the retrieval of data from the data warehouse tends to operate very quickly dimensional structures are easy to understand for business users because the structure is divided into measurements facts and context dimensions facts are related to the organization s business processes and operational system whereas the dimensions surrounding them contain context about the measurement kimball ralph 2008 the main disadvantages of the dimensional approach are in order to maintain the integrity of facts and dimensions loading the data warehouse with data from different operational systems is complicated and it is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business in the normalized approach the data in the data warehouse are stored following to a degree database normalization rules tables are grouped together by subject areas that reflect general data categories e g data on customers products finance etc the normalized structure divides data into entities which creates several tables in a relational database when applied in large enterprises the result is dozens of tables that are linked together by a web of joins furthermore each of the created entities is converted into separate physical tables when the database is implemented kimball ralph 2008 the main advantage of this approach is that it is straightforward to add information into the database a disadvantage of this approach is that because of the number of tables involved it can be difficult for users both to join data from different sources into meaningful information and then access the information without a precise understanding of the sources of data and of the data structure of the data warehouse it should be noted that both normalized and dimensional models can be represented in entity relationship diagrams as both contain joined relational tables the difference between the two models is the degree of normalization these approaches are not mutually exclusive and there are other approaches dimensional approaches can involve normalizing data to a degree kimball ralph 2008 in information driven business wiley 2010 6 robert hillard proposes an approach to comparing the two approaches based on the information needs of the business problem the technique shows that normalized models hold far more information than their dimensional equivalents even when the same fields are used in both models but this extra information comes at the cost of usability the technique measures information quantity in terms of information entropy and usability in terms of the small worlds data transformation measure 7 top down versus bottom up design methodologies edit this section appears to be written like an advertisement please help improve it by rewriting promotional content from a neutral point of view and removing any inappropriate external links november 2012 bottom up design edit ralph kimball a well known author on data warehousing 8 is a proponent of an approach to data warehouse design which he describes as bottom up 9 in the bottom up approach data marts are first created to provide reporting and analytical capabilities for specific business processes it is important to note that in kimball methodology the bottom up process is the result of an initial business oriented top down analysis of the relevant business processes to be modelled data marts contain primarily dimensions and facts facts can contain either atomic data and if necessary summarized data the single data mart often models a specific business area such as sales or production these data marts can eventually be integrated to create a comprehensive data warehouse the integration of data marts is managed through the implementation of what kimball calls a data warehouse bus architecture 10 the data warehouse bus architecture is primarily an implementation of the bus a collection of conformed dimensions and conformed facts which are dimensions that are shared in a specific way between facts in two or more data marts the integration of the data marts in the data warehouse is centered on the conformed dimensions residing in the bus that define the possible integration points between data marts the actual integration of two or more data marts is then done by a process known as drill across a drill across works by grouping summarizing the data along the keys of the shared conformed dimensions of each fact participating in the drill across followed by a join on the keys of these grouped summarized facts maintaining tight management over the data warehouse bus architecture is fundamental to maintaining the integrity of the data warehouse the most important management task is making sure dimensions among data marts are consistent in kimball s words this means that the dimensions conform some consider it an advantage of the kimball method that the data warehouse ends up being segmented into a number of logically self contained up to and including the bus and consistent data marts rather than a big and often complex centralized model business value can be returned as quickly as the first data marts can be created and the method gives itself well to an exploratory and iterative approach to building data warehouses for example the data warehousing effort might start in the sales department by building a sales data mart upon completion of the sales data mart the business might then decide to expand the warehousing activities into the say production department resulting in a production data mart the requirement for the sales data mart and the production data mart to be integrable is that they share the same bus that will be that the data warehousing team has made the effort to identify and implement the conformed dimensions in the bus and that the individual data marts links that information from the bus note that this does not require 100 awareness from the onset of the data warehousing effort no master plan is required upfront the sales data mart is good as it is assuming that the bus is complete and the production data mart can be constructed virtually independent of the sales data mart but not independent of the bus if integration via the bus is achieved the data warehouse through its two data marts will not only be able to deliver the specific information that the individual data marts are designed to do in this example either sales or production information but can deliver integrated sales production information which often is of critical business value top down design edit bill inmon one of the first authors on the subject of data warehousing has defined a data warehouse as a centralized repository for the entire enterprise 10 inmon is one of the leading proponents of the top down approach to data warehouse design in which the data warehouse is designed using a normalized enterprise data model atomic data that is data at the lowest level of detail are stored in the data warehouse dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse in the inmon vision the data warehouse is at the center of the corporate information factory cif which provides a logical framework for delivering business intelligence bi and business management capabilities inmon states that the data warehouse is subject oriented the data in the data warehouse is organized so that all the data elements relating to the same real world event or object are linked together non volatile data in the data warehouse are never over written or deleted 160 once committed the data are static read only and retained for future reporting integrated the data warehouse contains data from most or all of an organization s operational systems and these data are made consistent time variant for an operational system the stored data contains the current value the data warehouse however contains the history of data values the top down design methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository top down design has also proven to be robust against business changes generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task the main disadvantage to the top down methodology is that it represents a very large project with a very broad scope the up front cost for implementing a data warehouse using the top down methodology is significant and the duration of time from the start of project to the point that end users experience initial benefits can be substantial in addition the top down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases 10 hybrid design edit data warehouse dw solutions often resemble the hub and spokes architecture legacy systems feeding the dw bi solution often include customer relationship management crm and enterprise resource planning solutions erp generating large amounts of data to consolidate these various data models and facilitate the extract transform load etl process dw solutions often make use of an operational data store ods the information from the ods is then parsed into the actual dw to reduce data redundancy larger systems will often store the data in a normalized way data marts for specific reports can then be built on top of the dw solution it is important to note that the dw database in a hybrid solution is kept on third normal form to eliminate data redundancy a normal relational database however is not efficient for business intelligence reports where dimensional modelling is prevalent small data marts can shop for data from the consolidated warehouse and use the filtered specific data for the fact tables and dimensions required the dw effectively provides a single source of information from which the data marts can read creating a highly flexible solution from a bi point of view the hybrid architecture allows a dw to be replaced with a master data management solution where operational not static information could reside the data vault modeling components follow hub and spokes architecture this modeling style is a hybrid design consisting of the best practices from both 3rd normal form and star schema the data vault model is not a true 3rd normal form and breaks some of the rules that 3nf dictates be followed it is however a top down architecture with a bottom up design the data vault model is geared to be strictly a data warehouse it is not geared to be end user accessible which when built still requires the use of a data mart or star schema based release area for business purposes data warehouses versus operational systems edit operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity relationship model operational system designers generally follow the codd rules of database normalization in order to ensure data integrity codd defined five increasingly stringent rules of normalization fully normalized database designs that is those satisfying all five codd rules often result in information from a business transaction being stored in dozens to hundreds of tables relational databases are efficient at managing the relationships between these tables the databases have very fast insert update performance because only a small amount of data in those tables is affected each time a transaction is processed finally in order to improve performance older data are usually periodically purged from operational systems evolution in organization use edit these terms refer to the level of sophistication of a data warehouse offline operational data warehouse data warehouses in this stage of evolution are updated on a regular time cycle usually daily weekly or monthly from the operational systems and the data is stored in an integrated reporting oriented data offline data warehouse data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data are stored in a data structure designed to facilitate reporting on time data warehouse online integrated data warehousing represent the real time data warehouses stage data in the warehouse is updated for every transaction performed on the source data integrated data warehouse these data warehouses assemble data from different areas of business so users can look up the information they need across other systems 11 sample applications edit some of the applications of data warehousing include agriculture 12 biological data analysis call record analysis churn prediction for telecom subscribers credit card users etc decision support financial forecasting insurance fraud analysis logistics and inventory management trend analysis see also edit accounting intelligence anchor modeling business intelligence business intelligence tools data integration data mart data mining data presentation architecture data scraping data warehouse appliance database management system decision support system data vault modeling executive information system extract transform load master data management online analytical processing online transaction processing operational data store semantic warehousing snowflake schema software as a service star schema slowly changing dimension references edit patil preeti s srikantha rao suryakant b patil 2011 optimization of data warehousing system simplification in reporting and analysis international journal of computer applications foundation of computer science 9 6 33 37 160 more than one of work and journal specified help the story so far 2002 04 15 retrieved 2008 09 21 160 a b kimball 2002 pg 16 inmon bill 1992 building the data warehouse wiley isbn 160 0 471 56960 7 160 kimball ralph 1996 the data warehouse toolkit wiley isbn 160 0 471 15337 0 160 hillard robert 2010 information driven business wiley isbn 160 978 0 470 62577 4 160 information theory amp business intelligence strategy small worlds data transformation measure mike2 0 the open source methodology for information development mike2 openmethodology org retrieved 2013 06 14 160 kimball 2002 pg 310 the bottom up misnomer 2003 09 17 retrieved 2012 02 14 160 a b c ericsson 2004 pp 28 29 data warehouse 160 abdullah ahsan 2009 analysis of mealybug incidence on the cotton crop using adss olap online analytical processing tool volume 69 issue 1 computers and electronics in agriculture 69 59 72 doi 10 1016 j compag 2009 07 003 160 further reading edit davenport thomas h and harris jeanne g competing on analytics the new science of winning 2007 harvard business school press isbn 978 1 4221 0332 6 ganczarski joe data warehouse implementations critical implementation factors study 2009 vdm verlag isbn 3 639 18589 7 isbn 978 3 639 18589 8 kimball ralph and ross margy the data warehouse toolkit second edition 2002 john wiley and sons inc isbn 0 471 20024 7 linstedt graziano hultgren the business of data vault modeling second edition 2010 dan linstedt isbn 978 1 4357 1914 9 william inmon building the data warehouse 2005 john wiley and sons isbn 978 8 1265 0645 3 external links edit ralph kimball articles international journal of computer applications data warehouse introduction v t e data warehouse 160 creating the data warehouse concepts database dimension dimensional modeling fact olap star schema aggregate variants anchor modeling column oriented dbms data vault modeling holap molap rolap operational data store elements data dictionary metadata data mart sixth normal form surrogate key fact fact table early arriving fact measure dimension dimension table degenerate slowly changing filling extract transform load etl extract transform load 160 using the data warehouse concepts business intelligence dashboard data mining decision support system dss olap cube languages data mining extensions dmx multidimensional expressions mdx xml for analysis xmla tools business intelligence tools reporting software spreadsheet 160 related people bill inmon ralph kimball products comparison of olap servers data warehousing products and their producers retrieved from http en wikipedia org w index php title data_warehouse amp oldid 561642772 categories business intelligencedata managementdata warehousinginformation technology managementhidden categories pages with citations having redundant parametersarticles needing additional references from february 2008all articles needing additional referencesarticles with a promotional tone from november 2012all articles with a promotional tone navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages az rbaycanca catal esky dansk deutsch espa ol fran ais hrvatski bahasa indonesia italiano latvie u lietuvi lumbaart magyar nederlands norsk bokm l polski portugus rom n sloven ina svenska t rk e ti ng vi t edit links this page was last modified on 26 june 2013 at 09 06 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Database_system b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Database_system new file mode 100644 index 00000000..808e6b84 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Database_system @@ -0,0 +1 @@ +database wikipedia the free encyclopedia database from wikipedia the free encyclopedia redirected from database system jump to navigation search a database is an organized collection of data the data is typically organized to model relevant aspects of reality for example the availability of rooms in hotels in a way that supports processes requiring this information for example finding a hotel with vacancies database management systems dbmss are specially designed applications that interact with the user other applications and the database itself to capture and analyze data a general purpose database management system dbms is a software system designed to allow the definition creation querying update and administration of databases well known dbmss include mysql postgresql sqlite microsoft sql server microsoft access oracle sap dbase foxpro ibm db2 and filemakerpro a database is not generally portable across different dbms but different dbmss can inter operate by using standards such as sql and odbc or jdbc to allow a single application to work with more than one database contents 1 terminology and overview 2 applications and roles 2 1 general purpose and special purpose dbmss 3 history 3 1 1960s navigational dbms 3 2 1970s relational dbms 3 3 database machines and appliances 3 4 late 1970s sql dbms 3 5 1980s desktop databases 3 6 1980s object oriented databases 3 7 2000s nosql and newsql databases 4 database research 5 database type examples 6 database design and modeling 6 1 database models 6 2 external conceptual and internal views 7 database languages 8 performance security and availability 8 1 database storage 8 1 1 database materialized views 8 1 2 database and database object replication 8 2 database security 8 3 transactions and concurrency 8 4 migration 8 5 database building maintaining and tuning 8 6 backup and restore 8 7 other 9 see also 10 references 11 further reading 12 external links terminology and overview edit formally the term database refers to the data itself and supporting data structures databases are created to operate large quantities of information by inputting storing retrieving and managing that information databases are set up so that one set of software programs provides all users with access to all the data databases use a table format that is made up of rows and columns each piece of information is entered into a row which then creates a record once the records are created in the database they can be organized and operated in a variety of ways that are limited mainly by the software being used databases are somewhat similar to spreadsheets but databases are more demanding than spreadsheets because of their ability to manipulate the data that is stored it is possible to do a number of functions with a database that would be more difficult to do with a spreadsheet the word data is normally defined as facts from which information can be derived a database may contain millions of such facts from these facts the database management system dbms can develop information a database management system dbms is a suite of computer software providing the interface between users and a database or databases because they are so closely related the term database when used casually often refers to both a dbms and the data it manipulates outside the world of professional information technology the term database is sometimes used casually to refer to any collection of data perhaps a spreadsheet maybe even a card index this article is concerned only with databases where the size and usage requirements necessitate use of a database management system 1 the interactions catered for by most existing dbms fall into four main groups data definition defining new data structures for a database removing data structures from the database modifying the structure of existing data update inserting modifying and deleting data retrieval obtaining information either for end user queries and reports or for processing by applications administration registering and monitoring users enforcing data security monitoring performance maintaining data integrity dealing with concurrency control and recovering information if the system fails a dbms is responsible for maintaining the integrity and security of stored data and for recovering information if the system fails both a database and its dbms conform to the principles of a particular database model 2 database system refers collectively to the database model database management system and database 3 physically database servers are dedicated computers that hold the actual databases and run only the dbms and related software database servers are usually multiprocessor computers with generous memory and raid disk arrays used for stable storage hardware database accelerators connected to one or more servers via a high speed channel are also used in large volume transaction processing environments dbmss are found at the heart of most database applications dbmss may be built around a custom multitasking kernel with built in networking support but modern dbmss typically rely on a standard operating system to provide these functions citation needed since dbmss comprise a significant economical market computer and storage vendors often take into account dbms requirements in their own development plans citation needed databases and dbmss can be categorized according to the database model s that they support such as relational or xml the type s of computer they run on from a server cluster to a mobile phone the query language s used to access the database such as sql or xquery and their internal engineering which affects performance scalability resilience and security applications and roles edit this section does not cite any references or sources please help improve this section by adding citations to reliable sources unsourced material may be challenged and removed march 2013 most organizations in developed countries today depend on databases for their business operations increasingly databases are not only used to support the internal operations of the organization but also to underpin its online interactions with customers and suppliers see enterprise software databases are not used only to hold administrative information but are often embedded within applications to hold more specialized data for example engineering data or economic models examples of database applications include computerized library systems flight reservation systems and computerized parts inventory systems client server or transactional dbmss are often complex to maintain high performance availability and security when many users are querying and updating the database at the same time personal desktop based database systems tend to be less complex for example filemaker and microsoft access come with built in graphical user interfaces general purpose and special purpose dbmss edit a dbms has evolved into a complex software system and its development typically requires thousands of person years of development effort 4 some general purpose dbmss such as adabas oracle and db2 have been undergoing upgrades since the 1970s general purpose dbmss aim to meet the needs of as many applications as possible which adds to the complexity however the fact that their development cost can be spread over a large number of users means that they are often the most cost effective approach however a general purpose dbms is not always the optimal solution in some cases a general purpose dbms may introduce unnecessary overhead therefore there are many examples of systems that use special purpose databases a common example is an email system email systems are designed to optimize the handling of email messages and do not need significant portions of a general purpose dbms functionality many databases have application software that accesses the database on behalf of end users without exposing the dbms interface directly application programmers may use a wire protocol directly or more likely through an application programming interface database designers and database administrators interact with the dbms through dedicated interfaces to build and maintain the applications databases and thus need some more knowledge and understanding about how dbmss operate and the dbmss external interfaces and tuning parameters general purpose databases are usually developed by one organization or community of programmers while a different group builds the applications that use it in many companies specialized database administrators maintain databases run reports and may work on code that runs on the databases themselves rather than in the client application history edit with the progress in technology in the areas of processors computer memory computer storage and computer networks the sizes capabilities and performance of databases and their respective dbmss have grown in orders of magnitudes the development of database technology can be divided into three eras based on data model or structure navigational 5 sql relational and post relational the two main early navigational data models were the hierarchical model epitomized by ibm s ims system and the codasyl model network model implemented in a number of products such as idms the relational model first proposed in 1970 by edgar f codd departed from this tradition by insisting that applications should search for data by content rather than by following links the relational model is made up of ledger style tables each used for a different type of entity it was not until the mid 1980s that computing hardware became powerful enough to allow relational systems dbmss plus applications to be widely deployed by the early 1990s however relational systems were dominant for all large scale data processing applications and they remain dominant today 2012 except in niche areas the dominant database language is the standard sql for the relational model which has influenced database languages for other data models citation needed object databases were invented in the 1980s to overcome the inconvenience of object relational impedance mismatch which led to the coining of the term post relational but also development of hybrid object relational databases the next generation of post relational databases in the 2000s became known as nosql databases introducing fast key value stores and document oriented databases a competing next generation known as newsql databases attempted new implementations that retained the relational sql model while aiming to match the high performance of nosql compared to commercially available relational dbmss 1960s navigational dbms edit basic structure of navigational codasyl database model further information navigational database the introduction of the term database coincided with the availability of direct access storage disks and drums from the mid 1960s onwards the term represented a contrast with the tape based systems of the past allowing shared interactive use rather than daily batch processing the oxford english dictionary cites citation needed a 1962 technical report as the first to use the term data base as computers grew in speed and capability a number of general purpose database systems emerged by the mid 1960s there were a number of such systems in commercial use interest in a standard began to grow and charles bachman author of one such product the integrated data store ids founded the database task group within codasyl the group responsible for the creation and standardization of cobol in 1971 they delivered their standard which generally became known as the codasyl approach and soon a number of commercial products based on this approach were made available the codasyl approach was based on the manual navigation of a linked data set which was formed into a large network records could be found either by use of a primary key known as a calc key typically implemented by hashing by navigating relationships called sets from one record to another or by scanning all the records in sequential order later systems added b trees to provide alternate access paths many codasyl databases also added a query language that was very straightforward however in the final tally codasyl was very complex and required significant training and effort to produce useful applications ibm also had their own dbms system in 1968 known as ims ims was a development of software written for the apollo program on the system 360 ims was generally similar in concept to codasyl but used a strict hierarchy for its model of data navigation instead of codasyl s network model both concepts later became known as navigational databases due to the way data was accessed and bachman s 1973 turing award presentation was the programmer as navigator ims is classified as a hierarchical database idms and cincom systems total database are classified as network databases 1970s relational dbms edit edgar codd worked at ibm in san jose california in one of their offshoot offices that was primarily involved in the development of hard disk systems he was unhappy with the navigational model of the codasyl approach notably the lack of a search facility in 1970 he wrote a number of papers that outlined a new approach to database construction that eventually culminated in the groundbreaking a relational model of data for large shared data banks 6 in this paper he described a new system for storing and working with large databases instead of records being stored in some sort of linked list of free form records as in codasyl codd s idea was to use a table of fixed length records with each table used for a different type of entity a linked list system would be very inefficient when storing sparse databases where some of the data for any one record could be left empty the relational model solved this by splitting the data into a series of normalized tables or relations with optional elements being moved out of the main table to where they would take up room only if needed data may be freely inserted deleted and edited in these tables with the dbms doing whatever maintenance needed to present a table view to the application user in the relational model related records are linked together with a key the relational model also allowed the content of the database to evolve without constant rewriting of links and pointers the relational part comes from entities referencing other entities in what is known as one to many relationship like a traditional hierarchical model and many to many relationship like a navigational network model thus a relational model can express both hierarchical and navigational models as well as its native tabular model allowing for pure or combined modeling in terms of these three models as the application requires for instance a common use of a database system is to track information about users their name login information various addresses and phone numbers in the navigational approach all of these data would be placed in a single record and unused items would simply not be placed in the database in the relational approach the data would be normalized into a user table an address table and a phone number table for instance records would be created in these optional tables only if the address or phone numbers were actually provided linking the information back together is the key to this system in the relational model some bit of information was used as a key uniquely defining a particular record when information was being collected about a user information stored in the optional tables would be found by searching for this key for instance if the login name of a user is unique addresses and phone numbers for that user would be recorded with the login name as its key this simple re linking of related data back into a single collection is something that traditional computer languages are not designed for just as the navigational approach would require programs to loop in order to collect records the relational approach would require loops to collect information about any one record codd s solution to the necessary looping was a set oriented language a suggestion that would later spawn the ubiquitous sql using a branch of mathematics known as tuple calculus he demonstrated that such a system could support all the operations of normal databases inserting updating etc as well as providing a simple system for finding and returning sets of data in a single operation codd s paper was picked up by two people at berkeley eugene wong and michael stonebraker they started a project known as ingres using funding that had already been allocated for a geographical database project and student programmers to produce code beginning in 1973 ingres delivered its first test products which were generally ready for widespread use in 1979 ingres was similar to system r in a number of ways including the use of a language for data access known as quel over time ingres moved to the emerging sql standard ibm itself did one test implementation of the relational model prtv and a production one business system 12 both now discontinued honeywell wrote mrds for multics and now there are two new implementations alphora dataphor and rel most other dbms implementations usually called relational are actually sql dbmss in 1970 the university of michigan began development of the micro information management system 7 based on d l childs set theoretic data model 8 9 10 micro was used to manage very large data sets by the us department of labor the u s environmental protection agency and researchers from the university of alberta the university of michigan and wayne state university it ran on ibm mainframe computers using the michigan terminal system 11 the system remained in production until 1998 database machines and appliances edit main article database machine in the 1970s and 1980s attempts were made to build database systems with integrated hardware and software the underlying philosophy was that such integration would provide higher performance at lower cost examples were ibm system 38 the early offering of teradata and the britton lee inc database machine another approach to hardware support for database management was icl s cafs accelerator a hardware disk controller with programmable search capabilities in the long term these efforts were generally unsuccessful because specialized database machines could not keep pace with the rapid development and progress of general purpose computers thus most database systems nowadays are software systems running on general purpose hardware using general purpose computer data storage however this idea is still pursued for certain applications by some companies like netezza and oracle exadata late 1970s sql dbms edit ibm started working on a prototype system loosely based on codd s concepts as system r in the early 1970s the first version was ready in 1974 5 and work then started on multi table systems in which the data could be split so that all of the data for a record some of which is optional did not have to be stored in a single large chunk subsequent multi user versions were tested by customers in 1978 and 1979 by which time a standardized query language sql citation needed had been added codd s ideas were establishing themselves as both workable and superior to codasyl pushing ibm to develop a true production version of system r known as sql ds and later database 2 db2 larry ellison s oracle started from a different chain based on ibm s papers on system r and beat ibm to market when the first version was released in 1978 citation needed stonebraker went on to apply the lessons from ingres to develop a new database postgres which is now known as postgresql postgresql is often used for global mission critical applications the org and info domain name registries use it as their primary data store as do many large companies and financial institutions in sweden codd s paper was also read and mimer sql was developed from the mid 70s at uppsala university in 1984 this project was consolidated into an independent enterprise in the early 1980s mimer introduced transaction handling for high robustness in applications an idea that was subsequently implemented on most other dbms another data model the entity relationship model emerged in 1976 and gained popularity for database design as it emphasized a more familiar description than the earlier relational model later on entity relationship constructs were retrofitted as a data modeling construct for the relational model and the difference between the two have become irrelevant citation needed 1980s desktop databases edit the 1980s ushered in the age of desktop computing the new computers empowered their users with spreadsheets like lotus 1 2 3 and database software like dbase the dbase product was lightweight and easy for any computer user to understand out of the box c wayne ratliff the creator of dbase stated dbase was different from programs like basic c fortran and cobol in that a lot of the dirty work had already been done the data manipulation is done by dbase instead of by the user so the user can concentrate on what he is doing rather than having to mess with the dirty details of opening reading and closing files and managing space allocation 12 dbase was one of the top selling software titles in the 1980s and early 1990 s 1980s object oriented databases edit the 1980s along with a rise in object oriented programming saw a growth in how data in various databases were handled programmers and designers began to treat the data in their databases as objects that is to say that if a person s data were in a database that person s attributes such as their address phone number and age were now considered to belong to that person instead of being extraneous data this allows for relations between data to be relations to objects and their attributes and not to individual fields 13 the term object relational impedance mismatch described the inconvenience of translating between programmed objects and database tables object databases and object relational databases attempt to solve this problem by providing an object oriented language sometimes as extensions to sql that programmers can use as alternative to purely relational sql on the programming side libraries known as object relational mappings orms attempt to solve the same problem 2000s nosql and newsql databases edit main article nosql the next generation of post relational databases in the 2000s became known as nosql databases including fast key value stores and document oriented databases xml databases are a type of structured document oriented database that allows querying based on xml document attributes nosql databases are often very fast do not require fixed table schemas avoid join operations by storing denormalized data and are designed to scale horizontally in recent years there was a high demand for massively distributed databases with high partition tolerance but according to the cap theorem it is impossible for a distributed system to simultaneously provide consistency availability and partition tolerance guarantees a distributed system can satisfy any two of these guarantees at the same time but not all three for that reason many nosql databases are using what is called eventual consistency to provide both availability and partition tolerance guarantees with a maximum level of data consistency the most popular nosql systems include mongodb memcached redis couchdb hazelcast apache cassandra and hbase 14 note that all are open source software products a number of new relational databases continuing use of sql but aiming for performance comparable to nosql are known as newsql database research edit database technology has been an active research topic since the 1960s both in academia and in the research and development groups of companies for example ibm research research activity includes theory and development of prototypes notable research topics have included models the atomic transaction concept and related concurrency control techniques query languages and query optimization methods raid and more the database research area has several dedicated academic journals for example acm transactions on database systems tods data and knowledge engineering dke and annual conferences e g acm sigmod acm pods vldb ieee icde database type examples edit one way to classify databases involves the type of their contents for example bibliographic document text statistical or multimedia objects another way is by their application area for example accounting music compositions movies banking manufacturing or insurance a third way is by some technical aspect such as the database structure or interface type this section lists a few of the adjectives used to characterize different kinds of databases an active database includes an event driven architecture which can respond to conditions both inside and outside the database possible uses include security monitoring alerting statistics gathering and authorization many databases provide active database features in the form of database triggers a cloud database relies on cloud technology both the database and most of its dbms reside remotely in the cloud while its applications are both developed by programmers and later maintained and utilized by application s end users through a web browser and open apis data warehouses archive data from operational databases and often from external sources such as market research firms the warehouse becomes the central source of data for use by managers and other end users who may not have access to operational data for example sales data might be aggregated to weekly totals and converted from internal product codes to use upcs so that they can be compared with acnielsen data some basic and essential components of data warehousing include retrieving analyzing and mining data transforming loading and managing data so as to make them available for further use a deductive database combines logic programming with a relational database for example by using the datalog language a distributed database is one in which both the data and the dbms span multiple computers a document oriented database is designed for storing retrieving and managing document oriented or semi structured data information document oriented databases are one of the main categories of nosql databases an embedded database system is a dbms which is tightly integrated with an application software that requires access to stored data in such a way that the dbms is hidden from the application s end users and requires little or no ongoing maintenance 15 end user databases consist of data developed by individual end users examples of these are collections of documents spreadsheets presentations multimedia and other files several products exist to support such databases some of them are much simpler than full fledged dbmss with more elementary dbms functionality a federated database system comprises several distinct databases each with its own dbms it is handled as a single database by a federated database management system fdbms which transparently integrates multiple autonomous dbmss possibly of different types in which case it would also be a heterogeneous database system and provides them with an integrated conceptual view sometimes the term multi database is used as a synonym to federated database though it may refer to a less integrated e g without an fdbms and a managed integrated schema group of databases that cooperate in a single application in this case typically middleware is used for distribution which typically includes an atomic commit protocol acp e g the two phase commit protocol to allow distributed global transactions across the participating databases a graph database is a kind of nosql database that uses graph structures with nodes edges and properties to represent and store information general graph databases that can store any graph are distinct from specialized graph databases such as triplestores and network databases in a hypertext or hypermedia database any word or a piece of text representing an object e g another piece of text an article a picture or a film can be hyperlinked to that object hypertext databases are particularly useful for organizing large amounts of disparate information for example they are useful for organizing online encyclopedias where users can conveniently jump around the text the world wide web is thus a large distributed hypertext database an in memory database is a database that primarily resides in main memory but is typically backed up by non volatile computer data storage main memory databases are faster than disk databases and so are often used where response time is critical such as in telecommunications network equipment 16 a knowledge base abbreviated kb kb or 17 18 is a special kind of database for knowledge management providing the means for the computerized collection organization and retrieval of knowledge also a collection of data representing problems with their solutions and related experiences a mobile database can be carried on or synchronized from a mobile computing device operational databases store detailed data about the operations of an organization they typically process relatively high volumes of updates using transactions examples include customer databases that record contact credit and demographic information about a business customers personnel databases that hold information such as salary benefits skills data about employees enterprise resource planning systems that record details about product components parts inventory and financial databases that keep track of the organization s money accounting and financial dealings a parallel database seeks to improve performance through parallelization for tasks such as loading data building indexes and evaluating queries the major parallel dbms architectures which are induced by the underlying hardware architecture are shared memory architecture where multiple processors share the main memory space as well as other data storage shared disk architecture where each processing unit typically consisting of multiple processors has its own main memory but all units share the other storage shared nothing architecture where each processing unit has its own main memory and other storage probabilistic databases employ fuzzy logic to draw inferences from imprecise data real time databases process transactions fast enough for the result to come back and be acted on right away a spatial database can store the data with multidimensional features the queries on such data include location based queries like where is the closest hotel in my area a temporal database has built in time aspects for example a temporal data model and a temporal version of sql more specifically the temporal aspects usually include valid time and transaction time a terminology oriented database builds upon an object oriented database often customized for a specific field an unstructured data database is intended to store in a manageable and protected way diverse objects that do not fit naturally and conveniently in common databases it may include email messages documents journals multimedia objects etc the name may be misleading since some objects can be highly structured however the entire possible object collection does not fit into a predefined structured framework most established dbmss now support unstructured data in various ways and new dedicated dbmss are emerging database design and modeling edit main article database design the first task of a database designer is to produce a conceptual data model that reflects the structure of the information to be held in the database a common approach to this is to develop an entity relationship model often with the aid of drawing tools another popular approach is the unified modeling language a successful data model will accurately reflect the possible state of the external world being modeled for example if people can have more than one phone number it will allow this information to be captured designing a good conceptual data model requires a good understanding of the application domain it typically involves asking deep questions about the things of interest to an organisation like can a customer also be a supplier or if a product is sold with two different forms of packaging are those the same product or different products or if a plane flies from new york to dubai via frankfurt is that one flight or two or maybe even three the answers to these questions establish definitions of the terminology used for entities customers products flights flight segments and their relationships and attributes producing the conceptual data model sometimes involves input from business processes or the analysis of workflow in the organization this can help to establish what information is needed in the database and what can be left out for example it can help when deciding whether the database needs to hold historic data as well as current data having produced a conceptual data model that users are happy with the next stage is to translate this into a schema that implements the relevant data structures within the database this process is often called logical database design and the output is a logical data model expressed in the form of a schema whereas the conceptual data model is in theory at least independent of the choice of database technology the logical data model will be expressed in terms of a particular database model supported by the chosen dbms the terms data model and database model are often used interchangeably but in this article we use data model for the design of a specific database and database model for the modelling notation used to express that design the most popular database model for general purpose databases is the relational model or more precisely the relational model as represented by the sql language the process of creating a logical database design using this model uses a methodical approach known as normalization the goal of normalization is to ensure that each elementary fact is only recorded in one place so that insertions updates and deletions automatically maintain consistency the final stage of database design is to make the decisions that affect performance scalability recovery security and the like this is often called physical database design a key goal during this stage is data independence meaning that the decisions made for performance optimization purposes should be invisible to end users and applications physical design is driven mainly by performance requirements and requires a good knowledge of the expected workload and access patterns and a deep understanding of the features offered by the chosen dbms another aspect of physical database design is security it involves both defining access control to database objects as well as defining security levels and methods for the data itself database models edit collage of five types of database models main article database model a database model is a type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored organized and manipulated the most popular example of a database model is the relational model or the sql approximation of relational which uses a table based format common logical data models for databases include hierarchical database model network model relational model entity relationship model enhanced entity relationship model object model document model entity attribute value model star schema an object relational database combines the two related structures physical data models include inverted index flat file other models include associative model multidimensional model multivalue model semantic model xml database named graph external conceptual and internal views edit traditional view of data 19 a database management system provides three views of the database data the external level defines how each group of end users sees the organization of data in the database a single database can have any number of views at the external level the conceptual level unifies the various external views into a coherent global view 20 it provides the synthesis of all the external views it is out of the scope of the various database end users and is rather of interest to database application developers and database administrators the internal level or physical level is the internal organization of data inside a dbms see implementation section below it is concerned with cost performance scalability and other operational matters it deals with storage layout of the data using storage structures such as indexes to enhance performance occasionally it stores data of individual views materialized views computed from generic data if performance justification exists for such redundancy it balances all the external views performance requirements possibly conflicting in an attempt to optimize overall performance across all activities while there is typically only one conceptual or logical and physical or internal view of the data there can be any number of different external views this allows users to see database information in a more business related way rather than from a technical processing viewpoint for example a financial department of a company needs the payment details of all employees as part of the company s expenses but does not need details about employees that are the interest of the human resources department thus different departments need different views of the company s database the three level database architecture relates to the concept of data independence which was one of the major initial driving forces of the relational model the idea is that changes made at a certain level do not affect the view at a higher level for example changes in the internal level do not affect application programs written using conceptual level interfaces which reduces the impact of making physical changes to improve performance the conceptual view provides a level of indirection between internal and external on one hand it provides a common view of the database independent of different external view structures and on the other hand it abstracts away details of how the data is stored or managed internal level in principle every level and even every external view can be presented by a different data model in practice usually a given dbms uses the same data model for both the external and the conceptual levels e g relational model the internal level which is hidden inside the dbms and depends on its implementation see implementation section below requires a different level of detail and uses its own types of data structure types separating the external conceptual and internal levels was a major feature of the relational database model implementations that dominate 21st century databases 20 database languages edit database languages are special purpose languages which do one or more of the following data definition language defines data types and the relationships among them data manipulation language performs tasks such as inserting updating or deleting data occurrences query language allows searching for information and computing derived information database languages are specific to a particular data model notable examples include sql combines the roles of data definition data manipulation and query in a single language it was one of the first commercial languages for the relational model although it departs in some respects from the relational model as described by codd for example the rows and columns of a table can be ordered sql became a standard of the american national standards institute ansi in 1986 and of the international organization for standards iso in 1987 the standards have been regularly enhanced since and is supported with varying degrees of conformance by all mainstream commercial relational dbmss 21 22 oql is an object model language standard from the object data management group it has influenced the design of some of the newer query languages like jdoql and ejb ql xquery is a standard xml query language implemented by xml database systems such as marklogic and exist by relational databases with xml capability such as oracle and db2 and also by in memory xml processors such as saxon sql xml combines xquery with sql 23 a database language may also incorporate features like dbms specific configuration and storage engine management computations to modify query results like counting summing averaging sorting grouping and cross referencing constraint enforcement e g in an automotive database only allowing one engine type per car application programming interface version of the query language for programmer convenience performance security and availability edit because of the critical importance of database technology to the smooth running of an enterprise database systems include complex mechanisms to deliver the required performance security and availability and allow database administrators to control the use of these features database storage edit main articles computer data storage and database engine database storage is the container of the physical materialization of a database it comprises the internal physical level in the database architecture it also contains all the information needed e g metadata data about the data and internal data structures to reconstruct the conceptual level and external level from the internal level when needed putting data into permanent storage is generally the responsibility of the database engine a k a storage engine though typically accessed by a dbms through the underlying operating system and often utilizing the operating systems file systems as intermediates for storage layout storage properties and configuration setting are extremely important for the efficient operation of the dbms and thus are closely maintained by database administrators a dbms while in operation always has its database residing in several types of storage e g memory and external storage the database data and the additional needed information possibly in very large amounts are coded into bits data typically reside in the storage in structures that look completely different from the way the data look in the conceptual and external levels but in ways that attempt to optimize the best possible these levels reconstruction when needed by users and programs as well as for computing additional types of needed information from the data e g when querying the database some dbms support specifying which character encoding was used to store data so multiple encodings can be used in the same database various low level database storage structures are used by the storage engine to serialize the data model so it can be written to the medium of choice techniques such as indexing may be used to improve performance conventional storage is row oriented but there are also column oriented and correlation databases database materialized views edit main article materialized view often storage redundancy is employed to increase performance a common example is storing materialized views which consist of frequently needed external views or query results storing such views saves the expensive computing of them each time they are needed the downsides of materialized views are the overhead incurred when updating them to keep them synchronized with their original updated database data and the cost of storage redundancy database and database object replication edit main article database replication occasionally a database employs storage redundancy by database objects replication with one or more copies to increase data availability both to improve performance of simultaneous multiple end user accesses to a same database object and to provide resiliency in a case of partial failure of a distributed database updates of a replicated object need to be synchronized across the object copies in many cases the entire database is replicated database security edit the following text needs to be harmonized with text in database security main article database security database security deals with all various aspects of protecting the database content its owners and its users it ranges from protection from intentional unauthorized database uses to unintentional database accesses by unauthorized entities e g a person or a computer program database access control deals with controlling who a person or a certain computer program is allowed to access what information in the database the information may comprise specific database objects e g record types specific records data structures certain computations over certain objects e g query types or specific queries or utilizing specific access paths to the former e g using specific indexes or other data structures to access information database access controls are set by special authorized by the database owner personnel that uses dedicated protected security dbms interfaces this may be managed directly on an individual basis or by the assignment of individuals and privileges to groups or in the most elaborate models through the assignment of individuals and groups to roles which are then granted entitlements data security prevents unauthorized users from viewing or updating the database using passwords users are allowed access to the entire database or subsets of it called subschemas for example an employee database can contain all the data about an individual employee but one group of users may be authorized to view only payroll data while others are allowed access to only work history and medical data if the dbms provides a way to interactively enter and update the database as well as interrogate it this capability allows for managing personal databases data security in general deals with protecting specific chunks of data both physically i e from corruption or destruction or removal e g see physical security or the interpretation of them or parts of them to meaningful information e g by looking at the strings of bits that they comprise concluding specific valid credit card numbers e g see data encryption change and access logging records who accessed which attributes what was changed and when it was changed logging services allow for a forensic database audit later by keeping a record of access occurrences and changes sometimes application level code is used to record changes rather than leaving this to the database monitoring can be set up to attempt to detect security breaches transactions and concurrency edit database transactions can be used to introduce some level of fault tolerance and data integrity after recovery from a crash a database transaction is a unit of work typically encapsulating a number of operations over a database e g reading a database object writing acquiring lock etc an abstraction supported in database and also other systems each transaction has well defined boundaries in terms of which program code executions are included in that transaction determined by the transaction s programmer via special transaction commands the acronym acid describes some ideal properties of a database transaction atomicity consistency isolation and durability further information concurrency control migration edit see also section database migration in article data migration a database built with one dbms is not portable to another dbms i e the other dbms cannot run it however in some situations it is desirable to move migrate a database from one dbms to another the reasons are primarily economical different dbmss may have different total costs of ownership or tcos functional and operational different dbmss may have different capabilities the migration involves the database s transformation from one dbms type to another the transformation should maintain if possible the database related application i e all related application programs intact thus the database s conceptual and external architectural levels should be maintained in the transformation it may be desired that also some aspects of the architecture internal level are maintained a complex or large database migration may be a complicated and costly one time project by itself which should be factored into the decision to migrate this in spite of the fact that tools may exist to help migration between specific dbms typically a dbms vendor provides tools to help importing databases from other popular dbmss database building maintaining and tuning edit main article database tuning after designing a database for an application arrives the stage of building the database typically an appropriate general purpose dbms can be selected to be utilized for this purpose a dbms provides the needed user interfaces to be utilized by database administrators to define the needed application s data structures within the dbms s respective data model other user interfaces are used to select needed dbms parameters like security related storage allocation parameters etc when the database is ready all its data structures and other needed components are defined it is typically populated with initial application s data database initialization which is typically a distinct project in many cases using specialized dbms interfaces that support bulk insertion before making it operational in some cases the database becomes operational while empty from application s data and data are accumulated along its operation after completing building the database and making it operational arrives the database maintenance stage various database parameters may need changes and tuning for better performance application s data structures may be changed or added new related application programs may be written to add to the application s functionality etc contribution by malebye joyce as adapted from informations systems for businesses from chapter 5 storing ad organizing data databases are often confused with spread sheet such as microsoft excel which is different from microsoft access both can be used to store information however a database serves a better function at this below is a comparison of spreadsheets and databases spread sheets strengths 1 very simple data storage 2 relatively easy to use 3 require less planning weaknesses 1 data integrity problems include inaccurate inconsistent and out of date version and out of date data 2 formulas could be incorrect databases strengths 1 methods for keeping data up to date and consistent 2 data is of higher quality than data stored in spreadsheets 3 good for storing and organizing information weakness 1 require more planning and designing backup and restore edit main article backup sometimes it is desired to bring a database back to a previous state for many reasons e g cases when the database is found corrupted due to a software error or if it has been updated with erroneous data to achieve this a backup operation is done occasionally or continuously where each desired database state i e the values of its data and their embedding in database s data structures is kept within dedicated backup files many techniques exist to do this effectively when this state is needed i e when it is decided by a database administrator to bring the database back to this state e g by specifying this state by a desired point in time when the database was in this state these files are utilized to restore that state other edit other dbms features might include database logs graphics component for producing graphs and charts especially in a data warehouse system query optimizer performs query optimization on every query to choose for it the most efficient query plan a partial order tree of operations to be executed to compute the query result may be specific to a particular storage engine tools or hooks for database design application programming application program maintenance database performance analysis and monitoring database configuration monitoring dbms hardware configuration a dbms and related database may span computers networks and storage units and related database mapping especially for a distributed dbms storage allocation and database layout monitoring storage migration etc see also edit book databases comparison of database tools comparison of object database management systems comparison of object relational database management systems comparison of relational database management systems data access data hierarchy data store data warehouse database testing database centric architecture metadata references edit jeffrey ullman 1997 first course in database systems prentice hall inc simon amp schuster page 1 isbn 0 13 861337 0 tsitchizris d c and f h lochovsky 1982 data models englewood cliffs prentice hall beynon davies p 2004 database systems 3rd edition palgrave basingstoke uk isbn 1 4039 1601 2 raul f chong michael dang dwaine r snow xiaomei wang 3 july 2008 introduction to db2 retrieved 17 march 2013 160 this article quotes a development time of 5 years involving 750 people for db2 release 9 alone c w bachmann november 1973 the programmer as navigator cacm 160 turing award lecture 1973 codd e f 1970 a relational model of data for large shared data banks in communications of the acm 13 6 377 387 william hershey and carol easthope a set theoretic data structure and retrieval language spring joint computer conference may 1972 in acm sigir forum volume 7 issue 4 december 1972 pp 45 55 doi 10 1145 1095495 1095500 ken north sets data models and data independence dr dobb s 10 march 2010 description of a set theoretic data structure d l childs 1968 technical report 3 of the concomp research in conversational use of computers project university of michigan ann arbor michigan usa feasibility of a set theoretic data structure 160 a general structure based on a reconstituted definition of relation d l childs 1968 technical report 6 of the concomp research in conversational use of computers project university of michigan ann arbor michigan usa micro information management system version 5 0 reference manual m a kahn d l rumelhart and b l bronson october 1977 institute of labor and industrial relations ilir university of michigan and wayne state university http www foxprohistory org interview_wayne_ratliff htm development of an object oriented dbms portland oregon united states pages 472 482 1986 isbn 0 89791 204 7 db engines ranking january 2013 retrieved 22 january 2013 160 graves steve cots databases for embedded systems embedded computing design magazine january 2007 retrieved on august 13 2008 telecommunication systems signs up as a reseller of timesten mobile operators and carriers gain real time platform for location based services business wire 2002 06 24 160 argumentation in artificial intelligence by iyad rahwan guillermo r simari owl dl semantics retrieved 10 december 2010 160 itl nist gov 1993 integration definition for information modeling idefix 21 december 1993 a b date 1990 pp 160 31 32 chapple mike sql fundamentals databases about com retrieved 2009 01 28 160 structured query language sql international business machines october 27 2006 retrieved 2007 06 10 160 wagner michael 2010 1 auflage sql xml 2006 evaluierung der standardkonformit t ausgew hlter datenbanksysteme diplomica verlag isbn 160 3 8366 9609 6 160 further reading edit ling liu and tamer m zsu eds 2009 encyclopedia of database systems 4100 p 160 60 illus isbn 978 0 387 49616 0 beynon davies p 2004 database systems 3rd edition palgrave houndmills basingstoke connolly thomas and carolyn begg database systems new york harlow 2002 date c j 2003 an introduction to database systems fifth edition addison wesley isbn 160 0 201 51381 1 160 gray j and reuter a transaction processing concepts and techniques 1st edition morgan kaufmann publishers 1992 kroenke david m and david j auer database concepts 3rd ed new york prentice 2007 raghu ramakrishnan and johannes gehrke database management systems abraham silberschatz henry f korth s sudarshan database system concepts lightstone s teorey t nadeau t 2007 physical database design the database professional s guide to exploiting indexes views storage and more morgan kaufmann press isbn 160 0 12 369389 6 160 teorey t lightstone s and nadeau t database modeling amp design logical design 4th edition morgan kaufmann press 2005 isbn 0 12 685352 5 external links edit find more about database at wikipedia s sister projects definitions and translations from wiktionary media from commons learning resources from wikiversity news stories from wikinews quotations from wikiquote source texts from wikisource textbooks from wikibooks database at the open directory project v t e database main requirements theory models database management system machine server application connection datasource dsn administrator lock types tools languages data definition data manipulation query information retrieval security activity monitoring audit forensics negative database design entities and relationships and enhanced notation normalization refactoring programming abstraction layer object relational mapping management virtualization tuning caching migration preservation integrity see also database centric architecture intelligent database two phase locking locks with ordered sharing load file publishing halloween problem log shipping book category wikiproject v t e database management systems database models database normalization database storage distributed dbms federated database system referential integrity relational algebra relational calculus relational database relational dbms relational model object relational database transaction processing concepts database acid crud null candidate key foreign key primary key superkey surrogate key armstrong s axioms objects relation table column row view transaction log trigger index stored procedure cursor partition components concurrency control data dictionary jdbc xqj odbc query language query optimizer query plan functions administration and automation query optimization replication database products object oriented comparison relational comparison document oriented graph database nosql newsql v t e database models common models flat hierarchical dimensional model network relational entity relationship and enhanced notation graph object oriented entity attribute value model other models associative multidimensional semantic star schema xml database implementations flat file deductive document oriented object relational temporal xml data stores triplestores v t e data warehouse 160 creating the data warehouse concepts database dimension dimensional modeling fact olap star schema aggregate variants anchor modeling column oriented dbms data vault modeling holap molap rolap operational data store elements data dictionary metadata data mart sixth normal form surrogate key fact fact table early arriving fact measure dimension dimension table degenerate slowly changing filling extract transform load etl extract transform load 160 using the data warehouse concepts business intelligence dashboard data mining decision support system dss olap cube languages data mining extensions dmx multidimensional expressions mdx xml for analysis xmla tools business intelligence tools reporting software spreadsheet 160 related people bill inmon ralph kimball products comparison of olap servers data warehousing products and their producers retrieved from http en wikipedia org w index php title database amp oldid 561609702 categories database management systemsdatabasesdatabase theoryhidden categories all articles with unsourced statementsarticles with unsourced statements from april 2010articles with unsourced statements from march 2013articles needing additional references from march 2013all articles needing additional referencesarticles with unsourced statements from november 2011articles with unsourced statements from may 2012articles to harmonize navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages afrikaans aragon s az rbaycanca boarisch bosanski brezhoneg catal esky cymraeg dansk deutsch eesti espa ol esperanto euskara fran ais gaeilge galego hrvatski bahasa indonesia interlingua slenska italiano basa jawa kurd latina latvie u lietuvi magyar bahasa melayu mirand s nederlands norsk bokm l norsk nynorsk o zbekcha polski portugus rom n shqip sicilianu simple english sloven ina sloven ina srpski srpskohrvatski suomi svenska tagalog t rk e ti ng vi t walon winaray edit links this page was last modified on 26 june 2013 at 02 21 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Decision_support_system b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Decision_support_system new file mode 100644 index 00000000..0b440bd8 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Decision_support_system @@ -0,0 +1 @@ +decision support system wikipedia the free encyclopedia decision support system from wikipedia the free encyclopedia jump to navigation search example of a decision support system for john day reservoir a decision support system dss is a computer based information system that supports business or organizational decision making activities dsss serve the management operations and planning levels of an organization and help to make decisions which may be rapidly changing and not easily specified in advance decision support systems can be either fully computerized human or a combination of both dsss include knowledge based systems a properly designed dss is an interactive software based system intended to help decision makers compile useful information from a combination of raw data documents and personal knowledge or business models to identify and solve problems and make decisions typical information that a decision support application might gather and present includes inventories of information assets including legacy and relational data sources cubes data warehouses and data marts comparative sales figures between one period and the next projected revenue figures based on product sales assumptions contents 1 history 2 taxonomies 3 components 3 1 development frameworks 4 classification 5 applications 6 benefits 7 dss characteristics and capabilities 8 see also 9 references 10 further reading history edit according to keen 1 the concept of decision support has evolved from two main areas of research the theoretical studies of organizational decision making done at the carnegie institute of technology during the late 1950s and early 1960s and the technical work on technology in the 1960s it is considered that the concept of dss became an area of research of its own in the middle of the 1970s before gaining in intensity during the 1980s in the middle and late 1980s executive information systems eis group decision support systems gdss and organizational decision support systems odss evolved from the single user and model oriented dss according to sol 1987 2 the definition and scope of dss has been migrating over the years in the 1970s dss was described as a computer based system to aid decision making in the late 1970s the dss movement started focusing on interactive computer based systems which help decision makers utilize data bases and models to solve ill structured problems in the 1980s dss should provide systems using suitable and available technology to improve effectiveness of managerial and professional activities and end 1980s dss faced a new challenge towards the design of intelligent workstations 2 in 1987 texas instruments completed development of the gate assignment display system gads for united airlines this decision support system is credited with significantly reducing travel delays by aiding the management of ground operations at various airports beginning with o hare international airport in chicago and stapleton airport in denver colorado 3 4 beginning in about 1990 data warehousing and on line analytical processing olap began broadening the realm of dss as the turn of the millennium approached new web based analytical applications were introduced the advent of better and better reporting technologies has seen dss start to emerge as a critical component of management design examples of this can be seen in the intense amount of discussion of dss in the education environment dss also have a weak connection to the user interface paradigm of hypertext both the university of vermont promis system for medical decision making and the carnegie mellon zog kms system for military and business decision making were decision support systems which also were major breakthroughs in user interface research furthermore although hypertext researchers have generally been concerned with information overload certain researchers notably douglas engelbart have been focused on decision makers in particular decision generate by chronical format so it is important taxonomies edit as with the definition there is no universally accepted taxonomy of dss either different authors propose different classifications using the relationship with the user as the criterion haettenschwiler 5 differentiates passive active and cooperative dss a passive dss is a system that aids the process of decision making but that cannot bring out explicit decision suggestions or solutions an active dss can bring out such decision suggestions or solutions a cooperative dss allows the decision maker or its advisor to modify complete or refine the decision suggestions provided by the system before sending them back to the system for validation the system again improves completes and refines the suggestions of the decision maker and sends them back to him for validation the whole process then starts again until a consolidated solution is generated another taxonomy for dss has been created by daniel power using the mode of assistance as the criterion power differentiates communication driven dss data driven dss document driven dss knowledge driven dss and model driven dss 6 a communication driven dss supports more than one person working on a shared task examples include integrated tools like microsoft s netmeeting or groove 7 a data driven dss or data oriented dss emphasizes access to and manipulation of a time series of internal company data and sometimes external data a document driven dss manages retrieves and manipulates unstructured information in a variety of electronic formats a knowledge driven dss provides specialized problem solving expertise stored as facts rules procedures or in similar structures 6 a model driven dss emphasizes access to and manipulation of a statistical financial optimization or simulation model model driven dss use data and parameters provided by users to assist decision makers in analyzing a situation they are not necessarily data intensive dicodess is an example of an open source model driven dss generator 8 using scope as the criterion power 9 differentiates enterprise wide dss and desktop dss an enterprise wide dss is linked to large data warehouses and serves many managers in the company a desktop single user dss is a small system that runs on an individual manager s pc components edit design of a drought mitigation decision support system three fundamental components of a dss architecture are 5 6 10 11 12 the database or knowledge base the model i e the decision context and user criteria and the user interface the users themselves are also important components of the architecture 5 12 development frameworks edit dss systems are not entirely different from other systems and require a structured approach such a framework includes people technology and the development approach 10 dss technology levels of hardware and software may include the actual application that will be used by the user this is the part of the application that allows the decision maker to make decisions in a particular problem area the user can act upon that particular problem generator contains hardware software environment that allows people to easily develop specific dss applications this level makes use of case tools or systems such as crystal analytica and ithink tools include lower level hardware software dss generators including special languages function libraries and linking modules an iterative developmental approach allows for the dss to be changed and redesigned at various intervals once the system is designed it will need to be tested and revised where necessary for the desired outcome classification edit there are several ways to classify dss applications not every dss fits neatly into one of the categories but may be a mix of two or more architectures holsapple and whinston 13 classify dss into the following six frameworks text oriented dss database oriented dss spreadsheet oriented dss solver oriented dss rule oriented dss and compound dss a compound dss is the most popular classification for a dss it is a hybrid system that includes two or more of the five basic structures described by holsapple and whinston 13 the support given by dss can be separated into three distinct interrelated categories 14 personal support group support and organizational support dss components may be classified as inputs factors numbers and characteristics to analyze user knowledge and expertise inputs requiring manual analysis by the user outputs transformed data from which dss decisions are generated decisions results generated by the dss based on user criteria dsss which perform selected cognitive decision making functions and are based on artificial intelligence or intelligent agents technologies are called intelligent decision support systems idss citation needed the nascent field of decision engineering treats the decision itself as an engineered object and applies engineering principles such as design and quality assurance to an explicit representation of the elements that make up a decision applications edit as mentioned above there are theoretical possibilities of building such systems in any knowledge domain one is the clinical decision support system for medical diagnosis other examples include a bank loan officer verifying the credit of a loan applicant or an engineering firm that has bids on several projects and wants to know if they can be competitive with their costs dss is extensively used in business and management executive dashboard and other business performance software allow faster decision making identification of negative trends and better allocation of business resources due to dss all the information from any organization is represented in the form of charts graphs i e in a summarized way which helps the management to take strategic decision a growing area of dss application concepts principles and techniques is in agricultural production marketing for sustainable development for example the dssat4 package 15 16 developed through financial support of usaid during the 80 s and 90 s has allowed rapid assessment of several agricultural production systems around the world to facilitate decision making at the farm and policy levels there are however many constraints to the successful adoption on dss in agriculture 17 dss are also prevalent in forest management where the long planning time frame demands specific requirements all aspects of forest management from log transportation harvest scheduling to sustainability and ecosystem protection have been addressed by modern dsss a specific example concerns the canadian national railway system which tests its equipment on a regular basis using a decision support system a problem faced by any railroad is worn out or defective rails which can result in hundreds of derailments per year under a dss cn managed to decrease the incidence of derailments at the same time other companies were experiencing an increase benefits edit improves personal efficiency speed up the process of decision making increases organizational control encourages exploration and discovery on the part of the decision maker speeds up problem solving in an organization facilitates interpersonal communication promotes learning or training generates new evidence in support of a decision creates a competitive advantage over competition reveals new approaches to thinking about the problem space helps automate managerial processes create innovative ideas to speed up the performance dss characteristics and capabilities edit solve semi structured amp unstructured problems support to managers at all levels support individual and groups inter dependence and sequence decision support intelligence designee choice adaptable amp flexible interactive and ease of use interactive and efficiency human control the process ease of development by end user modeling and analysis data access stand alone integration amp web based support varieties of decision process support varieties of decision trees quick response see also edit clinical decision support system decision engineering decision making software no expert system judge advisor system land allocation decision support system morphological analysis online deliberation predictive analytics self service software spatial decision support system opinion driven decision support system references edit keen p g w 1978 decision support systems an organizational perspective reading mass addison wesley pub co isbn 0 201 03667 3 a b henk g sol et al 1987 expert systems and artificial intelligence in decision support systems proceedings of the second mini euroconference lunteren the netherlands 17 20 november 1985 springer 1987 isbn 90 277 2437 7 p 1 2 efraim turban jay e aronson ting peng liang 2008 decision support systems and intelligent systems p 160 574 160 gate delays at airports are minimised for united by texas instruments explorer computer business review 1987 11 26 160 a b c haettenschwiler p 1999 neues anwenderfreundliches konzept der entscheidungsunterst tzung gutes entscheiden in wirtschaft politik und gesellschaft zurich vdf hochschulverlag ag 189 208 a b c power d j 2002 decision support systems concepts and resources for managers westport conn quorum books stanhope p 2002 get in the groove building tools and peer to peer solutions with the groove platform new york hungry minds gachet a 2004 building model driven decision support systems with dicodess zurich vdf power d j 1996 what is a dss the on line executive journal for data intensive decision support 1 3 a b sprague r h and e d carlson 1982 building effective decision support systems englewood cliffs n j prentice hall isbn 0 13 086215 0 haag cummings mccubbrey pinsonneault donovan 2000 management information systems for the information age mcgraw hill ryerson limited 136 140 isbn 0 07 281947 2 a b marakas g m 1999 decision support systems in the twenty first century upper saddle river n j prentice hall a b holsapple c w and a b whinston 1996 decision support systems a knowledge based approach st paul west publishing isbn 0 324 03578 0 hackathorn r d and p g w keen 1981 september organizational strategies for personal computing in decision support systems mis quarterly vol 5 no 3 dssat4 pdf the decision support system for agrotechnology transfer stephens w and middleton t 2002 why has the uptake of decision support systems been so poor in crop soil simulation models in developing countries 129 148 eds r b matthews and william stephens wallingford cabi further reading edit delic k a douillet l and dayal u 2001 towards an architecture for real time decision support systems challenges and solutions diasio s agell n 2009 the evolution of expertise in decision support technologies a challenge for organizations cscwd pp 160 692 697 13th international conference on computer supported cooperative work in design 2009 http www computer org portal web csdl doi 10 1109 cscwd 2009 4968139 gadomski a m et al 2001 an approach to the intelligent decision advisor ida for emergency managers int j risk assessment and management vol 2 nos 3 4 gomes da silva carlos cl maco jo o figueira jos european journal of operational research ender gabriela e book 2005 2011 about the openspace online real time methodology knowledge sharing problem solving results oriented group dialogs about topics that matter with extensive conference documentation in real time download http www openspace online com openspace online_ebook_en pdf jim nez antonio r os insua sixto mateos alfonso computers amp operations research jintrawet attachai 1995 a decision support system for rapid assessment of lowland rice based cropping alternatives in thailand agricultural systems 47 245 258 matsatsinis n f and y siskos 2002 intelligent support systems for marketing decisions kluwer academic publishers power d j 2000 web based and model driven decision support systems concepts and issues in proceedings of the americas conference on information systems long beach california reich yoram kapeliuk adi decision support systems nov2005 vol 41 issue 1 p1 19 19p sauter v l 1997 decision support systems an applied managerial approach new york john wiley silver m 1991 systems that support decision makers description and analysis chichester 160 new york wiley sprague r h and h j watson 1993 decision support systems putting theory into practice englewood clifts n j prentice hall wikimedia commons has media related to decision support systems v t e data warehouse 160 creating the data warehouse concepts database dimension dimensional modeling fact olap star schema aggregate variants anchor modeling column oriented dbms data vault modeling holap molap rolap operational data store elements data dictionary metadata data mart sixth normal form surrogate key fact fact table early arriving fact measure dimension dimension table degenerate slowly changing filling extract transform load etl extract transform load 160 using the data warehouse concepts business intelligence dashboard data mining decision support system dss olap cube languages data mining extensions dmx multidimensional expressions mdx xml for analysis xmla tools business intelligence tools reporting software spreadsheet 160 related people bill inmon ralph kimball products comparison of olap servers data warehousing products and their producers retrieved from http en wikipedia org w index php title decision_support_system amp oldid 560308375 categories information systemsdecision theoryknowledge engineeringhidden categories all articles with unsourced statementsarticles with unsourced statements from october 2010commons category with local link same as on wikidatause dmy dates from october 2010 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages esky deutsch espa ol fran ais bahasa indonesia italiano magyar nederlands norsk bokm l polski portugus sloven ina sloven ina srpski t rk e edit links this page was last modified on 17 june 2013 at 15 24 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Decision_tree_learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Decision_tree_learning new file mode 100644 index 00000000..8ffa2018 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Decision_tree_learning @@ -0,0 +1 @@ +decision tree learning wikipedia the free encyclopedia decision tree learning from wikipedia the free encyclopedia jump to navigation search this article is about decision trees in machine learning for the use of the term in decision analysis see decision tree decision tree learning used in statistics data mining and machine learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item s target value more descriptive names for such tree models are classification trees or regression trees in these tree structures leaves represent class labels and branches represent conjunctions of features that lead to those class labels in decision analysis a decision tree can be used to visually and explicitly represent decisions and decision making in data mining a decision tree describes data but not decisions rather the resulting classification tree can be an input for decision making this page deals with decision trees in data mining contents 1 general 2 types 3 formulae 3 1 gini impurity 3 2 information gain 4 decision tree advantages 5 limitations 6 extensions 6 1 decision graphs 6 2 search through evolutionary algorithms 7 see also 8 implementations 9 references 10 external links general edit a tree showing survival of passengers on the titanic sibsp is the number of spouses or siblings aboard the figures under the leaves show the probability of survival and the percentage of observations in the leaf decision tree learning is a method commonly used in data mining 1 the goal is to create a model that predicts the value of a target variable based on several input variables an example is shown on the right each interior node corresponds to one of the input variables there are edges to children for each of the possible values of that input variable each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf a tree can be learned by splitting the source set into subsets based on an attribute value test this process is repeated on each derived subset in a recursive manner called recursive partitioning the recursion is completed when the subset at a node has all the same value of the target variable or when splitting no longer adds value to the predictions this process of top down induction of decision trees tdidt 2 is an example of a greedy algorithm and it is by far the most common strategy for learning decision trees from data but it is not the only strategy in fact some approaches have been developed recently allowing tree induction to be performed in a bottom up fashion 3 in data mining decision trees can be described also as the combination of mathematical and computational techniques to aid the description categorisation and generalisation of a given set of data data comes in records of the form the dependent variable y is the target variable that we are trying to understand classify or generalize the vector x is composed of the input variables x1 x2 x3 etc that are used for that task types edit decision trees used in data mining are of two main types classification tree analysis is when the predicted outcome is the class to which the data belongs regression tree analysis is when the predicted outcome can be considered a real number e g the price of a house or a patient s length of stay in a hospital the term classification and regression tree cart analysis is an umbrella term used to refer to both of the above procedures first introduced by breiman et al 4 trees used for regression and trees used for classification have some similarities but also some differences such as the procedure used to determine where to split 4 some techniques often called ensemble methods construct more than one decision tree bagging decision trees an early ensemble method builds multiple decision trees by repeatedly resampling training data with replacement and voting the trees for a consensus prediction 5 a random forest classifier uses a number of decision trees in order to improve the classification rate boosted trees can be used for regression type and classification type problems 6 7 rotation forest in which every decision tree is trained by first applying principal component analysis pca on a random subset of the input features 8 decision tree learning is the construction of a decision tree from class labeled training tuples a decision tree is a flow chart like structure where each internal non leaf node denotes a test on an attribute each branch represents the outcome of a test and each leaf or terminal node holds a class label the topmost node in a tree is the root node there are many specific decision tree algorithms notable ones include id3 iterative dichotomiser 3 c4 5 successor of id3 cart classification and regression tree chaid chi squared automatic interaction detector performs multi level splits when computing classification trees 9 mars extends decision trees to better handle numerical data id3 and cart were invented independently at around same time b w 1970 80 yet follow a similar approach for learning decision tree from training tuples formulae edit algorithms for constructing decision trees usually work top down by choosing a variable at each step that best splits the set of items 10 different algorithms use different metrics for measuring best these generally measure the homogeneity of the target variable within the subsets some examples are given below these metrics are applied to each candidate subset and the resulting values are combined e g averaged to provide a measure of the quality of the split gini impurity edit main article gini coefficient used by the cart classification and regression tree algorithm gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item it reaches its minimum zero when all cases in the node fall into a single target category to compute gini impurity for a set of items suppose i takes on values in 1 2 m and let fi be the fraction of items labeled with value i in the set information gain edit main article information gain in decision trees used by the id3 c4 5 and c5 0 tree generation algorithms information gain is based on the concept of entropy from information theory decision tree advantages edit amongst other data mining methods decision trees have various advantages simple to understand and interpret people are able to understand decision tree models after a brief explanation requires little data preparation other techniques often require data normalisation dummy variables need to be created and blank values to be removed able to handle both numerical and categorical data other techniques are usually specialised in analysing datasets that have only one type of variable for example relation rules can be used only with nominal variables while neural networks can be used only with numerical variables uses a white box model if a given situation is observable in a model the explanation for the condition is easily explained by boolean logic an example of a black box model is an artificial neural network since the explanation for the results is difficult to understand possible to validate a model using statistical tests that makes it possible to account for the reliability of the model robust performs well even if its assumptions are somewhat violated by the true model from which the data were generated performs well with large datasets large amounts of data can be analysed using standard computing resources in reasonable time limitations edit the problem of learning an optimal decision tree is known to be np complete under several aspects of optimality and even for simple concepts 11 12 consequently practical decision tree learning algorithms are based on heuristics such as the greedy algorithm where locally optimal decisions are made at each node such algorithms cannot guarantee to return the globally optimal decision tree decision tree learners can create over complex trees that do not generalise well from the training data this is known as overfitting 13 mechanisms such as pruning are necessary to avoid this problem there are concepts that are hard to learn because decision trees do not express them easily such as xor parity or multiplexer problems in such cases the decision tree becomes prohibitively large approaches to solve the problem involve either changing the representation of the problem domain known as propositionalisation 14 or using learning algorithms based on more expressive representations such as statistical relational learning or inductive logic programming for data including categorical variables with different numbers of levels information gain in decision trees is biased in favor of those attributes with more levels 15 extensions edit decision graphs edit in a decision tree all paths from the root node to the leaf node proceed by way of conjunction or and in a decision graph it is possible to use disjunctions ors to join two more paths together using minimum message length mml 16 decision graphs have been further extended to allow for previously unstated new attributes to be learnt dynamically and used at different places within the graph 17 the more general coding scheme results in better predictive accuracy and log loss probabilistic scoring citation needed in general decision graphs infer models with fewer leaves than decision trees search through evolutionary algorithms edit evolutionary algorithms have been used to avoid local optimal decisions and search the decision tree space with little a priori bias 18 19 see also edit decision tree pruning binary decision diagram chaid cart id3 algorithm c4 5 algorithm decision stump incremental decision tree alternating decision tree structured data analysis statistics implementations edit weka a free and open source data mining suite contains many decision tree algorithms orange a free data mining software suite module orngtree knime microsoft sql server 1 references edit rokach lior maimon o 2008 data mining with decision trees theory and applications world scientific pub co inc isbn 160 978 9812771711 160 quinlan j r 1986 induction of decision trees machine learning 1 81 106 kluwer academic publishers barros r c cerri r jaskowiak p a carvalho a c p l f a bottom up oblique decision tree induction algorithm proceedings of the 11th international conference on intelligent systems design and applications isda 2011 a b breiman leo friedman j h olshen r a amp stone c j 1984 classification and regression trees monterey ca wadsworth amp brooks cole advanced books amp software isbn 160 978 0 412 04841 8 160 breiman l 1996 bagging predictors machine learning 24 pp 123 140 friedman j h 1999 stochastic gradient boosting stanford university hastie t tibshirani r friedman j h 2001 the elements of statistical learning 160 data mining inference and prediction new york springer verlag rodriguez j j and kuncheva l i and alonso c j 2006 rotation forest a new classifier ensemble method ieee transactions on pattern analysis and machine intelligence 28 10 1619 1630 kass g v 1980 an exploratory technique for investigating large quantities of categorical data applied statistics 29 2 119 127 doi 10 2307 2986296 jstor 160 2986296 160 rokach l maimon o 2005 top down induction of decision trees classifiers a survey ieee transactions on systems man and cybernetics part c 35 4 476 487 doi 10 1109 tsmcc 2004 843247 160 hyafil laurent rivest rl 1976 constructing optimal binary decision trees is np complete information processing letters 5 1 15 17 doi 10 1016 0020 0190 76 90095 8 160 murthy s 1998 automatic construction of decision trees from data a multidisciplinary survey data mining and knowledge discovery principles of data mining 2007 doi 10 1007 978 1 84628 766 4 isbn 160 978 1 84628 765 7 160 edit horv th tam s yamamoto akihiro eds 2003 inductive logic programming lecture notes in computer science 2835 doi 10 1007 b13700 isbn 160 978 3 540 20144 1 160 edit deng h runger g tuv e 2011 bias of importance measures for multi valued attributes and solutions proceedings of the 21st international conference on artificial neural networks icann pp 160 293 300 160 http citeseer ist psu edu oliver93decision html tan amp dowe 2003 papagelis a kalles d 2001 breeding decision trees using evolutionary techniques proceedings of the eighteenth international conference on machine learning p 393 400 june 28 july 01 2001 barros rodrigo c basgalupp m p carvalho a c p l f freitas alex a 2011 a survey of evolutionary algorithms for decision tree induction ieee transactions on systems man and cybernetics part c applications and reviews vol 42 n 3 p 291 312 may 2012 external links edit building decision trees in python from o reilly an addendum to building decision trees in python from o reilly decision trees tutorial using microsoft excel decision trees page at aitopics org a page with commented links decision tree implementation in ruby ai4r evolutionary learning of decision trees in c java implementation of decision trees based on information gain retrieved from http en wikipedia org w index php title decision_tree_learning amp oldid 559683593 categories data miningdecision treesclassification algorithmsmachine learninghidden categories all articles with unsourced statementsarticles with unsourced statements from january 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages esky deutsch espa ol fran ais italiano magyar nederlands polski portugus ti ng vi t edit links this page was last modified on 13 june 2013 at 06 53 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Discovery_observation_ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Discovery_observation_ new file mode 100644 index 00000000..c73428dc --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Discovery_observation_ @@ -0,0 +1 @@ +discovery observation wikipedia the free encyclopedia discovery observation from wikipedia the free encyclopedia jump to navigation search sightings redirects here for other uses see sightings disambiguation this article needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed december 2011 discovery is the act of detecting something new or something old that had been unknown with reference to science and academic disciplines discovery is the observation of new phenomena new actions or new events and providing new reasoning to explain the knowledge gathered through such observations with previously acquired knowledge from abstract thought and everyday experiences visual discoveries are often called sightings citation needed contents 1 description 1 1 within science 2 exploration 3 see also 4 references 5 external links description edit new discoveries are acquired through various senses and are usually assimilated merging with pre existing knowledge and actions questioning is a major form of human thought and interpersonal communication and plays a key role in discovery citation needed discoveries are often made due to questions some discoveries lead to the invention of objects processes or techniques a discovery may sometimes be based on earlier discoveries collaborations or ideas and the process of discovery requires at least the awareness that an existing concept or method can be modified or transformed citation needed however some discoveries also represent a radical breakthrough in knowledge within science edit within scientific disciplines discovery is the observation of new phenomena actions or events which helps explain knowledge gathered through previously acquired scientific evidence in science exploration is one of three purposes of research citation needed the other two being description and explanation discovery is made by providing observational evidence and attempts to develop an initial rough understanding of some phenomenon discovery within the field of particle physics has an accepted definition for what constitutes a discovery a five sigma level of certainty 1 such a level defines statistically how unlikely it is that an experimental result is due to chance the combination of a five sigma level of certainty and independent confirmation by other experiments turns findings into accepted discoveries 1 exploration edit discovery can also be used to describe the first incursions of peoples from one culture into the geographical and cultural environment of others western culture has used the term discovery in their histories to subtly emphasize the importance of exploration in the history of the world such as in the age of exploration since the european exploration of the world the discovery of every continent island and geographical feature for the european traveler led to the notion that the native people were discovered though many were there centuries or even millennia before in that way the term has eurocentric and ethnocentric meaning often overlooked by westerners citation needed see also edit creativity techniques serendipity timeline of scientific discoveries references edit this article includes a list of references but its sources remain unclear because it has insufficient inline citations please help to improve this article by introducing more precise citations december 2011 general references b barber 1 september 1961 resistance by scientists to scientific discovery science 134 3479 596 602 doi 10 1126 science 134 3479 596 pmid 160 13686762 160 merton robert k 1957 12 priorities in scientific discovery a chapter in the sociology of science american sociological review 22 6 635 659 doi 10 2307 2089193 issn 160 00031224 jstor 160 2089193 160 carnegie mellon university artificial intelligence and psychology project yulin qin herbert a simon 1990 laboratory replication of scientific discovery processes cognitive science 14 2 281 312 doi 10 1016 0364 0213 90 90005 h oclc 160 832091458 160 preprint a silberschatz a tuzhilin december 1996 what makes patterns interesting in knowledge discovery systems ieee transactions on knowledge and data engineering 8 6 970 974 doi 10 1109 69 553165 160 tomasz imielinski heikki mannila november 1996 a database perspective on knowledge discovery communications of the acm 39 11 58 64 doi 10 1145 240455 240472 160 specific references a b rincon paul 12 december 2011 higgs boson excitement builds over glimpses at lhc bbc news retrieved 2011 12 12 160 external links edit a science odyssey people and discoveries from pbs ted education video how simple ideas lead to scientific discoveries a guide to inventions and discoveries from adrenaline to the zipper from infoplease retrieved from http en wikipedia org w index php title discovery_ observation amp oldid 560760991 categories learningobservationcognitionhidden categories articles needing additional references from december 2011all articles needing additional referencesall articles with unsourced statementsarticles with unsourced statements from december 2011articles lacking in text citations from december 2011all articles lacking in text citations navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages az rbaycanca catal esky deutsch espa ol esperanto fran ais galego magyar svenska ti ng vi t edit links this page was last modified on 20 june 2013 at 15 18 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Document_classification b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Document_classification new file mode 100644 index 00000000..b78982ff --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Document_classification @@ -0,0 +1 @@ +document classification wikipedia the free encyclopedia document classification from wikipedia the free encyclopedia jump to navigation search document classification or document categorization is a problem in library science information science and computer science the task is to assign a document to one or more classes or categories this may be done manually or intellectually or algorithmically the intellectual classification of documents has mostly been the province of library science while the algorithmic classification of documents is used mainly in information science and computer science the problems are overlapping however and there is therefore also interdisciplinary research on document classification the documents to be classified may be texts images music etc each kind of document possesses its special classification problems when not otherwise specified text classification is implied documents may be classified according to their subjects or according to other attributes such as document type author printing year etc in the rest of this article only subject classification is considered there are two main philosophies of subject classification of documents the content based approach and the request based approach contents 1 content based versus request based classification 2 classification versus indexing 3 automatic document classification 4 techniques 5 applications 6 see also 7 references 8 further reading 9 external links content based versus request based classification edit content based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned it is for example a rule in much library classification that at least 20 of the content of a book should be about the class to which the book is assigned 1 in automatic classification it could be the number of times given words appears in a document request oriented classification or indexing is classification in which the anticipated request from users is influencing how documents are being classified the classifier ask himself under which descriptors should this entity be found and think of all the possible queries and decide for which ones the entity at hand is relevant soergel 1985 p 160 230 2 request oriented classification may be classification that is targeted towards a particular audience or user group for example a library or a database for feminist studies may classify index documents different compared to a historical library it is probably better however to understand request oriented classification as policy based classification the classification is done according to some ideals and reflects the purpose of the library or database doing the classification in this way it is not necessarily a kind of classification or indexing based on user studies only if empirical data about use or users are applied should request oriented classification be regarded as a user based approach classification versus indexing edit sometimes a distinction is made between assigning documents to classes classification versus assigning subjects to documents subject indexing but as frederick wilfrid lancaster has argued this distinction is not fruitful these terminological distinctions he writes are quite meaningless and only serve to cause confusion lancaster 2003 p 160 21 3 the view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa cf aitchison 1986 4 2004 5 broughton 2008 6 riesthuis amp bliedung 1991 7 therefore is the act of labeling a document say by assigning a term from a controlled vocabulary to a document at the same time to assign that document to the class of documents indexed by that term all documents indexed or classified as x belong to the same class of documents automatic document classification edit automatic document classification tasks can be divided into three sorts supervised document classification where some external mechanism such as human feedback provides information on the correct classification for documents unsupervised document classification also known as document clustering where the classification must be done entirely without reference to external information and semi supervised document classification where parts of the documents are labeled by the external mechanism techniques edit automatic document classification techniques include expectation maximization em naive bayes classifier tf idf latent semantic indexing support vector machines svm artificial neural network k nearest neighbour algorithms decision trees such as id3 or c4 5 concept mining rough set based classifier soft set based classifier multiple instance learning natural language processing approaches applications edit classification techniques have been applied to spam filtering a process which tries to discern e mail spam messages from legitimate emails email routing sending an email sent to a general address to a specific address or mailbox depending on topic 8 language identification automatically determining the language of a text genre classification automatically determining the genre of a text 9 readability assessment automatically determining the degree of readability of a text either to find suitable materials for different age groups or reader types or as part of a larger text simplification system see also edit categorization classification disambiguation compound term processing concept based image indexing content based image retrieval document supervised learning unsupervised learning document retrieval document clustering information retrieval knowledge organization knowledge organization system library classification machine learning string metrics subject documents subject indexing text mining web mining concept mining references edit library of congress 2008 the subject headings manual washington dc library of congress policy and standards division sheet h 180 assign headings only for topics that comprise at least 20 of the work soergel dagobert 1985 organizing information principles of data base and retrieval systems orlando fl academic press lancaster f w 2003 indexing and abstracting in theory and practice library association london aitchison j 1986 a classification as a source for thesaurus the bibliographic classification of h e bliss as a source of thesaurus terms and structure journal of documentation vol 42 no 3 pp 160 181 aitchison j 2004 thesauri from bc2 problems and possibilities revealed in an experimental thesaurus derived from the bliss music schedule bliss classification bulletin vol 46 pp 20 26 broughton v 2008 a faceted classification as the basis of a faceted terminology conversion of a classified structure to thesaurus format in the bliss bibliographic classification 2nd ed axiomathes vol 18 no 2 pp 193 210 riesthuis g j a amp bliedung st 1991 thesaurification of the udc tools for knowledge organization and the human interface vol 2 pp 109 117 index verlag frankfurt stephan busemann sven schmeier and roman g arens 2000 message classification in the call center in sergei nirenburg douglas appelt fabio ciravegna and robert dale eds proc 6th applied natural language processing conf anlp 00 pp 158 165 acl santini marina rosso mark 2008 testing a genre enabled application a preliminary assessment bcs irsg symposium future directions in information access london uk pp 160 54 63 160 further reading edit fabrizio sebastiani machine learning in automated text categorization acm computing surveys 34 1 1 47 2002 stefan b ttcher charles l a clarke and gordon v cormack information retrieval implementing and evaluating search engines mit press 2010 external links edit introduction to document classification bibliography on automated text categorization bibliography on query classification text classification analysis page learning to classify text chap 6 of the book natural language processing with python available online techtc technion repository of text categorization datasets david d lewis s datasets biocreative iii act article classification task dataset retrieved from http en wikipedia org w index php title document_classification amp oldid 552783784 categories information sciencenatural language processingknowledge representationdata miningmachine learning navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol euskara fran ais italiano norsk nynorsk basa sunda suomi edit links this page was last modified on 25 june 2013 at 13 49 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/ECML_PKDD b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/ECML_PKDD new file mode 100644 index 00000000..291ad491 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/ECML_PKDD @@ -0,0 +1 @@ +ecml pkdd wikipedia the free encyclopedia ecml pkdd from wikipedia the free encyclopedia jump to navigation search ecml pkdd the european conference on machine learning and principles and practice of knowledge discovery in databases is one of the leading 1 2 academic conferences on machine learning and knowledge discovery held in europe every year contents 1 history 2 list of past conferences 3 references 4 external links history edit ecml pkdd is a merger of two european conferences european conference on machine learning ecml and european conference on principles and practice of knowledge discovery in databases pkdd ecml and pkdd have been co located since 2001 3 however both ecml and pkdd retained their own identity until 2007 for example the 2007 conference was known as the 18th european conference on machine learning ecml and the 11th european conference on principles and practice of knowledge discovery in databases pkdd or in brief ecml pkdd 2007 and the both ecml and pkdd had their own conference proceedings in 2008 the conferences were merged into one conference and the division into traditional ecml topics and traditional pkdd topics was removed 4 the history of ecml dates back to 1986 when the european working session on learning was first held in 1993 the name of the conference was changed to european conference on machine learning pkdd was first organised in 1997 originally pkdd stood for the european symposium on principles of data mining and knowledge discovery from databases 5 the name european conference on principles and practice of knowledge discovery in databases was used since 1999 6 list of past conferences edit conference year city country date ecml pkdd 2012 bristol united kingdom september 24 28 ecml pkdd 2011 athens greece september 5 9 ecml pkdd 2010 barcelona spain september 20 24 ecml pkdd 2009 bled slovenia september 7 11 ecml pkdd 2008 antwerp belgium september 15 19 18th ecml 11th pkdd 2007 warsaw poland september 17 21 17th ecml 10th pkdd 2006 berlin germany september 18 22 16th ecml 9th pkdd 2005 porto portugal october 3 7 15th ecml 8th pkdd 2004 pisa italy september 20 24 14th ecml 7th pkdd 2003 cavtat dubrovnik croatia september 22 26 13th ecml 6th pkdd 2002 helsinki finland august 19 23 12th ecml 5th pkdd 2001 freiburg germany september 3 7 conference year city country date 11th ecml 2000 barcelona spain may 30 june 2 10th ecml 1998 chemnitz germany april 21 24 9th ecml 1997 prague czech republic april 23 26 8th ecml 1995 heraclion crete greece april 25 27 7th ecml 1994 catania italy april 6 8 6th ecml 1993 vienna austria april 5 7 5th ewsl 1991 porto portugal march 6 8 4th ewsl 1989 montpellier france december 4 6 3rd ewsl 1988 glasgow scotland uk october 3 5 2nd ewsl 1987 bled yugoslavia may 13 15 1st ewsl 1986 orsay france february 3 4 conference year city country date 4th pkdd 2000 lyon france september 13 16 3rd pkdd 1999 prague czech republic september 15 18 2nd pkdd 1998 nantes france september 23 26 1st pkdd 1997 trondheim norway june 24 27 references edit machine learning and pattern recognition libra retrieved 2009 07 04 160 ecml is number 4 on the list 2007 australian ranking of ict conferences 160 both ecml and pkdd are ranked on tier a past conferences ecml pkdd retrieved 2009 07 04 160 daelemans walter goethals bart morik katharina 2008 preface proceedings of ecml pkdd 2008 lecture notes in artificial intelligence 5211 springer pp 160 v vi doi 10 1007 978 3 540 87479 9 isbn 160 978 3 540 87478 2 160 komorowski jan zytkow jan 1997 preface proceedings of pkdd 1997 lecture notes in artificial intelligence 1263 springer pp 160 v vi doi 10 1007 3 540 63223 9 isbn 160 978 3 540 63223 8 160 zytkow jan rauch jan 1999 preface proceedings of pkdd 1999 lecture notes in artificial intelligence 1704 springer pp 160 v vii doi 10 1007 b72280 isbn 160 978 3 540 66490 1 160 external links edit ecml pkdd web site ecml proceedings information in dblp pkdd proceedings information in dblp retrieved from http en wikipedia org w index php title ecml_pkdd amp oldid 545470678 categories artificial intelligence conferencesdata miningmachine learning navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 19 march 2013 at 17 22 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Elastic_map b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Elastic_map new file mode 100644 index 00000000..9b399ee4 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Elastic_map @@ -0,0 +1 @@ +elastic map wikipedia the free encyclopedia elastic map from wikipedia the free encyclopedia jump to navigation search elastic net redirects here for the statistical regularization technique see elastic net regularization linear pca versus nonlinear principal manifolds 1 for visualization of breast cancer microarray data a configuration of nodes and 2d principal surface in the 3d pca linear manifold the dataset is curved and can not be mapped adequately on a 2d principal plane b the distribution in the internal 2d non linear principal surface coordinates elmap2d together with an estimation of the density of points c the same as b but for the linear 2d pca manifold pca2d the basal breast cancer subtype is visualized more adequately with elmap2d and some features of the distribution become better resolved in comparison to pca2d principal manifolds are produced by the elastic maps algorithm data are available for public competition 2 software is available for free non commercial use 3 4 elastic maps provide a tool for nonlinear dimensionality reduction by their construction they are system of elastic springs embedded in the data space 1 this system approximates a low dimensional manifold the elastic coefficients of this system allow the switch from completely unstructured k means clustering zero elasticity to the estimators located closely to linear pca manifolds for high bending and low stretching modules with some intermediate values of the elasticity coefficients this system effectively approximates non linear principal manifolds this approach is based on a mechanical analogy between principal manifolds that are passing through the middle of data distribution and elastic membranes and plates the method was developed by a n gorban a y zinovyev and a a pitenko in 1996 1998 contents 1 energy of elastic map 2 expectation maximization algorithm 3 applications 4 references energy of elastic map edit let data set be a set of vectors in a finite dimensional euclidean space the elastic map is represented by a set of nodes in the same space each datapoint has a host node namely the closest node if there are several closest nodes then one takes the node with the smallest number the data set is divided on classes the approximation energy d is the distortion this is the energy of the springs with unit elasticity which connect each data point with its host node it is possible to apply weighting factors to the terms of this sum for example to reflect the standard deviation of the probability density function of any subset of data points on the set of nodes an additional structure is defined some pairs of nodes are connected by elastic edges call this set of pairs some triplets of nodes form bending ribs call this set of triplets the stretching energy is the bending energy is where and are the stretching and bending moduli respectively the stretching energy is sometimes referred to as the membrane term while the bending energy is referred to as the thin plate term 5 for example on the 2d rectangular grid the elastic edges are just vertical and horizontal edges pairs of closest vertices and the bending ribs are the vertical or horizontal triplets of consecutive closest vertices the total energy of the elastic map is thus the position of the nodes is determined by the mechanical equilibrium of the elastic map i e its location is such that it minimizes the total energy expectation maximization algorithm edit for a given splitting of the dataset in classes minimization of the quadratic functional is a linear problem with the sparse matrix of coefficients therefore similarly to pca or k means a splitting method is used for given find for given minimize and find if no change terminate this expectation maximization algorithm guarantees a local minimum of for improving the approximation various additional methods are proposed for example the softening strategy is used this strategy starts with a rigid grids small length small bending and large elasticity modules and coefficients and finishes with soft grids small and the training goes in several epochs each epoch with its own grid rigidness another adaptive strategy is growing net one starts from small amount of nodes and gradually adds new nodes each epoch goes with its own number of nodes applications edit application of principal curves build by the elastic maps method nonlinear quality of life index 6 points represent data of the un 171 countries in 4 dimensional space formed by the values of 4 indicators gross product per capita life expectancy infant mortality tuberculosis incidence different forms and colors correspond to various geographical locations and years red bold line represents the principal curve approximating the dataset most important applications are in bioinformatics 7 8 for exploratory data analysis and visualisation of multidimensional data for data visualisation in economics social and political sciences 9 as an auxiliary tool for data mapping in geographic informational systems and for visualisation of data of various nature recently the method is adapted as a support tool in the decision process underlying the selection optimization and management of financial portfolios 10 references edit a b a n gorban a y zinovyev principal graphs and manifolds in handbook of research on machine learning applications and trends algorithms methods and techniques olivas e s et al eds information science reference igi global hershey pa usa 2009 28 59 wang y klijn j g zhang y sieuwerts a m look m p yang f talantov d timmermans m meijer van gelder m e yu j et al gene expression profiles to predict distant metastasis of lymph node negative primary breast cancer lancet 365 671 679 2005 data online a zinovyev vidaexpert multidimensional data visualization tool free for non commercial use institut curie paris a zinovyev vidaexpert overview ihes institut des hautes tudes scientifiques bures sur yvette le de france michael kass andrew witkin demetri terzopoulos snakes active contour models int j computer vision 1988 vol 1 4 pp 321 331 a n gorban a zinovyev principal manifolds and graphs in practice from molecular biology to dynamical systems international journal of neural systems vol 20 no 3 2010 219 232 a n gorban b kegl d wunsch a zinovyev eds principal manifolds for data visualisation and dimension reduction lncse 58 springer berlin heidelberg new york 2007 isbn 978 3 540 73749 0 m chac n m l vano h allende h nowak detection of gene expressions in microarrays by applying iteratively elastic neural net in b beliczynski et al eds lecture notes in computer sciences vol 4432 springer berlin heidelberg 2007 355 363 a zinovyev data visualization in political and social sciences in sage international encyclopedia of political science badie b berg schlosser d morlino l a eds 2011 m resta portfolio optimization through elastic maps some evidence from the italian stock exchange knowledge based intelligent information and engineering systems b apolloni r j howlett and l jain eds lecture notes in computer science vol 4693 springer berlin heidelberg 2010 635 641 retrieved from http en wikipedia org w index php title elastic_map amp oldid 545979473 categories data miningmultivariate statisticsdimension reduction navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 21 march 2013 at 13 08 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases new file mode 100644 index 00000000..10a818e6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases @@ -0,0 +1 @@ +ecml pkdd wikipedia the free encyclopedia ecml pkdd from wikipedia the free encyclopedia redirected from european conference on machine learning and principles and practice of knowledge discovery in databases jump to navigation search ecml pkdd the european conference on machine learning and principles and practice of knowledge discovery in databases is one of the leading 1 2 academic conferences on machine learning and knowledge discovery held in europe every year contents 1 history 2 list of past conferences 3 references 4 external links history edit ecml pkdd is a merger of two european conferences european conference on machine learning ecml and european conference on principles and practice of knowledge discovery in databases pkdd ecml and pkdd have been co located since 2001 3 however both ecml and pkdd retained their own identity until 2007 for example the 2007 conference was known as the 18th european conference on machine learning ecml and the 11th european conference on principles and practice of knowledge discovery in databases pkdd or in brief ecml pkdd 2007 and the both ecml and pkdd had their own conference proceedings in 2008 the conferences were merged into one conference and the division into traditional ecml topics and traditional pkdd topics was removed 4 the history of ecml dates back to 1986 when the european working session on learning was first held in 1993 the name of the conference was changed to european conference on machine learning pkdd was first organised in 1997 originally pkdd stood for the european symposium on principles of data mining and knowledge discovery from databases 5 the name european conference on principles and practice of knowledge discovery in databases was used since 1999 6 list of past conferences edit conference year city country date ecml pkdd 2012 bristol united kingdom september 24 28 ecml pkdd 2011 athens greece september 5 9 ecml pkdd 2010 barcelona spain september 20 24 ecml pkdd 2009 bled slovenia september 7 11 ecml pkdd 2008 antwerp belgium september 15 19 18th ecml 11th pkdd 2007 warsaw poland september 17 21 17th ecml 10th pkdd 2006 berlin germany september 18 22 16th ecml 9th pkdd 2005 porto portugal october 3 7 15th ecml 8th pkdd 2004 pisa italy september 20 24 14th ecml 7th pkdd 2003 cavtat dubrovnik croatia september 22 26 13th ecml 6th pkdd 2002 helsinki finland august 19 23 12th ecml 5th pkdd 2001 freiburg germany september 3 7 conference year city country date 11th ecml 2000 barcelona spain may 30 june 2 10th ecml 1998 chemnitz germany april 21 24 9th ecml 1997 prague czech republic april 23 26 8th ecml 1995 heraclion crete greece april 25 27 7th ecml 1994 catania italy april 6 8 6th ecml 1993 vienna austria april 5 7 5th ewsl 1991 porto portugal march 6 8 4th ewsl 1989 montpellier france december 4 6 3rd ewsl 1988 glasgow scotland uk october 3 5 2nd ewsl 1987 bled yugoslavia may 13 15 1st ewsl 1986 orsay france february 3 4 conference year city country date 4th pkdd 2000 lyon france september 13 16 3rd pkdd 1999 prague czech republic september 15 18 2nd pkdd 1998 nantes france september 23 26 1st pkdd 1997 trondheim norway june 24 27 references edit machine learning and pattern recognition libra retrieved 2009 07 04 160 ecml is number 4 on the list 2007 australian ranking of ict conferences 160 both ecml and pkdd are ranked on tier a past conferences ecml pkdd retrieved 2009 07 04 160 daelemans walter goethals bart morik katharina 2008 preface proceedings of ecml pkdd 2008 lecture notes in artificial intelligence 5211 springer pp 160 v vi doi 10 1007 978 3 540 87479 9 isbn 160 978 3 540 87478 2 160 komorowski jan zytkow jan 1997 preface proceedings of pkdd 1997 lecture notes in artificial intelligence 1263 springer pp 160 v vi doi 10 1007 3 540 63223 9 isbn 160 978 3 540 63223 8 160 zytkow jan rauch jan 1999 preface proceedings of pkdd 1999 lecture notes in artificial intelligence 1704 springer pp 160 v vii doi 10 1007 b72280 isbn 160 978 3 540 66490 1 160 external links edit ecml pkdd web site ecml proceedings information in dblp pkdd proceedings information in dblp retrieved from http en wikipedia org w index php title ecml_pkdd amp oldid 545470678 categories artificial intelligence conferencesdata miningmachine learning navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 19 march 2013 at 17 22 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Evolutionary_data_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Evolutionary_data_mining new file mode 100644 index 00000000..e93d2b8b --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Evolutionary_data_mining @@ -0,0 +1 @@ +evolutionary data mining wikipedia the free encyclopedia evolutionary data mining from wikipedia the free encyclopedia jump to navigation search this article is an orphan as no other articles link to it please introduce links to this page from related articles suggestions may be available october 2011 evolutionary data mining or genetic data mining is an umbrella term for any data mining using evolutionary algorithms while it can be used for mining data from dna sequences 1 it is not limited to biological contexts and can be used in any classification based prediction scenario which helps predict the value of a user specified goal attribute based on the values of other attributes 2 for instance a banking institution might want to predict whether a customer s credit would be good or bad based on their age income and current savings 2 evolutionary algorithms for data mining work by creating a series of random rules to be checked against a training dataset 3 the rules which most closely fit the data are selected and are mutated 3 the process is iterated many times and eventually a rule will arise that approaches 100 similarity with the training data 2 this rule is then checked against a test dataset which was previously invisible to the genetic algorithm 2 contents 1 process 1 1 data preparation 1 2 data mining 2 see also 3 references process edit data preparation edit before databases can be mined for data using evolutionary algorithms it first has to be cleaned 2 which means incomplete noisy or inconsistent data should be repaired it is imperative that this be done before the mining takes place as it will help the algorithms produce more accurate results 3 if data comes from more than one database they can be integrated or combined at this point 3 when dealing with large datasets it might be beneficial to also reduce the amount of data being handled 3 one common method of data reduction works by getting a normalized sample of data from the database resulting in much faster yet statistically equivalent results 3 at this point the data is split into two equal but mutually exclusive elements a test and a training dataset 2 the training dataset will be used to let rules evolve which match it closely 2 the test dataset will then either confirm or deny these rules 2 data mining edit evolutionary algorithms work by trying to emulate natural evolution 3 first a random series of rules are set on the training dataset which try to generalize the data into formulas 3 the rules are checked and the ones that fit the data best are kept the rules that do not fit the data are discarded 3 the rules that were kept are then mutated and multiplied to create new rules 3 this process iterates as necessary in order to produce a rule that matches the dataset as closely as possible 3 when this rule is obtained it is then checked against the test dataset 2 if the rule still matches the data then the rule is valid and is kept 2 if it does not match the data then it is discarded and the process begins by selecting random rules again 2 see also edit data mining evolutionary algorithm knowledge discovery pattern mining data analysis references edit wai ho au keith c c chan and xin yao a novel evolutionary data mining algorithm with applications to churn prediction ieee retrieved on 2008 12 4 a b c d e f g h i j k freitas alex a a survey of evolutionary algorithms for data mining and knowledge discovery pontif cia universidade cat lica do paran retrieved on 2008 12 4 a b c d e f g h i j k jiawei han micheline kamber data mining concepts and techniques 2006 morgan kaufmann isbn 1 55860 901 6 retrieved from http en wikipedia org w index php title evolutionary_data_mining amp oldid 524873871 categories data miningdata analysishidden categories orphaned articles from october 2011all orphaned articles navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 26 november 2012 at 00 04 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/FICO b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/FICO new file mode 100644 index 00000000..facd7484 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/FICO @@ -0,0 +1 @@ +fico wikipedia the free encyclopedia fico from wikipedia the free encyclopedia jump to navigation search this article is about the company for the credit score see fico score not to be confused with the financing corporation a mixed ownership government financing vehicle for the federal savings and loan insurance corporation fico type public company traded as nyse 160 fico founded 1956 headquarters san jose california key people william j lansing ceo products fico score triad blaze advisor falcon fraud debt manager model builder website www fico com fico is a public company that provides analytics and decision making services including credit scoring 1 intended to help financial services companies make complex high volume decisions 1 contents 1 history 2 location 3 fico score 4 clients 5 see also 6 references 7 external links history edit a pioneer credit score company fico was founded in 1956 as fair isaac and company by engineer bill fair and mathematician earl isaac 2 fico was first headquartered in san rafael ca united states 3 selling its first credit scoring system two years after the company s creation 2 sales of similar systems soon followed in 1987 fico went public 2 the introduction of the first general purpose fico score was in 1989 when beacon debuted at equifax 2 originally called fair isaac and company it was renamed fair isaac corporation in 2003 2 the company rebranded again in 2009 changing its name and ticker symbol to fico 4 5 fico also sells a product called falcon fraud manager for banks and corporations which is a neural network based application designed to fight fraud by proactively detecting unusual transaction patterns 6 location edit fico has its headquarters in san jose california united states 7 and has offices in asia pacific australia brazil canada china india korea malaysia russia singapore spain turkey united kingdom and the usa its main offices are located in san jose ca san rafael ca and san diego ca 8 fico score edit main article fico score a measure of credit risk fico scores are available through all of the major consumer reporting agencies in the united states and canada equifax 9 experian 9 transunion 9 prbc 10 clients edit fico provides its products and services to very large businesses and corporations across a number of fields notably banking see also edit mark n greene references edit a b about us fico official site a b c d e history fico official site fico score retrieved december 01 2011 fair isaac is now fico fico official site fico unveils new ticker symbol on new york stock exchange fico official site kathryn balint fraud fighters san diego union tribune 18 february 2005 worldwide locations fico official site fico com a b c credit reporting agencies fico official site fair isaac and prbc team up to enhance credit risk tools used by mortgage industry fico official site external links edit official website explanation of the fico credit score range http www creditscoreresource com what is a fico credit score range 401539 retrieved from http en wikipedia org w index php title fico amp oldid 556118265 categories companies listed on the new york stock exchangefinancial services companies of the united statescompanies based in minneapolis minnesota navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 21 may 2013 at 15 33 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/FSA_Red_Algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/FSA_Red_Algorithm new file mode 100644 index 00000000..4c65930e --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/FSA_Red_Algorithm @@ -0,0 +1 @@ +fsa red algorithm wikipedia the free encyclopedia fsa red algorithm from wikipedia the free encyclopedia jump to navigation search fsa red algorithm 1 is an algorithm for data reduction which is suitable to build strong association rule using data mining method such as apriori algorithm contents 1 setting 2 benefit 3 see also 4 references setting edit fsa red algorithm was introduced by feri sulianta in international conference of information and communication technology icoict indonesia bandung wednesday march 20 2013 2 when he delivered presentation with the theme topics mining food industrys multidimensional data to produce association rules using apriori algorithm as a basis of business stratgey 3 the algorithm is used for data reduction or preprocesssing to minimize the attribute to be analyzed the goal is to make strong association rules using data mining technics related to the data which is reduced the data preprocessing in fsa red performed a few of reduction techniques such as attribute selection row selection and feature selection row selection has done by deleting all signed record which related to the attribute which need to be analyzed feature selection will remove all the unwanted attribute ended with attribute selection to eliminate the non value attributes which is no need to be included the idea base on the justification no matter the reduction has done the reduction procedure have to consider the presence of the other information in all dataset so that the reduction should be done systematically consider the linkages between attributes after the reduction proccess there would be only the in instances in the small scale with integrity by mean no information lost among the attribute in every selective instance flowchart of fsa red algorithm bind with association rule method using apriori algorithm benefit edit the flexibility according to the fsa red algorithm is the way attribute is chosen there is no limitation to exclude the attribute by mean any kind of attribute can be chose as a basis of reduction process even though there would be the attribute which is not the best compare to the others this is the benefit from the reduction procedure which might result rich association patterns of the data see also edit data mining association rule learning apriori algorithm feri sulianta references edit fsa red algorithm ferisulianta com retrieved 24 april 2013 icoict 2013 icoict org retrieved 24 april 2013 icoict 2013 agenda and presenter icoict org retrieved 24 april 2013 retrieved from http en wikipedia org w index php title fsa red_algorithm amp oldid 556714414 categories data miningdata analysisformal sciences navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version this page was last modified on 25 may 2013 at 11 20 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Feature_vector b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Feature_vector new file mode 100644 index 00000000..7a6aa94c --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Feature_vector @@ -0,0 +1 @@ +feature vector wikipedia the free encyclopedia feature vector from wikipedia the free encyclopedia jump to navigation search it has been suggested that this article be merged with feature space discuss proposed since march 2013 in pattern recognition and machine learning a feature vector is an n dimensional vector of numerical features that represent some object many algorithms in machine learning require a numerical representation of objects since such representations facilitate processing and statistical analysis when representing images the feature values might correspond to the pixels of an image when representing texts perhaps to term occurrence frequencies feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression feature vectors are often combined with weights using a dot product in order to construct a linear predictor function that is used to determine a score for making a prediction the vector space associated with these vectors is often called the feature space in order to reduce the dimensionality of the feature space a number of dimensionality reduction techniques can be employed higher level features can be obtained from already available features and added to the feature vector for example for the study of diseases the feature age is useful and is defined as age year of death year of birth this process is referred to as feature construction 1 2 feature construction is the application of a set of constructive operators to a set of existing features resulting in construction of new features examples of such constructive operators include checking for the equality conditions the arithmetic operators the array operators max s min s average s as well as other more sophisticated operators for example count s c 3 that counts the number of features in the feature vector s satisfying some condition c or for example distances to other recognition classes generalized by some accepting device feature construction has long been considered a powerful tool for increasing both accuracy and understanding of structure particularly in high dimensional problems 4 applications include studies of disease and emotion recognition from speech 5 references edit liu h motoda h 1998 feature selection for knowledge discovery and data mining kluwer academic publishers norwell ma usa 1998 piramuthu s sikora r t iterative feature construction for improving inductive learning algorithms in journal of expert systems with applications vol 36 iss 2 march 2009 pp 3401 3406 2009 bloedorn e michalski r data driven constructive induction a methodology and its applications ieee intelligent systems special issue on feature transformation and subset selection pp 30 37 march april 1998 breiman l friedman t olshen r stone c 1984 classification and regression trees wadsworth sidorova j badia t syntactic learning for eseda 1 tool for enhanced speech emotion detection and analysis internet technology and secured transactions conference 2009 icitst 2009 london november 9 12 ieee see also edit feature extraction feature selection dimensionality reduction this applied mathematics related article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title feature_vector amp oldid 556219315 categories machine learningdata miningapplied mathematics stubshidden categories articles to be merged from march 2013all articles to be merged navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch edit links this page was last modified on 22 may 2013 at 05 30 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Formal_concept_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Formal_concept_analysis new file mode 100644 index 00000000..622a0117 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Formal_concept_analysis @@ -0,0 +1 @@ +formal concept analysis wikipedia the free encyclopedia formal concept analysis from wikipedia the free encyclopedia jump to navigation search this article may be expanded with text translated from the corresponding article in the german wikipedia february 2012 click show on the right to read important instructions before translating view a machine translated version of the german article google s machine translation is a useful starting point for translations but translators must revise errors as necessary and confirm that the translation is accurate rather than simply copy pasting machine translated text into the english wikipedia do not translate text that appears unreliable or low quality if possible verify the text with references provided in the foreign language article after translating translated de formale begriffsanalyse must be added to the talk page to ensure copyright compliance for more guidance see wikipedia translation in information science formal concept analysis is a principled way of deriving a concept hierarchy or formal ontology from a collection of objects and their properties each concept in the hierarchy represents the set of objects sharing the same values for a certain set of properties and each sub concept in the hierarchy contains a subset of the objects in the concepts above it the term was introduced by rudolf wille in 1984 and builds on applied lattice and order theory that was developed by birkhoff and others in the 1930s formal concept analysis finds practical application in fields including data mining text mining machine learning knowledge management semantic web software development and biology contents 1 overview and history 2 motivation and philosophical background 3 example 4 contexts and concepts 5 concept lattice of a context 6 concept algebra of a context 7 recovering the context from the line diagram 8 efficient construction 9 tools 10 see also 11 notes 12 references 13 external links overview and history edit the original motivation of formal concept analysis was the concrete representation of complete lattices and their properties by means of formal contexts data tables that represent binary relations between objects and attributes in this theory a formal concept is defined to be a pair consisting of a set of objects the extent and a set of attributes the intent such that the extent consists of all objects that share the given attributes and the intent consists of all attributes shared by the given objects in this way formal concept analysis formalizes the notions of extension and intension pairs of formal concepts may be partially ordered by the subset relation between their sets of objects or equivalently by the superset relation between their sets of attributes this ordering results in a graded system of sub and superconcepts a concept hierarchy which can be displayed as a line diagram the family of these concepts obeys the mathematical axioms defining a lattice and is called more formally a concept lattice in french this is called a treillis de galois galois lattice because of the relation between the sets of concepts and attributes is a galois connection the theory in its present form goes back to the darmstadt research group led by rudolf wille bernhard ganter and peter burmeister where formal concept analysis originated in the early 1980s the mathematical basis however was already created by garrett birkhoff in the 1930s as part of the general lattice theory before the work of the darmstadt group there were already approaches in various french groups philosophical foundations of formal concept analysis refer in particular to charles s peirce and the educationalist hartmut von hentig motivation and philosophical background edit in his article restructuring lattice theory 1982 initiating formal concept analysis as a mathematical discipline rudolf wille starts from a discontent with the current lattice theory and pure mathematics in general the production of theoretical results often achieved by elaborate mental gymnastics were impressive but the connections between neighbouring domains even parts of a theory were getting weaker restructuring lattice theory is an attempt to reinvigorate connections with our general culture by interpreting the theory as concretely as possible and in this way to promote better communication between lattice theorists and potential users of lattice theory 1 this aim traces back to hartmut von hentig who in 1972 plaided for restructuring sciences in view of better teaching and in order to make sciences mutually available and more generally i e also without specialized knowledge criticable 2 hence by its origins formal concept analysis aims at interdisciplinarity and democratic control of research 3 it corrects the starting point of lattice theory during the development of formal logic in 19th century then and later in model theory a concept as unary predicate had been reduced to its extent now again the philosophy of concepts should become less abstract by considering the intent hence formal concept analysis is oriented towards the categories extension and intension of linguistics and classical conceptual logic 4 fca aims at the clarity of concepts according to charles s peirce s pragmatic maxim by unfolding observable elementary properties of the subsumed objects 3 in his late philosophy peirce assumed that logical thinking aims at perceiving reality by the triade concept judgement and conclusion mathematics is an abstraction of logic develops patterns of possible realities and therefore may support rational communication on this background wille defines the aim and meaning of formal concept analysis as mathematical theory of concepts and concept hierarchies is to support the rational communication of humans by mathematically developing appropriate conceptual structures which can be logically activated 5 example edit a concept lattice for objects consisting of the integers from 1 to 10 and attributes composite c square s even e odd o and prime p consider o 1 2 3 4 5 6 7 8 9 10 and a composite even odd prime square the smallest concept including the number 3 is the one with objects 3 5 7 and attributes odd prime for 3 has both of those attributes and 3 5 7 is the set of objects having that set of attributes the largest concept involving the attribute of being square is the one with objects 1 4 9 and attributes square for 1 4 and 9 are all the square numbers and all three of them have that set of attributes it can readily be seen that both of these example concepts satisfy the formal definitions below the full set of concepts for these objects and attributes is shown in the illustration it includes a concept for each of the original attributes the composite numbers square numbers even numbers odd numbers and prime numbers additionally it includes concepts for the even composite numbers composite square numbers that is all square numbers except 1 even composite squares odd squares odd composite squares even primes and odd primes contexts and concepts edit a formal context consists of a set of objects o a set of unary attributes a and an indication of which objects have which attributes formally it can be regarded as a bipartite graph i 160 160 o 160 160 a composite even odd prime square 1 2 3 4 5 6 7 8 9 10 a formal concept for a context is defined to be a pair oi ai such that oi o ai a every object in oi has every attribute in ai for every object in o that is not in oi there is an attribute in ai that the object does not have for every attribute in a that is not in ai there is an object in oi that does not have that attribute oi is called the extent of the concept ai the intent a context may be described as a table with the objects corresponding to the rows of the table the attributes corresponding to the columns of the table and a boolean value in the example represented graphically as a checkmark in cell x y whenever object x has value y a concept in this representation forms a maximal subarray not necessarily contiguous such that all cells within the subarray are checked for instance the concept highlighted with a different background color in the example table is the one describing the odd prime numbers and forms a 3 160 160 2 subarray in which all cells are checked 6 concept lattice of a context edit the concepts oi ai defined above can be partially ordered by inclusion if oi ai and oj aj are concepts we define a partial order by saying that oi ai oj aj whenever oi oj equivalently oi ai oj aj whenever aj ai every pair of concepts in this partial order has a unique greatest lower bound meet the greatest lower bound of oi ai and oj aj is the concept with objects oi oj it has as its attributes the union of ai aj and any additional attributes held by all objects in oi oj symmetrically every pair of concepts in this partial order has a unique least upper bound join the least upper bound of oi ai and oj aj is the concept with attributes ai aj it has as its objects the union of oi oj and any additional objects that have all attributes in ai aj these meet and join operations satisfy the axioms defining a lattice in fact by considering infinite meets and joins analogously to the binary meets and joins defined above one sees that this is a complete lattice it may be viewed as the dedekind macneille completion of a partially ordered set of height two in which the elements of the partial order are the objects and attributes of a and in which two elements x and y satisfy x 160 160 y exactly when x is an object that has attribute 160 y any finite lattice may be generated as the concept lattice for some context for let l be a finite lattice and form a context in which the objects and the attributes both correspond to elements of l in this context let object x have attribute y exactly when x and y are ordered as x y in the lattice then the concept lattice of this context is isomorphic to l itself 7 this construction may be interpreted as forming the dedekind macneille completion of l which is known to produce an isomorphic lattice from any finite lattice concept algebra of a context edit modelling negation in a formal context is somewhat problematic because the complement o oi a ai of a concept oi ai is in general not a concept however since the concept lattice is complete one can consider the join oi ai of all concepts oj aj that satisfy oj 160 160 g oi or dually the meet oi ai of all concepts satisfying aj 160 160 g ai these two operations are known as weak negation and weak opposition respectively this can be expressed in terms of the derivative functions the derivative of a set oi 160 160 o of objects is the set oi 160 160 a of all attributes that hold for all objects in oi the derivative of a set ai 160 160 a of attributes is the set ai 160 160 o of all objects that have all attributes in ai a pair oi ai is a concept if and only if oi 160 160 ai and ai 160 160 oi using this function weak negation can be written as oi ai g a g a and weak opposition can be written as oi ai m b m b the concept lattice equipped with the two additional operations and is known as the concept algebra of a context concept algebras are a generalization of power sets weak negation on a concept lattice l is a weak complementation i e an order reversing map 160 l 160 160 l which satisfies the axioms x 160 160 x and x y 160 160 x y 160 160 x weak composition is a dual weak complementation a bounded lattice such as a concept algebra which is equipped with a weak complementation and a dual weak complementation is called a weakly dicomplemented lattice weakly dicomplemented lattices generalize distributive orthocomplemented lattices i e boolean algebras 8 9 recovering the context from the line diagram edit the line diagram of the concept lattice encodes enough information to recover the original context from which it was formed each object of the context corresponds to a lattice element the element with the minimal object set that contains that object and with an attribute set consisting of all attributes of the object symmetrically each attribute of the context corresponds to a lattice element the one with the minimal attribute set containing that attribute and with an object set consisting of all objects with that attribute we may label the nodes of the line diagram with the objects and attributes they correspond to with this labeling object x has attribute y if and only if there exists a monotonic path from x to y in the diagram 10 efficient construction edit kuznetsov amp obiedkov 2001 survey the many algorithms that have been developed for constructing concept lattices these algorithms vary in many details but are in general based on the idea that each edge of the line diagram of the concept lattice connects some concept c to the concept formed by the join of c with a single object thus one can build up the concept lattice one concept at a time by finding the neighbors in the line diagram of known concepts starting from the concept with an empty set of objects the amount of time spent to traverse the entire concept lattice in this way is polynomial in the number of input objects and attributes per generated concept tools edit many fca software applications are available today the main purpose of these tools varies from formal context creation to formal concept mining and generating the concepts lattice of a given formal context and the corresponding association rules most of these tools are academic and still under active development one can find a non exhaustive list of fca tools in the fca software website most of these tools are open source applications like conexp toscanaj lattice miner 11 coron fcabedrock etc see also edit biclustering description logic cluster analysis concept mining conceptual clustering factor analysis notes edit rudolf wille restructuring lattice theory an approach based on hierarchies of concepts reprint in icfca 09 proceedings of the 7th international conference on formal concept analysis berlin heidelberg 2009 p 314 hartmut von hentig magier oder magister ber die einheit der wissenschaft im verst ndigungsproze klett 1972 suhrkamp 1974 cited after karl erich wolff ordnung wille und begriff ernst schr der zentrum f r begriffliche wissensverarbeitung darmstadt 2003 a b johannes wollbold attribute exploration of gene regulatory processes phd thesis university of jena 2011 p 9 bernhard ganter bernhard and rudolf wille formal concept analysis mathematical foundations springer berlin isbn 3 540 62771 5 p 1 rudolf wille formal concept analysis as mathematical theory of concepts and concept hierarchies in b ganter et al formal concept analysis foundations and applications springer 2005 p 1f wolff section 2 stumme theorem 1 wille rudolf 2000 boolean concept logic in ganter b mineau g w iccs 2000 conceptual structures logical linguistic and computational issues lnai 1867 springer pp 160 317 331 isbn 160 978 3 540 67859 5 160 kwuida l onard 2004 dicomplemented lattices a contextual generalization of boolean algebras shaker verlag isbn 160 978 3 8322 3350 1 160 wolff section 3 boumedjout lahcen and leonard kwuida lattice miner a tool for concept lattice construction and exploration in suplementary proceeding of international conference on formal concept analysis icfca 10 2010 references edit ganter bernhard stumme gerd wille rudolf eds 2005 formal concept analysis foundations and applications lecture notes in artificial intelligence no 3626 springer verlag isbn 160 3 540 27891 5 160 ganter bernhard wille rudolf 1998 formal concept analysis mathematical foundations springer verlag berlin isbn 160 3 540 62771 5 160 translated by c franzke carpineto claudio romano giovanni 2004 concept data analysis theory and applications wiley isbn 160 978 0 470 85055 8 160 kuznetsov sergei o obiedkov sergei a 2001 algorithms for the construction of concept lattices and their diagram graphs principles of data mining and knowledge discovery lecture notes in computer science 2168 springer verlag pp 160 289 300 doi 10 1007 3 540 44794 6_24 isbn 160 978 3 540 42534 2 160 wolff karl erich 1994 a first course in formal concept analysis in f faulbaum statsoft 93 gustav fischer verlag pp 160 429 438 160 davey b a priestley h a 2002 3 formal concept analysis introduction to lattices and order cambridge university press isbn 160 978 0 521 78451 1 160 external links edit a formal concept analysis homepage demo 11th international conference on formal concept analysis icfca 2013 dresden germany may 21 24 2013 retrieved from http en wikipedia org w index php title formal_concept_analysis amp oldid 560607817 categories machine learninglattice theorydata miningontology information science hidden categories articles to be expanded from february 2012all articles to be expandedarticles needing translation from german wikipedia navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol fran ais suomi edit links this page was last modified on 19 june 2013 at 13 58 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/GSP_Algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/GSP_Algorithm new file mode 100644 index 00000000..71a92d50 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/GSP_Algorithm @@ -0,0 +1 @@ +gsp algorithm wikipedia the free encyclopedia gsp algorithm from wikipedia the free encyclopedia jump to navigation search this article does not cite any references or sources please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed may 2007 gsp algorithm generalized sequential pattern algorithm is an algorithm used for sequence mining the algorithms for solving sequence mining problems are mostly based on the a priori level wise algorithm one way to use the level wise paradigm is to first discover all the frequent items in a level wise fashion it simply means counting the occurrences of all singleton elements in the database then the transactions are filtered by removing the non frequent items at the end of this step each transaction consists of only the frequent elements it originally contained this modified database becomes an input to the gsp algorithm this process requires one pass over the whole database gsp algorithm makes multiple database passes in the first pass all single items 1 sequences are counted from the frequent items a set of candidate 2 sequences are formed and another pass is made to identify their frequency the frequent 2 sequences are used to generate the candidate 3 sequences and this process is repeated until no more frequent sequences are found there are two main steps in the algorithm candidate generation given the set of frequent k 1 frequent sequences f k 1 the candidates for the next pass are generated by joining f k 1 with itself a pruning phase eliminates any sequence at least one of whose subsequences is not frequent support counting normally a hash tree based search is employed for efficient support counting finally non maximal frequent sequences are removed algorithm edit f1 the set of frequent 1 sequence k 2 do while f k 1 null generate candidate sets ck set of candidate k sequences for all input sequences s in the database d do increment count of all a in ck if s supports a fk a ck such that its frequency exceeds the threshold k k 1 result set of all frequent sequences is the union of all fks end do end do the above algorithm looks like the apriori algorithm one main difference is however the generation of candidate sets let us assume that a b and a c are two frequent 2 sequences the items involved in these sequences are a b and a c respectively the candidate generation in a usual apriori style would give a b c as a 3 itemset but in the present context we get the following 3 sequences as a result of joining the above 2 sequences a b c a c b and a bc the candidate generation phase takes this into account the gsp algorithm discovers frequent sequences allowing for time constraints such as maximum gap and minimum gap among the sequence elements moreover it supports the notion of a sliding window i e of a time interval within which items are observed as belonging to the same event even if they originate from different events see also edit sequence mining references edit data mining techniques pujari arun k 2001 universities press isbn 160 81 7371 380 4 160 missing or empty title help pp 256 260 at google books retrieved from http en wikipedia org w index php title gsp_algorithm amp oldid 538332900 categories data mininghidden categories articles lacking sources from may 2007all articles lacking sourcespages with citations lacking titlesarticles with example pseudocode navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 15 february 2013 at 02 01 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Gene_expression_programming b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Gene_expression_programming new file mode 100644 index 00000000..0e594364 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Gene_expression_programming @@ -0,0 +1 @@ +gene expression programming wikipedia the free encyclopedia gene expression programming from wikipedia the free encyclopedia jump to navigation search a major contributor to this article appears to have a close connection with its subject it may require cleanup to comply with wikipedia s content policies particularly neutral point of view please discuss further on the talk page november 2012 gene expression programming gep is an evolutionary algorithm that creates computer programs or models these computer programs are complex tree structures that learn and adapt by changing their sizes shapes and composition much like a living organism and like living organisms the computer programs of gep are also encoded in simple linear chromosomes of fixed length thus gep is a genotype phenotype system benefiting from a simple genome to keep and transmit the genetic information and a complex phenotype to explore the environment and adapt to it contents 1 background 2 encoding the genotype 3 expression trees the phenotype 4 k expressions and genes 5 multigenic chromosomes 6 cells and code reuse 6 1 homeotic genes and the cellular system 6 2 multiple main programs and multicellular systems 7 other levels of complexity 8 the basic gene expression algorithm 8 1 populations of programs 8 2 fitness functions and the selection environment 8 2 1 the selection environment or training data 8 2 2 fitness functions 8 2 2 1 fitness functions for regression 8 2 2 2 fitness functions for classification and logistic regression 8 2 2 3 fitness functions for boolean problems 8 3 selection and elitism 8 4 reproduction with modification 8 4 1 replication and selection 8 4 2 mutation 8 4 3 recombination 8 4 4 transposition 8 4 5 inversion 8 4 6 other genetic operators 9 the gep rnc algorithm 10 neural networks 11 decision trees 12 criticism 13 software 13 1 commercial applications 13 2 open source libraries 14 further reading 15 see also 16 references 17 external links background edit evolutionary algorithms use populations of individuals select individuals according to fitness and introduce genetic variation using one or more genetic operators their use in artificial computational systems dates back to the 1950s where they were used to solve optimization problems e g box 1957 1 and friedman 1959 2 but it was with the introduction of evolution strategies by rechenberg in 1965 3 that evolutionary algorithms gained popularity a good overview text on evolutionary algorithms is the book an introduction to genetic algorithms by mitchell 1996 4 gene expression programming 5 belongs to the family of evolutionary algorithms and is closely related to genetic algorithms and genetic programming from genetic algorithms it inherited the linear chromosomes of fixed length and from genetic programming it inherited the expressive parse trees of varied sizes and shapes in gene expression programming the linear chromosomes work as the genotype and the parse trees as the phenotype creating a genotype phenotype system this genotype phenotype system is multigenic thus encoding multiple parse trees in each chromosome this means that the computer programs created by gep are composed of multiple parse trees because these parse trees are the result of gene expression in gep they are called expression trees encoding the genotype edit the genome of gene expression programming consists of a linear symbolic string or chromosome of fixed length composed of one or more genes of equal size these genes despite their fixed length code for expression trees of different sizes and shapes an example of a chromosome with two genes each of size 9 is the string position zero indicates the start of each gene 012345678012345678 l a baccd clabacd where l represents the natural logarithm function and a b c and d represent the variables and constants used in a problem expression trees the phenotype edit as shown above the genes of gene expression programming have all the same size however these fixed length strings code for expression trees of different sizes this means that the size of the coding regions varies from gene to gene allowing for adaptation and evolution to occur smoothly for example the mathematical expression can also be represented as an expression tree where q represents the square root function this kind of expression tree consists of the phenotypic expression of gep genes whereas the genes are linear strings encoding these complex structures for this particular example the linear string corresponds to 01234567 q abcd which is the straightforward reading of the expression tree from top to bottom and from left to right these linear strings are called k expressions from karva notation going from k expressions to expression trees is also very simple for example the following k expression 01234567890 q b baqba is composed of two different terminals the variables a and b two different functions of two arguments and and a function of one argument q its expression gives k expressions and genes edit the k expressions of gene expression programming correspond to the region of genes that gets expressed this means that there might be sequences in the genes that are not expressed which is indeed true for most genes the reason for these noncoding regions is to provide a buffer of terminals so that all k expressions encoded in gep genes correspond always to valid programs or expressions the genes of gene expression programming are therefore composed of two different domains a head and a tail each with different properties and functions the head is used mainly to encode the functions and variables chosen to solve the problem at hand whereas the tail while also used to encode the variables provides essentially a reservoir of terminals to ensure that all programs are error free for gep genes the length of the tail is given by the formula where h is the head s length and nmax is maximum arity for example for a gene created using the set of functions f q and the set of terminals t a b nmax 2 and if we choose a head length of 15 then t 15 2 1 1 16 which gives a gene length g of 15 16 31 the randomly generated string below is an example of one such gene 0123456789012345678901234567890 b a aqab b babbabbbababbaaa it encodes the expression tree which in this case only uses 8 of the 31 elements that constitute the gene it s not hard to see that despite their fixed length each gene has the potential to code for expression trees of different sizes and shapes with the simplest composed of only one node when the first element of a gene is a terminal and the largest composed of as many nodes as there are elements in the gene when all the elements in the head are functions with maximum arity it s also not hard to see that it is trivial to implement all kinds of genetic modification mutation inversion insertion recombination and so on with the guarantee that all resulting offspring encode correct error free programs multigenic chromosomes edit the chromosomes of gene expression programming are usually composed of more than one gene of equal length each gene codes for a sub expression tree sub et or sub program then the sub ets can interact with one another in different ways forming a more complex program the figure shows an example of a program composed of three sub ets expression of gep genes as sub ets a a three genic chromosome with the tails shown in bold b the sub ets encoded by each gene in the final program the sub ets could be linked by addition or some other function as there are no restrictions to the kind of linking function one might choose some examples of more complex linkers include taking the average the median the midrange thresholding their sum to make a binomial classification applying the sigmoid function to compute a probability and so on these linking functions are usually chosen a priori for each problem but they can also be evolved elegantly and efficiently by the cellular system 6 7 of gene expression programming cells and code reuse edit in gene expression programming homeotic genes control the interactions of the different sub ets or modules of the main program the expression of such genes results in different main programs or cells that is they determine which genes are expressed in each cell and how the sub ets of each cell interact with one another in other words homeotic genes determine which sub ets are called upon and how often in which main program or cell and what kind of connections they establish with one another homeotic genes and the cellular system edit homeotic genes have exactly the same kind of structural organization as normal genes and they are built using an identical process they also contain a head domain and a tail domain with the difference that the heads contain now linking functions and a special kind of terminals genic terminals that represent the normal genes the expression of the normal genes results as usual in different sub ets which in the cellular system are called adfs automatically defined functions as for the tails they contain only genic terminals that is derived features generated on the fly by the algorithm for example the chromosome in the figure has three normal genes and one homeotic gene and encodes a main program that invokes three different functions a total of four times linking them in a particular way expression of a unicellular system with three adfs a the chromosome composed of three conventional genes and one homeotic gene shown in bold b the adfs encoded by each conventional gene c the main program or cell from this example it is clear that the cellular system not only allows the unconstrained evolution of linking functions but also code reuse and it shouldn t be hard to implement recursion in this system multiple main programs and multicellular systems edit multicellular systems are composed of more than one homeotic gene each homeotic gene in this system puts together a different combination of sub expression trees or adfs creating multiple cells or main programs for example the program shown in the figure was created using a cellular system with two cells and three normal genes expression of a multicellular system with three adfs and two main programs a the chromosome composed of three conventional genes and two homeotic genes shown in bold b the adfs encoded by each conventional gene c two different main programs expressed in two different cells the applications of these multicellular systems are multiple and varied and like the multigenic systems they can be used both in problems with just one output and in problems with multiple outputs other levels of complexity edit the head tail domain of gep genes both normal and homeotic is the basic building block of all gep algorithms however gene expression programming also explores other chromosomal organizations that are more complex than the head tail structure essentially these complex structures consist of functional units or genes with a basic head tail domain plus one or more extra domains these extra domains usually encode random numerical constants that the algorithm relentlessly fine tunes in order to find a good solution for instance these numerical constants may be the weights or factors in a function approximation problem see the gep rnc algorithm below they may be the weights and thresholds of a neural network see the gep nn algorithm below the numerical constants needed for the design of decision trees see the gep dt algorithm below the weights needed for polynomial induction or the random numerical constants used to discover the parameter values in a parameter optimization task the basic gene expression algorithm edit the fundamental steps of the basic gene expression algorithm are listed below in pseudocode 1 select function set 2 select terminal set 3 load dataset for fitness evaluation 4 create chromosomes of initial population randomly 5 for each program in population a express chromosome b execute program c evaluate fitness 6 verify stop condition 7 select programs 8 replicate selected programs to form the next population 9 modify chromosomes using genetic operators 10 go to step 5 the first four steps prepare all the ingredients that are needed for the iterative loop of the algorithm steps 5 through 10 of these preparative steps the crucial one is the creation of the initial population which is created randomly using the elements of the function and terminal sets populations of programs edit like all evolutionary algorithms gene expression programming works with populations of individuals which in this case are computer programs therefore some kind of initial population must be created to get things started subsequent populations are descendants via selection and genetic modification of the initial population in the genotype phenotype system of gene expression programming it is only necessary to create the simple linear chromosomes of the individuals without worrying about the structural soundness of the programs they code for as their expression always results in syntactically correct programs fitness functions and the selection environment edit fitness functions and selection environments called training datasets in machine learning are the two facets of fitness and are therefore intricately connected indeed the fitness of a program depends not only on the cost function used to measure its performance but also on the training data chosen to evaluate fitness the selection environment or training data edit the selection environment consists of the set of training records which are also called fitness cases these fitness cases could be a set of observations or measurements concerning some problem and they form what is called the training dataset the quality of the training data is essential for the evolution of good solutions a good training set should be representative of the problem at hand and also well balanced otherwise the algorithm might get stuck at some local optimum in addition it is also important to avoid using unnecessarily large datasets for training as this will slow things down unnecessarily a good rule of thumb is to choose enough records for training to enable a good generalization in the validation data and leave the remaining records for validation and testing fitness functions edit broadly speaking there are essentially three different kinds of problems based on the kind of prediction being made 1 problems involving numeric continuous predictions 2 problems involving categorical or nominal predictions both binomial and multinomial 3 problems involving binary or boolean predictions the first type of problem goes by the name of regression the second is known as classification with logistic regression as a special case where besides the crisp classifications like yes or no a probability is also attached to each outcome and the last one is related to boolean algebra and logic synthesis fitness functions for regression edit in regression the response or dependent variable is numeric usually continuous and therefore the output of a regression model is also continuous so it s quite straightforward to evaluate the fitness of the evolving models by comparing the output of the model to the value of the response in the training data there are several basic fitness functions for evaluating model performance with the most common being based on the error or residual between the model output and the actual value such functions include the mean squared error root mean squared error mean absolute error relative squared error root relative squared error relative absolute error and others all these standard measures offer a fine granularity or smoothness to the solution space and therefore work very well for most applications but some problems might require a coarser evolution such as determining if a prediction is within a certain interval for instance less than 10 of the actual value however even if one is only interested in counting the hits that is a prediction that is within the chosen interval making populations of models evolve based on just the number of hits each program scores is usually not very efficient due to the coarse granularity of the fitness landscape thus the solution usually involves combining these coarse measures with some kind of smooth function such as the standard error measures listed above fitness functions based on the correlation coefficient and r square are also very smooth for regression problems these functions work best by combining them with other measures because by themselves they only tend to measure correlation not caring for the range of values of the model output so by combining them with functions that work at approximating the range of the target values they form very efficient fitness functions for finding models with good correlation and good fit between predicted and actual values fitness functions for classification and logistic regression edit the design of fitness functions for classification and logistic regression takes advantage of three different characteristics of classification models the most obvious is just counting the hits that is if a record is classified correctly it is counted as a hit this fitness function is very simple and works well for simple problems but for more complex problems or datasets highly unbalanced it gives poor results one way to improve this type of hits based fitness function consists of expanding the notion of correct and incorrect classifications in a binary classification task correct classifications can be 00 or 11 the 00 representation means that a negative case represented by 0 was correctly classified whereas the 11 means that a positive case represented by 1 was correctly classified classifications of the type 00 are called true negatives tn and 11 true positives tp there are also two types of incorrect classifications and they are represented by 01 and 10 they are called false positives fp when the actual value is 0 and the model predicts a 1 and false negatives fn when the target is 1 and the model predicts a 0 the counts of tp tn fp and fn are usually kept on a table known as the confusion matrix confusion matrix for a binomial classification task so by counting the tp tn fp and fn and further assigning different weights to these four types of classifications it is possible to create smoother and therefore more efficient fitness functions some popular fitness functions based on the confusion matrix include sensitivity specificity recall precision f measure jaccard similarity matthews correlation coefficient and cost gain matrix which combines the costs and gains assigned to the 4 different types of classifications these functions based on the confusion matrix are quite sophisticated and are adequate to solve most problems efficiently but there is another dimension to classification models which is key to exploring more efficiently the solution space and therefore results in the discovery of better classifiers this new dimension involves exploring the structure of the model itself which includes not only the domain and range but also the distribution of the model output and the classifier margin by exploring this other dimension of classification models and then combining the information about the model with the confusion matrix it is possible to design very sophisticated fitness functions that allow the smooth exploration of the solution space for instance one can combine some measure based on the confusion matrix with the mean squared error evaluated between the raw model outputs and the actual values or combine the f measure with the r square evaluated for the raw model output and the target or the cost gain matrix with the correlation coefficient and so on more exotic fitness functions that explore model granularity include the area under the roc curve and rank measure also related to this new dimension of classification models is the idea of assigning probabilities to the model output which is what is done in logistic regression then it is also possible to use these probabilities and evaluate the mean squared error or some other similar measure between the probabilities and the actual values then combine this with the confusion matrix to create very efficient fitness functions for logistic regression popular examples of fitness functions based on the probabilities include maximum likelihood estimation and hinge loss fitness functions for boolean problems edit in logic there is no model structure as defined above for classification and logistic regression to explore the domain and range of logical functions comprises only 0 s and 1 s or false and true so the fitness functions available for boolean algebra can only be based on the hits or on the confusion matrix as explained in the section above selection and elitism edit roulette wheel selection is perhaps the most popular selection scheme used in evolutionary computation it involves mapping the fitness of each program to a slice of the roulette wheel proportional to its fitness then the roulette is spun as many times as there are programs in the population in order to keep the population size constant so with roulette wheel selection programs are selected both according to fitness and the luck of the draw which means that some times the best traits might be lost however by combining roulette wheel selection with the cloning of the best program of each generation one guarantees that at least the very best traits are not lost this technique of cloning the best of generation program is known as simple elitism and is used by most stochastic selection schemes reproduction with modification edit the reproduction of programs involves first the selection and then the reproduction of their genomes genome modification is not required for reproduction but without it adaptation and evolution won t take place replication and selection edit the selection operator selects the programs for the replication operator to copy depending on the selection scheme the number of copies one program originates may vary with some programs getting copied more than once while others are copied just once or not at all in addition selection is usually set up so that the population size remains constant from one generation to another the replication of genomes in nature is very complex and it took scientists a long time to discover the dna double helix and propose a mechanism for its replication but the replication of strings is trivial in artificial evolutionary systems where only an instruction to copy strings is required to pass all the information in the genome from generation to generation the replication of the selected programs is a fundamental piece of all artificial evolutionary systems but for evolution to occur it needs to be implemented not with the usual precision of a copy instruction but rather with a few errors thrown in indeed genetic diversity is created with genetic operators such as mutation recombination transposition inversion and many others mutation edit in gene expression programming mutation is the most important genetic operator 8 it changes genomes by changing an element by another the accumulation of many small changes over time can create great diversity in gene expression programming mutation is totally unconstrained which means that in each gene domain any domain symbol can be replaced by another for example in the heads of genes any function can be replaced by a terminal or another function regardless of the number of arguments in this new function and a terminal can be replaced by a function or another terminal recombination edit recombination usually involves two parent chromosomes to create two new chromosomes by combining different parts from the parent chromosomes and as long as the parent chromosomes are aligned and the exchanged fragments are homologous that is occupy the same position in the chromosome the new chromosomes created by recombination will always encode syntactically correct programs different kinds of crossover are easily implemented either by changing the number of parents involved there s no reason for choosing only two the number of split points or the way one chooses to exchange the fragments for example either randomly or in some orderly fashion for example gene recombination which is a special case of recombination can be done by exchanging homologous genes genes that occupy the same position in the chromosome or by exchanging genes chosen at random from any position in the chromosome transposition edit transposition involves the introduction of an insertion sequence somewhere in a chromosome in gene expression programming insertion sequences might appear anywhere in the chromosome but they are only inserted in the heads of genes this method guarantees that even insertion sequences from the tails result in error free programs for transposition to work properly it must preserve chromosome length and gene structure so in gene expression programming transposition can be implemented using two different methods the first creates a shift at the insertion site followed by a deletion at the end of the head the second overwrites the local sequence at the target site and therefore is easier to implement both methods can be implemented to operate between chromosomes or within a chromosome or even within a single gene inversion edit inversion is an interesting operator especially powerful for combinatorial optimization 9 it consists of inverting a small sequence within a chromosome in gene expression programming it can be easily implemented in all gene domains and in all cases the offspring produced is always syntactically correct for any gene domain a sequence ranging from at least two elements to as big as the domain itself is chosen at random within that domain and then inverted other genetic operators edit several other genetic operators exist and in gene expression programming with its different genes and gene domains the possibilities are endless for example genetic operators such as one point recombination two point recombination gene recombination uniform recombination gene transposition root transposition domain specific mutation domain specific inversion domain specific transposition and so on are easily implemented and widely used the gep rnc algorithm edit numerical constants are essential elements of mathematical and statistical models and therefore it is important to allow their integration in the models designed by evolutionary algorithms gene expression programming solves this problem very elegantly through the use of an extra gene domain the dc for handling random numerical constants rnc by combining this domain with a special terminal placeholder for the rncs a richly expressive system can be created structurally the dc comes after the tail has a length equal to the size of the tail t and is composed of the symbols used to represent the rncs for example below is shown a simple chromosome composed of only one gene a head size of 7 the dc stretches over positions 15 22 01234567890123456789012 aaa aaa68083295 where the terminal represents the placeholder for the rncs this kind of chromosome is expressed exactly as shown above giving then the 160 s in the expression tree are replaced from left to right and from top to bottom by the symbols for simplicity represented by numerals in the dc giving the values corresponding to these symbols are kept in an array for simplicity the number represented by the numeral indicates the order in the array for instance for the following 10 element array of rncs c 0 611 1 184 2 449 2 98 0 496 2 286 0 93 2 305 2 737 0 755 the expression tree above gives this elegant structure for handling random numerical constants is at the heart of different gep systems such as gep neural networks and gep decision trees like the basic gene expression algorithm the gep rnc algorithm is also multigenic and its chromosomes are decoded as usual by expressing one gene after another and then linking them all together by the same kind of linking process the genetic operators used in the gep rnc system are an extension to the genetic operators of the basic gep algorithm see above and they all can be straightforwardly implemented in these new chromosomes on the other hand the basic operators of mutation inversion transposition and recombination are also used in the gep rnc algorithm furthermore special dc specific operators such as mutation inversion and transposition are also used to aid in a more efficient circulation of the rncs among individual programs in addition there is also a special mutation operator that allows the permanent introduction of variation in the set of rncs the initial set of rncs is randomly created at the beginning of a run which means that for each gene in the initial population a specified number of numerical constants chosen from a certain range are randomly generated then their circulation and mutation is enabled by the genetic operators neural networks edit an artificial neural network ann or nn is a computational device that consists of many simple connected units or neurons the connections between the units are usually weighted by real valued weights these weights are the primary means of learning in neural networks and a learning algorithm is usually used to adjust them structurally a neural network has three different classes of units input units hidden units and output units an activation pattern is presented at the input units and then spreads in a forward direction from the input units through one or more layers of hidden units to the output units the activation coming into one unit from other unit is multiplied by the weights on the links over which it spreads all incoming activation is then added together and the unit becomes activated only if the incoming result is above the unit s threshold in summary the basic components of a neural network are the units the connections between the units the weights and the thresholds so in order to fully simulate an artificial neural network one must somehow encode these components in a linear chromosome and then be able to express them in a meaningful way in gep neural networks gep nn or gep nets the network architecture is encoded in the usual structure of a head tail domain 10 the head contains special functions neurons that activate the hidden and output units in the gep context all these units are more appropriately called functional units and terminals that represent the input units the tail as usual contains only terminals input units besides the head and the tail these neural network genes contain two additional domains dw and dt for encoding the weights and thresholds of the neural network structurally the dw comes after the tail and its length dw depends on the head size h and maximum arity nmax and is evaluated by the formula the dt comes after dw and has a length dt equal to t both domains are composed of symbols representing the weights and thresholds of the neural network for each nn gene the weights and thresholds are created at the beginning of each run but their circulation and adaptation are guaranteed by the usual genetic operators of mutation transposition inversion and recombination in addition special operators are also used to allow a constant flow of genetic variation in the set of weights and thresholds for example below is shown a neural network with two input units i1 and i2 two hidden units h1 and h2 and one output unit o1 it has a total of six connections with six corresponding weights represented by the numerals 1 6 for simplicity the thresholds are all equal to 1 and are omitted this representation is the canonical neural network representation but neural networks can also be represented by a tree which in this case corresponds to where a and b represent the two inputs i1 and i2 and d represents a function with connectivity two this function adds all its weighted arguments and then thresholds this activation in order to determine the forwarded output this output zero or one in this simple case depends on the threshold of each unit that is if the total incoming activation is equal to or greater than the threshold then the output is one zero otherwise the above nn tree can be linearized as follows 0123456789012 dddabab654321 where the structure in positions 7 12 dw encodes the weights the values of each weight are kept in an array and retrieved as necessary for expression as a more concrete example below is shown a neural net gene for the exclusive or problem it has a head size of 3 and dw size of 6 0123456789012 dddabab393257 its expression results in the following neural network which for the set of weights w 1 978 0 514 0 465 1 22 1 686 1 797 0 197 1 606 0 1 753 it gives which is a perfect solution to the exclusive or function besides simple boolean functions with binary inputs and binary outputs the gep nets algorithm can handle all kinds of functions or neurons linear neuron tanh neuron atan neuron logistic neuron limit neuron radial basis and triangular basis neurons all kinds of step neurons and so on also interesting is that the gep nets algorithm can use all these neurons together and let evolution decide which ones work best to solve the problem at hand so gep nets can be used not only in boolean problems but also in logistic regression classification and regression in all cases gep nets can be implemented not only with multigenic systems but also cellular systems both unicellular and multicellular furthermore multinomial classification problems can also be tackled in one go by gep nets both with multigenic systems and multicellular systems decision trees edit decision trees dt are classification models where a series of questions and answers are mapped using nodes and directed edges decision trees have three types of nodes a root node internal nodes and leaf or terminal nodes the root node and all internal nodes represent test conditions for different attributes or variables in a dataset leaf nodes specify the class label for all different paths in the tree most decision tree induction algorithms involve selecting an attribute for the root node and then make the same kind of informed decision about all the nodes in a tree decision trees can also be created by gene expression programming 11 with the advantage that all the decisions concerning the growth of the tree are made by the algorithm itself without any kind of human input there are basically two different types of dt algorithms one for inducing decision trees with only nominal attributes and another for inducing decision trees with both numeric and nominal attributes this aspect of decision tree induction also carries to gene expression programming and there are two gep algorithms for decision tree induction the evolvable decision trees edt algorithm for dealing exclusively with nominal attributes and the edt rnc edt with random numerical constants for handling both nominal and numeric attributes in the decision trees induced by gene expression programming the attributes behave as function nodes in the basic gene expression algorithm whereas the class labels behave as terminals this means that attribute nodes have also associated with them a specific arity or number of branches that will determine their growth and ultimately the growth of the tree class labels behave like terminals which means that for a k class classification task a terminal set with k terminals is used representing the k different classes the rules for encoding a decision tree in a linear genome are very similar to the rules used to encode mathematical expressions see above so for decision tree induction the genes also have a head and a tail with the head containing attributes and terminals and the tail containing only terminals this again ensures that all decision trees designed by gep are always valid programs furthermore the size of the tail t is also dictated by the head size h and the number of branches of the attribute with more branches nmax and is evaluated by the equation for example consider the decision tree below to decide whether to play outside it can be linearly encoded as 01234567 howbaaba where h represents the attribute humidity o the attribute outlook w represents windy and a and b the class labels yes and no respectively note that the edges connecting the nodes are properties of the data specifying the type and number of branches of each attribute and therefore don t have to be encoded the process of decision tree induction with gene expression programming starts as usual with an initial population of randomly created chromosomes then the chromosomes are expressed as decision trees and their fitness evaluated against a training dataset according to fitness they are then selected to reproduce with modification the genetic operators are exactly the same that are used in a conventional unigenic system for example mutation inversion transposition and recombination decision trees with both nominal and numeric attributes are also easily induced with gene expression programming using the framework described above for dealing with random numerical constants the chromosomal architecture includes an extra domain for encoding random numerical constants which are used as thresholds for splitting the data at each branching node for example the gene below with a head size of 5 the dc starts at position 16 012345678901234567890 wothabababbbabba46336 encodes the decision tree shown below in this system every node in the head irrespective of its type numeric attribute nominal attribute or terminal has associated with it a random numerical constant which for simplicity in the example above is represented by a numeral 0 9 these random numerical constants are encoded in the dc domain and their expression follows a very simple scheme from top to bottom and from left to right the elements in dc are assigned one by one to the elements in the decision tree so for the following array of rncs c 62 51 68 83 86 41 43 44 9 67 the decision tree above results in which can also be represented more colorfully as a conventional decision tree criticism edit gep has been criticized for not being a major improvement over other genetic programming techniques in many experiments it did not perform better than existing methods 12 software edit commercial applications edit genexprotools genexprotools is a predictive analytics suite developed by gepsoft genexprotools modeling frameworks include logistic regression classification regression time series prediction and logic synthesis genexprotools implements the basic gene expression algorithm and the gep rnc algorithm both used in all the modeling frameworks of genexprotools open source libraries edit gep4j gep for java project created by jason thomas gep4j is an open source implementation of gene expression programming in java it implements different gep algorithms including evolving decision trees with nominal numeric or mixed attributes and automatically defined functions gep4j is hosted at google code pygep gene expression programming for python created by ryan o neil with the goal to create a simple library suitable for the academic study of gene expression programming in python aiming for ease of use and rapid implementation it implements standard multigenic chromosomes and the genetic operators mutation crossover and transposition pygep is hosted at google code jgep java gep toolkit created by matthew sottile to rapidly build java prototype codes that use gep which can then be written in a language such as c or fortran for real speed jgep is hosted at sourceforge further reading edit ferreira c 2006 gene expression programming mathematical modeling by an artificial intelligence springer verlag isbn 160 3 540 32796 7 160 ferreira c 2002 gene expression programming mathematical modeling by an artificial intelligence portugal angra do heroismo isbn 160 972 95890 5 4 160 see also edit artificial intelligence decision trees evolutionary algorithms genetic algorithms genetic programming genexprotools machine learning neural networks references edit box g e p 1957 evolutionary operation a method for increasing industrial productivity applied statistics 6 81 101 friedman g j 1959 digital simulation of an evolutionary process general systems yearbook 4 171 184 rechenberg ingo 1973 evolutionsstrategie stuttgart holzmann froboog isbn 160 3 7728 0373 3 160 mitchell melanie 1996 an introduction to genetic algorithms cambridge ma mit press 160 ferreira c 2001 gene expression programming a new adaptive algorithm for solving problems complex systems vol 13 issue 2 87 129 160 ferreira c 2002 gene expression programming mathematical modeling by an artificial intelligence portugal angra do heroismo isbn 160 972 95890 5 4 160 ferreira c 2006 automatically defined functions in gene expression programming in n nedjah l de m mourelle a abraham eds genetic systems programming theory and experiences studies in computational intelligence vol 13 pp 21 56 springer verlag 160 ferreira c 2002 mutation transposition and recombination an analysis of the evolutionary dynamics in h j caulfield s h chen h d cheng r duro v honavar e e kerre m lu m g romay t k shih d ventura p p wang y yang eds proceedings of the 6th joint conference on information sciences 4th international workshop on frontiers in evolutionary algorithms pages 614 617 research triangle park north carolina usa 160 ferreira c 2002 combinatorial optimization by gene expression programming inversion revisited in j m santos and a zapico eds proceedings of the argentine symposium on artificial intelligence pages 160 174 santa fe argentina 160 ferreira c 2006 designing neural networks using gene expression programming in a abraham b de baets m k ppen and b nickolay eds applied soft computing technologies the challenge of complexity pages 517 536 springer verlag 160 ferreira c 2006 gene expression programming mathematical modeling by an artificial intelligence springer verlag isbn 160 3 540 32796 7 160 oltean m grosan c 2003 a comparison of several linear genetic programming techniques complex systems 14 4 285 314 160 external links edit gep home page maintained by the inventor of gene expression programming genexprotools commercial gep software retrieved from http en wikipedia org w index php title gene_expression_programming amp oldid 555599250 categories gene expression programmingartificial intelligenceclassification algorithmscomputational statisticsdata miningdecision treesevolutionevolutionary algorithmsevolutionary computationgenetic algorithmsgenetic programmingmachine learning algorithmsmachine learningmathematical optimizationneural networksoptimization algorithms and methodsstatistical modelshidden categories wikipedia articles with possible conflicts of interest from november 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 18 may 2013 at 02 49 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Genetic_algorithms b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Genetic_algorithms new file mode 100644 index 00000000..5422609f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Genetic_algorithms @@ -0,0 +1 @@ +genetic algorithm wikipedia the free encyclopedia genetic algorithm from wikipedia the free encyclopedia redirected from genetic algorithms jump to navigation search in the computer science field of artificial intelligence a genetic algorithm ga is a search heuristic that mimics the process of natural evolution this heuristic also sometimes called a metaheuristic is routinely used to generate useful solutions to optimization and search problems 1 genetic algorithms belong to the larger class of evolutionary algorithms ea which generate solutions to optimization problems using techniques inspired by natural evolution such as inheritance mutation selection and crossover genetic algorithms find application in bioinformatics phylogenetics computational science engineering economics chemistry manufacturing mathematics physics pharmacometrics and other fields contents 1 methodology 1 1 initialization 1 2 selection 1 3 genetic operators 1 4 termination 2 the building block hypothesis 3 limitations 4 variants 5 problem domains 6 history 7 related techniques 7 1 parent fields 7 2 related fields 7 2 1 evolutionary algorithms 7 2 2 swarm intelligence 7 2 3 other evolutionary computing algorithms 7 2 4 other metaheuristic methods 7 2 5 other stochastic optimisation methods 8 see also 9 references 10 bibliography 11 external links 11 1 resources 11 2 tutorials methodology edit in a genetic algorithm a population of candidate solutions called individuals creatures or phenotypes to an optimization problem is evolved toward better solutions each candidate solution has a set of properties its chromosomes or genotype which can be mutated and altered traditionally solutions are represented in binary as strings of 0s and 1s but other encodings are also possible 2 the evolution usually starts from a population of randomly generated individuals and is an iterative process with the population in each iteration called a generation in each generation the fitness of every individual in the population is evaluated the fitness is usually the value of the objective function in the optimization problem being solved the more fit individuals are stochastically selected from the current population and each individual s genome is modified recombined and possibly randomly mutated to form a new generation the new generation of candidate solutions is then used in the next iteration of the algorithm commonly the algorithm terminates when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the population a typical genetic algorithm requires a genetic representation of the solution domain a fitness function to evaluate the solution domain a standard representation of each candidate solution is as an array of bits 2 arrays of other types and structures can be used in essentially the same way the main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size which facilitates simple crossover operations variable length representations may also be used but crossover implementation is more complex in this case tree like representations are explored in genetic programming and graph form representations are explored in evolutionary programming a mix of both linear chromosomes and trees is explored in gene expression programming once the genetic representation and the fitness function are defined a ga proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation crossover inversion and selection operators initialization edit initially many individual solutions are usually randomly generated to form an initial population the population size depends on the nature of the problem but typically contains several hundreds or thousands of possible solutions traditionally the population is generated randomly allowing the entire range of possible solutions the search space occasionally the solutions may be seeded in areas where optimal solutions are likely to be found selection edit main article selection genetic algorithm during each successive generation a proportion of the existing population is selected to breed a new generation individual solutions are selected through a fitness based process where fitter solutions as measured by a fitness function are typically more likely to be selected certain selection methods rate the fitness of each solution and preferentially select the best solutions other methods rate only a random sample of the population as the former process may be very time consuming the fitness function is defined over the genetic representation and measures the quality of the represented solution the fitness function is always problem dependent for instance in the knapsack problem one wants to maximize the total value of objects that can be put in a knapsack of some fixed capacity a representation of a solution might be an array of bits where each bit represents a different object and the value of the bit 0 or 1 represents whether or not the object is in the knapsack not every such representation is valid as the size of objects may exceed the capacity of the knapsack the fitness of the solution is the sum of values of all objects in the knapsack if the representation is valid or 0 otherwise in some problems it is hard or even impossible to define the fitness expression in these cases a simulation may be used to determine the fitness function value of a phenotype e g computational fluid dynamics is used to determine the air resistance of a vehicle whose shape is encoded as the phenotype or even interactive genetic algorithms are used genetic operators edit main articles crossover genetic algorithm and mutation genetic algorithm the next step is to generate a second generation population of solutions from those selected through genetic operators crossover also called recombination and or mutation for each new solution to be produced a pair of parent solutions is selected for breeding from the pool selected previously by producing a child solution using the above methods of crossover and mutation a new solution is created which typically shares many of the characteristics of its parents new parents are selected for each new child and the process continues until a new population of solutions of appropriate size is generated although reproduction methods that are based on the use of two parents are more biology inspired some research 3 4 suggests that more than two parents generate higher quality chromosomes these processes ultimately result in the next generation population of chromosomes that is different from the initial generation generally the average fitness will have increased by this procedure for the population since only the best organisms from the first generation are selected for breeding along with a small proportion of less fit solutions these less fit solutions ensure genetic diversity within the genetic pool of the parents and therefore ensure the genetic diversity of the subsequent generation of children opinion is divided over the importance of crossover versus mutation there are many references in fogel 2006 that support the importance of mutation based search although crossover and mutation are known as the main genetic operators it is possible to use other operators such as regrouping colonization extinction or migration in genetic algorithms 5 it is worth tuning parameters such as the mutation probability crossover probability and population size to find reasonable settings for the problem class being worked on a very small mutation rate may lead to genetic drift which is non ergodic in nature a recombination rate that is too high may lead to premature convergence of the genetic algorithm a mutation rate that is too high may lead to loss of good solutions unless there is elitist selection there are theoretical but not yet practical upper and lower bounds for these parameters that can help guide selection citation needed termination edit this generational process is repeated until a termination condition has been reached common terminating conditions are a solution is found that satisfies minimum criteria fixed number of generations reached allocated budget computation time money reached the highest ranking solution s fitness is reaching or has reached a plateau such that successive iterations no longer produce better results manual inspection combinations of the above the building block hypothesis edit genetic algorithms are simple to implement but their behavior is difficult to understand in particular it is difficult to understand why these algorithms frequently succeed at generating solutions of high fitness when applied to practical problems the building block hypothesis bbh consists of a description of a heuristic that performs adaptation by identifying and recombining building blocks i e low order low defining length schemata with above average fitness a hypothesis that a genetic algorithm performs adaptation by implicitly and efficiently implementing this heuristic goldberg describes the heuristic as follows short low order and highly fit schemata are sampled recombined crossed over and resampled to form strings of potentially higher fitness in a way by working with these particular schemata the building blocks we have reduced the complexity of our problem instead of building high performance strings by trying every conceivable combination we construct better and better strings from the best partial solutions of past samplings because highly fit schemata of low defining length and low order play such an important role in the action of genetic algorithms we have already given them a special name building blocks just as a child creates magnificent fortresses through the arrangement of simple blocks of wood so does a genetic algorithm seek near optimal performance through the juxtaposition of short low order high performance schemata or building blocks 6 limitations edit there are several limitations of the use of a genetic algorithm compared to alternative optimization algorithms repeated fitness function evaluation for complex problems is often the most prohibitive and limiting segment of artificial evolutionary algorithms finding the optimal solution to complex high dimensional multimodal problems often requires very expensive fitness function evaluations in real world problems such as structural optimization problems one single function evaluation may require several hours to several days of complete simulation typical optimization methods can not deal with such types of problem in this case it may be necessary to forgo an exact evaluation and use an approximated fitness that is computationally efficient it is apparent that amalgamation of approximate models may be one of the most promising approaches to convincingly use ga to solve complex real life problems genetic algorithms do not scale well with complexity that is where the number of elements which are exposed to mutation is large there is often an exponential increase in search space size this makes it extremely difficult to use the technique on problems such as designing an engine a house or plane in order to make such problems tractable to evolutionary search they must be broken down into the simplest representation possible hence we typically see evolutionary algorithms encoding designs for fan blades instead of engines building shapes instead of detailed construction plans airfoils instead of whole aircraft designs the second problem of complexity is the issue of how to protect parts that have evolved to represent good solutions from further destructive mutation particularly when their fitness assessment requires them to combine well with other parts it has been suggested by some citation needed in the community that a developmental approach to evolved solutions could overcome some of the issues of protection but this remains an open research question the better solution is only in comparison to other solutions as a result the stop criterion is not clear in every problem in many problems gas may have a tendency to converge towards local optima or even arbitrary points rather than the global optimum of the problem this means that it does not know how to sacrifice short term fitness to gain longer term fitness the likelihood of this occurring depends on the shape of the fitness landscape certain problems may provide an easy ascent towards a global optimum others may make it easier for the function to find the local optima this problem may be alleviated by using a different fitness function increasing the rate of mutation or by using selection techniques that maintain a diverse population of solutions 7 although the no free lunch theorem 8 proves citation needed that there is no general solution to this problem a common technique to maintain diversity is to impose a niche penalty wherein any group of individuals of sufficient similarity niche radius have a penalty added which will reduce the representation of that group in subsequent generations permitting other less similar individuals to be maintained in the population this trick however may not be effective depending on the landscape of the problem another possible technique would be to simply replace part of the population with randomly generated individuals when most of the population is too similar to each other diversity is important in genetic algorithms and genetic programming because crossing over a homogeneous population does not yield new solutions in evolution strategies and evolutionary programming diversity is not essential because of a greater reliance on mutation operating on dynamic data sets is difficult as genomes begin to converge early on towards solutions which may no longer be valid for later data several methods have been proposed to remedy this by increasing genetic diversity somehow and preventing early convergence either by increasing the probability of mutation when the solution quality drops called triggered hypermutation or by occasionally introducing entirely new randomly generated elements into the gene pool called random immigrants again evolution strategies and evolutionary programming can be implemented with a so called comma strategy in which parents are not maintained and new parents are selected only from offspring this can be more effective on dynamic problems gas cannot effectively solve problems in which the only fitness measure is a single right wrong measure like decision problems as there is no way to converge on the solution no hill to climb in these cases a random search may find a solution as quickly as a ga however if the situation allows the success failure trial to be repeated giving possibly different results then the ratio of successes to failures provides a suitable fitness measure for specific optimization problems and problem instances other optimization algorithms may find better solutions than genetic algorithms given the same amount of computation time alternative and complementary algorithms include evolution strategies evolutionary programming simulated annealing gaussian adaptation hill climbing and swarm intelligence e g ant colony optimization particle swarm optimization and methods based on integer linear programming the question of which if any problems are suited to genetic algorithms in the sense that such algorithms are better than others is open and controversial variants edit the simplest algorithm represents each chromosome as a bit string typically numeric parameters can be represented by integers though it is possible to use floating point representations the floating point representation is natural to evolution strategies and evolutionary programming the notion of real valued genetic algorithms has been offered but is really a misnomer because it does not really represent the building block theory that was proposed by john henry holland in the 1970s this theory is not without support though based on theoretical and experimental results see below the basic algorithm performs crossover and mutation at the bit level other variants treat the chromosome as a list of numbers which are indexes into an instruction table nodes in a linked list hashes objects or any other imaginable data structure crossover and mutation are performed so as to respect data element boundaries for most data types specific variation operators can be designed different chromosomal data types seem to work better or worse for different specific problem domains when bit string representations of integers are used gray coding is often employed in this way small changes in the integer can be readily effected through mutations or crossovers this has been found to help prevent premature convergence at so called hamming walls in which too many simultaneous mutations or crossover events must occur in order to change the chromosome to a better solution other approaches involve using arrays of real valued numbers instead of bit strings to represent chromosomes theoretically the smaller the alphabet the better the performance but paradoxically good results have been obtained from using real valued chromosomes a very successful slight variant of the general process of constructing a new population is to allow some of the better organisms from the current generation to carry over to the next unaltered this strategy is known as elitist selection parallel implementations of genetic algorithms come in two flavours coarse grained parallel genetic algorithms assume a population on each of the computer nodes and migration of individuals among the nodes fine grained parallel genetic algorithms assume an individual on each processor node which acts with neighboring individuals for selection and reproduction other variants like genetic algorithms for online optimization problems introduce time dependence or noise in the fitness function genetic algorithms with adaptive parameters adaptive genetic algorithms agas is another significant and promising variant of genetic algorithms the probabilities of crossover pc and mutation pm greatly determine the degree of solution accuracy and the convergence speed that genetic algorithms can obtain instead of using fixed values of pc and pm agas utilize the population information in each generation and adaptively adjust the pc and pm in order to maintain the population diversity as well as to sustain the convergence capacity in aga adaptive genetic algorithm 9 the adjustment of pc and pm depends on the fitness values of the solutions in caga clustering based adaptive genetic algorithm 10 through the use of clustering analysis to judge the optimization states of the population the adjustment of pc and pm depends on these optimization states it can be quite effective to combine ga with other optimization methods ga tends to be quite good at finding generally good global solutions but quite inefficient at finding the last few mutations to find the absolute optimum other techniques such as simple hill climbing are quite efficient at finding absolute optimum in a limited region alternating ga and hill climbing can improve the efficiency of ga while overcoming the lack of robustness of hill climbing this means that the rules of genetic variation may have a different meaning in the natural case for instance provided that steps are stored in consecutive order crossing over may sum a number of steps from maternal dna adding a number of steps from paternal dna and so on this is like adding vectors that more probably may follow a ridge in the phenotypic landscape thus the efficiency of the process may be increased by many orders of magnitude moreover the inversion operator has the opportunity to place steps in consecutive order or any other suitable order in favour of survival or efficiency see for instance 11 or example in travelling salesman problem in particular the use of an edge recombination operator a variation where the population as a whole is evolved rather than its individual members is known as gene pool recombination a number of variations have been developed to attempt to improve performance of gas on problems with a high degree of fitness epistasis i e where the fitness of a solution consists of interacting subsets of its variables such algorithms aim to learn before exploiting these beneficial phenotypic interactions as such they are aligned with the building block hypothesis in adaptively reducing disruptive recombination prominent examples of this approach include the mga 12 gemga 13 and llga 14 problem domains edit problems which appear to be particularly appropriate for solution by genetic algorithms include timetabling and scheduling problems and many scheduling software packages are based on gas citation needed gas have also been applied to engineering genetic algorithms are often applied as an approach to solve global optimization problems as a general rule of thumb genetic algorithms might be useful in problem domains that have a complex fitness landscape as mixing i e mutation in combination with crossover is designed to move the population away from local optima that a traditional hill climbing algorithm might get stuck in observe that commonly used crossover operators cannot change any uniform population mutation alone can provide ergodicity of the overall genetic algorithm process seen as a markov chain examples of problems solved by genetic algorithms include mirrors designed to funnel sunlight to a solar collector antennae designed to pick up radio signals in space and walking methods for computer figures many of their solutions have been highly effective unlike anything a human engineer would have produced and inscrutable as to how they arrived at that solution history edit computer simulations of evolution started as early as in 1954 with the work of nils aall barricelli who was using the computer at the institute for advanced study in princeton new jersey 15 16 his 1954 publication was not widely noticed starting in 1957 17 the australian quantitative geneticist alex fraser published a series of papers on simulation of artificial selection of organisms with multiple loci controlling a measurable trait from these beginnings computer simulation of evolution by biologists became more common in the early 1960s and the methods were described in books by fraser and burnell 1970 18 and crosby 1973 19 fraser s simulations included all of the essential elements of modern genetic algorithms in addition hans joachim bremermann published a series of papers in the 1960s that also adopted a population of solution to optimization problems undergoing recombination mutation and selection bremermann s research also included the elements of modern genetic algorithms 20 other noteworthy early pioneers include richard friedberg george friedman and michael conrad many early papers are reprinted by fogel 1998 21 although barricelli in work he reported in 1963 had simulated the evolution of ability to play a simple game 22 artificial evolution became a widely recognized optimization method as a result of the work of ingo rechenberg and hans paul schwefel in the 1960s and early 1970s rechenberg s group was able to solve complex engineering problems through evolution strategies 23 24 25 26 another approach was the evolutionary programming technique of lawrence j fogel which was proposed for generating artificial intelligence evolutionary programming originally used finite state machines for predicting environments and used variation and selection to optimize the predictive logics genetic algorithms in particular became popular through the work of john holland in the early 1970s and particularly his book adaptation in natural and artificial systems 1975 his work originated with studies of cellular automata conducted by holland and his students at the university of michigan holland introduced a formalized framework for predicting the quality of the next generation known as holland s schema theorem research in gas remained largely theoretical until the mid 1980s when the first international conference on genetic algorithms was held in pittsburgh pennsylvania as academic interest grew the dramatic increase in desktop computational power allowed for practical application of the new technique in the late 1980s general electric started selling the world s first genetic algorithm product a mainframe based toolkit designed for industrial processes in 1989 axcelis inc released evolver the world s first commercial ga product for desktop computers the new york times technology writer john markoff wrote 27 about evolver in 1990 related techniques edit see also list of genetic algorithm applications parent fields edit genetic algorithms are a sub field of evolutionary algorithms evolutionary computing metaheuristics stochastic optimization optimization related fields edit evolutionary algorithms edit this section needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed may 2011 evolutionary algorithms is a sub field of evolutionary computing evolution strategies es see rechenberg 1994 evolve individuals by means of mutation and intermediate or discrete recombination es algorithms are designed particularly to solve problems in the real value domain they use self adaptation to adjust control parameters of the search de randomization of self adaptation has led to the contemporary covariance matrix adaptation evolution strategy cma es evolutionary programming ep involves populations of solutions with primarily mutation and selection and arbitrary representations they use self adaptation to adjust parameters and can include other variation operations such as combining information from multiple parents gene expression programming gep also uses populations of computer programs these complex computer programs are encoded in simpler linear chromosomes of fixed length which are afterwards expressed as expression trees expression trees or computer programs evolve because the chromosomes undergo mutation and recombination in a manner similar to the canonical ga but thanks to the special organization of gep chromosomes these genetic modifications always result in valid computer programs 28 genetic programming gp is a related technique popularized by john koza in which computer programs rather than function parameters are optimized genetic programming often uses tree based internal data structures to represent the computer programs for adaptation instead of the list structures typical of genetic algorithms grouping genetic algorithm gga is an evolution of the ga where the focus is shifted from individual items like in classical gas to groups or subset of items 29 the idea behind this ga evolution proposed by emanuel falkenauer is that solving some complex problems a k a clustering or partitioning problems where a set of items must be split into disjoint group of items in an optimal way would better be achieved by making characteristics of the groups of items equivalent to genes these kind of problems include bin packing line balancing clustering with respect to a distance measure equal piles etc on which classic gas proved to perform poorly making genes equivalent to groups implies chromosomes that are in general of variable length and special genetic operators that manipulate whole groups of items for bin packing in particular a gga hybridized with the dominance criterion of martello and toth is arguably the best technique to date interactive evolutionary algorithms are evolutionary algorithms that use human evaluation they are usually applied to domains where it is hard to design a computational fitness function for example evolving images music artistic designs and forms to fit users aesthetic preference swarm intelligence edit swarm intelligence is a sub field of evolutionary computing ant colony optimization aco uses many ants or agents to traverse the solution space and find locally productive areas while usually inferior to genetic algorithms and other forms of local search it is able to produce results in problems where no global or up to date perspective can be obtained and thus the other methods cannot be applied citation needed particle swarm optimization pso is a computational method for multi parameter optimization which also uses population based approach a population swarm of candidate solutions particles moves in the search space and the movement of the particles is influenced both by their own best known position and swarm s global best known position like genetic algorithms the pso method depends on information sharing among population members in some problems the pso is often more computationally efficient than the gas especially in unconstrained problems with continuous variables 30 intelligent water drops or the iwd algorithm 31 is a nature inspired optimization algorithm inspired from natural water drops which change their environment to find the near optimal or optimal path to their destination the memory is the river s bed and what is modified by the water drops is the amount of soil on the river s bed other evolutionary computing algorithms edit evolutionary computation is a sub field of the metaheuristic methods harmony search hs is an algorithm mimicking the behaviour of musicians in the process of improvisation memetic algorithm ma also called hybrid genetic algorithm among others is a relatively new evolutionary method where local search is applied during the evolutionary cycle the idea of memetic algorithms comes from memes which unlike genes can adapt themselves in some problem areas they are shown to be more efficient than traditional evolutionary algorithms bacteriologic algorithms ba inspired by evolutionary ecology and more particularly bacteriologic adaptation evolutionary ecology is the study of living organisms in the context of their environment with the aim of discovering how they adapt its basic concept is that in a heterogeneous environment you can t find one individual that fits the whole environment so you need to reason at the population level it is also believed bas could be successfully applied to complex positioning problems antennas for cell phones urban planning and so on or data mining 32 cultural algorithm ca consists of the population component almost identical to that of the genetic algorithm and in addition a knowledge component called the belief space differential search algorithm ds inspired by migration of superorganisms 33 gaussian adaptation normal or natural adaptation abbreviated na to avoid confusion with ga is intended for the maximisation of manufacturing yield of signal processing systems it may also be used for ordinary parametric optimisation it relies on a certain theorem valid for all regions of acceptability and all gaussian distributions the efficiency of na relies on information theory and a certain theorem of efficiency its efficiency is defined as information divided by the work needed to get the information 34 because na maximises mean fitness rather than the fitness of the individual the landscape is smoothed such that valleys between peaks may disappear therefore it has a certain ambition to avoid local peaks in the fitness landscape na is also good at climbing sharp crests by adaptation of the moment matrix because na may maximise the disorder average information of the gaussian simultaneously keeping the mean fitness constant other metaheuristic methods edit metaheuristic methods broadly fall within stochastic optimisation methods simulated annealing sa is a related global optimization technique that traverses the search space by testing random mutations on an individual solution a mutation that increases fitness is always accepted a mutation that lowers fitness is accepted probabilistically based on the difference in fitness and a decreasing temperature parameter in sa parlance one speaks of seeking the lowest energy instead of the maximum fitness sa can also be used within a standard ga algorithm by starting with a relatively high rate of mutation and decreasing it over time along a given schedule tabu search ts is similar to simulated annealing in that both traverse the solution space by testing mutations of an individual solution while simulated annealing generates only one mutated solution tabu search generates many mutated solutions and moves to the solution with the lowest energy of those generated in order to prevent cycling and encourage greater movement through the solution space a tabu list is maintained of partial or complete solutions it is forbidden to move to a solution that contains elements of the tabu list which is updated as the solution traverses the solution space extremal optimization eo unlike gas which work with a population of candidate solutions eo evolves a single solution and makes local modifications to the worst components this requires that a suitable representation be selected which permits individual solution components to be assigned a quality measure fitness the governing principle behind this algorithm is that of emergent improvement through selectively removing low quality components and replacing them with a randomly selected component this is decidedly at odds with a ga that selects good solutions in an attempt to make better solutions other stochastic optimisation methods edit the cross entropy ce method generates candidates solutions via a parameterized probability distribution the parameters are updated via cross entropy minimization so as to generate better samples in the next iteration reactive search optimization rso advocates the integration of sub symbolic machine learning techniques into search heuristics for solving complex optimization problems the word reactive hints at a ready response to events during the search through an internal online feedback loop for the self tuning of critical parameters methodologies of interest for reactive search include machine learning and statistics in particular reinforcement learning active or query learning neural networks and meta heuristics see also edit list of genetic algorithm applications propagation of schema universal darwinism metaheuristics references edit mitchell 1996 p 160 2 a b whitley 1994 p 160 66 eiben a e et al 1994 genetic algorithms with multi parent recombination ppsn iii proceedings of the international conference on evolutionary computation the third conference on parallel problem solving from nature 78 87 isbn 3 540 58484 6 ting chuan kang 2005 on the mean convergence time of multi parent genetic algorithms without selection advances in artificial life 403 412 isbn 978 3 540 28848 0 akbari ziarati 2010 a multilevel evolutionary algorithm for optimizing numerical functions ijiec 2 2011 419 430 1 goldberg 1989 p 160 41 taherdangkoo mohammad paziresh mahsa yazdi mehran bagheri mohammad hadi 19 november 2012 an efficient algorithm for function optimization modified stem cells algorithm central european journal of engineering 3 1 36 50 doi 10 2478 s13531 012 0047 8 160 wolpert d h macready w g 1995 no free lunch theorems for optimisation santa fe institute sfi tr 05 010 santa fe srinivas m and patnaik l adaptive probabilities of crossover and mutation in genetic algorithms ieee transactions on system man and cybernetics vol 24 no 4 pp 656 667 1994 zhang j chung h and lo w l clustering based adaptive crossover and mutation probabilities for genetic algorithms ieee transactions on evolutionary computation vol 11 no 3 pp 326 335 2007 evolution in a nutshell d e goldberg b korb and k deb messy genetic algorithms motivation analysis and first results complex systems 5 3 493 530 october 1989 gene expression the missing link in evolutionary computation g harik learning linkage to efficiently solve problems of bounded difficulty using genetic algorithms phd thesis dept computer science university of michigan ann arbour 1997 barricelli nils aall 1954 esempi numerici di processi di evoluzione methodos 45 68 160 barricelli nils aall 1957 symbiogenetic evolution processes realized by artificial methods methodos 143 182 160 fraser alex 1957 simulation of genetic systems by automatic digital computers i introduction aust j biol sci 10 484 491 160 fraser alex donald burnell 1970 computer models in genetics new york mcgraw hill isbn 160 0 07 021904 4 160 crosby jack l 1973 computer simulation in genetics london john wiley amp sons isbn 160 0 471 18880 8 160 02 27 96 uc berkeley s hans bremermann professor emeritus and pioneer in mathematical biology has died at 69 fogel david b editor 1998 evolutionary computation the fossil record new york ieee press isbn 160 0 7803 3481 7 160 barricelli nils aall 1963 numerical testing of evolution theories part ii preliminary tests of performance symbiogenesis and terrestrial life acta biotheoretica 16 99 126 160 rechenberg ingo 1973 evolutionsstrategie stuttgart holzmann froboog isbn 160 3 7728 0373 3 160 schwefel hans paul 1974 numerische optimierung von computer modellen phd thesis 160 schwefel hans paul 1977 numerische optimierung von computor modellen mittels der evolutionsstrategie 160 mit einer vergleichenden einf hrung in die hill climbing und zufallsstrategie basel stuttgart birkh user isbn 160 3 7643 0876 1 160 schwefel hans paul 1981 numerical optimization of computer models translation of 1977 numerische optimierung von computor modellen mittels der evolutionsstrategie chichester 160 new york wiley isbn 160 0 471 09988 0 160 markoff john 1990 08 29 what s the best answer it s survival of the fittest new york times retrieved 2009 08 09 160 ferreira c gene expression programming a new adaptive algorithm for solving problems complex systems vol 13 issue 2 87 129 160 falkenauer emanuel 1997 genetic algorithms and grouping problems chichester england john wiley amp sons ltd isbn 160 978 0 471 97150 4 160 rania hassan babak cohanim olivier de weck gerhard vente r 2005 a comparison of particle swarm optimization and the genetic algorithm hamed shah hosseini the intelligent water drops algorithm a nature inspired swarm based optimization algorithm international journal of bio inspired computation ijbic vol 1 no 2009 2 dead link baudry benoit franck fleurey jean marc j z quel and yves le traon march april 2005 automatic test case optimization a bacteriologic algorithm pdf ieee software ieee computer society 22 2 76 82 doi 10 1109 ms 2005 30 retrieved 2009 08 09 160 civicioglu p 2012 transforming geocentric cartesian coordinates to geodetic coordinates by using differential search algorithm computers amp geosciences 46 229 247 doi 10 1016 j cageo 2011 12 011 160 kjellstr m g december 1991 on the efficiency of gaussian adaptation journal of optimization theory and applications 71 3 589 597 doi 10 1007 bf00941405 160 bibliography edit this article includes a list of references but its sources remain unclear because it has insufficient inline citations please help to improve this article by introducing more precise citations june 2010 banzhaf wolfgang nordin peter keller robert francone frank 1998 genetic programming an introduction san francisco ca morgan kaufmann isbn 160 978 1558605107 160 bies robert r muldoon matthew f pollock bruce g manuck steven smith gwenn and sale mark e 2006 a genetic algorithm based hybrid machine learning approach to model selection journal of pharmacokinetics and pharmacodynamics netherlands springer 196 221 160 cha sung hyuk tappert charles c 2009 a genetic algorithm for constructing compact binary decision trees journal of pattern recognition research 4 1 1 13 160 fraser alex s 1957 simulation of genetic systems by automatic digital computers i introduction australian journal of biological sciences 10 484 491 160 goldberg david 1989 genetic algorithms in search optimization and machine learning reading ma addison wesley professional isbn 160 978 0201157673 160 goldberg david 2002 the design of innovation lessons from and for competent genetic algorithms norwell ma kluwer academic publishers isbn 160 978 1402070983 160 fogel david evolutionary computation toward a new philosophy of machine intelligence 3rd ed piscataway nj ieee press isbn 160 978 0471669517 160 holland john 1992 adaptation in natural and artificial systems cambridge ma mit press isbn 160 978 0262581110 160 koza john 1992 genetic programming on the programming of computers by means of natural selection cambridge ma mit press isbn 160 978 0262111706 160 michalewicz zbigniew 1996 genetic algorithms data structures evolution programs springer verlag isbn 160 978 3540606765 160 mitchell melanie 1996 an introduction to genetic algorithms cambridge ma mit press isbn 160 9780585030944 160 poli r langdon w b mcphee n f 2008 a field guide to genetic programming lulu com freely available from the internet isbn 160 978 1 4092 0073 4 160 rechenberg ingo 1994 evolutionsstrategie 94 stuttgart fromman holzboog schmitt lothar m nehaniv chrystopher l fujii robert h 1998 linear analysis of genetic algorithms theoretical computer science 208 111 148 schmitt lothar m 2001 theory of genetic algorithms theoretical computer science 259 1 61 schmitt lothar m 2004 theory of genetic algorithms ii models for genetic operators over the string tensor representation of populations and convergence to global optima for arbitrary fitness function under scaling theoretical computer science 310 181 231 schwefel hans paul 1974 numerische optimierung von computer modellen phd thesis reprinted by birkh user 1977 vose michael 1999 the simple genetic algorithm foundations and theory cambridge ma mit press isbn 160 978 0262220583 160 whitley darrell 1994 a genetic algorithm tutorial statistics and computing 4 2 65 85 doi 10 1007 bf00175354 160 accessdate requires url help hingston philip barone luigi michalewicz zbigniew 2008 design by evolution advances in evolutionary design springer isbn 160 978 3540741091 160 eiben agoston smith james 2003 introduction to evolutionary computing springer isbn 160 978 3540401841 160 external links edit this article s use of external links may not follow wikipedia s policies or guidelines please improve this article by removing excessive or inappropriate external links and converting useful links where appropriate into footnote references january 2010 resources edit genetic algorithms index the site genetic programming notebook provides a structured resource pointer to web pages in genetic algorithms field tutorials edit genetic algorithms computer programs that evolve in ways that resemble natural selection can solve complex problems even their creators do not fully understand an excellent introduction to ga by john holland and with an application to the prisoner s dilemma an online interactive ga demonstrator to practise or learn how a ga works learn step by step or watch global convergence in batch change population size crossover rate mutation rate and selection mechanism and add constraints a genetic algorithm tutorial by darrell whitley computer science department colorado state university an excellent tutorial with lots of theory essentials of metaheuristics 2009 225 p free open text by sean luke global optimization algorithms theory and application v t e major subfields of optimization convex programming integer programming quadratic programming nonlinear programming stochastic programming robust optimization combinatorial optimization infinite dimensional optimization metaheuristics constraint satisfaction multiobjective optimization retrieved from http en wikipedia org w index php title genetic_algorithm amp oldid 560754922 categories genetic algorithmsmathematical optimizationoptimization algorithms and methodssearch algorithmscyberneticshidden categories all articles with dead external linksarticles with dead external links from september 2011all articles with unsourced statementsarticles with unsourced statements from november 2009articles with unsourced statements from june 2012articles with unsourced statements from december 2011articles needing additional references from may 2011all articles needing additional referencesarticles with unsourced statements from august 2007articles lacking in text citations from june 2010all articles lacking in text citationspages using citations with accessdate and no urlwikipedia external links cleanup from january 2010wikipedia spam cleanup from january 2010use dmy dates from august 2010 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal esky dansk deutsch espa ol fran ais galego hrvatski bahasa indonesia italiano latina latvie u lietuvi magyar nederlands norsk bokm l polski portugus rom n simple english sloven ina srpski suomi svenska t rk e ti ng vi t volap k edit links this page was last modified on 20 june 2013 at 14 30 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/IEEE b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/IEEE new file mode 100644 index 00000000..41e8729f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/IEEE @@ -0,0 +1 @@ +institute of electrical and electronics engineers wikipedia the free encyclopedia institute of electrical and electronics engineers from wikipedia the free encyclopedia redirected from ieee jump to navigation search ieee redirects here it is not to be confused with the institution of electrical engineers iee ieee type professional organization founded january 1 1963 headquarters new york city new york united states origins merger of the american institute of electrical engineers and the institute of radio engineers key people gordon day president and ceo area served worldwide focus electrical electronics communications computer engineering computer science and information technology 1 method industry standards conferences publications revenue us 330 million members 425 000 website www ieee org the institute of electrical and electronics engineers ieee read i triple e is a professional association headquartered in new york city that is dedicated to advancing technological innovation and excellence it has more than 425 000 members in more than 160 countries about 51 4 of whom reside in the united states 2 3 contents 1 history 2 publications 3 educational activities 4 standards and development process 5 membership and member grades 6 awards 6 1 medals 6 2 technical field awards 6 3 recognitions 6 4 prize paper awards 6 5 scholarships 7 societies 8 technical councils 9 technical committees 10 organizational units 11 ieee foundation 12 copyright policy 13 see also 14 references 15 external links history the ieee corporate office is on the 17th floor of 3 park avenue in new york city the ieee is incorporated under the not for profit corporation law of the state of new york in the united states 4 it was formed in 1963 by the merger of the institute of radio engineers ire founded 1912 and the american institute of electrical engineers aiee founded 1884 the major interests of the aiee were wire communications telegraphy and telephony and light and power systems the ire concerned mostly radio engineering and was formed from two smaller organizations the society of wireless and telegraph engineers and the wireless institute with the rise of electronics in the 1930s electronics engineers usually became members of the ire but the applications of electron tube technology became so extensive that the technical boundaries differentiating the ire and the aiee became difficult to distinguish after world war ii the two organizations became increasingly competitive and in 1961 the leadership of both the ire and the aiee resolved to consolidate the two organizations the two organizations formally merged as the ieee on january 1 1963 notable presidents of ieee and its founding organizations include elihu thomson aiee 1889 1890 alexander graham bell aiee 1891 1892 charles proteus steinmetz aiee 1901 1902 lee de forest ire 1930 frederick e terman ire 1941 william r hewlett ire 1954 ernst weber ire 1959 ieee 1963 and ivan getting ieee 1978 ieee s constitution defines the purposes of the organization as scientific and educational directed toward the advancement of the theory and practice of electrical electronics communications and computer engineering as well as computer science the allied branches of engineering and the related arts and sciences 1 in pursuing these goals the ieee serves as a major publisher of scientific journals and organizer of conferences workshops and symposia many of which have associated published proceedings it is also a leading standards development organization for the development of industrial standards having developed over 900 active industry technical standards in a broad range of disciplines including electric power and energy biomedical technology and healthcare information technology information assurance telecommunications consumer electronics transportation aerospace and nanotechnology ieee develops and participates in educational activities such as accreditation of electrical engineering programs in institutes of higher learning the ieee logo is a diamond shaped design which illustrates the right hand grip rule embedded in benjamin franklin s kite and it was created at the time of the 1963 merger 5 ieee has a dual complementary regional and technical structure with organizational units based on geography e g the ieee philadelphia section ieee south africa section 1 and technical focus e g the ieee computer society it manages a separate organizational unit ieee usa which recommends policies and implements programs specifically intended to benefit the members the profession and the public in the united states the ieee includes 38 technical societies organized around specialized technical fields with more than 300 local organizations that hold regular meetings the ieee standards association is in charge of the standardization activities of the ieee publications main article list of ieee publications ieee produces 30 of the world s literature in the electrical and electronics engineering and computer science fields publishing well over 100 peer reviewed journals 6 the published content in these journals as well as the content from several hundred annual conferences sponsored by the ieee are available in the ieee online digital library for subscription based access and individual publication purchases 7 in addition to journals and conference proceedings the ieee also publishes tutorials and the standards that are produced by its standardization committees educational activities picture of the place where an office of ieee works in the district university of bogot colombia the ieee provides learning opportunities within the engineering sciences research and technology the goal of the ieee education programs is to ensure the growth of skill and knowledge in the electricity related technical professions and to foster individual commitment to continuing education among ieee members the engineering and scientific communities and the general public ieee offers educational opportunities such as ieee elearning library 8 the education partners program 9 standards in education 10 and continuing education units ceus 11 ieee elearning library is a collection of online educational courses designed for self paced learning education partners exclusive for ieee members offers on line degree programs certifications and courses at a 10 discount the standards in education website explains what standards are and the importance of developing and using them the site includes tutorial modules and case illustrations to introduce the history of standards the basic terminology their applications and impact on products as well as news related to standards book reviews and links to other sites that contain information on standards currently twenty nine states in the united states require professional development hours pdh to maintain a professional engineering license encouraging engineers to seek continuing education units ceus for their participation in continuing education programs ceus readily translate into professional development hours pdhs with 1 ceu being equivalent to 10 pdhs countries outside the united states such as south africa similarly require continuing professional development cpd credits and it is anticipated that ieee expert now courses will feature in the cpd listing for south africa ieee also sponsors a website 12 designed to help young people understand better what engineering means and how an engineering career can be made part of their future students of age 8 18 parents and teachers can explore the site to prepare for an engineering career ask experts engineering related questions play interactive games explore curriculum links and review lesson plans this website also allows students to search for accredited engineering degree programs in canada and the united states visitors are able to search by state province territory country degree field tuition ranges room and board ranges size of student body and location rural suburban or urban standards and development process main article ieee standards association ieee is one of the leading standards making organizations in the world ieee performs its standards making and maintaining functions through the ieee standards association ieee sa ieee standards affect a wide range of industries including power and energy biomedical and healthcare information technology it telecommunications transportation nanotechnology information assurance and many more in 2005 ieee had close to 900 active standards with 500 standards under development one of the more notable ieee standards is the ieee 802 lan man group of standards which includes the ieee 802 3 ethernet standard and the ieee 802 11 wireless networking standard membership and member grades most ieee members are electrical and electronics engineers but the organization s wide scope of interests has attracted people in other disciplines as well e g computer science mechanical engineering civil engineering biology physics and mathematics an individual can join the ieee as a student member professional member or associate member in order to qualify for membership the individual must fulfil certain academic or professional criteria and abide to the code of ethics and bylaws of the organization there are several categories and levels of ieee membership and affiliation student members student membership is available for a reduced fee to those who are enrolled in an accredited institution of higher education as undergraduate or graduate students in technology or engineering members ordinary or professional membership requires that the individual have graduated from a technology or engineering program of an appropriately accredited institution of higher education or have demonstrated professional competence in technology or engineering through at least six years of professional work experience an associate membership is available to individuals whose area of expertise falls outside the scope of the ieee or who does not at the time of enrollment meet all the requirements for full membership students and associates have all the privileges of members except the right to vote and hold certain offices society affiliates some ieee societies also allow a person who is not an ieee member to become a society affiliate of a particular society within the ieee which allows a limited form of participation in the work of a particular ieee society senior members upon meeting certain requirements a professional member can apply for senior membership which is the highest level of recognition that a professional member can directly apply for applicants for senior member must have at least three letters of recommendation from senior fellow or honorary members and fulfill other rigorous requirements of education achievement remarkable contribution and experience in the field the senior members are a selected group and certain ieee officer positions are available only to senior and fellow members senior membership is also one of the requirements for those who are nominated and elevated to the grade ieee fellow a distinctive honor fellow members the fellow grade of membership is the highest level of membership and cannot be applied for directly by the member instead the candidate must be nominated by others this grade of membership is conferred by the ieee board of directors in recognition of a high level of demonstrated extraordinary accomplishment honorary members individuals who are not ieee members but have demonstrated exceptional contributions such as being a recipient of an ieee medal of honor may receive honorary membership from the ieee board of directors life members and life fellows members who have reached the age of 65 and whose number of years of membership plus their age in years adds up to at least 100 are recognized as life members and in the case of fellow members as life fellows awards through its awards program the ieee recognizes contributions that advance the fields of interest to the ieee for nearly a century the ieee awards program has paid tribute to technical professionals whose exceptional achievements and outstanding contributions have made a lasting impact on technology society and the engineering profession funds for the awards program other than those provided by corporate sponsors for some awards are administered by the ieee foundation medals ieee medal of honor ieee edison medal ieee founders medal for leadership planning and administration ieee james h mulligan jr education medal ieee alexander graham bell medal for communications engineering ieee simon ramo medal for systems engineering ieee medal for engineering excellence ieee medal for environmental and safety technologies ieee medal in power engineering ieee richard w hamming medal for information technology ieee heinrich hertz medal for electromagnetics ieee john von neumann medal for computer related technology ieee jack s kilby signal processing medal ieee dennis j picard medal for radar technologies and applications ieee robert n noyce medal for microelectronics ieee medal for innovations in healthcare technology ieee rse wolfson james clerk maxwell award ieee centennial medal technical field awards ieee biomedical engineering award ieee cledo brunetti award for nanotechnology and miniaturization ieee claude e shannon award in information theory ieee components packaging and manufacturing technologies award ieee control systems award ieee electromagnetics award ieee james l flanagan speech and audio processing award ieee andrew s grove award for solid state devices ieee herman halperin electric transmission and distribution award ieee masaru ibuka consumer electronics award ieee internet award ieee reynold b johnson data storage device technology award ieee reynold b johnson information storage systems award ieee richard harold kaufmann award for industrial systems engineering ieee joseph f keithley award in instrumentation and measurement ieee gustav robert kirchhoff award for electronic circuits and systems ieee leon k kirchmayer graduate teaching award ieee koji kobayashi computers and communications award ieee william e newell power electronics award ieee daniel e noble award for emerging technologies ieee donald o pederson award in solid state circuits ieee frederik philips award for management of research and development ieee photonics award ieee emanuel r piore award for information processing systems in computer science ieee judith a resnik award for space engineering ieee robotics and automation award ieee frank rosenblatt award for biologically and linguistically motivated computational paradigms such as neural networks ieee david sarnoff award for electronics ieee charles proteus steinmetz award for standardization ieee marie sklodowska curie award for nuclear and plasma engineering ieee eric e sumner award for communications technology ieee undergraduate teaching award ieee nikola tesla award for power technology ieee kiyo tomiyasu award for technologies holding the promise of innovative applications recognitions ieee haraden pratt award ieee richard m emberson award ieee corporate innovation recognition ieee ernst weber engineering leadership recognition ieee honorary membership prize paper awards ieee donald g fink prize paper award ieee w r g baker award scholarships ieee life members graduate study fellowship in electrical engineering was established by the ieee in 2000 the fellowship is awarded annually to a first year full time graduate student obtaining their masters for work in the area of electrical engineering at an engineering school program of recognized standing worldwide 13 ieee charles legeyt fortescue graduate scholarship was established by the ire in 1939 to commemorate charles legeyt fortescue s contributions to electrical engineering the scholarship is awarded for one year of full time graduate work obtaining their masters in electrical engineering an ane engineering school of recognized standing in the united states 14 societies ieee is supported by 38 societies each one focused on a certain knowledge area they provide specialized publications conferences business networking and sometimes other services 15 16 ieee aerospace and electronic systems society ieee antennas amp propagation society ieee broadcast technology society ieee circuits and systems society ieee communications society ieee components packaging amp manufacturing technology society ieee computational intelligence society ieee computer society ieee consumer electronics society ieee control systems society ieee dielectrics amp electrical insulation society ieee education society ieee electromagnetic compatibility society ieee electron devices society ieee engineering in medicine and biology society ieee geoscience and remote sensing society ieee industrial electronics society ieee industry applications society ieee information theory society ieee instrumentation amp measurement society ieee intelligent transportation systems society ieee magnetics society ieee microwave theory and techniques society ieee nuclear and plasma sciences society ieee oceanic engineering society ieee photonics society ieee power electronics society ieee power amp energy society ieee product safety engineering society ieee professional communication society ieee reliability society ieee robotics and automation society ieee signal processing society ieee society on social implications of technology ieee solid state circuits society ieee systems man amp cybernetics society ieee ultrasonics ferroelectrics amp frequency control society ieee vehicular technology society technical councils ieee technical councils are collaborations of several ieee societies on a broader knowledge area there are currently seven technical councils 15 17 ieee biometrics council ieee council on electronic design automation ieee nanotechnology council ieee sensors council ieee council on superconductivity ieee systems council ieee technology management council technical committees to allow a quick response to new innovations ieee can also organize technical committees on top of their societies and technical councils there are currently two such technical committees 15 ieee committee on earth observation iceo ieee technical committee on rfid crfid organizational units technical activities board tab ieee foundation the ieee foundation is a charitable foundation established in 1973 to support and promote technology education innovation and excellence 18 it is incorporated separately from the ieee although it has a close relationship to it members of the board of directors of the foundation are required to be active members of ieee and one third of them must be current or former members of the ieee board of directors initially the ieee foundation s role was to accept and administer donations for the ieee awards program but donations increased beyond what was necessary for this purpose and the scope was broadened in addition to soliciting and administering unrestricted funds the foundation also administers donor designated funds supporting particular educational humanitarian historical preservation and peer recognition programs of the ieee 18 as of the end of 2009 the foundation s total assets were 27 million split equally between unrestricted and donor designated funds 19 copyright policy the ieee generally does not create its own research it is a professional organization that coordinates journal peer review activities and holds subject specific conferences in which authors present their research the ieee then publishes the authors papers in journals and other proceedings and authors are required to transfer their copyright for works they submit for publication 20 21 section 6 3 1 ieee copyright policies subsections 7 and 8 states that all authors shall transfer to the ieee in writing any copyright they hold for their individual papers but that the ieee will grant the authors permission to make copies and use the papers they originally authored so long as such use is permitted by the board of directors the guidelines for what the board considers a permitted use are not entirely clear although posting a copy on a personally controlled website is allowed the author is also not allowed to change the work absent explicit approval from the organization the ieee justifies this practice in the first paragraph of that section by stating that they will serve and protect the interests of its authors and their employers 20 21 the ieee places research papers and other publications such as ieee standards behind a paywall 20 although the ieee explicitly allows authors to make a copy of the papers that they authored freely available on their own website as of september 2011 the ieee also provides authors for most new journal papers with the option to pay to allow free download of their papers by the public from the ieee publication website 22 ieee publications have received a green 23 rating from the sherpa romeo guide 24 for affirming authors and or their companies shall have the right to post their ieee copyrighted material on their own servers without permission ieee publication policy 8 1 9 d 25 this open access policy effectively allows authors at their choice to make their article openly available roughly 1 3 of the ieee authors take this route citation needed some other professional associations use different copyright policies for example the usenix association 20 requires that the author only give up the right to publish the paper elsewhere for 12 months in addition to allowing authors to post copies of the paper on their own website during that time the organization operates successfully even though all of its publications are freely available online 20 see also certified software development professional csdp program of the ieee computer society eta kappa nu the electrical and computer engineering honor society of the ieee institution of engineering and technology association of scientists developers and faculties ieee s sanctions against iranian scientists references a b ieee technical activities board operations manual ieee retrieved december 160 7 160 2010 160 2010 12 07 160 section 1 3 technical activities objectives ieee at a glance gt ieee quick facts ieee december 160 31 160 2010 160 2010 12 31 retrieved may 160 5 160 2013 160 2013 05 05 160 ieee 2012 annual report ieee october 160 2011 160 2011 10 retrieved may 160 5 160 2013 160 2013 05 05 160 ieee technical activities board operations manual ieee retrieved november 160 10 160 2010 160 2010 11 10 160 section 1 1 ieee incorporation ieee master brand and logos www ieee org retrieved 2011 01 28 160 about ieee ieee s online digital library ieee ieee expert now ieee ieee education partners program ieee the ieee standards education pages have moved ieee ieee continuing education units welcome to tryengineering org ieee life member graduate study fellowship retrieved on 2010 01 23 charles legeyt fortescue graduate scholarship retrieved on 2010 01 23 a b c ieee societies amp communities ieee retrieved november 160 7 160 2010 160 2010 11 07 160 ieee society memberships ieee retrieved november 160 7 160 2010 160 2010 11 07 160 ieee technical councils ieee retrieved november 160 8 160 2010 160 2010 11 08 160 a b ieee foundation home page ieee foundation overview page a b c d e johns chris march 12 2011 matt blaze s criticism of the acm and the ieee washington college of law intellectual property brief american university retrieved 2011 04 17 160 this section uses content available under the cc by sa 3 0 license the american university washington college of law intellectual property brief is licensed by dan rosenthal under a creative commons attribution 3 0 united states license and hosted by dan rosenthal a b 6 3 1 ieee copyright policies available online ieee 2011 retrieved 2011 04 17 160 davis amanda most ieee journals are now open access the institute october 7 2011 sherpa romeo color code sherpa romeo site ieee publication policy 8 1 9 d dead link external links wikimedia commons has media related to institute of electrical and electronics engineers official ieee website ieee global history network a wiki based website containing information about the history of ieee its members their professions and their technologies ieee xplore the ieee xplore digital library with over 2 6 million technical documents available online for purchase ieee tv a video content website operated by the ieee ieee elearning library an online library of more than 200 self study multimedia short courses and tutorials in technical fields of interest to the ieee v t e technical societies of the ieee aerospace and electronic systems antennas and propagation broadcast technology circuits and systems communications components packaging and manufacturing technology computational intelligence computer consumer electronics control systems dielectrics and electrical insulation education electromagnetic compatibility electron devices engineering in medicine and biology geoscience and remote sensing industrial electronics industry applications information theory instrumentation and measurement intelligent transportation systems magnetics microwave theory and techniques nuclear and plasma sciences oceanic engineering photonics power and energy power electronics product safety engineering professional communication reliability robotics and automation signal processing social implications of technology solid state circuits systems man and cybernetics ultrasonics ferroelectrics and frequency control vehicular technology v t e technical councils of the ieee biometrics electronic design automation nanotechnology sensors superconductivity systems technology management v t e telecommunications history beacon broadcasting communications satellite computer network drums electrical telegraph fax heliographs hydraulic telegraph internet mass media mobile phone optical telecommunication optical telegraphy photophone prepaid mobile phone radio radiotelephone satellite communications smoke signals telecommunications history telegraphy telephone the telephone cases television timeline of communication technology undersea telegraph line videoconferencing videophone videotelephony pioneers edwin howard armstrong john logie baird alexander graham bell tim berners lee jagadish chandra bose vint cerf claude chappe lee de forest philo farnsworth reginald fessenden elisha gray guglielmo marconi alexander stepanovich popov johann philipp reis nikola tesla camille papin tissot alfred vail charles wheatstone vladimir k zworykin network topology links nodes terminal node transmission media coaxial cable free space optical optical fiber radio waves telephone lines terrestrial microwave switching circuit switching packet switching telephone exchange network switch multiplexing space division multiplexing frequency division multiplexing time division multiplexing polarization division multiplexing orbital angular momentum multiplexing code division multiplexing networks arpanet bitnet ethernet fidonet internet isdn lan mobile ngn public switched telephone radio telecommunications equipment television telex wan wireless world wide web geographic v t e telecommunications in africa sovereign states algeria angola benin botswana burkina faso burundi cameroon cape verde central african republic chad comoros democratic republic of the congo republic of the congo djibouti egypt equatorial guinea eritrea ethiopia gabon the gambia ghana guinea guinea bissau ivory coast c te d ivoire kenya lesotho liberia libya madagascar malawi mali mauritania mauritius morocco mozambique namibia niger nigeria rwanda s o tom and pr ncipe senegal seychelles sierra leone somalia south africa south sudan sudan swaziland tanzania togo tunisia uganda zambia zimbabwe states with limited recognition sahrawi arab democratic republic somaliland dependencies and other territories canary islands 160 ceuta 160 melilla 160 plazas de soberan a 160 spain madeira 160 portugal mayotte 160 r union 160 france saint helena 160 ascension island 160 tristan da cunha 160 united kingdom western sahara v t e telecommunications in asia sovereign states afghanistan armenia azerbaijan bahrain bangladesh bhutan brunei burma myanmar cambodia people s republic of china cyprus east timor timor leste egypt georgia india indonesia iran iraq israel japan jordan kazakhstan north korea south korea kuwait kyrgyzstan laos lebanon malaysia maldives mongolia nepal oman pakistan philippines qatar russia saudi arabia singapore sri lanka syria tajikistan thailand turkey turkmenistan united arab emirates uzbekistan vietnam yemen states with limited recognition abkhazia nagorno karabakh northern cyprus palestine south ossetia taiwan dependencies and other territories british indian ocean territory christmas island cocos keeling islands hong kong macau v t e telecommunications in europe sovereign states albania andorra armenia austria azerbaijan belarus belgium bosnia and herzegovina bulgaria croatia cyprus czech republic denmark estonia finland france georgia germany greece hungary iceland ireland italy kazakhstan latvia liechtenstein lithuania luxembourg macedonia malta moldova monaco montenegro netherlands norway poland portugal romania russia san marino serbia slovakia slovenia spain sweden switzerland turkey ukraine united kingdom states with limited recognition abkhazia kosovo nagorno karabakh northern cyprus south ossetia transnistria dependencies and other territories land faroe islands gibraltar guernsey jersey isle of man svalbard other entities european union v t e telecommunications in north america sovereign states antigua and barbuda bahamas barbados belize canada costa rica cuba dominica dominican republic el salvador grenada guatemala haiti honduras jamaica mexico nicaragua panama saint kitts and nevis saint lucia saint vincent and the grenadines trinidad and tobago united states dependencies and other territories anguilla aruba bermuda bonaire british virgin islands cayman islands cura ao greenland guadeloupe martinique montserrat navassa island puerto rico saint barth lemy saint martin saint pierre and miquelon saba sint eustatius sint maarten turks and caicos islands united states virgin islands v t e telecommunications in oceania sovereign states australia east timor timor leste fiji indonesia kiribati marshall islands federated states of micronesia nauru new zealand palau papua new guinea samoa solomon islands tonga tuvalu vanuatu dependencies and other territories american samoa christmas island cocos keeling islands cook islands easter island french polynesia guam hawaii new caledonia niue norfolk island northern mariana islands pitcairn islands tokelau wallis and futuna v t e telecommunications in south america sovereign states argentina bolivia brazil chile colombia ecuador guyana paraguay peru suriname trinidad and tobago uruguay venezuela dependencies and other territories aruba bonaire cura ao falkland islands french guiana south georgia and the south sandwich islands v t e ieee standards current 488 754 revision 829 830 1003 1014 1987 1016 1076 1149 1 1164 1219 1233 1275 1278 1284 1355 1364 1394 1451 1471 1516 1541 2002 1547 1584 1588 1596 1603 1613 1667 1675 1685 1801 1900 1901 1902 11073 12207 802 series 802 1 p q qat qay x ad ae ag ah ak aq 802 11 a b d e f g h i j k n p r s u v w y ac ad 2 3 4 5 6 7 8 9 10 12 15 15 4 15 4a 16 18 20 21 22 proposed p1363 p1619 p1823 p2030 superseded 754 1985 854 1987 see also ieee standards association category ieee standards retrieved from http en wikipedia org w index php title institute_of_electrical_and_electronics_engineers amp oldid 561665911 categories professional associationsorganizations established in 1963american engineering organizationsinstitute of electrical and electronics engineersinternational nongovernmental organizationsstandards organizationsbibliographic database providersengineering societiescomputer science related professional associationshidden categories all articles with dead external linksarticles with dead external links from january 2012wikipedia semi protected pagesall articles with unsourced statementsarticles with unsourced statements from april 2011commons category template with no category setcommons category with page title same as on wikidata navigation menu personal tools create accountlog in namespaces article talk variants views read view source view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages afrikaans az rbaycanca b n l m g bosanski brezhoneg catal esky dansk deutsch eesti espa ol esperanto euskara fran ais galego hrvatski bahasa indonesia italiano kurd latina latvie u bahasa melayu nederlands norsk bokm l norsk nynorsk piemont is polski portugus rom n shqip simple english sloven ina srpski suomi svenska t rk e ti ng vi t yor b edit links this page was last modified on 26 june 2013 at 13 00 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Information_extraction b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Information_extraction new file mode 100644 index 00000000..8a41b673 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Information_extraction @@ -0,0 +1 @@ +information extraction wikipedia the free encyclopedia information extraction from wikipedia the free encyclopedia jump to navigation search information extraction ie is the task of automatically extracting structured information from unstructured and or semi structured machine readable documents in most of the cases this activity concerns processing human language texts by means of natural language processing nlp recent activities in multimedia document processing like automatic annotation and content extraction out of images audio video could be seen as information extraction due to the difficulty of the problem current approaches to ie focus on narrowly restricted domains an example is the extraction from news wire reports of corporate mergers such as denoted by the formal relation from an online news sentence such as yesterday new york based foo inc announced their acquisition of bar corp a broad goal of ie is to allow computation to be done on the previously unstructured data a more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data structured data is semantically well defined data from a chosen target domain interpreted with respect to category and context contents 1 history 2 present significance 3 tasks and subtasks 4 world wide web applications 5 approaches 6 free or open source software and services 7 see also 8 references 9 external links history edit information extraction dates back to the late 1970s in the early days of nlp 1 an early commercial system from the mid 1980s was jasper built for reuters by the carnegie group with the aim of providing real time financial news to financial traders 2 beginning in 1987 ie was spurred by a series of message understanding conferences muc is a competition based conference that focused on the following domains muc 1 1987 muc 2 1989 naval operations messages muc 3 1991 muc 4 1992 terrorism in latin american countries muc 5 1993 joint ventures and microelectronics domain muc 6 1995 news articles on management changes muc 7 1998 satellite launch reports considerable support came from darpa the us defense agency who wished to automate mundane tasks performed by government analysts such as scanning newspapers for possible links to terrorism present significance edit the present significance of ie pertains to the growing amount of information available in unstructured form tim berners lee inventor of the world wide web refers to the existing internet as the web of documents 3 and advocates that more of the content be made available as a web of data 4 until this transpires the web largely consists of unstructured documents lacking semantic metadata knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form or by marking up with xml tags an intelligent agent monitoring a news data feed requires ie to transform unstructured data into something that can be reasoned with a typical application of ie is to scan a set of documents written in a natural language and populate a database with the information extracted 5 tasks and subtasks edit applying information extraction on text is linked to the problem of text simplification in order to create a structured view of the information present in free text the overall goal being to create a more easily machine readable text to process the sentences typical subtasks of ie include named entity extraction which could include named entity recognition recognition of known entity names for people and organizations place names temporal expressions and certain types of numerical expressions employing existing knowledge of the domain or information extracted from other sentences typically the recognition task involves assigning a unique identifier to the extracted entity a simpler task is named entity detection which aims to detect entities without having any existing knowledge about the entity instances for example in processing the sentence m smith likes fishing named entity detection would denote detecting that the phrase m smith does refer to a person but without necessarily having or using any knowledge about a certain m smith who is or might be the specific person whom that sentence is talking about coreference resolution detection of coreference and anaphoric links between text entities in ie tasks this is typically restricted to finding links between previously extracted named entities for example international business machines and ibm refer to the same real world entity if we take the two sentences m smith likes fishing but he doesn t like biking it would be beneficial to detect that he is referring to the previously detected person m smith relationship extraction identification of relations between entities such as person works for organization extracted from the sentence bill works for ibm person located in location extracted from the sentence bill is in france semi structured information extraction which may refer to any ie that tries to restore some kind information structure that has been lost through publication such as table extraction finding and extracting tables from documents comments extraction 160 extracting comments from actual content of article in order to restore the link between author of each sentence language and vocabulary analysis terminology extraction finding the relevant terms for a given corpus audio extraction template based music extraction finding relevant characteristic in an audio signal taken from a given repertoire for instance 6 time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece note this list is not exhaustive and that the exact meaning of ie activities is not commonly accepted and that many approaches combine multiple sub tasks of ie in order to achieve a wider goal machine learning statistical analysis and or natural language processing are often used in ie ie on non text documents is becoming an increasing topic in research and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text this naturally lead to the fusion of extracted information from multiple kind of documents and sources world wide web applications edit ie has been the focus of the muc conferences the proliferation of the web however intensified the need for developing ie systems that help people to cope with the enormous amount of data that is available online systems that perform ie from online text should meet the requirements of low cost flexibility in development and easy adaptation to new domains muc systems fail to meet those criteria moreover linguistic analysis performed for unstructured text does not exploit the html xml tags and layout format that are available in online text as a result less linguistically intensive approaches have been developed for ie on the web using wrappers which are sets of highly accurate rules that extract a particular page s content manually developing wrappers has proved to be a time consuming task requiring a high level of expertise machine learning techniques either supervised or unsupervised have been used to induce such rules automatically wrappers typically handle highly structured collections of web pages such as product catalogues and telephone directories they fail however when the text type is less structured which is also common on the web recent effort on adaptive information extraction motivates the development of ie systems that can handle different types of text from well structured to almost free text where common wrappers fail including mixed types such systems can exploit shallow natural language knowledge and thus can be also applied to less structured text approaches edit three standard approaches are now widely accepted hand written regular expressions perhaps stacked using classifiers generative na ve bayes classifier discriminative maximum entropy models sequence models hidden markov model cmms memms conditional random fields crf are commonly used in conjunction with ie for tasks as varied as extracting information from research papers 7 to extracting navigation instructions 8 numerous other approaches exist for ie including hybrid approaches that combine some of the standard approaches previously listed free or open source software and services edit general architecture for text engineering general architecture for text engineering which is bundled with a free information extraction system opencalais automated information extraction web service from thomson reuters free limited version machine learning for language toolkit mallet is a java based package for a variety of natural language processing tasks including information extraction dbpedia spotlight is an open source tool in java scala and free web service that can be used for named entity recognition and name resolution see also crf implementations see also edit ai effect applications of artificial intelligence concept mining darpa tipster program enterprise search faceted search knowledge extraction named entity recognition nutch semantic translation web scraping lists list of emerging technologies outline of artificial intelligence references edit andersen peggy m hayes philip j huettner alison k schmandt linda m nirenburg irene b weinstein steven p automatic extraction of facts from press releases to generate news stories citeseerx 10 1 1 14 7943 160 missing or empty url help cowie jim wilks yorick information extraction citeseerx 10 1 1 61 6480 160 missing or empty url help linked data the story so far 160 tim berners lee on the next web 160 r k srihari w li c niu and t cornell infoxtract a customizable intermediate level information extraction engine journal of natural language engineering cambridge u press 14 1 2008 pp 33 69 a zils f pachet o delerue and f gouyon automatic extraction of drum tracks from polyphonic music signals proceedings of wedelmusic darmstadt germany 2002 peng f mccallum a 2006 information extraction from research papers using conditional random fields information processing amp management 42 963 doi 10 1016 j ipm 2005 09 002 160 edit shimizu nobuyuki hass andrew 2006 extracting frame based knowledge representation from route instructions 160 external links edit muc ace ldc ace nist alias i competition page a listing of academic toolkits and industrial toolkits for natural language information extraction gabor melli s page on ie detailed description of the information extraction task crf yet another crf toolkit a survey of web information extraction systems a comprehensive survey an information extraction framework a framework to develop and compare information extractors enterprise search retrieved from http en wikipedia org w index php title information_extraction amp oldid 550994368 categories natural language processingartificial intelligencehidden categories pages using web citations with no url navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages esky deutsch espa ol euskara srpski edit links this page was last modified on 18 april 2013 at 16 02 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/International_Conference_on_Very_Large_Data_Bases b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/International_Conference_on_Very_Large_Data_Bases new file mode 100644 index 00000000..cf0662c6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/International_Conference_on_Very_Large_Data_Bases @@ -0,0 +1 @@ +vldb wikipedia the free encyclopedia vldb from wikipedia the free encyclopedia redirected from international conference on very large data bases jump to navigation search this article is about international conference on very large databases for large size databases see very large database vldb abbreviation vldb discipline database publication details publisher vldb endowment inc history 1975 frequency annual vldb is an annual conference held by the non profit very large data base endowment inc the mission of vldb is to promote and exchange scholarly work in databases and related fields throughout the world the vldb conference began in 1975 and is closely associated with sigmod and sigkdd acceptance rate of vldb averaged from 1993 to 2007 is 16 1 and the rate for the core database technology track is 16 7 in 2009 and 18 4 in 2010 2 venues edit year city country link 2014 hangzhou china http www vldb org 2014 2013 riva del garda italy http www vldb org 2013 2012 istanbul turkey http www vldb2012 org 2011 seattle united states http www vldb org 2011 2010 singapore http vldb2010 org 2009 lyon france http vldb2009 org 2008 auckland new zealand vldb at cs auckland ac nz 2007 vienna austria http www vldb2007 org 2006 seoul south korea dblp 2005 trondheim norway dblp 2004 toronto canada dblp 2003 berlin germany dblp 2002 hong kong china dblp 2001 rome italy dblp 2000 cairo egypt dblp 1999 edinburgh scotland uni trier de 1998 new york usa uni trier de 1997 athens greece uni trier de references edit apers peter 2007 acceptance rates major database conferences retrieved 2009 06 12 160 vldb statistics 2010 retrieved 2012 09 17 160 external links edit vldb endowment inc retrieved from http en wikipedia org w index php title vldb amp oldid 559622487 categories computer science conferences navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 12 june 2013 at 20 29 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/KDD_Conference b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/KDD_Conference new file mode 100644 index 00000000..4287c88a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/KDD_Conference @@ -0,0 +1 @@ +sigkdd wikipedia the free encyclopedia sigkdd from wikipedia the free encyclopedia redirected from kdd conference jump to navigation search sigkdd is the association for computing machinery s special interest group on knowledge discovery and data mining it became an official acm sig in 1998 the official web page of sigkdd can be found on www kdd org the current chairman of sigkdd since 2009 is usama m fayyad ph d contents 1 conferences 2 kdd cup 3 awards 4 sigkdd explorations 5 current executive committee 6 information directors 7 references 8 external links conferences edit sigkdd has hosted an annual conference acm sigkdd conference on knowledge discovery and data mining kdd since 1995 kdd conferences grew from kdd knowledge discovery and data mining workshops at aaai conferences which were started by gregory piatetsky shapiro in 1989 1991 and 1993 and usama fayyad in 1994 1 conference papers of each proceedings of the sigkdd international conference on knowledge discovery and data mining are published through acm 2 kdd 2012 took place in beijing china 3 and kdd 2013 will take place in chicago united states aug 11 14 2013 kdd cup edit sigkdd sponsors the kdd cup competition every year in conjunction with the annual conference it is aimed at members of the industry and academia particularly students interested in kdd awards edit the group also annually recognizes members of the kdd community with its innovation award and service award additionally kdd presents a best paper award 4 to recognize the highest quality paper at each conference sigkdd explorations edit sigkdd has also published a biannual academic journal titled sigkdd explorations since june 1999 editors in chief bart goethals since 2010 osmar r zaiane 2008 2010 ramakrishnan srikant 2006 2007 sunita sarawagi 2003 2006 usama fayyad 1999 2002 current executive committee edit chair usama fayyad 2009 treasurer osmar r zaiane 2009 directors johannes gehrke robert grossman david d jensen 5 raghu ramakrishnan sunita sarawagi 6 ramakrishnan srikant 7 former chairpersons gregory piatetsky shapiro 8 2005 2008 won kim 1998 2004 information directors edit ankur teredesai 2011 gabor melli 9 2004 2011 ramakrishnan srikant 1998 2003 references edit http www sigkdd org conferences php http dl acm org event cfm id re329 http kdd2012 sigkdd org kdd conference best paper awards retrieved 2012 04 07 160 http kdl cs umass edu people jensen http www it iitb ac in sunita http www rsrikant com http www kdnuggets com gps html http www gabormeli com rkb external links edit acm sigkdd homepage acm sigkdd explorations homepage kdd 2013 conference homepage kdd 2012 conference homepage this computing article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title sigkdd amp oldid 558448906 conferences categories association for computing machinery special interest groupsdata miningcomputing stubs navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 5 june 2013 at 14 14 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/K_optimal_pattern_discovery b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/K_optimal_pattern_discovery new file mode 100644 index 00000000..b98c7af2 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/K_optimal_pattern_discovery @@ -0,0 +1 @@ +k optimal pattern discovery wikipedia the free encyclopedia k optimal pattern discovery from wikipedia the free encyclopedia jump to navigation search k optimal pattern discovery is a data mining technique that provides an alternative to the frequent pattern discovery approach that underlies most association rule learning techniques frequent pattern discovery techniques find all patterns for which there are sufficiently frequent examples in the sample data in contrast k optimal pattern discovery techniques find the k patterns that optimize a user specified measure of interest the parameter k is also specified by the user examples of k optimal pattern discovery techniques include k optimal classification rule discovery 1 k optimal subgroup discovery 2 finding k most interesting patterns using sequential sampling 3 mining top k frequent closed patterns without minimum support 4 k optimal rule discovery 5 in contrast to k optimal rule discovery and frequent pattern mining techniques subgroup discovery focuses on mining interesting patterns with respect to a specified target property of interest this includes for example binary nominal or numeric attributes 6 but also more complex target concepts such as correlations between several variables background knowledge 7 like constraints and ontological relations can often be successfully applied for focusing and improving the discovery results references edit webb g i 1995 opus an efficient admissible algorithm for unordered search journal of artificial intelligence research 3 431 465 wrobel stefan 1997 an algorithm for multi relational discovery of subgroups in proceedings first european symposium on principles of data mining and knowledge discovery springer scheffer t amp wrobel s 2002 finding the most interesting patterns in a database quickly by using sequential sampling journal of machine learning research 3 833 862 han j wang j lu y amp tzvetkov p 2002 mining top k frequent closed patterns without minimum support in proceedings of the international conference on data mining pp 211 218 webb g i amp zhang s 2005 k optimal rule discovery data mining and knowledge discovery 10 1 39 79 kloesgen w 1996 explora a multipattern and multistrategy discovery assistant advances in knowledge discovery and data mining pp 249 271 atzmueller m puppe f buscher hp 2005 exploiting background knowledge for knowledge intensive subgroup discovery proc ijcai 05 19th international joint conference on artificial intelligence morgan kaufmann external links edit software edit magnum opus vikamine retrieved from http en wikipedia org w index php title k optimal_pattern_discovery amp oldid 517405257 categories data mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 12 october 2012 at 14 29 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Lift_data_mining_ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Lift_data_mining_ new file mode 100644 index 00000000..b3a3795f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Lift_data_mining_ @@ -0,0 +1 @@ +lift data mining wikipedia the free encyclopedia lift data mining from wikipedia the free encyclopedia jump to navigation search for other uses see lift in data mining and association rule learning lift is a measure of the performance of a targeting model association rule at predicting or classifying cases as having an enhanced response with respect to the population as a whole measured against a random choice targeting model a targeting model is doing a good job if the response within the target is much better than the average for the population as a whole lift is simply the ratio of these values target response divided by average response for example suppose a population has an average response rate of 5 but a certain model or rule has identified a segment with a response rate of 20 then that segment would have a lift of 4 0 20 5 typically the modeller seeks to divide the population into quantiles and rank the quantiles by lift organizations can then consider each quantile and by weighing the predicted response rate and associated financial benefit against the cost they can decide whether to market to that quantile or not lift is analogous to information retrieval s average precision metric if one treats the precision fraction of the positives that are true positives as the target response probability the lift curve can also be considered a variation on the receiver operating characteristic roc curve and is also known in econometrics as the lorenz or power curve 1 example edit assume the data set being mined is antecedent consequent a 0 a 0 a 1 a 0 b 1 b 0 b 1 where the antecedent is the input variable that we can control and the consequent is the variable we are trying to predict real mining problems would typically have more complex antecedents but usually focus on single value consequents most mining algorithms would determine the following rules targeting models rule 1 a implies 0 rule 2 b implies 1 because these are simply the most common patterns found in the data a simple review of the above table should make these rules obvious the support for rule 1 is 3 7 because that is the number of items in the dataset in which the antecedent is a and the consequent 0 the support for rule 2 is 2 7 because two of the seven records meet the antecedent of b and the consequent of 1 the supports can be written as the confidence for rule 1 is 3 4 because three of the four records that meet the antecedent of a meet the consequent of 0 the confidence for rule 2 is 2 3 because two of the three records that meet the antecedent of b meet the consequent of 1 the confidences can be written as lift can be found by dividing the confidence by the unconditional probability of the consequent or by dividing the support by the probability of the antecedent times the probability of the consequent so the lift for rule 1 is 3 4 4 7 1 3125 the lift for rule 2 is 2 3 3 7 2 3 7 3 14 9 1 5 if some rule had a lift of 1 it would imply that the probability of occurrence of the antecedent and that of the consequent are independent of each other when two events are independent of each other no rule can be drawn involving those two events if the lift is positive like it is here for rules 1 and 2 that lets us know the degree to which those two occurrences are dependent on one another and makes those rules potentially useful for predicting the consequent in future data sets observe that even though rule 1 has higher confidence it has lower lift intuitively it would seem that rule 1 is more valuable because of its higher confidence it seems more accurate better supported but accuracy of the rule independent of the data set can be misleading the value of lift is that it considers both the confidence of the rule and the overall data set references edit tuff ry st phane 2011 data mining and statistics for decision making chichester gb john wiley amp sons translated from the french data mining et statistique d cisionnelle ditions technip 2008 coppock david s 2002 06 21 data modeling and management why lift retrieved 2007 02 19 160 see also edit uplift modelling retrieved from http en wikipedia org w index php title lift_ data_mining amp oldid 544692864 categories data mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 16 march 2013 at 17 38 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/List_of_machine_learning_algorithms b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/List_of_machine_learning_algorithms new file mode 100644 index 00000000..eb566689 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/List_of_machine_learning_algorithms @@ -0,0 +1 @@ +list of machine learning algorithms wikipedia the free encyclopedia list of machine learning algorithms from wikipedia the free encyclopedia jump to navigation search contents 1 supervised learning 1 1 statistical classification 2 unsupervised learning 2 1 association rule learning 2 2 hierarchical clustering 2 3 partitional clustering 3 reinforcement learning 4 others supervised learning edit aode artificial neural network backpropagation bayesian statistics naive bayes classifier bayesian network bayesian knowledge base case based reasoning decision trees inductive logic programming gaussian process regression gene expression programming group method of data handling gmdh learning automata learning vector quantization logistic model tree minimum message length decision trees decision graphs etc lazy learning instance based learning nearest neighbor algorithm analogical modeling probably approximately correct learning pac learning ripple down rules a knowledge acquisition methodology symbolic machine learning algorithms subsymbolic machine learning algorithms support vector machines random forests ensembles of classifiers bootstrap aggregating bagging boosting meta algorithm ordinal classification regression analysis information fuzzy networks ifn statistical classification edit anova linear classifiers fisher s linear discriminant logistic regression naive bayes classifier perceptron support vector machines quadratic classifiers k nearest neighbor boosting decision trees c4 5 random forests bayesian networks hidden markov models unsupervised learning edit artificial neural network data clustering expectation maximization algorithm self organizing map radial basis function network vector quantization generative topographic map information bottleneck method ibsead association rule learning edit apriori algorithm eclat algorithm fp growth algorithm hierarchical clustering edit single linkage clustering conceptual clustering partitional clustering edit k means algorithm fuzzy clustering reinforcement learning edit temporal difference learning q learning learning automata monte carlo method sarsa others edit data pre processing retrieved from http en wikipedia org w index php title list_of_machine_learning_algorithms amp oldid 552756294 categories machine learning algorithmsartificial intelligencedata mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 29 april 2013 at 17 29 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Local_outlier_factor b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Local_outlier_factor new file mode 100644 index 00000000..655d3ae7 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Local_outlier_factor @@ -0,0 +1 @@ +local outlier factor wikipedia the free encyclopedia local outlier factor from wikipedia the free encyclopedia jump to navigation search local outlier factor lof is an anomaly detection algorithm presented as lof identifying density based local outliers by markus m breunig hans peter kriegel raymond t ng and j rg sander 1 the key idea of lof is comparing the local density of a point s neighborhood with the local density of its neighbors lof shares some concepts with dbscan and optics such as the concepts of core distance and reachability distance which are used for local density estimation contents 1 basic idea 2 formal 3 advantages 4 disadvantages and extensions 5 references basic idea edit basic idea of lof comparing the local density of a point with the densities of its neighbors a has a much lower density than its neighbors as indicated by the title the local outlier factor is based on a concept of a local density where locality is given by nearest neighbors whose distance is used to estimate the density by comparing the local density of an object to the local densities of its neighbors one can identify regions of similar density and points that have a substantially lower density than their neighbors these are considered to be outliers the local density is estimated by the typical distance at which a point can be reached from its neighbors the definition of reachability distance used in lof is an additional measure to produce more stable results within clusters formal edit let be the distance of the object to the k nearest neighbor note that the set of the k nearest neighbors includes all objects at this distance which can in the case of a tie be more than k objects we denote the set of k nearest neighbors as illustration of the reachability distance objects b and c have the same reachability distance k 3 while d is not a k nearest neighbor this distance is used to define what is called reachability distance in words the reachability distance of an object from is the true distance of the two objects but at least the of objects that belong to the k nearest neighbors of the core of see dbscan cluster analysis are considered to be equally distant the reason for this distance is to get more stable results note that this is not a distance in the mathematical definition since it is not symmetric the local reachability density of an object is defined by which is the quotient of the average reachability distance of the object from its neighbors note that it is not the average reachability of the neighbors from which by definition would be the but the distance at which it can be reached from its neighbors with duplicate points this value can become infinite the local reachability densities are then compared with those of the neighbors using which is the average local reachability density of the neighbors divided by the objects own local reachability density a value of approximately indicates that the object is comparable to its neighbors and thus not an outlier a value below indicates a denser region which would be an inlier while values significantly larger than indicate outliers advantages edit lof scores as visualized by elki while the upper right cluster has a comparable density to the outliers close to the bottom left cluster they are detected correctly due to the local approach lof is able to identify outliers in a data set that would not be outliers in another area of the data set for example a point at a small distance to a very dense cluster is an outlier while a point within a sparse cluster might exhibit similar distances to its neighbors while the geometric intuition of lof is only applicable to low dimensional vector spaces the algorithm can be applied in any context a dissimilarity function can be defined it has experimentally been shown to work very well in numerous setups often outperforming the competitors for example in network intrusion detection 2 disadvantages and extensions edit the resulting values are quotient values and hard to interpret a value of 1 or even less indicates a clear inlier but there is no clear rule for when a point is an outlier in one data set a value of 1 1 may already be an outlier in another dataset and parameterization with strong local fluctuations a value of 2 could still be an inlier these differences can also occur within a dataset due to the locality of the method there exist extensions of lof that try to improve over lof in these aspects feature bagging for outlier detection 3 runs lof on multiple projections and combines the results for improved detection qualities in high dimensions local outlier probability loop 4 is a method derived from lof but using inexpensive local statistics to become less sensitive to the choice of the parameter k in addition the resulting values are scaled to a value range of interpreting and unifying outlier scores 5 proposes a normalization of the lof outlier scores to the interval using statistical scaling to increase usability and can be seen a improved version of the loop ideas on evaluation of outlier rankings and outlier scores 6 proposes methods for measuring similarity and diversity of methods for building advanced outlier detection ensembles using lof variants and other algorithms and improving on the feature bagging approach discussed above references edit breunig m m kriegel h p ng r t sander j 2000 lof identifying density based local outliers acm sigmod record 29 93 doi 10 1145 335191 335388 160 edit ar lazarevic aysel ozgur levent ertoz jaideep srivastava vipin kumar 2003 a comparative study of anomaly detection schemes in network intrusion detection proc 3rd siam international conference on data mining 25 36 160 lazarevic a kumar v 2005 feature bagging for outlier detection proc 11th acm sigkdd international conference on knowledge discovery in data mining 157 166 doi 10 1145 1081870 1081891 160 edit kriegel h p kr ger p schubert e zimek a 2009 loop local outlier probabilities proc 18th acm conference on information and knowledge management cikm 1649 doi 10 1145 1645953 1646195 160 edit hans peter kriegel peer kr ger erich schubert arthur zimek 2011 interpreting and unifying outlier scores proc 11th siam international conference on data mining 160 erich schubert remigius wojdanowski hans peter kriegel arthur zimek 2012 on evaluation of outlier rankings and outlier scores proc 12 siam international conference on data mining 160 retrieved from http en wikipedia org w index php title local_outlier_factor amp oldid 559246766 categories statistical outliersdata miningmachine learning algorithms navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch srpski edit links this page was last modified on 10 june 2013 at 15 57 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Machine_learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Machine_learning new file mode 100644 index 00000000..e4f07753 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Machine_learning @@ -0,0 +1 @@ +machine learning wikipedia the free encyclopedia machine learning from wikipedia the free encyclopedia jump to navigation search for the journal see machine learning journal for the cognitive psychology theory see statistical learning in language acquisition see also pattern recognition machine learning a branch of artificial intelligence is about the construction and study of systems that can learn from data for example a machine learning system could be trained on email messages to learn to distinguish between spam and non spam messages after learning it can then be used to classify new email messages into spam and non spam folders the core of machine learning deals with representation and generalization representation of data instances and functions evaluated on these instances are part of all machine learning systems generalization is the property that the system will perform well on unseen data instances the conditions under which this can be guaranteed are a key object of study in the subfield of computational learning theory there is a wide variety of machine learning tasks and successful applications optical character recognition in which printed characters are recognized automatically based on previous examples is a classic example of machine learning 1 contents 1 definition 2 generalization 3 machine learning and data mining 4 human interaction 5 algorithm types 6 theory 7 approaches 7 1 decision tree learning 7 2 association rule learning 7 3 artificial neural networks 7 4 genetic programming 7 5 inductive logic programming 7 6 support vector machines 7 7 clustering 7 8 bayesian networks 7 9 reinforcement learning 7 10 representation learning 7 11 similarity and metric learning 7 12 sparse dictionary learning 8 applications 9 software 10 journals and conferences 11 see also 12 references 13 further reading 14 external links definition edit in 1959 arthur samuel defined machine learning as a field of study that gives computers the ability to learn without being explicitly programmed 2 tom m mitchell provided a widely quoted more formal definition a computer program is said to learn from experience e with respect to some class of tasks t and performance measure p if its performance at tasks in t as measured by p improves with experience e 3 this definition is notable for its defining machine learning in fundamentally operational rather than cognitive terms thus following alan turing s proposal in turing s paper computing machinery and intelligence that the question can machines think be replaced with the question can machines do what we as thinking entities can do 4 generalization edit generalization in this context is the ability of an algorithm to perform accurately on new unseen examples after having trained on a learning data set the core objective of a learner is to generalize from its experience 5 6 the training examples come from some generally unknown probability distribution and the learner has to extract from them something more general something about that distribution that allows it to produce useful predictions in new cases machine learning and data mining edit these two terms are commonly confused as they often employ the same methods and overlap significantly they can be roughly defined as follows machine learning focuses on prediction based on known properties learned from the training data data mining which is the analysis step of knowledge discovery in databases focuses on the discovery of previously unknown properties on the data the two areas overlap in many ways data mining uses many machine learning methods but often with a slightly different goal in mind on the other hand machine learning also employs data mining methods as unsupervised learning or as a preprocessing step to improve learner accuracy much of the confusion between these two research communities which do often have separate conferences and separate journals ecml pkdd being a major exception comes from the basic assumptions they work with in machine learning performance is usually evaluated with respect to the ability to reproduce known knowledge while in knowledge discovery and data mining kdd the key task is the discovery of previously unknown knowledge evaluated with respect to known knowledge an uninformed unsupervised method will easily be outperformed by supervised methods while in a typical kdd task supervised methods cannot be used due to the unavailability of training data human interaction edit some machine learning systems attempt to eliminate the need for human intuition in data analysis while others adopt a collaborative approach between human and machine human intuition cannot however be entirely eliminated since the system s designer must specify how the data is to be represented and what mechanisms will be used to search for a characterization of the data algorithm types edit machine learning algorithms can be organized into a taxonomy based on the desired outcome of the algorithm or the type of input available during training the machine citation needed supervised learning generates a function that maps inputs to desired outputs also called labels because they are often provided by human experts labeling the training examples for example in a classification problem the learner approximates a function mapping a vector into classes by looking at input output examples of the function unsupervised learning models a set of inputs like clustering see also data mining and knowledge discovery here labels are not known during training semi supervised learning combines both labeled and unlabeled examples to generate an appropriate function or classifier transduction or transductive inference tries to predict new outputs on specific and fixed test cases from observed specific training cases reinforcement learning learns how to act given an observation of the world every action has some impact in the environment and the environment provides feedback in the form of rewards that guides the learning algorithm learning to learn learns its own inductive bias based on previous experience developmental learning elaborated for robot learning generates its own sequences also called curriculum of learning situations to cumulatively acquire repertoires of novel skills through autonomous self exploration and social interaction with human teachers and using guidance mechanisms such as active learning maturation motor synergies and imitation theory edit main article computational learning theory the computational analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory because training sets are finite and the future is uncertain learning theory usually does not yield guarantees of the performance of algorithms instead probabilistic bounds on the performance are quite common in addition to performance bounds computational learning theorists study the time complexity and feasibility of learning in computational learning theory a computation is considered feasible if it can be done in polynomial time there are two kinds of time complexity results positive results show that a certain class of functions can be learned in polynomial time negative results show that certain classes cannot be learned in polynomial time there are many similarities between machine learning theory and statistical inference although they use different terms approaches edit main article list of machine learning algorithms decision tree learning edit main article decision tree learning decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item s target value association rule learning edit main article association rule learning association rule learning is a method for discovering interesting relations between variables in large databases artificial neural networks edit main article artificial neural network an artificial neural network ann learning algorithm usually called neural network nn is a learning algorithm that is inspired by the structure and functional aspects of biological neural networks computations are structured in terms of an interconnected group of artificial neurons processing information using a connectionist approach to computation modern neural networks are non linear statistical data modeling tools they are usually used to model complex relationships between inputs and outputs to find patterns in data or to capture the statistical structure in an unknown joint probability distribution between observed variables genetic programming edit main articles genetic programming and evolutionary computation genetic programming gp is an evolutionary algorithm based methodology inspired by biological evolution to find computer programs that perform a user defined task it is a specialization of genetic algorithms ga where each individual is a computer program it is a machine learning technique used to optimize a population of computer programs according to a fitness landscape determined by a program s ability to perform a given computational task inductive logic programming edit main article inductive logic programming inductive logic programming ilp is an approach to rule learning using logic programming as a uniform representation for examples background knowledge and hypotheses given an encoding of the known background knowledge and a set of examples represented as a logical database of facts an ilp system will derive a hypothesized logic program which entails all the positive and none of the negative examples support vector machines edit main article support vector machines support vector machines svms are a set of related supervised learning methods used for classification and regression given a set of training examples each marked as belonging to one of two categories an svm training algorithm builds a model that predicts whether a new example falls into one category or the other clustering edit main article cluster analysis cluster analysis is the assignment of a set of observations into subsets called clusters so that observations within the same cluster are similar according to some predesignated criterion or criteria while observations drawn from different clusters are dissimilar different clustering techniques make different assumptions on the structure of the data often defined by some similarity metric and evaluated for example by internal compactness similarity between members of the same cluster and separation between different clusters other methods are based on estimated density and graph connectivity clustering is a method of unsupervised learning and a common technique for statistical data analysis bayesian networks edit main article bayesian network a bayesian network belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph dag for example a bayesian network could represent the probabilistic relationships between diseases and symptoms given symptoms the network can be used to compute the probabilities of the presence of various diseases efficient algorithms exist that perform inference and learning reinforcement learning edit main article reinforcement learning reinforcement learning is concerned with how an agent ought to take actions in an environment so as to maximize some notion of long term reward reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states reinforcement learning differs from the supervised learning problem in that correct input output pairs are never presented nor sub optimal actions explicitly corrected representation learning edit several learning algorithms mostly unsupervised learning algorithms aim at discovering better representations of the inputs provided during training classical examples include principal components analysis and cluster analysis representation learning algorithms often attempt to preserve the information in their input but transform it in a way that makes it useful often as a pre processing step before performing classification or predictions allowing to reconstruct the inputs coming from the unknown data generating distribution while not being necessarily faithful for configurations that are implausible under that distribution manifold learning algorithms attempt to do so under the constraint that the learned representation is low dimensional sparse coding algorithms attempt to do so under the constraint that the learned representation is sparse has many zeros multilinear subspace learning algorithms aim to learn low dimensional representations directly from tensor representations for multidimensional data without reshaping them into high dimensional vectors 7 deep learning algorithms discover multiple levels of representation or a hierarchy of features with higher level more abstract features defined in terms of or generating lower level features it has been argued that an intelligent machine is one that learns a representation that disentangles the underlying factors of variation that explain the observed data 8 similarity and metric learning edit main article similarity learning in this problem the learning machine is given pairs of examples that are considered similar and pairs of less similar objects it then needs to learn a similarity function or a distance metric function that can predict if new objects are similar it is sometimes used in recommendation systems sparse dictionary learning edit in this method a datum is represented as a linear combination of basis functions and the coefficients are assumed to be sparse let x be a d dimensional datum d be a d by n matrix where each column of d represents a basis function r is the coefficient to represent x using d mathematically sparse dictionary learning means the following where r is sparse generally speaking n is assumed to be larger than d to allow the freedom for a sparse representation sparse dictionary learning has been applied in several contexts in classification the problem is to determine which classes a previously unseen datum belongs to suppose a dictionary for each class has already been built then a new datum is associated with the class such that it s best sparsely represented by the corresponding dictionary sparse dictionary learning has also been applied in image de noising the key idea is that a clean image path can be sparsely represented by an image dictionary but the noise cannot 9 applications edit this section does not cite any references or sources please help improve this section by adding citations to reliable sources unsourced material may be challenged and removed february 2013 applications for machine learning include machine perception computer vision natural language processing syntactic pattern recognition search engines medical diagnosis bioinformatics brain machine interfaces cheminformatics detecting credit card fraud stock market analysis classifying dna sequences sequence mining speech and handwriting recognition object recognition in computer vision game playing software engineering adaptive websites robot locomotion computational advertising computational finance structural health monitoring sentiment analysis or opinion mining affective computing information retrieval recommender systems in 2006 the online movie company netflix held the first netflix prize competition to find a program to better predict user preferences and improve the accuracy on its existing cinematch movie recommendation algorithm by at least 10 a joint team made up of researchers from at amp t labs research in collaboration with the teams big chaos and pragmatic theory built an ensemble model to win the grand prize in 2009 for 1 million 10 software edit ayasdi angoss knowledgestudio apache mahout gesture recognition toolkit ibm spss modeler knime kxen modeler lionsolver matlab mlpy mcmll opencv dlib oracle data mining orange python scikit learn r rapidminer salford predictive modeler sas enterprise miner shogun toolbox statistica data miner and weka are software suites containing a variety of machine learning algorithms journals and conferences edit machine learning journal journal of machine learning research neural computation journal journal of intelligent systems journal international conference on machine learning icml conference neural information processing systems nips conference see also edit artificial intelligence portal adaptive control automatic reasoning cache language model computational intelligence computational neuroscience cognitive science cognitive modeling data mining explanation based learning hidden markov model list of machine learning algorithms important publications in machine learning multi label classification multilinear subspace learning pattern recognition predictive analytics robot learning developmental robotics references edit wernick yang brankov yourganov and strother machine learning in medical imaging ieee signal processing magazine vol 27 no 4 july 2010 pp 25 38 phil simon march 18 2013 too big to ignore the business case for big data wiley p 160 89 isbn 160 978 1118638170 160 mitchell t 1997 machine learning mcgraw hill isbn 0 07 042807 7 p 2 harnad stevan 2008 the annotation game on turing 1950 on computing machinery and intelligence in epstein robert peters grace the turing test sourcebook philosophical and methodological issues in the quest for the thinking computer kluwer 160 christopher m bishop 2006 pattern recognition and machine learning springer isbn 0 387 31073 8 mehryar mohri afshin rostamizadeh ameet talwalkar 2012 foundations of machine learning the mit press isbn 9780262018258 lu haiping plataniotis k n venetsanopoulos a n 2011 a survey of multilinear subspace learning for tensor data pattern recognition 44 7 1540 1551 doi 10 1016 j patcog 2011 01 004 160 yoshua bengio 2009 learning deep architectures for ai now publishers inc pp 160 1 3 isbn 160 978 1 60198 294 0 160 aharon m m elad and a bruckstein 2006 k svd an algorithm for designing overcomplete dictionaries for sparse representation signal processing ieee transactions on 54 11 4311 4322 belkor home page research att com further reading edit mehryar mohri afshin rostamizadeh ameet talwalkar 2012 foundations of machine learning the mit press isbn 9780262018258 ian h witten and eibe frank 2011 data mining practical machine learning tools and techniques morgan kaufmann 664pp isbn 978 0123748560 sergios theodoridis konstantinos koutroumbas 2009 pattern recognition 4th edition academic press isbn 978 1 59749 272 0 mierswa ingo and wurst michael and klinkenberg ralf and scholz martin and euler timm yale rapid prototyping for complex data mining tasks in proceedings of the 12th acm sigkdd international conference on knowledge discovery and data mining kdd 06 2006 bing liu 2007 web data mining exploring hyperlinks contents and usage data springer isbn 3 540 37881 2 toby segaran 2007 programming collective intelligence o reilly isbn 0 596 52932 5 huang t m kecman v kopriva i 2006 kernel based algorithms for mining huge data sets supervised semi supervised and unsupervised learning springer verlag berlin heidelberg 260 pp 160 96 illus hardcover isbn 3 540 31681 7 ethem alpayd n 2004 introduction to machine learning adaptive computation and machine learning mit press isbn 0 262 01211 1 mackay d j c 2003 information theory inference and learning algorithms cambridge university press isbn 0 521 64298 1 kecman vojislav 2001 learning and soft computing support vector machines neural networks and fuzzy logic models the mit press cambridge ma 608 pp 268 illus isbn 0 262 11255 8 trevor hastie robert tibshirani and jerome friedman 2001 the elements of statistical learning springer isbn 0 387 95284 5 richard o duda peter e hart david g stork 2001 pattern classification 2nd edition wiley new york isbn 0 471 05669 3 bishop c m 1995 neural networks for pattern recognition oxford university press isbn 0 19 853864 2 ryszard s michalski george tecuci 1994 machine learning a multistrategy approach volume iv morgan kaufmann isbn 1 55860 251 8 sholom weiss and casimir kulikowski 1991 computer systems that learn morgan kaufmann isbn 1 55860 065 5 yves kodratoff ryszard s michalski 1990 machine learning an artificial intelligence approach volume iii morgan kaufmann isbn 1 55860 119 8 ryszard s michalski jaime g carbonell tom m mitchell 1986 machine learning an artificial intelligence approach volume ii morgan kaufmann isbn 0 934613 00 1 ryszard s michalski jaime g carbonell tom m mitchell 1983 machine learning an artificial intelligence approach tioga publishing company isbn 0 935382 05 4 vladimir vapnik 1998 statistical learning theory wiley interscience isbn 0 471 03003 1 ray solomonoff an inductive inference machine ire convention record section on information theory part 2 pp 56 62 1957 ray solomonoff an inductive inference machine a privately circulated report from the 1956 dartmouth summer research conference on ai external links edit international machine learning society popular online course by andrew ng at ml class org it uses gnu octave the course is a free version of stanford university s actual course taught by ng whose lectures are also available for free machine learning video lectures retrieved from http en wikipedia org w index php title machine_learning amp oldid 561115081 categories learning in computer visionmachine learninglearningcyberneticshidden categories all articles with unsourced statementsarticles with unsourced statements from march 2013articles needing additional references from february 2013all articles needing additional references navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages az rbaycanca catal esky deutsch eesti espa ol fran ais bahasa indonesia italiano latvie u lietuvi nederlands norsk bokm l norsk nynorsk polski portugus shqip sloven ina srpski suomi svenska tagalog t rk e ti ng vi t edit links this page was last modified on 22 june 2013 at 21 17 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Mining_Software_Repositories b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Mining_Software_Repositories new file mode 100644 index 00000000..c16b9f77 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Mining_Software_Repositories @@ -0,0 +1 @@ +mining software repositories wikipedia the free encyclopedia mining software repositories from wikipedia the free encyclopedia jump to navigation search this article provides insufficient context for those unfamiliar with the subject please help improve the article with a good introductory style august 2012 the mining software repositories msr field analyzes the rich data available in software repositories such as version control repositories mailing list archives bug tracking systems issue tracking systems etc to uncover interesting and actionable information about software systems projects and software engineering contents 1 data repositories 1 1 metrics 1 2 defect prediction 1 3 collection of open source code 2 techniques 3 tools 3 1 experimentation tools 3 2 metric extraction tools 3 3 mining tools 4 contradictory findings 5 software metrics 6 see also 7 external links data repositories edit metrics edit floss mole 1 defect prediction edit promise software repository 2 collection of open source code edit merobase 3 techniques edit this section is empty you can help by adding to it january 2013 tools edit experimentation tools edit trace lab metric extraction tools edit columbus 4 pmd 5 mining tools edit weka 6 rapidminer 7 contradictory findings edit this section is empty you can help by adding to it january 2013 software metrics edit this section is empty you can help by adding to it january 2013 see also edit software analytics software maintenance software archaeology external links edit working conference on mining software repositories the main software engineering conference in the area this science article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title mining_software_repositories amp oldid 557371996 categories softwaredata miningsoftware engineeringscience stubshidden categories wikipedia articles needing context from august 2012all wikipedia articles needing contextwikipedia introduction cleanup from august 2012all pages needing cleanuparticles to be expanded from january 2013all articles to be expandedarticles with empty sections from january 2013all articles with empty sections navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version this page was last modified on 29 may 2013 at 16 30 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Molecule_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Molecule_mining new file mode 100644 index 00000000..00c1e8ca --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Molecule_mining @@ -0,0 +1 @@ +molecule mining wikipedia the free encyclopedia molecule mining from wikipedia the free encyclopedia jump to navigation search this page describes mining for molecules since molecules may be represented by molecular graphs this is strongly related to graph mining and structured data mining the main problem is how to represent molecules while discriminating the data instances one way to do this is chemical similarity metrics which has a long tradition in the field of cheminformatics typical approaches to calculate chemical similarities use chemical fingerprints but this loses the underlying information about the molecule topology mining the molecular graphs directly avoids this problem so does the inverse qsar problem which is preferable for vectorial mappings contents 1 coding moleculei moleculeji 1 1 kernel methods 1 2 maximum common graph methods 2 coding moleculei 2 1 molecular query methods 2 2 methods based on special architectures of neural networks 3 see also 4 references 4 1 further reading 5 see also 6 external links coding moleculei moleculeji edit kernel methods edit marginalized graph kernel 1 optimal assignment kernel 2 3 4 pharmacophore kernel 5 c and r implementation combining the marginalized graph kernel between labeled graphs extensions of the marginalized kernel tanimoto kernels graph kernels based on tree patterns kernels based on pharmacophores for 3d structure of molecules maximum common graph methods edit mcs hscs 6 highest scoring common substructure hscs ranking strategy for single mcs small molecule subgraph detector smsd 7 is a java based software library for calculating maximum common subgraph mcs between small molecules this will help us to find similarity distance between two molecules mcs is also used for screening drug like compounds by hitting molecules which share common subgraph substructure 8 coding moleculei edit molecular query methods edit warmr 9 10 agm 11 12 polyfarm 13 fsg 14 15 molfea 16 mofa moss 17 18 19 gaston 20 lazar 21 parmol 22 contains mofa ffsm gspan and gaston optimized gspan 23 24 smirep 25 dmax 26 sam aim rhc 27 afgen 28 gred 29 g hash 30 methods based on special architectures of neural networks edit bpz 31 32 chemnet 33 ccs 34 35 molnet 36 graph machines 37 see also edit molecular query language chemical graph theory references edit h kashima k tsuda a inokuchi marginalized kernels between labeled graphs the 20th international conference on machine learning icml2003 2003 pdf h fr hlich j k wegner a zell optimal assignment kernels for attributed molecular graphs the 22nd international conference on machine learning icml 2005 omnipress madison wi usa 2005 225 232 pdf h fr hlich j k wegner a zell kernel functions for attributed molecular graphs a new similarity based approach to adme prediction in classification and regression qsar comb sci 2006 25 317 326 doi 10 1002 qsar 200510135 h fr hlich j k wegner a zell assignment kernels for chemical compounds international joint conference on neural networks 2005 ijcnn 05 2005 913 918 citeseer p mahe l ralaivola v stoven j vert the pharmacophore kernel for virtual screening with support vector machines j chem inf model 2006 46 2003 2014 doi 10 1021 ci060138m j k wegner h fr hlich h mielenz a zell data and graph mining in chemical space for adme and activity data sets qsar comb sci 2006 25 205 220 doi 10 1002 qsar 200510009 s a rahman m bashton g l holliday r schrader and j m thornton small molecule subgraph detector smsd toolkit journal of cheminformatics 2009 1 12 doi 10 1186 1758 2946 1 12 http www ebi ac uk thornton srv software smsd r d king a srinivasan l dehaspe wamr a data mining tool for chemical data j comput aid mol des 2001 15 173 181 doi 10 1023 a 1008171016861 l dehaspe h toivonen king finding frequent substructures in chemical compounds 4th international conference on knowledge discovery and data mining aaai press 1998 30 36 a inokuchi t washio t okada h motoda applying the apriori based graph mining method to mutagenesis data analysis journal of computer aided chemistry 2001 2 87 92 a inokuchi t washio k nishimura h motoda a fast algorithm for mining frequent connected subgraphs ibm research tokyo research laboratory 2002 a clare r d king data mining the yeast genome in a lazy functional language practical aspects of declarative languages padl2003 2003 m kuramochi g karypis an efficient algorithm for discovering frequent subgraphs ieee transactions on knowledge and data engineering 2004 16 9 1038 1051 m deshpande m kuramochi n wale g karypis frequent substructure based approaches for classifying chemical compounds ieee transactions on knowledge and data engineering 2005 17 8 1036 1050 c helma t cramer s kramer l de raedt data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds j chem inf comput sci 2004 44 1402 1411 doi 10 1021 ci034254q t meinl c borgelt m r berthold discriminative closed fragment mining and perfect extensions in mofa proceedings of the second starting ai researchers symposium stairs 2004 2004 t meinl c borgelt m r berthold m philippsen mining fragments with fuzzy chains in molecular databases second international workshop on mining graphs trees and sequences mgts2004 2004 t meinl m r berthold hybrid fragment mining with mofa and fsg proceedings of the 2004 ieee conference on systems man amp cybernetics smc2004 2004 s nijssen j n kok frequent graph mining and its application to molecular databases proceedings of the 2004 ieee conference on systems man amp cybernetics smc2004 2004 c helma predictive toxicology crc press 2005 m w rlein extension and parallelization of a graph mining algorithm friedrich alexander universit t 2006 pdf k jahn s kramer optimizing gspan for molecular datasets proceedings of the third international workshop on mining graphs trees and sequences mgts 2005 2005 x yan j han gspan graph based substructure pattern mining proceedings of the 2002 ieee international conference on data mining icdm 2002 ieee computer society 2002 721 724 a karwath l d raedt smirep predicting chemical activity from smiles j chem inf model 2006 46 2432 2444 doi 10 1021 ci060159g h ando l dehaspe w luyten e craenenbroeck h vandecasteele l meervelt discovering h bonding rules in crystals with inductive logic programming mol pharm 2006 3 665 674 doi 10 1021 mp060034z p mazzatorta l tran b schilter m grigorov integration of structure activity relationship and artificial intelligence systems to improve in silico prediction of ames test mutagenicity j chem inf model 2006 asap alert doi 10 1021 ci600411v n wale g karypis comparison of descriptor spaces for chemical compound retrieval and classification icdm 2006 678 689 a gago alonso j e medina pagola j a carrasco ochoa and j f mart nez trinidad mining connected subgraph mining reducing the number of candidates in proc of ecml pkdd pp 365 376 2008 xiaohong wang jun huan aaron smalter gerald lushington application of kernel functions for accurate similarity search in large chemical databases in bmc bioinformatics vol 11 suppl 3 s8 2010 baskin i i v a palyulin and n s zefirov 1993 a methodology for searching direct correlations between structures and properties of organic compounds by using computational neural networks trans title requires title help doklady akademii nauk sssr 333 2 176 179 160 i i baskin v a palyulin n s zefirov 1997 a neural device for searching direct correlations between structures and properties of organic compounds j chem inf comput sci 37 4 715 721 doi 10 1021 ci940128y 160 d b kireev 1995 chemnet a novel neural network based method for graph property mapping j chem inf comput sci 35 2 175 180 doi 10 1021 ci00024a001 160 a m bianucci micheli alessio sperduti alessandro starita antonina 2000 application of cascade correlation networks for structures to chemistry applied intelligence 12 1 2 117 146 doi 10 1023 a 1008368105614 160 a micheli a sperduti a starita a m bianucci 2001 analysis of the internal representations developed by neural networks for structures applied to quantitative structure activity relationship studies of benzodiazepines j chem inf comput sci 41 1 202 218 doi 10 1021 ci9903399 pmid 160 11206375 160 o ivanciuc 2001 molecular structure encoding into artificial neural networks topology roumanian chemical quarterly reviews 8 197 220 160 a goulon t picot a duprat g dreyfus 2007 predicting activities without computing descriptors graph machines for qsar sar and qsar in environmental research 18 1 2 141 153 doi 10 1080 10629360601054313 pmid 160 17365965 160 further reading edit sch lkopf b k tsuda and j p vert kernel methods in computational biology mit press cambridge ma 2004 r o duda p e hart d g stork pattern classification john wiley amp sons 2001 isbn 0 471 05669 3 gusfield d algorithms on strings trees and sequences computer science and computational biology cambridge university press 1997 isbn 0 521 58519 8 r todeschini v consonni handbook of molecular descriptors wiley vch 2000 isbn 3 527 29913 0 see also edit qsar adme partition coefficient external links edit small molecule subgraph detector smsd is a java based software library for calculating maximum common subgraph mcs between small molecules 5th international workshop on mining and learning with graphs 2007 overview for 2006 molecule mining basic chemical expert systems parmol and master thesis documentation java open source distributed mining benchmark algorithm library tu m nchen kramer group molecule mining advanced chemical expert systems dmax chemistry assistant commercial software afgen software for generating fragment based descriptors retrieved from http en wikipedia org w index php title molecule_mining amp oldid 558440096 categories cheminformaticscomputational chemistrydata mininghidden categories pages with citations using translated terms without the original navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 5 june 2013 at 13 02 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Multifactor_dimensionality_reduction b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Multifactor_dimensionality_reduction new file mode 100644 index 00000000..c58521d3 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Multifactor_dimensionality_reduction @@ -0,0 +1 @@ +multifactor dimensionality reduction wikipedia the free encyclopedia multifactor dimensionality reduction from wikipedia the free encyclopedia jump to navigation search multifactor dimensionality reduction mdr is a data mining approach for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable mdr was designed specifically to identify interactions among discrete variables that influence a binary outcome and is considered a nonparametric alternative to traditional statistical methods such as logistic regression the basis of the mdr method is a constructive induction algorithm that converts two or more variables or attributes to a single attribute this process of constructing a new attribute changes the representation space of the data the end goal is to create or discover a representation that facilitates the detection of nonlinear or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data contents 1 illustrative example 2 data mining with mdr 3 applications 4 software 5 see also 6 references 7 further reading illustrative example edit consider the following simple example using the exclusive or xor function xor is a logical operator that is commonly used in data mining and machine learning as an example of a function that is not linearly separable the table below represents a simple dataset where the relationship between the attributes x1 and x2 and the class variable y is defined by the xor function such that y x1 xor x2 table 1 x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 a data mining algorithm would need to discover or approximate the xor function in order to accurately predict y using information about x1 and x2 an alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling the mdr algorithm would change the representation of the data x1 and x2 in the following manner mdr starts by selecting two attributes in this simple example x1 and x2 are selected each combination of values for x1 and x2 are examined and the number of times y 1 and or y 0 is counted in this simple example y 1 occurs zero times and y 0 occurs once for the combination of x1 0 and x2 0 with mdr the ratio of these counts is computed and compared to a fixed threshold here the ratio of counts is 0 1 which is less than our fixed threshold of 1 since 0 1 lt 1 we encode a new attribute z as a 0 when the ratio is greater than one we encode z as a 1 this process is repeated for all unique combinations of values for x1 and x2 table 2 illustrates our new transformation of the data table 2 z y 0 0 1 1 1 1 0 0 the data mining algorithm now has much less work to do to find a good predictive function in fact in this very simple example the function y z has a classification accuracy of 1 a nice feature of constructive induction methods such as mdr is the ability to use any data mining or machine learning method to analyze the new representation of the data decision trees neural networks or a naive bayes classifier could be used data mining with mdr edit as illustrated above the basic constructive induction algorithm in mdr is very simple however its implementation for mining patterns from real data can be computationally complex as with any data mining algorithm there is always concern about overfitting that is data mining algorithms are good at finding patterns in completely random data it is often difficult to determine whether a reported pattern is an important signal or just chance one approach is to estimate the generalizability of a model to independent datasets using methods such as cross validation models that describe random data typically don t generalize another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit permutation testing makes it possible to generate an empirical p value for the result these approaches have all been shown to be useful for choosing and evaluating mdr models applications edit mdr has mostly been applied citation needed to detecting gene gene interactions or epistasis in genetic studies of common human diseases such as atrial fibrillation autism bladder cancer breast cancer cardiovascular disease hypertension prostate cancer schizophrenia and type ii diabetes however it can be applied to other domains such as economics engineering meteorology etc where interactions among discrete attributes might be important for predicting a binary outcome citation needed software edit www epistasis org provides an open source and freely available mdr software package see also edit dimensionality reduction machine learning multilinear subspace learning data mining this article includes a list of references related reading or external links but its sources remain unclear because it lacks inline citations please improve this article by introducing more precise citations november 2010 references edit ritchie md hahn lw roodi n bailey lr dupont wd parl ff moore jh multifactor dimensionality reduction reveals high order interactions among estrogen metabolism genes in sporadic breast cancer am j hum genet 2001 jul 69 1 138 47 pubmed moore jh williams sm new strategies for identifying gene gene interactions in hypertension ann med 2002 34 2 88 95 pubmed ritchie md hahn lw moore jh power of multifactor dimensionality reduction for detecting gene gene interactions in the presence of genotyping error missing data phenocopy and genetic heterogeneity genet epidemiol 2003 feb 24 2 150 7 pubmed hahn lw ritchie md moore jh multifactor dimensionality reduction software for detecting gene gene and gene environment interactions bioinformatics 2003 feb 12 19 3 376 82 pubmed moore jh the ubiquitous nature of epistasis in determining susceptibility to common human diseases hum hered 2003 56 1 3 73 82 pubmed cho ym ritchie md moore jh park jy lee ku shin hd lee hk park ks multifactor dimensionality reduction shows a two locus interaction associated with type 2 diabetes mellitus diabetologia 2004 mar 47 3 549 54 pubmed tsai ct lai lp lin jl chiang ft hwang jj ritchie md moore jh hsu kl tseng cd liau cs tseng yz renin angiotensin system gene polymorphisms and atrial fibrillation circulation 2004 apr 6 109 13 1640 6 pubmed hahn lw moore jh ideal discrimination of discrete clinical endpoints using multilocus genotypes in silico biol 2004 4 2 183 94 pubmed coffey cs hebert pr ritchie md krumholz hm gaziano jm ridker pm brown nj vaughan de moore jh an application of conditional logistic regression and multifactor dimensionality reduction for detecting gene gene interactions on risk of myocardial infarction the importance of model validation bmc bioinformatics 2004 apr 30 5 49 pubmed moore jh computational analysis of gene gene interactions using multifactor dimensionality reduction expert rev mol diagn 2004 nov 4 6 795 803 pubmed williams sm ritchie md phillips ja 3rd dawson e prince m dzhura e willis a semenya a summar m white bc addy jh kpodonu j wong lj felder ra jose pa moore jh multilocus analysis of hypertension a hierarchical approach hum hered 2004 57 1 28 38 pubmed bastone l reilly m rader dj foulkes as mdr and prp a comparison of methods for high order genotype phenotype associations hum hered 2004 58 2 82 92 pubmed ma dq whitehead pl menold mm martin er ashley koch ae mei h ritchie md delong gr abramson rk wright hh cuccaro ml hussman jp gilbert jr pericak vance ma identification of significant association and gene gene interaction of gaba receptor subunit genes in autism am j hum genet 2005 sep 77 3 377 88 pubmed soares ml coelho t sousa a batalov s conceicao i sales luis ml ritchie md williams sm nievergelt cm schork nj saraiva mj buxbaum jn susceptibility and modifier genes in portuguese transthyretin v30m amyloid polyneuropathy complexity in a single gene disease hum mol genet 2005 feb 15 14 4 543 53 pubmed qin s zhao x pan y liu j feng g fu j bao j zhang z he l an association study of the n methyl d aspartate receptor nr1 subunit gene grin1 and nr2b subunit gene grin2b in schizophrenia with universal dna microarray eur j hum genet 2005 jul 13 7 807 14 pubmed wilke ra moore jh burmester jk relative impact of cyp3a genotype and concomitant medication on the severity of atorvastatin induced muscle damage pharmacogenet genomics 2005 jun 15 6 415 21 pubmed xu j lowey j wiklund f sun j lindmark f hsu fc dimitrov l chang b turner ar liu w adami ho suh e moore jh zheng sl isaacs wb trent jm gronberg h the interaction of four genes in the inflammation pathway significantly predicts prostate cancer risk cancer epidemiol biomarkers prev 2005 nov 14 11 pt 1 2563 8 pubmed wilke ra reif dm moore jh combinatorial pharmacogenetics nat rev drug discov 2005 nov 4 11 911 8 pubmed ritchie md motsinger aa multifactor dimensionality reduction for detecting gene gene and gene environment interactions in pharmacogenomics studies pharmacogenomics 2005 dec 6 8 823 34 pubmed andrew as nelson hh kelsey kt moore jh meng ac casella dp tosteson td schned ar karagas mr concordance of multiple analytical approaches demonstrates a complex relationship between dna repair gene snps smoking and bladder cancer susceptibility carcinogenesis 2006 may 27 5 1030 7 pubmed moore jh gilbert jc tsai ct chiang ft holden t barney n white bc a flexible computational framework for detecting characterizing and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility j theor biol 2006 jul 21 241 2 252 61 pubmed further reading edit r s michalski pattern recognition as knowledge guided computer induction department of computer science reports no 927 university of illinois urbana june 1978 retrieved from http en wikipedia org w index php title multifactor_dimensionality_reduction amp oldid 539585541 categories data miningdimension reductionclassification algorithmshidden categories all articles with unsourced statementsarticles with unsourced statements from december 2010articles lacking in text citations from november 2010all articles lacking in text citationsuse dmy dates from december 2010 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 22 february 2013 at 02 52 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Nearest_neighbor_search b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Nearest_neighbor_search new file mode 100644 index 00000000..5429f9bc --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Nearest_neighbor_search @@ -0,0 +1 @@ +nearest neighbor search wikipedia the free encyclopedia nearest neighbor search from wikipedia the free encyclopedia jump to navigation search nearest neighbor search nns also known as proximity search similarity search or closest point search is an optimization problem for finding closest points in metric spaces the problem is given a set s of points in a metric space m and a query point q 160 160 m find the closest point in s to q in many cases m is taken to be d dimensional euclidean space and distance is measured by euclidean distance manhattan distance or other distance metric donald knuth in vol 3 of the art of computer programming 1973 called it the post office problem referring to an application of assigning to a residence the nearest post office contents 1 applications 2 methods 2 1 linear search 2 2 space partitioning 2 3 locality sensitive hashing 2 4 nearest neighbor search in spaces with small intrinsic dimension 2 5 vector approximation files 2 6 compression clustering based search 3 variants 3 1 k nearest neighbor 3 2 approximate nearest neighbor 3 3 nearest neighbor distance ratio 3 4 fixed radius near neighbors 3 5 all nearest neighbors 4 see also 5 notes 6 references 7 further reading 8 external links applications edit the nearest neighbor search problem arises in numerous fields of application including pattern recognition in particular for optical character recognition statistical classification see k nearest neighbor algorithm computer vision computational geometry see closest pair of points problem databases e g content based image retrieval coding theory see maximum likelihood decoding data compression see mpeg 2 standard recommendation systems e g see collaborative filtering internet marketing see contextual advertising and behavioral targeting dna sequencing spell checking suggesting correct spelling plagiarism detection contact searching algorithms in fea similarity scores for predicting career paths of professional athletes cluster analysis assignment of a set of observations into subsets called clusters so that observations in the same cluster are similar in some sense usually based on euclidean distance chemical similarity methods edit various solutions to the nns problem have been proposed the quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained the informal observation usually referred to as the curse of dimensionality states that there is no general purpose exact solution for nns in high dimensional euclidean space using polynomial preprocessing and polylogarithmic search time linear search edit the simplest solution to the nns problem is to compute the distance from the query point to every other point in the database keeping track of the best so far this algorithm sometimes referred to as the naive approach has a running time of o nd where n is the cardinality of s and d is the dimensionality of m there are no search data structures to maintain so linear search has no space complexity beyond the storage of the database naive search can on average outperform space partitioning approaches on higher dimensional spaces 1 space partitioning edit since the 1970s branch and bound methodology has been applied to the problem in the case of euclidean space this approach is known as spatial index or spatial access methods several space partitioning methods have been developed for solving the nns problem perhaps the simplest is the k d tree which iteratively bisects the search space into two regions containing half of the points of the parent region queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split depending on the distance specified in the query neighboring branches that might contain hits may also need to be evaluated for constant dimension query time average complexity is o log 160 n 2 in the case of randomly distributed points worst case complexity analyses have been performed 3 alternatively the r tree data structure was designed to support nearest neighbor search in dynamic context as it has efficient algorithms for insertions and deletions in case of general metric space branch and bound approach is known under the name of metric trees particular examples include vp tree and bk tree using a set of points taken from a 3 dimensional space and put into a bsp tree and given a query point taken from the same space a possible solution to the problem of finding the nearest point cloud point to the query point is given in the following description of an algorithm strictly speaking no such point may exist because it may not be unique but in practice usually we only care about finding any one of the subset of all point cloud points that exist at the shortest distance to a given query point the idea is for each branching of the tree guess that the closest point in the cloud resides in the half space containing the query point this may not be the case but it is a good heuristic after having recursively gone through all the trouble of solving the problem for the guessed half space now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane this latter distance is that between the query point and the closest possible point that could exist in the half space not searched if this distance is greater than that returned in the earlier result then clearly there is no need to search the other half space if there is such a need then you must go through the trouble of solving the problem for the other half space and then compare its result to the former result and then return the proper result the performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud because as the distance between the query point and the closest point cloud point nears zero the algorithm needs only perform a look up using the query point as a key to get the correct result locality sensitive hashing edit locality sensitive hashing lsh is a technique for grouping points in space into buckets based on some distance metric operating on the points points that are close to each other under the chosen metric are mapped to the same bucket with high probability 4 nearest neighbor search in spaces with small intrinsic dimension edit the cover tree has a theoretical bound that is based on the dataset s doubling constant the bound on search time is o c12 160 log 160 n where c is the expansion constant of the dataset vector approximation files edit in high dimensional spaces tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway to speed up linear search a compressed version of the feature vectors stored in ram is used to prefilter the datasets in a first run the final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation 5 compression clustering based search edit the va file approach is a special case of a compression based search where each feature component is compressed uniformly and independently the optimal compression technique in multidimensional spaces is vector quantization vq implemented through clustering the database is clustered and the most promising clusters are retrieved huge gains over va file tree based indexes and sequential scan have been observed 6 7 also note the parallels between clustering and lsh variants edit there are numerous variants of the nns problem and the two most well known are the k nearest neighbor search and the approximate nearest neighbor search k nearest neighbor edit main article k nearest neighbor algorithm k nearest neighbor search identifies the top k nearest neighbors to the query this technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors k nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors approximate nearest neighbor edit in some applications it may be acceptable to retrieve a good guess of the nearest neighbor in those cases we can use an algorithm which doesn t guarantee to return the actual nearest neighbor in every case in return for improved speed or memory savings often such an algorithm will find the nearest neighbor in a majority of cases but this depends strongly on the dataset being queried algorithms that support the approximate nearest neighbor search include locality sensitive hashing best bin first and balanced box decomposition tree based search 8 approximate nearest neighbor search is becoming an increasingly popular tool for fighting the curse of dimensionality citation needed nearest neighbor distance ratio edit nearest neighbor distance ratio do not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor it is used in cbir to retrieve pictures through a query by example using the similarity between local features more generally it is involved in several matching problems fixed radius near neighbors edit fixed radius near neighbors is the problem where one wants to efficiently find all points given in euclidean space within a given fixed distance from a specified point the data structure should work on a distance which is fixed however the query point is arbitrary all nearest neighbors edit for some applications e g entropy estimation we may have n data points and wish to know which is the nearest neighbor for every one of those n points this could of course be achieved by running a nearest neighbor search once for every point but an improved strategy would be an algorithm that exploits the information redundancy between these n queries to produce a more efficient search as a simple example when we find the distance from point x to point y that also tells us the distance from point y to point x so the same calculation can be reused in two different queries given a fixed dimension a semi definite positive norm thereby including every lp norm and n points in this space the nearest neighbour of every point can be found in o n 160 log 160 n time and the m nearest neighbours of every point can be found in o mn 160 log 160 n time 9 10 see also edit set cover problem statistical distance closest pair of points problem ball tree cluster analysis neighbor joining content based image retrieval curse of dimensionality digital signal processing dimension reduction fixed radius near neighbors fourier analysis instance based learning k nearest neighbor algorithm linear least squares locality sensitive hashing multidimensional analysis nearest neighbor interpolation principal component analysis singular value decomposition time series voronoi diagram wavelet minhash notes edit weber schek blott a quantitative analysis and performance study for similarity search methods in high dimensional spaces 160 andrew moore an introductory tutorial on kd trees 160 lee d t wong c k 1977 worst case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees acta informatica 9 1 23 29 doi 10 1007 bf00263763 160 a rajaraman and j ullman 2010 mining of massive datasets ch 3 160 weber blott an approximation based data structure for similarity search 160 missing or empty url help ramaswamy rose icip 2007 adaptive cluster distance bounding for similarity search in image databases 160 missing or empty url help ramaswamy rose tkde 2001 adaptive cluster distance bounding for high dimensional indexing 160 missing or empty url help s arya d m mount n s netanyahu r silverman and a wu an optimal algorithm for approximate nearest neighbor searching journal of the acm 45 6 891 923 1998 1 clarkson kenneth l 1983 fast algorithms for the all nearest neighbors problem 24th ieee symp foundations of computer science focs 83 pp 160 226 232 doi 10 1109 sfcs 1983 16 160 vaidya p m 1989 an o n 160 log 160 n algorithm for the all nearest neighbors problem discrete and computational geometry 4 1 101 115 doi 10 1007 bf02187718 160 references edit andrews l a template for the nearest neighbor problem c c users journal vol 19 no 11 november 2001 40 49 2001 issn 1075 2838 www ddj com architect 184401449 arya s d m mount n s netanyahu r silverman and a y wu an optimal algorithm for approximate nearest neighbor searching in fixed dimensions journal of the acm vol 45 no 6 pp 160 891 923 beyer k goldstein j ramakrishnan r and shaft u 1999 when is nearest neighbor meaningful in proceedings of the 7th icdt jerusalem israel chung min chen and yibei ling a sampling based estimator for top k query icde 2002 617 627 samet h 2006 foundations of multidimensional and metric data structures morgan kaufmann isbn 0 12 369446 9 zezula p amato g dohnal v and batko m similarity search the metric space approach springer 2006 isbn 0 387 29146 6 further reading edit shasha dennis 2004 high performance discovery in time series berlin springer isbn 160 0 387 00857 8 160 external links edit nearest neighbors and similarity search a website dedicated to educational materials software literature researchers open problems and events related to nn searching maintained by yury lifshits similarity search wiki a collection of links people ideas keywords papers slides code and data sets on nearest neighbours metric spaces library an open source c based library for metric space indexing by karina figueroa gonzalo navarro edgar ch vez ann a library for approximate nearest neighbor searching by david m mount and sunil arya flann fast approximate nearest neighbor search library by marius muja and david g lowe product quantization matlab implementation of approximate nearest neighbor search in the compressed domain by herve jegou messif metric similarity search implementation framework by michal batko and david novak obsearch similarity search engine for java gpl implementation by arnoldo muller developed during google summer of code 2007 knnlsb k nearest neighbors linear scan baseline distributed lgpl implementation by georges qu not lig cnrs neartree an api for finding nearest neighbors among points in spaces of arbitrary dimensions by lawrence c andrews and herbert j bernstein nearpy python framework for fast approximated nearest neighbor search by ole krause sparmann retrieved from http en wikipedia org w index php title nearest_neighbor_search amp oldid 558338819 categories approximation algorithmsclassification algorithmsdata miningdiscrete geometrygeometric algorithmsinformation retrievalmachine learningnumerical analysismathematical optimizationsearchingsearch algorithmshidden categories pages using web citations with no urlall articles with unsourced statementsarticles with unsourced statements from march 2011 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais magyar srpski edit links this page was last modified on 4 june 2013 at 20 03 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Neural_networks b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Neural_networks new file mode 100644 index 00000000..c1f194e5 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Neural_networks @@ -0,0 +1 @@ +neural network wikipedia the free encyclopedia neural network from wikipedia the free encyclopedia redirected from neural networks jump to navigation search for other uses see neural network disambiguation this article includes a list of references but its sources remain unclear because it has insufficient inline citations please help to improve this article by introducing more precise citations october 2010 simplified view of a feedforward artificial neural network the term neural network was traditionally used to refer to a network or circuit of biological neurons 1 the modern usage of the term often refers to artificial neural networks which are composed of artificial neurons or nodes thus the term may refer to either biological neural networks made up of real biological neurons or artificial neural networks for solving artificial intelligence problems unlike von neumann model computations artificial neural networks do not separate memory and processing and operate via the flow of signals through the net connections somewhat akin to biological networks these artificial networks may be used for predictive modeling adaptive control and applications where they can be trained via a dataset contents 1 overview 2 history 3 neural networks and artificial intelligence 4 neural networks and neuroscience 4 1 types of models 5 criticism 6 recent improvements 7 see also 8 references 9 external links overview edit a biological neural network is composed of a group or groups of chemically connected or functionally associated neurons a single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive connections called synapses are usually formed from axons to dendrites though dendrodendritic microcircuits 2 and other connections are possible apart from the electrical signaling there are other forms of signaling that arise from neurotransmitter diffusion artificial intelligence cognitive modelling and neural networks are information processing paradigms inspired by the way biological neural systems process data artificial intelligence and cognitive modeling try to simulate some properties of biological neural networks in the artificial intelligence field artificial neural networks have been applied successfully to speech recognition image analysis and adaptive control in order to construct software agents in computer and video games or autonomous robots historically digital computers evolved from the von neumann model and operate via the execution of explicit instructions via access to memory by a number of processors on the other hand the origins of neural networks are based on efforts to model information processing in biological systems unlike the von neumann model neural network computing does not separate memory and processing neural network theory has served both to better identify how the neurons in the brain function and to provide the basis for efforts to create artificial intelligence history edit the preliminary theoretical base for contemporary neural networks was independently proposed by alexander bain 3 1873 and william james 4 1890 in their work both thoughts and body activity resulted from interactions among neurons within the brain computer simulation of the branching architecture of the dendrites of pyramidal neurons 5 for bain 3 every activity led to the firing of a certain set of neurons when activities were repeated the connections between those neurons strengthened according to his theory this repetition was what led to the formation of memory the general scientific community at the time was skeptical of bain s 3 theory because it required what appeared to be an inordinate number of neural connections within the brain it is now apparent that the brain is exceedingly complex and that the same brain wiring can handle multiple problems and inputs james s 4 theory was similar to bain s 3 however he suggested that memories and actions resulted from electrical currents flowing among the neurons in the brain his model by focusing on the flow of electrical currents did not require individual neural connections for each memory or action c s sherrington 6 1898 conducted experiments to test james s theory he ran electrical currents down the spinal cords of rats however instead of demonstrating an increase in electrical current as projected by james sherrington found that the electrical current strength decreased as the testing continued over time importantly this work led to the discovery of the concept of habituation mcculloch and pitts 7 1943 created a computational model for neural networks based on mathematics and algorithms they called this model threshold logic the model paved the way for neural network research to split into two distinct approaches one approach focused on biological processes in the brain and the other focused on the application of neural networks to artificial intelligence in the late 1940s psychologist donald hebb 8 created a hypothesis of learning based on the mechanism of neural plasticity that is now known as hebbian learning hebbian learning is considered to be a typical unsupervised learning rule and its later variants were early models for long term potentiation these ideas started being applied to computational models in 1948 with turing s b type machines farley and clark 9 1954 first used computational machines then called calculators to simulate a hebbian network at mit other neural network computational machines were created by rochester holland habit and duda 10 1956 rosenblatt 11 1958 created the perceptron an algorithm for pattern recognition based on a two layer learning computer network using simple addition and subtraction with mathematical notation rosenblatt also described circuitry not in the basic perceptron such as the exclusive or circuit a circuit whose mathematical computation could not be processed until after the backpropagation algorithm was created by werbos 12 1975 neural network research stagnated after the publication of machine learning research by minsky and papert 13 1969 they discovered two key issues with the computational machines that processed neural networks the first issue was that single layer neural networks were incapable of processing the exclusive or circuit the second significant issue was that computers were not sophisticated enough to effectively handle the long run time required by large neural networks neural network research slowed until computers achieved greater processing power also key in later advances was the backpropagation algorithm which effectively solved the exclusive or problem werbos 1975 12 the parallel distributed processing of the mid 1980s became popular under the name connectionism the text by rumelhart and mcclelland 14 1986 provided a full exposition on the use of connectionism in computers to simulate neural processes neural networks as used in artificial intelligence have traditionally been viewed as simplified models of neural processing in the brain even though the relation between this model and brain biological architecture is debated as it is not clear to what degree artificial neural networks mirror brain function 15 neural networks and artificial intelligence edit main article artificial neural network a neural network nn in the case of artificial neurons called artificial neural network ann or simulated neural network snn is an interconnected group of natural or artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation in most cases an ann is an adaptive system that changes its structure based on external or internal information that flows through the network in more practical terms neural networks are non linear statistical data modeling or decision making tools they can be used to model complex relationships between inputs and outputs or to find patterns in data however the paradigm of neural networks i e implicit not explicit 160 learning is stressed seems more to correspond to some kind of natural intelligence than to the traditional symbol based artificial intelligence which would stress instead rule based learning an artificial neural network involves a network of simple processing elements artificial neurons which can exhibit complex global behavior determined by the connections between the processing elements and element parameters artificial neurons were first proposed in 1943 by warren mcculloch a neurophysiologist and walter pitts a logician who first collaborated at the university of chicago 16 one classical type of artificial neural network is the recurrent hopfield net in a neural network model simple nodes which can be called by a number of names including neurons neurodes processing elements pe and units are connected together to form a network of nodes hence the term neural network while a neural network does not have to be adaptive per se its practical use comes with algorithms designed to alter the strength weights of the connections in the network to produce a desired signal flow citation needed the concept of a neural network appears to have first been proposed by alan turing in his 1948 paper intelligent machinery in which called them b type unorganised machines 17 the utility of artificial neural network models lies in the fact that they can be used to infer a function from observations and also to use it unsupervised neural networks can also be used to learn representations of the input that capture the salient characteristics of the input distribution e g see the boltzmann machine 1983 and more recently deep learning algorithms which can implicitly learn the distribution function of the observed data learning in neural networks is particularly useful in applications where the complexity of the data or task makes the design of such functions by hand impractical the tasks to which artificial neural networks are applied tend to fall within the following broad categories function approximation or regression analysis including time series prediction and modeling classification including pattern and sequence recognition novelty detection and sequential decision making data processing including filtering clustering blind signal separation and compression application areas of anns include system identification and control vehicle control process control game playing and decision making backgammon chess racing pattern recognition radar systems face identification object recognition sequence recognition gesture speech handwritten text recognition medical diagnosis financial applications data mining or knowledge discovery in databases kdd visualization and e mail spam filtering neural networks and neuroscience edit theoretical and computational neuroscience is the field concerned with the theoretical analysis and computational modeling of biological neural systems since neural systems are intimately related to cognitive processes and behaviour the field is closely related to cognitive and behavioural modeling the aim of the field is to create models of biological neural systems in order to understand how biological systems work to gain this understanding neuroscientists strive to make a link between observed biological processes data biologically plausible mechanisms for neural processing and learning biological neural network models and theory statistical learning theory and information theory types of models edit many models are used defined at different levels of abstraction and modeling different aspects of neural systems they range from models of the short term behaviour of individual neurons through models of the dynamics of neural circuitry arising from interactions between individual neurons to models of behaviour arising from abstract neural modules that represent complete subsystems these include models of the long term and short term plasticity of neural systems and its relation to learning and memory from the individual neuron to the system level criticism edit a common criticism of neural networks particularly in robotics is that they require a large diversity of training for real world operation this is not surprising since any learning machine needs sufficient representative examples in order to capture the underlying structure that allows it to generalize to new cases dean pomerleau in his research presented in the paper knowledge based training of artificial neural networks for autonomous robot driving uses a neural network to train a robotic vehicle to drive on multiple types of roads single lane multi lane dirt etc a large amount of his research is devoted to 1 extrapolating multiple training scenarios from a single training experience and 2 preserving past training diversity so that the system does not become overtrained if for example it is presented with a series of right turns it should not learn to always turn right these issues are common in neural networks that must decide from amongst a wide variety of responses but can be dealt with in several ways for example by randomly shuffling the training examples by using a numerical optimization algorithm that does not take too large steps when changing the network connections following an example or by grouping examples in so called mini batches a k dewdney a former scientific american columnist wrote in 1997 although neural nets do solve a few toy problems their powers of computation are so limited that i am surprised anyone takes them seriously as a general problem solving tool dewdney p 160 82 arguments for dewdney s position are that to implement large and effective software neural networks much processing and storage resources need to be committed while the brain has hardware tailored to the task of processing signals through a graph of neurons simulating even a most simplified form on von neumann technology may compel a nn designer to fill many millions of database rows for its connections which can consume vast amounts of computer memory and hard disk space furthermore the designer of nn systems will often need to simulate the transmission of signals through many of these connections and their associated neurons which must often be matched with incredible amounts of cpu processing power and time while neural networks often yield effective programs they too often do so at the cost of efficiency they tend to consume considerable amounts of time and money arguments against dewdney s position are that neural nets have been successfully used to solve many complex and diverse tasks ranging from autonomously flying aircraft 2 to detecting credit card fraud citation needed technology writer roger bridgman commented on dewdney s statements about neural nets neural networks for instance are in the dock not only because they have been hyped to high heaven what hasn t but also because you could create a successful net without understanding how it worked the bunch of numbers that captures its behaviour would in all probability be an opaque unreadable table valueless as a scientific resource in spite of his emphatic declaration that science is not technology dewdney seems here to pillory neural nets as bad science when most of those devising them are just trying to be good engineers an unreadable table that a useful machine could read would still be well worth having 18 in response to this kind of criticism one should note that although it is true that analyzing what has been learned by an artificial neural network is difficult it is much easier to do so than to analyze what has been learned by a biological neural network furthermore researchers involved in exploring learning algorithms for neural networks are gradually uncovering generic principles which allow a learning machine to be successful for example bengio and lecun 2007 wrote an article regarding local vs non local learning as well as shallow vs deep architecture 3 some other criticisms came from believers of hybrid models combining neural networks and symbolic approaches they advocate the intermix of these two approaches and believe that hybrid models can better capture the mechanisms of the human mind sun and bookman 1990 recent improvements edit this section does not cite any references or sources please help improve this section by adding citations to reliable sources unsourced material may be challenged and removed june 2010 while initially research had been concerned mostly with the electrical characteristics of neurons a particularly important part of the investigation in recent years has been the exploration of the role of neuromodulators such as dopamine acetylcholine and serotonin on behaviour and learning biophysical models such as bcm theory have been important in understanding mechanisms for synaptic plasticity and have had applications in both computer science and neuroscience research is ongoing in understanding the computational algorithms used in the brain with some recent biological evidence for radial basis networks and neural backpropagation as mechanisms for processing data computational devices have been created in cmos for both biophysical simulation and neuromorphic computing more recent efforts show promise for creating nanodevices 19 for very large scale principal components analyses and convolution if successful these efforts could usher in a new era of neural computing 20 that is a step beyond digital computing because it depends on learning rather than programming and because it is fundamentally analog rather than digital even though the first instantiations may in fact be with cmos digital devices between 2009 and 2012 the recurrent neural networks and deep feedforward neural networks developed in the research group of j rgen schmidhuber at the swiss ai lab idsia have won eight international competitions in pattern recognition and machine learning 21 for example multi dimensional long short term memory lstm 22 23 won three competitions in connected handwriting recognition at the 2009 international conference on document analysis and recognition icdar without any prior knowledge about the three different languages to be learned variants of the back propagation algorithm as well as unsupervised methods by geoff hinton and colleagues at the university of toronto 24 25 can be used to train deep highly nonlinear neural architectures similar to the 1980 neocognitron by kunihiko fukushima 26 and the standard architecture of vision 27 inspired by the simple and complex cells identified by david h hubel and torsten wiesel in the primary visual cortex deep learning feedforward networks alternate convolutional layers and max pooling layers topped by several pure classification disambiguation needed layers fast gpu based implementations of this approach have won several pattern recognition contests including the ijcnn 2011 traffic sign recognition competition 28 and the isbi 2012 segmentation of neuronal structures in electron microscopy stacks challenge 29 such neural networks also were the first artificial pattern recognizers to achieve human competitive or even superhuman performance 30 on benchmarks such as traffic sign recognition ijcnn 2012 or the mnist handwritten digits problem of yann lecun and colleagues at nyu see also edit adaline adaptive resonance theory artificial neural network backpropagation biological cybernetics biologically inspired computing cerebellar model articulation controller cognitive architecture cognitive science connectomics cultured neuronal networks digital morphogenesis exclusive or gene expression programming group method of data handling habituation in situ adaptive tabulation memristor multilinear subspace learning neural network software neuroscience parallel constraint satisfaction processes parallel distributed processing predictive analytics radial basis function network recurrent neural networks simulated reality support vector machine tensor product network time delay neural network references edit j j hopfield neural networks and physical systems with emergent collective computational abilities proc natl acad sci usa vol 79 pp 2554 2558 april 1982 biophysics 1 arbib p 666 a b c d bain 1873 mind and body the theories of their relation new york d appleton and company 160 a b james 1890 the principles of psychology new york h holt and company 160 plos computational biology issue image vol 6 8 august 2010 plos computational biology 6 8 ev06 ei08 2010 doi 10 1371 image pcbi v06 i08 160 edit sherrington c s experiments in examination of the peripheral distribution of the fibers of the posterior roots of some spinal nerves proceedings of the royal society of london 190 45 186 160 mcculloch warren walter pitts 1943 a logical calculus of ideas immanent in nervous activity bulletin of mathematical biophysics 5 4 115 133 doi 10 1007 bf02478259 160 hebb donald 1949 the organization of behavior new york wiley 160 farley b w a clark 1954 simulation of self organizing systems by digital computer ire transactions on information theory 4 4 76 84 doi 10 1109 tit 1954 1057468 160 rochester n j h holland l h habit and w l duda 1956 tests on a cell assembly theory of the action of the brain using a large digital computer ire transactions on information theory 2 3 80 93 doi 10 1109 tit 1956 1056810 160 rosenblatt f 1958 the perceptron a probalistic model for information storage and organization in the brain psychological review 65 6 386 408 doi 10 1037 h0042519 pmid 160 13602029 160 a b werbos p j 1975 beyond regression new tools for prediction and analysis in the behavioral sciences 160 minsky m s papert 1969 an introduction to computational geometry mit press isbn 160 0 262 63022 2 160 rumelhart d e james mcclelland 1986 parallel distributed processing explorations in the microstructure of cognition cambridge mit press 160 russell ingrid neural networks module retrieved 2012 160 mcculloch warren pitts walter a logical calculus of ideas immanent in nervous activity 1943 bulletin of mathematical biophysics 5 115 133 the essential turing by alan m turing and b jack copeland nov 18 2004 isbn 0198250800 page 403 roger bridgman s defence of neural networks yang j j pickett m d li x m ohlberg d a a stewart d r williams r s nat nanotechnol 2008 3 429 433 strukov d b snider g s stewart d r williams r s nature 2008 453 80 83 http www kurzweilai net how bio inspired deep learning keeps winning competitions 2012 kurzweil ai interview with j rgen schmidhuber on the eight competitions won by his deep learning team 2009 2012 graves alex and schmidhuber j rgen offline handwriting recognition with multidimensional recurrent neural networks in bengio yoshua schuurmans dale lafferty john williams chris k i and culotta aron eds advances in neural information processing systems 22 nips 22 december 7th 10th 2009 vancouver bc neural information processing systems nips foundation 2009 pp 545 552 a graves m liwicki s fernandez r bertolami h bunke j schmidhuber a novel connectionist system for improved unconstrained handwriting recognition ieee transactions on pattern analysis and machine intelligence vol 31 no 5 2009 http www scholarpedia org article deep_belief_networks hinton g e osindero s teh y 2006 a fast learning algorithm for deep belief nets neural computation 18 7 1527 1554 doi 10 1162 neco 2006 18 7 1527 pmid 160 16764513 160 k fukushima neocognitron a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position biological cybernetics 36 4 93 202 1980 m riesenhuber t poggio hierarchical models of object recognition in cortex nature neuroscience 1999 d c ciresan u meier j masci j schmidhuber multi column deep neural network for traffic sign classification neural networks 2012 d ciresan a giusti l gambardella j schmidhuber deep neural networks segment neuronal membranes in electron microscopy images in advances in neural information processing systems nips 2012 lake tahoe 2012 d c ciresan u meier j schmidhuber multi column deep neural networks for image classification ieee conf on computer vision and pattern recognition cvpr 2012 external links edit listen to this article info dl sorry your browser either has javascript disabled or does not have any supported player you can download the clip or download a player to play the clip in your browser this audio file was created from a revision of the neural network article dated 2011 11 27 and does not reflect subsequent edits to the article audio help more spoken articles a brief introduction to neural networks d kriesel illustrated bilingual manuscript about artificial neural networks topics so far perceptrons backpropagation radial basis functions recurrent neural networks self organizing maps hopfield networks review of neural networks in materials science artificial neural networks tutorial in three languages univ polit cnica de madrid another introduction to ann next generation of neural networks google tech talks performance of neural networks neural networks and information retrieved from http en wikipedia org w index php title neural_network amp oldid 561341512 categories computational neuroscienceneural networksnetwork architecturenetworkseconometricsinformation knowledge and uncertaintyhidden categories articles lacking in text citations from october 2010all articles lacking in text citationsall articles with unsourced statementsarticles with unsourced statements from march 2013articles with unsourced statements from august 2012articles needing additional references from june 2010all articles needing additional referencesarticles with links needing disambiguation from april 2013spoken articlesarticles with haudio microformats navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal esky dansk deutsch espa ol esperanto fran ais hrvatski italiano magyar nederlands polski portugus rom n simple english sloven ina sloven ina srpski suomi svenska t rk e ti ng vi t edit links this page was last modified on 24 june 2013 at 10 50 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Nothing_to_hide_argument b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Nothing_to_hide_argument new file mode 100644 index 00000000..2b542421 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Nothing_to_hide_argument @@ -0,0 +1 @@ +nothing to hide argument wikipedia the free encyclopedia nothing to hide argument from wikipedia the free encyclopedia jump to navigation search the nothing to hide argument is an argument which states that government data mining and surveillance programs do not threaten privacy unless they uncover illegal activities and that if they do uncover illegal activities the person committing these activities does not have the right to keep them private hence a person who favors this argument may state i ve got nothing to hide and therefore do not express opposition to government data mining and surveillance 1 an individual using this argument may say that a person should not have worries about government data mining or surveillance if he she has nothing to hide 2 this argument is commonly used in discussions regarding privacy geoffrey stone a legal scholar said that the use of the argument is all too common 3 bruce schneier a data security expert and cryptographer described it as the most common retort against privacy advocates 3 the motto if you ve got nothing to hide you ve got nothing to fear has been used in the closed circuit television program practiced in cities in the united kingdom 3 contents 1 ethnography 2 effect on privacy protection 3 arguments in favor and against the nothing to hide argument 4 see also 5 references 6 notes 7 further reading ethnography edit an ethnographic study by ana viseu andrew clement and jane aspinal of the integration of online services into everyday life was published as situating privacy online complex perceptions and everyday practices in the information communication amp society journal in 2004 it found that in the words of kirsty best author of living in the control society surveillance users and digital screen technologies fully employed middle to middle upper income earners articulated similar beliefs about not being targeted for surveillance compared to other respondents who did not show concern and that in these cases respondents expressed the view that they were not doing anything wrong or that they had nothing to hide 4 of the participant sample in viseu s study one reported using privacy enhancing technology 5 and viseu et al said one of the clearest features of our subjects privacy perceptions and practices was their passivity towards the issue 6 viseu et al said the passivity originated from the nothing to hide argument 7 effect on privacy protection edit viseu et al said that the argument has been well documented in the privacy literature as a stumbling block to the development of pragmatic privacy protection strategies and it too is related to the ambiguous and symbolic nature of the term privacy itself 7 they explained that privacy is an abstract concept and people only become concerned with it once their privacy is gone and they compare a loss to privacy with people knowing that ozone depletion and global warming are negative developments but that the immediate gains of driving the car to work or putting on hairspray outweigh the often invisible losses of polluting the environment 7 arguments in favor and against the nothing to hide argument edit this section requires expansion june 2013 daniel j solove stated in an article for the the chronicle of higher education that he opposes the argument he stated that a government can leak information about a person and cause damage to that person or use information about a person to deny access to services even if a person did not actually engage in wrongdoing and that a government can cause damage to one s personal life through making errors 3 snolove wrote when engaged directly the nothing to hide argument can ensnare for it forces the debate to focus on its narrow understanding of privacy but when confronted with the plurality of privacy problems implicated by government data collection and use beyond surveillance and disclosure the nothing to hide argument in the end has nothing to say 3 danah boyd a social media researcher opposes the argument she said that even though p eople often feel immune from state surveillance because they ve done nothing wrong an entity or group can distort a person s image and harm one s reputation or guilt by association can be used to defame a person 8 bruce schneier a computer security expert and cryptographer expressed opposition citing cardinal richelieu s statement if one would give me six lines written by the hand of the most honest man i would find something in them to have him hanged referring to how a state government can find aspects in a person s life in order to prosecute or blackmail that individual 9 schneier also argued too many wrongly characterize the debate as security versus privacy the real choice is liberty versus control 9 johann hari a british writer argued that the nothing to hide argument is irrelevant to the placement of cctv cameras in public places in the united kingdom because the cameras are public areas where one is observed by many people he or she would be unfamiliar with and not in places where you hide 10 see also edit human rights portal politics portal philosophy portal mass surveillance national security right to privacy references edit best kirsty living in the control society surveillance users and digital screen technologies international journal of cultural studies january 2010 volume 13 no 1 p 5 24 doi 10 1177 1367877909348536 available at sage journals mordini emilio nothing to hide biometrics privacy and private sphere in schouten ben niels christian juul andrzej drygajlo and massimo tistarelli editors biometrics and identity management first european workshop bioid 2008 roskilde denmark may 7 9 2008 revised selected papers springer science business media 2008 p 245 258 isbn 3540899901 9783540899907 solove daniel j nothing to hide the false tradeoff between privacy and security yale university press may 31 2011 isbn 0300172311 9780300172317 viseu ana andrew clement and jane aspinal situating privacy online complex perceptions and everyday practices information communication amp society issn 1369 118x 2004 7 1 92 114 doi 10 1080 1369118042000208924 available from taylor amp francis online notes edit mordini p 252 solove nothing to hide the false tradeoff between privacy and security p 1 if you ve got nothing to hide you shouldn t worry about government surveillance a b c d e solove daniel j why privacy matters even if you have nothing to hide the chronicle of higher education may 15 2011 retrieved on june 25 2013 the nothing to hide argument pervades discussions about privacy the data security expert bruce schneier calls it the most common retort against privacy advocates the legal scholar geoffrey stone refers to it as an all too common refrain in its most compelling form it is an argument that the privacy interest is generally minimal thus making the contest with security concerns a foreordained victory for security best p 12 viseu et al p 102 103 viseu et al p 102 a b c viseu et al p 103 boyd danah danah boyd the problem with the i have nothing to hide argument opinion the dallas morning news june 14 2013 retrieved on june 25 2013 it s disturbing to me how often i watch as someone s likeness is constructed in ways a b schneier bruce the eternal value of privacy wired may 18 2006 retrieved on june 25 2013 also available from schneier s personal website hari johann johann hari this strange backlash against cctv the independent monday march 17 2008 retrieved on june 26 2013 further reading edit klein sascha i ve got nothing to hide electronic surveillance of communications privacy and the power of arguments grin verlag apr 26 2012 isbn 3656179131 9783656179139 solove daniel j i ve got nothing to hide and other misunderstandings of privacy san diego law review vol 44 p 745 2007 p 745 issn 0036 4037 accession number 31197940 george washington university law school public law research paper no 289 an essay that was written for a symposium in the san diego law review available at academic search complete heinonline lexisnexis academic and social science research network surveillance and nothing to hide archive cse ise 312 legal social and ethical issues stony brook university powerpoint presentation based off of solove s work retrieved from http en wikipedia org w index php title nothing_to_hide_argument amp oldid 561706261 categories privacysurveillancedata mininghidden categories articles to be expanded from june 2013all articles to be expanded navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version this page was last modified on 26 june 2013 at 18 28 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Online_algorithm b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Online_algorithm new file mode 100644 index 00000000..a62a1fe6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Online_algorithm @@ -0,0 +1 @@ +online algorithm wikipedia the free encyclopedia online algorithm from wikipedia the free encyclopedia jump to navigation search this article relies largely or entirely upon a single source relevant discussion may be found on the talk page please help improve this article by introducing citations to additional sources june 2013 in computer science an online algorithm is one that can process its input piece by piece in a serial fashion i e in the order that the input is fed to the algorithm without having the entire input available from the start in contrast an offline algorithm is given the whole problem data from the beginning and is required to output an answer which solves the problem at hand for example selection sort requires that the entire list be given before it can sort it while insertion sort doesn t because it does not know the whole input an online algorithm is forced to make decisions that may later turn out not to be optimal and the study of online algorithms has focused on the quality of decision making that is possible in this setting competitive analysis formalizes this idea by comparing the relative performance of an online and offline algorithm for the same problem instance for other points of view on online inputs to algorithms see streaming algorithm focusing on the amount of memory needed to accurately represent past inputs dynamic algorithm focusing on the time complexity of maintaining solutions to problems with online inputs and online machine learning a problem exemplifying the concepts of online algorithms is the canadian traveller problem the goal of this problem is to minimize the cost of reaching a target in a weighted graph where some of the edges are unreliable and may have been removed from the graph however that an edge has been removed failed is only revealed to the traveller when she he reaches one of the edge s endpoints the worst case for this problem is simply that all of the unreliable edges fail and the problem reduces to the usual shortest path problem an alternative analysis of the problem can be made with the help of competitive analysis for this method of analysis the offline algorithm knows in advance which edges will fail and the goal is to minimize the ratio between the online and offline algorithms performance this problem is pspace complete contents 1 online algorithms 2 see also 3 references 4 external links online algorithms edit the names below are referenced with capital letters since they appear in papers with capital letters the following are the names of some online algorithms perceptron algorithms for the k server problem balance2 balance slack double coverage equipoise handicap harmonic random slack tight span algorithm tree algorithm work function algorithm wfa see also edit greedy algorithm adversary model job shop scheduling list update problem metrical task systems odds algorithm paging problem real time computing secretary problem ski rental problem linear search problem search games algorithms for calculating variance bandit problem ukkonen s algorithm references edit borodin a el yaniv r 1998 online computation and competitive analysis cambridge university press isbn 160 0 521 56392 5 160 external links edit bibliography of papers on online algorithms this computer science article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title online_algorithm amp oldid 559165636 categories online algorithmscomputer science stubshidden categories articles needing additional references from june 2013all articles needing additional referenceswikiproject computer science stubs navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch fran ais italiano nederlands polski ti ng vi t edit links this page was last modified on 10 june 2013 at 03 24 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Optimal_matching b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Optimal_matching new file mode 100644 index 00000000..f7593c91 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Optimal_matching @@ -0,0 +1 @@ +optimal matching wikipedia the free encyclopedia optimal matching from wikipedia the free encyclopedia jump to navigation search optimal matching is a sequence analysis method used in social science to assess the dissimilarity of ordered arrays of tokens that usually represent a time ordered sequence of socio economic states two individuals have experienced once such distances have been calculated for a set of observations e g individuals in a cohort classical tools such as cluster analysis can be used the method was tailored to social sciences 1 from a technique originally introduced to study molecular biology protein or genetic sequences see sequence alignment optimal matching uses the needleman wunsch algorithm contents 1 algorithm 2 criticism 3 optimal matching in causal modelling 4 software 5 references and notes algorithm edit let be a sequence of states belonging to a finite set of possible states let us denote the sequence space i e the set of all possible sequences of states optimal matching algorithms work by defining simple operator algebras that manipulate sequences i e a set of operators in the most simple approach a set composed of only three basic operations to transform sequences is used one state is inserted in the sequence one state is deleted from the sequence and a state is replaced substituted by state imagine now that a cost is associated to each operator given two sequences and the idea is to measure the cost of obtaining from using operators from the algebra let be a sequence of operators such that the application of all the operators of this sequence to the first sequence gives the second sequence where denotes the compound operator to this set we associate the cost that represents the total cost of the transformation one should consider at this point that there might exist different such sequences that transform into a reasonable choice is to select the cheapest of such sequences we thus call distance that is the cost of the least expensive set of transformations that turn into notice that is by definition nonnegative since it is the sum of positive costs and trivially if and only if that is there is no cost the distance function is symmetric if insertion and deletion costs are equal the term indel cost usually refers to the common cost of insertion and deletion considering a set composed of only the three basic operations described above this proximity measure satisfies the triangular inequality transitivity however depends on the definition of the set of elementary operations criticism edit although optimal matching techniques are widely used in sociology and demography such techniques also have their flaws as was pointed out by several authors for example l l wu 2 the main problem in the application of optimal matching is to appropriately define the costs optimal matching in causal modelling edit optimal matching is also a term used in statistical modelling of causal effects in this context it refers to matching cases with controls and is completely separate from the sequence analytic sense software edit tda is a powerful program offering access to some of the latest developments in transition data analysis stata has implemented a package to run optimal matching analysis traminer is an open source r package for analysing and visualizing states and events sequences including optimal matching analysis references and notes edit a abbott and a tsay 2000 sequence analysis and optimal matching methods in sociology review and prospect sociological methods amp research vol 29 3 33 doi 10 1177 0049124100029001001 l l wu 2000 some comments on sequence analysis and optimal matching methods in sociology review and prospect sociological methods amp research 29 41 64 doi 10 1177 0049124100029001003 retrieved from http en wikipedia org w index php title optimal_matching amp oldid 535744874 categories data miningstatistical distance measures navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 30 january 2013 at 20 22 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Predictive_analytics b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Predictive_analytics new file mode 100644 index 00000000..71f5c4a1 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Predictive_analytics @@ -0,0 +1 @@ +predictive analytics wikipedia the free encyclopedia predictive analytics from wikipedia the free encyclopedia jump to navigation search this article needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed june 2011 predictive analytics encompasses a variety of techniques from statistics modeling machine learning and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events 1 2 in business predictive models exploit patterns found in historical and transactional data to identify risks and opportunities models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions guiding decision making for candidate transactions predictive analytics is used in actuarial science 3 marketing 4 financial services 5 insurance telecommunications 6 retail 7 travel 8 healthcare 9 pharmaceuticals 10 and other fields one of the most well known applications is credit scoring 1 which is used throughout financial services scoring models process a customer s credit history loan application customer data etc in order to rank order individuals by their likelihood of making future credit payments on time a well known example is the fico score contents 1 definition 2 types 2 1 predictive models 2 2 descriptive models 2 3 decision models 3 applications 3 1 analytical customer relationship management crm 3 2 clinical decision support systems 3 3 collection analytics 3 4 cross sell 3 5 customer retention 3 6 direct marketing 3 7 fraud detection 3 8 portfolio product or economy level prediction 3 9 risk management 3 10 underwriting 4 technology and big data influences on predictive analytics 5 analytical techniques 5 1 regression techniques 5 1 1 linear regression model 5 1 2 discrete choice models 5 1 3 logistic regression 5 1 4 multinomial logistic regression 5 1 5 probit regression 5 1 6 logit versus probit 5 1 7 time series models 5 1 8 survival or duration analysis 5 1 9 classification and regression trees 5 1 10 multivariate adaptive regression splines 5 2 machine learning techniques 5 2 1 neural networks 5 2 2 radial basis functions 5 2 3 support vector machines 5 2 4 na ve bayes 5 2 5 k nearest neighbours 5 2 6 geospatial predictive modeling 6 tools 6 1 pmml 7 see also 8 references 8 1 further reading definition edit predictive analytics is an area of statistical analysis that deals with extracting information from data and using it to predict trends and behavior patterns often the unknown event of interest is in the future but predictive analytics can be applied to any type of unknown whether it be in the past present or future for example identifying suspects after a crime has been committed or credit card fraud as it occurs the core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences and exploiting them to predict the unknown outcome it is important to note however that the accuracy and usability of results will depend greatly on the level of data analysis and the quality of assumptions types edit generally the term predictive analytics is used to mean predictive modeling scoring data with predictive models and forecasting however people are increasingly using the term to refer to related analytical disciplines such as descriptive modeling and decision modeling or optimization these disciplines also involve rigorous data analysis and are widely used in business for segmentation and decision making but have different purposes and the statistical techniques underlying them vary predictive models edit predictive models analyze past performance to assess how likely a customer is to exhibit a specific behavior in order to improve marketing effectiveness this category also encompasses models that seek out subtle data patterns to answer questions about customer performance such as fraud detection models predictive models often perform calculations during live transactions for example to evaluate the risk or opportunity of a given customer or transaction in order to guide a decision with advancements in computing speed individual agent modeling systems have become capable of simulating human behaviors or reactions to given stimuli or scenarios the new term for animating data specifically linked to an individual in a simulated environment is avatar analytics descriptive models edit descriptive models quantify relationships in data in a way that is often used to classify customers or prospects into groups unlike predictive models that focus on predicting a single customer behavior such as credit risk descriptive models identify many different relationships between customers or products descriptive models do not rank order customers by their likelihood of taking a particular action the way predictive models do instead descriptive models can be used for example to categorize customers by their product preferences and life stage descriptive modeling tools can be utilized to develop further models that can simulate large number of individualized agents and make predictions decision models edit decision models describe the relationship between all the elements of a decision the known data including results of predictive models the decision and the forecast results of the decision in order to predict the results of decisions involving many variables these models can be used in optimization maximizing certain outcomes while minimizing others decision models are generally used to develop decision logic or a set of business rules that will produce the desired action for every customer or circumstance applications edit although predictive analytics can be put to use in many applications we outline a few examples where predictive analytics has shown positive impact in recent years analytical customer relationship management crm edit analytical customer relationship management is a frequent commercial application of predictive analysis methods of predictive analysis are applied to customer data to pursue crm objectives which involve constructing a holistic view of the customer no matter where their information resides in the company or the department involved crm uses predictive analysis in applications for marketing campaigns sales and customer services to name a few these tools are required in order for a company to posture and focus their efforts effectively across the breadth of their customer base they must analyze and understand the products in demand or have the potential for high demand predict customers buying habits in order to promote relevant products at multiple touch points and proactively identify and mitigate issues that have the potential to lose customers or reduce their ability to gain new ones analytical customer relationship management can be applied throughout the customers lifecycle acquisition relationship growth retention and win back several of the application areas described below direct marketing cross sell customer retention are part of customer relationship management clinical decision support systems edit experts use predictive analysis in health care primarily to determine which patients are at risk of developing certain conditions like diabetes asthma heart disease and other lifetime illnesses additionally sophisticated clinical decision support systems incorporate predictive analytics to support medical decision making at the point of care a working definition has been proposed by robert hayward of the centre for health evidence clinical decision support systems link health observations with health knowledge to influence health choices by clinicians for improved health care citation needed collection analytics edit every portfolio has a set of delinquent customers who do not make their payments on time the financial institution has to undertake collection activities on these customers to recover the amounts due a lot of collection resources are wasted on customers who are difficult or impossible to recover predictive analytics can help optimize the allocation of collection resources by identifying the most effective collection agencies contact strategies legal actions and other strategies to each customer thus significantly increasing recovery at the same time reducing collection costs cross sell edit often corporate organizations collect and maintain abundant data e g customer records sale transactions as exploiting hidden relationships in the data can provide a competitive advantage for an organization that offers multiple products predictive analytics can help analyze customers spending usage and other behavior leading to efficient cross sales or selling additional products to current customers 2 this directly leads to higher profitability per customer and stronger customer relationships customer retention edit with the number of competing services available businesses need to focus efforts on maintaining continuous consumer satisfaction rewarding consumer loyalty and minimizing customer attrition businesses tend to respond to customer attrition on a reactive basis acting only after the customer has initiated the process to terminate service at this stage the chance of changing the customer s decision is almost impossible proper application of predictive analytics can lead to a more proactive retention strategy by a frequent examination of a customer s past service usage service performance spending and other behavior patterns predictive models can determine the likelihood of a customer terminating service sometime in the near future 6 an intervention with lucrative offers can increase the chance of retaining the customer silent attrition the behavior of a customer to slowly but steadily reduce usage is another problem that many companies face predictive analytics can also predict this behavior so that the company can take proper actions to increase customer activity direct marketing edit when marketing consumer products and services there is the challenge of keeping up with competing products and consumer behavior apart from identifying prospects predictive analytics can also help to identify the most effective combination of product versions marketing material communication channels and timing that should be used to target a given consumer the goal of predictive analytics is typically to lower the cost per order or cost per action fraud detection edit fraud is a big problem for many businesses and can be of various types inaccurate credit applications fraudulent transactions both offline and online identity thefts and false insurance claims these problems plague firms of all sizes in many industries some examples of likely victims are credit card issuers insurance companies 11 retail merchants manufacturers business to business suppliers and even services providers a predictive model can help weed out the bads and reduce a business s exposure to fraud predictive modeling can also be used to identify high risk fraud candidates in business or the public sector nigrini developed a risk scoring method to identify audit targets he describes the use of this approach to detect fraud in the franchisee sales reports of an international fast food chain each location is scored using 10 predictors the 10 scores are then weighted to give one final overall risk score for each location the same scoring approach was also used to identify high risk check kiting accounts potentially fraudulent travel agents and questionable vendors a reasonably complex model was used to identify fraudulent monthly reports submitted by divisional controllers 12 the internal revenue service irs of the united states also uses predictive analytics to mine tax returns and identify tax fraud 11 recent when advancements in technology have also introduced predictive behavior analysis for web fraud detection this type of solution utilizes heuristics in order to study normal web user behavior and detect anomalies indicating fraud attempts portfolio product or economy level prediction edit often the focus of analysis is not the consumer but the product portfolio firm industry or even the economy for example a retailer might be interested in predicting store level demand for inventory management purposes or the federal reserve board might be interested in predicting the unemployment rate for the next year these types of problems can be addressed by predictive analytics using time series techniques see below they can also be addressed via machine learning approaches which transform the original time series into a feature vector space where the learning algorithm finds patterns that have predictive power 13 14 risk management edit when employing risk management techniques the results are always to predict and benefit from a future scenario the capital asset pricing model cap m predicts the best portfolio to maximize return probabilistic risk assessment pra when combined with mini delphi techniques and statistical approaches yields accurate forecasts and riskaoa is a stand alone predictive tool 15 these are three examples of approaches that can extend from project to market and from near to long term underwriting see below and other business approaches identify risk management as a predictive method underwriting edit many businesses have to account for risk exposure due to their different services and determine the cost needed to cover the risk for example auto insurance providers need to accurately determine the amount of premium to charge to cover each automobile and driver a financial company needs to assess a borrower s potential and ability to pay before granting a loan for a health insurance provider predictive analytics can analyze a few years of past medical claims data as well as lab pharmacy and other records where available to predict how expensive an enrollee is likely to be in the future predictive analytics can help underwrite these quantities by predicting the chances of illness default bankruptcy etc predictive analytics can streamline the process of customer acquisition by predicting the future risk behavior of a customer using application level data 3 predictive analytics in the form of credit scores have reduced the amount of time it takes for loan approvals especially in the mortgage market where lending decisions are now made in a matter of hours rather than days or even weeks proper predictive analytics can lead to proper pricing decisions which can help mitigate future risk of default technology and big data influences on predictive analytics edit big data is a collection of data sets that are so large and complex that they become awkward to work with using traditional database management tools the volume variety and velocity of big data have introduced challenges across the board for capture storage search sharing analysis and visualization examples of big data sources include web logs rfid and sensor data social networks internet search indexing call detail records military surveillance and complex data in astronomic biogeochemical genomics and atmospheric sciences thanks to technological advances in computer hardware faster cpus cheaper memory and mpp architectures and new technologies such as hadoop mapreduce and in database and text analytics for processing big data it is now feasible to collect analyze and mine massive amounts of structured and unstructured data for new insights 11 today exploring big data and using predictive analytics is within reach of more organizations than ever before analytical techniques edit the approaches and techniques used to conduct predictive analytics can broadly be grouped into regression techniques and machine learning techniques regression techniques edit regression models are the mainstay of predictive analytics the focus lies on establishing a mathematical equation as a model to represent the interactions between the different variables in consideration depending on the situation there is a wide variety of models that can be applied while performing predictive analytics some of them are briefly discussed below linear regression model edit the linear regression model analyzes the relationship between the response or dependent variable and a set of independent or predictor variables this relationship is expressed as an equation that predicts the response variable as a linear function of the parameters these parameters are adjusted so that a measure of fit is optimized much of the effort in model fitting is focused on minimizing the size of the residual as well as ensuring that it is randomly distributed with respect to the model predictions the goal of regression is to select the parameters of the model so as to minimize the sum of the squared residuals this is referred to as ordinary least squares ols estimation and results in best linear unbiased estimates blue of the parameters if and only if the gauss markov assumptions are satisfied once the model has been estimated we would be interested to know if the predictor variables belong in the model i e is the estimate of each variable s contribution reliable to do this we can check the statistical significance of the model s coefficients which can be measured using the t statistic this amounts to testing whether the coefficient is significantly different from zero how well the model predicts the dependent variable based on the value of the independent variables can be assessed by using the r statistic it measures predictive power of the model i e the proportion of the total variation in the dependent variable that is explained accounted for by variation in the independent variables discrete choice models edit multivariate regression above is generally used when the response variable is continuous and has an unbounded range often the response variable may not be continuous but rather discrete while mathematically it is feasible to apply multivariate regression to discrete ordered dependent variables some of the assumptions behind the theory of multivariate linear regression no longer hold and there are other techniques such as discrete choice models which are better suited for this type of analysis if the dependent variable is discrete some of those superior methods are logistic regression multinomial logit and probit models logistic regression and probit models are used when the dependent variable is binary logistic regression edit for more details on this topic see logistic regression in a classification setting assigning outcome probabilities to observations can be achieved through the use of a logistic model which is basically a method which transforms information about the binary dependent variable into an unbounded continuous variable and estimates a regular multivariate model see allison s logistic regression for more information on the theory of logistic regression the wald and likelihood ratio test are used to test the statistical significance of each coefficient b in the model analogous to the t tests used in ols regression see above a test assessing the goodness of fit of a classification model is the percentage correctly predicted multinomial logistic regression edit an extension of the binary logit model to cases where the dependent variable has more than 2 categories is the multinomial logit model in such cases collapsing the data into two categories might not make good sense or may lead to loss in the richness of the data the multinomial logit model is the appropriate technique in these cases especially when the dependent variable categories are not ordered for examples colors like red blue green some authors have extended multinomial regression to include feature selection importance methods such as random multinomial logit probit regression edit probit models offer an alternative to logistic regression for modeling categorical dependent variables even though the outcomes tend to be similar the underlying distributions are different probit models are popular in social sciences like economics a good way to understand the key difference between probit and logit models is to assume that there is a latent variable z we do not observe z but instead observe y which takes the value 0 or 1 in the logit model we assume that y follows a logistic distribution in the probit model we assume that y follows a standard normal distribution note that in social sciences e g economics probit is often used to model situations where the observed variable y is continuous but takes values between 0 and 1 logit versus probit edit the probit model has been around longer than the logit model they behave similarly except that the logistic distribution tends to be slightly flatter tailed one of the reasons the logit model was formulated was that the probit model was computationally difficult due to the requirement of numerically calculating integrals modern computing however has made this computation fairly simple the coefficients obtained from the logit and probit model are fairly close however the odds ratio is easier to interpret in the logit model practical reasons for choosing the probit model over the logistic model would be there is a strong belief that the underlying distribution is normal the actual event is not a binary outcome e g bankruptcy status but a proportion e g proportion of population at different debt levels time series models edit time series models are used for predicting or forecasting the future behavior of variables these models account for the fact that data points taken over time may have an internal structure such as autocorrelation trend or seasonal variation that should be accounted for as a result standard regression techniques cannot be applied to time series data and methodology has been developed to decompose the trend seasonal and cyclical component of the series modeling the dynamic path of a variable can improve forecasts since the predictable component of the series can be projected into the future time series models estimate difference equations containing stochastic components two commonly used forms of these models are autoregressive models ar and moving average ma models the box jenkins methodology 1976 developed by george box and g m jenkins combines the ar and ma models to produce the arma autoregressive moving average model which is the cornerstone of stationary time series analysis arima autoregressive integrated moving average models on the other hand are used to describe non stationary time series box and jenkins suggest differencing a non stationary time series to obtain a stationary series to which an arma model can be applied non stationary time series have a pronounced trend and do not have a constant long run mean or variance box and jenkins proposed a three stage methodology which includes model identification estimation and validation the identification stage involves identifying if the series is stationary or not and the presence of seasonality by examining plots of the series autocorrelation and partial autocorrelation functions in the estimation stage models are estimated using non linear time series or maximum likelihood estimation procedures finally the validation stage involves diagnostic checking such as plotting the residuals to detect outliers and evidence of model fit in recent years time series models have become more sophisticated and attempt to model conditional heteroskedasticity with models such as arch autoregressive conditional heteroskedasticity and garch generalized autoregressive conditional heteroskedasticity models frequently used for financial time series in addition time series models are also used to understand inter relationships among economic variables represented by systems of equations using var vector autoregression and structural var models survival or duration analysis edit survival analysis is another name for time to event analysis these techniques were primarily developed in the medical and biological sciences but they are also widely used in the social sciences like economics as well as in engineering reliability and failure time analysis censoring and non normality which are characteristic of survival data generate difficulty when trying to analyze the data using conventional statistical models such as multiple linear regression the normal distribution being a symmetric distribution takes positive as well as negative values but duration by its very nature cannot be negative and therefore normality cannot be assumed when dealing with duration survival data hence the normality assumption of regression models is violated the assumption is that if the data were not censored it would be representative of the population of interest in survival analysis censored observations arise whenever the dependent variable of interest represents the time to a terminal event and the duration of the study is limited in time an important concept in survival analysis is the hazard rate defined as the probability that the event will occur at time t conditional on surviving until time t another concept related to the hazard rate is the survival function which can be defined as the probability of surviving to time t most models try to model the hazard rate by choosing the underlying distribution depending on the shape of the hazard function a distribution whose hazard function slopes upward is said to have positive duration dependence a decreasing hazard shows negative duration dependence whereas constant hazard is a process with no memory usually characterized by the exponential distribution some of the distributional choices in survival models are f gamma weibull log normal inverse normal exponential etc all these distributions are for a non negative random variable duration models can be parametric non parametric or semi parametric some of the models commonly used are kaplan meier and cox proportional hazard model non parametric classification and regression trees edit main article decision tree learning classification and regression trees cart is a non parametric decision tree learning technique that produces either classification or regression trees depending on whether the dependent variable is categorical or numeric respectively decision trees are formed by a collection of rules based on variables in the modeling data set rules based on variables values are selected to get the best split to differentiate observations based on the dependent variable once a rule is selected and splits a node into two the same process is applied to each child node i e it is a recursive procedure splitting stops when cart detects no further gain can be made or some pre set stopping rules are met alternatively the data are split as much as possible and then the tree is later pruned each branch of the tree ends in a terminal node each observation falls into one and exactly one terminal node and each terminal node is uniquely defined by a set of rules a very popular method for predictive analytics is leo breiman s random forests or derived versions of this technique like random multinomial logit multivariate adaptive regression splines edit multivariate adaptive regression splines mars is a non parametric technique that builds flexible models by fitting piecewise linear regressions an important concept associated with regression splines is that of a knot knot is where one local regression model gives way to another and thus is the point of intersection between two splines in multivariate and adaptive regression splines basis functions are the tool used for generalizing the search for knots basis functions are a set of functions used to represent the information contained in one or more variables multivariate and adaptive regression splines model almost always creates the basis functions in pairs multivariate and adaptive regression spline approach deliberately overfits the model and then prunes to get to the optimal model the algorithm is computationally very intensive and in practice we are required to specify an upper limit on the number of basis functions machine learning techniques edit machine learning a branch of artificial intelligence was originally employed to develop techniques to enable computers to learn today since it includes a number of advanced statistical methods for regression and classification it finds application in a wide variety of fields including medical diagnostics credit card fraud detection face and speech recognition and analysis of the stock market in certain applications it is sufficient to directly predict the dependent variable without focusing on the underlying relationships between variables in other cases the underlying relationships can be very complex and the mathematical form of the dependencies unknown for such cases machine learning techniques emulate human cognition and learn from training examples to predict future events a brief discussion of some of these methods used commonly for predictive analytics is provided below a detailed study of machine learning can be found in mitchell 1997 neural networks edit neural networks are nonlinear sophisticated modeling techniques that are able to model complex functions they can be applied to problems of prediction classification or control in a wide spectrum of fields such as finance cognitive psychology neuroscience medicine engineering and physics neural networks are used when the exact nature of the relationship between inputs and output is not known a key feature of neural networks is that they learn the relationship between inputs and output through training there are three types of training in neural networks used by different networks supervised and unsupervised training reinforcement learning with supervised being the most common one some examples of neural network training techniques are backpropagation quick propagation conjugate gradient descent projection operator delta bar delta etc some unsupervised network architectures are multilayer perceptrons kohonen networks hopfield networks etc radial basis functions edit a radial basis function rbf is a function which has built into it a distance criterion with respect to a center such functions can be used very efficiently for interpolation and for smoothing of data radial basis functions have been applied in the area of neural networks where they are used as a replacement for the sigmoidal transfer function such networks have 3 layers the input layer the hidden layer with the rbf non linearity and a linear output layer the most popular choice for the non linearity is the gaussian rbf networks have the advantage of not being locked into local minima as do the feed forward networks such as the multilayer perceptron support vector machines edit support vector machines svm are used to detect and exploit complex patterns in data by clustering classifying and ranking the data they are learning machines that are used to perform binary classifications and regression estimations they commonly use kernel based methods to apply linear classification techniques to non linear classification problems there are a number of types of svm such as linear polynomial sigmoid etc na ve bayes edit na ve bayes based on bayes conditional probability rule is used for performing classification tasks na ve bayes assumes the predictors are statistically independent which makes it an effective classification tool that is easy to interpret it is best employed when faced with the problem of curse of dimensionality i e when the number of predictors is very high k nearest neighbours edit the nearest neighbour algorithm knn belongs to the class of pattern recognition statistical methods the method does not impose a priori any assumptions about the distribution from which the modeling sample is drawn it involves a training set with both positive and negative values a new sample is classified by calculating the distance to the nearest neighbouring training case the sign of that point will determine the classification of the sample in the k nearest neighbour classifier the k nearest points are considered and the sign of the majority is used to classify the sample the performance of the knn algorithm is influenced by three main factors 1 the distance measure used to locate the nearest neighbours 2 the decision rule used to derive a classification from the k nearest neighbours and 3 the number of neighbours used to classify the new sample it can be proved that unlike other methods this method is universally asymptotically convergent i e as the size of the training set increases if the observations are independent and identically distributed i i d regardless of the distribution from which the sample is drawn the predicted class will converge to the class assignment that minimizes misclassification error see devroy et al geospatial predictive modeling edit conceptually geospatial predictive modeling is rooted in the principle that the occurrences of events being modeled are limited in distribution occurrences of events are neither uniform nor random in distribution there are spatial environment factors infrastructure sociocultural topographic etc that constrain and influence where the locations of events occur geospatial predictive modeling attempts to describe those constraints and influences by spatially correlating occurrences of historical geospatial locations with environmental factors that represent those constraints and influences geospatial predictive modeling is a process for analyzing events through a geographic filter in order to make statements of likelihood for event occurrence or emergence tools edit historically using predictive analytics tools as well as understanding the results they delivered required advanced skills however modern predictive analytics tools are no longer restricted to it specialists as more organizations adopt predictive analytics into decision making processes and integrate it into their operations they are creating a shift in the market toward business users as the primary consumers of the information business users want tools they can use on their own vendors are responding by creating new software that removes the mathematical complexity provides user friendly graphic interfaces and or builds in short cuts that can for example recognize the kind of data available and suggest an appropriate predictive model 16 predictive analytics tools have become sophisticated enough to adequately present and dissect data problems so that any data savvy information worker can utilize them to analyze data and retrieve meaningful useful results 2 for example modern tools present findings using simple charts graphs and scores that indicate the likelihood of possible outcomes 17 there are numerous tools available in the marketplace that help with the execution of predictive analytics these range from those that need very little user sophistication to those that are designed for the expert practitioner the difference between these tools is often in the level of customization and heavy data lifting allowed notable open source predictive analytic tools include knime orange python r rapidminer weka notable commercial predictive analytic tools include angoss knowledgestudio ibm spss statistics and ibm spss modeler kxen modeler mathematica matlab oracle data mining odm pervasive sap sas and sas enterprise miner statistica tibco pmml edit in an attempt to provide a standard language for expressing predictive models the predictive model markup language pmml has been proposed such an xml based language provides a way for the different tools to define predictive models and to share these between pmml compliant applications pmml 4 0 was released in june 2009 see also edit criminal reduction utilising statistical history data mining learning analytics odds algorithm pattern recognition prescriptive analytics predictive modelling this article includes a list of references but its sources remain unclear because it has insufficient inline citations please help to improve this article by introducing more precise citations october 2011 references edit a b nyce charles 2007 predictive analytics white paper american institute for chartered property casualty underwriters insurance institute of america p 160 1 160 a b c eckerson wayne may 10 2007 extending the value of your data warehousing investment the data warehouse institute 160 a b conz nathan september 2 2008 insurers shift to customer focused predictive analytics technologies insurance amp technology 160 fletcher heather march 2 2011 the 7 best uses for predictive analytics in multichannel marketing target marketing 160 korn sue april 21 2011 the opportunity for predictive analytics in finance hpc wire 160 a b barkin eric may 2011 crm predictive analytics why it all adds up destination crm 160 das krantik vidyashankar g s july 1 2006 competitive advantage in retail through analytics developing insights creating value information management 160 mcdonald mich le september 2 2010 new technology taps predictive analytics to target travel recommendations travel market report 160 stevenson erin december 16 2011 tech beat can you pronounce health care predictive analytics times standard 160 mckay lauren august 2009 the new prescription for pharma destination crm 160 a b c schiff mike march 6 2012 bi experts why predictive analytics will continue to grow the data warehouse institute 160 nigrini mark june 2011 forensic analytics methods and techniques for forensic accounting investigations hoboken nj john wiley amp sons inc isbn 160 978 0 470 89046 2 160 dhar vasant april 2011 prediction in financial markets the case for small disjuncts acm transactions on intelligent systems and technologies 2 3 160 dhar vasant chou dashin and provost foster october 2000 discovering interesting patterns in investment decision making with glower a genetic learning algorithm overlaid with entropy reduction data mining and knowledge discovery 4 4 160 https acc dau mil communitybrowser aspx id 126070 halper fran november 1 2011 the top 5 trends in predictive analytics information management 160 maclennan jamie may 1 2012 5 myths about predictive analytics the data warehouse institute 160 further reading edit agresti alan 2002 categorical data analysis hoboken john wiley and sons isbn 160 0 471 36093 7 160 coggeshall stephen davies john jones roger and schutzer daniel intelligent security systems in freedman roy s flein robert a and lederman jess editors 1995 artificial intelligence in the capital markets chicago irwin isbn 160 1 55738 811 3 160 l devroye l gy rfi g lugosi 1996 a probabilistic theory of pattern recognition new york springer verlag 160 enders walter 2004 applied time series econometrics hoboken john wiley and sons isbn 160 0 521 83919 x 160 greene william 2000 econometric analysis prentice hall isbn 160 0 13 013297 7 160 unknown parameter ignored help guid re mathieu howard n sh argamon 2009 rich language analysis for counterterrrorism berlin london new york springer verlag isbn 160 978 3 642 01140 5 160 mitchell tom 1997 machine learning new york mcgraw hill isbn 160 0 07 042807 7 160 siegel eric 2013 predictive analytics the power to predict who will click buy lie or die john wiley isbn 160 978 1 1183 5685 2 160 tukey john 1977 exploratory data analysis new york addison wesley isbn 160 0 201 07616 0 160 finlay steven 2012 credit scoring response modeling and insurance rating a practical guide to forecasting customer behavior basingstoke palgrave macmillan isbn 160 0 230 34776 2 160 retrieved from http en wikipedia org w index php title predictive_analytics amp oldid 559085626 categories statistical modelsbusiness intelligenceinsurancehidden categories articles needing additional references from june 2011all articles needing additional referencesall articles with unsourced statementsarticles with unsourced statements from june 2012vague or ambiguous time from october 2011articles lacking in text citations from october 2011all articles lacking in text citationspages with citations using unsupported parameters navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 20 june 2013 at 09 07 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning new file mode 100644 index 00000000..0c9b79e5 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning @@ -0,0 +1 @@ +proactive discovery of insider threats using graph analysis and learning wikipedia the free encyclopedia proactive discovery of insider threats using graph analysis and learning from wikipedia the free encyclopedia jump to navigation search proactive discovery of insider threats using graph analysis and learning establishment 2011 sponsor darpa value 9 million goal rapidly data mine large sets to discover anomalies proactive discovery of insider threats using graph analysis and learning or prodigal is a computer system for predicting anomalous behavior amongst humans by data mining network traffic such as emails text messages and log entries 1 it is part of darpa s anomaly detection at multiple scales adams project 2 the initial schedule is for two years and the budget 9 million 3 it uses graph theory machine learning statistical anomaly detection and high performance computing to scan larger sets of data more quickly than in past systems the amount of data analyzed is in the range of terabytes per day 3 the targets of the analysis are employees within the government or defense contracting organizations specific examples of behavior the system is intended to detect include the actions of nidal malik hasan and wikileaks alleged source bradley manning 1 commercial applications may include finance 1 the results of the analysis the five most serious threats per day go to agents analysts and operators working in counterintelligence 1 3 4 primary participants edit georgia institute of technology college of computing georgia tech research institute defense advanced research projects agency army research office science applications international corporation oregon state university university of massachusetts amherst carnegie mellon university see also edit cyber insider threat einstein us cert program threat computer intrusion detection echelon thinthread trailblazer turbulence nsa programs fusion center investigative data warehouse fbi references edit a b c d video interview darpa s adams project taps big data to find the breaking bad inside hpc 2011 11 29 retrieved 2011 12 05 160 brandon john 2011 12 03 could the u s government start reading your emails fox news retrieved 2011 12 06 160 a b c georgia tech helps to develop system that will detect insider threats from massive data sets georgia institute of technology 2011 11 10 retrieved 2011 12 06 160 storm darlene 2011 12 06 sifting through petabytes prodigal monitoring for lone wolf insider threats computer world retrieved 2011 12 06 160 this software article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title proactive_discovery_of_insider_threats_using_graph_analysis_and_learning amp oldid 528260579 categories data miningcomputer securitygeorgia tech research institutedarpaparallel computingsoftware stubs navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 16 december 2012 at 04 41 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Profiling_practices b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Profiling_practices new file mode 100644 index 00000000..c89ce9a0 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Profiling_practices @@ -0,0 +1 @@ +profiling practices wikipedia the free encyclopedia profiling practices from wikipedia the free encyclopedia jump to navigation search profiling information science refers to the whole process of construction and application of profiles generated by computerized profiling technologies what characterizes profiling technologies is the use of algorithms or other mathematical techniques that allow one to discover patterns or correlations in large quantities of data aggregated in databases when these patterns or correlations are used to identify or represent people they can be called profiles other than a discussion of profiling technologies or population profiling the notion of profiling practices is not just about the construction of profiles but also concerns the application of group profiles to individuals e g in the case of credit scoring price discrimination or identification of security risks hildebrandt amp gutwirth 2008 elmer 2004 profiling is not simply a matter of computerized pattern recognition it enables refined price discrimination targeted servicing detection of fraud and extensive social sorting real time machine profiling constitutes the precondition for emerging socio technical infrastructures envisioned by advocates of ambient intelligence 1 autonomic computing kephart amp chess 2003 and ubiquitous computing weiser 1991 one of the most challenging problems of the information society is dealing with the increasing data overload with the digitizing of all sorts of content as well as the improvement and drop in cost of recording technologies the amount of available information has become enormous and is increasing exponentially it has thus become important for companies governments and individuals to be able to discriminate information from noise detecting those data that are useful or interesting the development of profiling technologies must be seen against this background these technologies are thought to efficiently collect and analyse data in order to find or test knowledge in the form of statistical patterns between data this process is called knowledge discovery in databases kdd fayyad piatetsky shapiro amp smyth 1996 which provides the profiler with sets of correlated data that are used as profiles contents 1 the profiling process 2 types of profiling practices 2 1 supervised and unsupervised learning 2 2 individual and group profiles 2 3 distributive and non distributive profiling 3 application domains 4 risks and issues 5 see also 6 references 6 1 notes and other references the profiling process edit the technical process of profiling can be separated in several steps preliminary grounding the profiling process starts with a specification of the applicable problem domain and the identification of the goals of analysis data collection the target dataset or database for analysis is formed by selecting the relevant data in the light of existing domain knowledge and data understanding data preparation the data are preprocessed for removing noise and reducing complexity by eliminating attributes data mining the data are analysed with the algorithm or heuristics developed to suit the data model and goals interpretation the mined patterns are evaluated on their relevance and validity by specialists and or professionals in the application domain e g excluding spurious correlations application the constructed profiles are applied e g to categories of persons to test and fine tune the algorithms institutional decision the institution decides what actions or policies to apply to groups or individuals whose data match a relevant profile data collection preparation and mining all belong to the phase in which the profile is under construction however profiling also refers to the application of profiles meaning the usage of profiles for the identification or categorization of groups or individual persons as can be seen in step six application the process is circular there is a feedback loop between the construction and the application of profiles the interpretation of profiles can lead to the reiterant possibly real time fine tuning of specific previous steps in the profiling process the application of profiles to people whose data were not used to construct the profile is based on data matching which provides new data that allows for further adjustments the process of profiling is both dynamic and adaptive a good illustration of the dynamic and adaptive nature of profiling is the cross industry standard process for data mining crisp dm types of profiling practices edit in order to clarify the nature of profiling technologies some crucial distinctions have to be made between different types of profiling practices apart from the distinction between the construction and the application of profiles the main distinctions are those between bottom up and top down profiling or supervised and unsupervised learning and between individual and group profiles supervised and unsupervised learning edit profiles can be classified according to the way they have been generated fayyad piatetsky shapiro amp smyth 1996 zarsky 2002 3 on the one hand profiles can be generated by testing a hypothesized correlation this is called top down profiling or supervised learning this is similar to the methodology of traditional scientific research in that it starts with a hypothesis and consists of testing its validity the result of this type of profiling is the verification or refutation of the hypothesis one could also speak of deductive profiling on the other hand profiles can be generated by exploring a data base using the data mining process to detect patterns in the data base that were not previously hypothesized in a way this is a matter of generating hypothesis finding correlations one did not expect or even think of once the patterns have been mined they will enter the loop described above and will be tested with the use of new data this is called unsupervised learning two things are important with regard to this distinction first unsupervised learning algorithms seem to allow the construction of a new type of knowledge not based on hypothesis developed by a researcher and not based on causal or motivational relations but exclusively based on stochastical correlations second unsupervised learning algorithms thus seem to allow for an inductive type of knowledge construction that does not require theoretical justification or causal explanation custers 2004 some authors claim that if the application of profiles based on computerized stochastical pattern recognition works i e allows for reliable predictions of future behaviours the theoretical or causal explanation of these patterns does not matter anymore anderson 2008 however the idea that blind algorithms provide reliable information does not imply that the information is neutral in the process of collecting and aggregating data into a database the first three steps of the process of profile construction translations are made from real life events to machine readable data these data are then prepared and cleansed to allow for initial computability potential bias will have to be located at these points as well as in the choice of algorithms that are developed it is not possible to mine a database for all possible linear and non linear correlations meaning that the mathematical techniques developed to search for patterns will be determinate of the patterns that can be found in the case of machine profiling potential bias is not informed by common sense prejudice or what psychologists call stereotyping but by the computer techniques employed in the initial steps of the process these techniques are mostly invisible for those to whom profiles are applied because their data match the relevant group profiles individual and group profiles edit profiles must also be classified according to the kind of subject they refer to this subject can either be an individual or a group of people when a profile is constructed with the data of a single person this is called individual profiling jaquet chiffelle 2008 this kind of profiling is used to discover the particular characteristics of a certain individual to enable unique identification or the provision of personalized services however personalized servicing is most often also based on group profiling which allows categorisation of a person as a certain type of person based on the fact that her profile matches with a profile that has been constructed on the basis of massive amounts of data about massive numbers of other people a group profile can refer to the result of data mining in data sets that refer to an existing community that considers itself as such like a religious group a tennis club a university a political party etc in that case it can describe previously unknown patterns of behaviour or other characteristics of such a group community a group profile can also refer to a category of people that do not form a community but are found to share previously unknown patterns of behaviour or other characteristics custers 2004 in that case the group profile describes specific behaviours or other characteristics of a category of people like for instance women with blue eyes and red hair or adults with relatively short arms and legs these categories may be found to correlate with health risks earning capacity mortality rates credit risks etc if an individual profile is applied to the individual that it was mined from then that is direct individual profiling if a group profile is applied to an individual whose data match the profile then that is indirect individual profiling because the profile was generated using data of other people similarly if a group profile is applied to the group that it was mined from then that is direct group profiling jaquet chiffelle 2008 however in as far as the application of a group profile to a group implies the application of the group profile to individual members of the group it makes sense to speak of indirect group profiling especially if the group profile is non distributive distributive and non distributive profiling edit group profiles can also be divided in terms of their distributive character vedder 1999 a group profile is distributive when its properties apply equally to all the members of its group all bachelors are unmarried or all persons with a specific gene have 80 chance to contract a specific disease a profile is non distributive when the profile does not necessarily apply to all the members of the group the group of persons with a specific postal code have an average earning capacity of xx or the category of persons with blue eyes has an average chance of 37 to contract a specific disease note that in this case the chance of an individual to have a particular earning capacity or to contract the specific disease will depend on other factors e g sex age background of parents previous health education it should be obvious that apart from tautological profiles like that of bachelors most group profiles generated by means of computer techniques are non distributive this has far reaching implications for the accuracy of indirect individual profiling based on data matching with non distributive group profiles quite apart from the fact that the application of accurate profiles may be unfair or cause undue stigmatisation most group profiles will not be accurate application domains edit profiling technologies can be applied in a variety of different domains and for a variety of purposes these profiling practices will all have different effect and raise different issues knowledge about the behaviour and preferences of customers is of great interest to the commercial sector on the basis of profiling technologies companies can predict the behaviour of different types of customers marketing strategies can then be tailored to the people fitting these types examples of profiling practices in marketing are customers loyalty cards customer relationship management in general and personalized advertising 1 2 3 in the financial sector institutions use profiling technologies for fraud prevention and credit scoring banks want to minimise the risks in giving credit to their customers on the basis of extensive group profiling customers are assigned a certain scoring value that indicates their creditworthiness financial institutions like banks and insurance companies also use group profiling to detect fraud or money laundering databases with transactions are searched with algorithms to find behaviours that deviate from the standard indicating potentially suspicious transactions 2 in the context of employment profiles can be of use for tracking employees by monitoring their online behaviour for the detection of fraud by them and for the deployment of human resources by pooling and ranking their skills leopold amp meints 2008 4 profiling can also be used to support people at work and also for learning by intervening in the design of adaptive hypermedia systems personalising the interaction for instance this can be useful for supporting the management of attention nabeth 2008 in forensic science the possibility exists of linking different databases of cases and suspects and mining these for common patterns this could be used for solving existing cases or for the purpose of establishing risk profiles of potential suspects geradts amp sommer 2008 harcourt 2006 risks and issues edit profiling technologies have raised a host of ethical legal and other issues including privacy equality due process security and liability numerous authors have warned against the affordances of a new technological infrastructure that could emerge on the basis of semi autonomic profiling technologies lessig 2006 solove 2004 schwartz 2000 privacy is one of the principal issues raised profiling technologies make possible a far reaching monitoring of an individual s behaviour and preferences profiles may reveal personal or private information about individuals that they might not even be aware of themselves hildebrandt amp gutwirth 2008 profiling technologies are by their very nature discriminatory tools they allow unparalleled kinds of social sorting and segmentation which could have unfair effects the people that are profiled may have to pay higher prices 3 they could miss out on important offers or opportunities and they may run increased risks because catering to their needs is less profitable lyon 2003 in most cases they will not be aware of this since profiling practices are mostly invisible and the profiles themselves are often protected by intellectual property or trade secret this poses a threat to the equality of and solidarity of citizens on a larger scale it might cause the segmentation of society 4 one of the problems underlying potential violations of privacy and non discrimination is that the process of profiling is more often than not invisible for those that are being profiled this creates difficulties in that it becomes hard if not impossible to contest the application of a particular group profile this disturbs principles of due process if a person has no access to information on the basis of which she is withheld benefits or attributed certain risks she cannot contest the way she is being treated steinbock 2005 profiles can be used against people when they end up in the hands of people who are not entitled to access or use them an important issue related to these breaches of security is identity theft when the application of profiles causes harm the liability for this harm has to be determined who is to be held accountable is the software programmer the profiling service provider or the profiled user to be held accountable this issue of liability is especially complex in the case the application and decisions on profiles have also become automated like in autonomic computing or ambient intelligence decisions of automated decisions based on profiling see also edit profiling forensic profiling data mining digital traces identification information identity behavioral targeting digital identity privacy labelling stereotype user profile demographics references edit anderson chris 2008 the end of theory the data deluge makes the scientific method obsolete wired magazine 16 7 160 custers b h m 2004 the power of knowledge tilburg wolf legal publishers 160 elmer g 2004 profiling machines mapping the personal information economy mit press 160 fayyad u m piatetsky shapiro g smyth p 1996 from data mining to knowledge discovery in databases ai magazine 17 3 37 54 160 geradts zeno sommer peter 2008 d6 7c forensic profiling fidis deliverables 6 7c 160 harcourt b e 2006 against prediction profiling policing and punishing in an actuarial age the university of chicago press chicago and london 160 hildebrandt mireille gutwirth serge 2008 profiling the european citizen cross disciplinary perspectives springer dordrecht doi 10 1007 978 1 4020 6914 7 isbn 160 978 1 4020 6913 0 160 jaquet chiffelle david olivier 2008 reply direct and indirect profiling in the light of virtual persons to defining profiling a new type of knowledge in hildebrandt mireille gutwirth serge profiling the european citizen springer netherlands pp 160 17 45 doi 10 1007 978 1 4020 6914 7_2 160 kephart j o chess d m 2003 the vision of autonomic computing computer 36 1 january 96 104 doi 10 1109 mc 2003 1160055 160 leopold n meints m 2008 profiling in employment situations fraud in hildebrandt mireille gutwirth serge profiling the european citizen springer netherlands pp 160 217 237 doi 10 1007 978 1 4020 6914 7_12 160 lessig l 2006 code 2 0 basic books new york 160 lyon d 2003 surveillance as social sorting privacy risk and digital discrimination routledge 160 nabeth thierry 2008 user profiling for attention support for school and work in hildebrandt mireille gutwirth serge profiling the european citizen springer netherlands pp 160 185 200 doi 10 1007 978 1 4020 6914 7_10 160 schwartz p 2000 beyond lessig s code for the internet privacy cyberspace filters privacy control and fair information practices wis law review 743 743 788 160 solove d j 2004 the digital person technology and privacy in the information age new york new york university press 160 steinbock d 2005 data matching data mining and due process georgia law review 40 1 1 84 160 vedder a 1999 kdd the challenge to individualism ethics and information technology 1 4 275 281 doi 10 1023 a 1010016102284 160 weiser m 1991 the computer for the twenty first century scientific american 265 3 94 104 160 zarsky t 2002 3 mine your own business making the case for the implications of the data mining or personal information in the forum of public opinion yale journal of law and technology 5 4 17 47 160 notes and other references edit istag 2001 scenarios for ambient intelligence in 2010 information society technology advisory group canhoto a i 2007 profiling behaviour the social construction of categories in the detection of financial crime dissertation at london school of economics at http www lse ac uk collections informationsystems pdf theses canhoto pdf odlyzko a 2003 privacy economics and price discrimination on the internet a m odlyzko icec2003 fifth international conference on electronic commerce n sadeh ed acm pp 355 366 available at http www dtc umn edu odlyzko doc privacy economics pdf gandy o 2002 data mining and surveillance in the post 9 11 environment presentation at iamcr barcelona at http www asc upenn edu usr ogandy iamcrdatamining pdf v t e ambient intelligence concepts context awareness internet of things object hyperlinking profiling practices spime supranet ubiquitous computing web of things wireless sensor networks technologies 6lowpan ant dash7 ieee 802 15 4 internet 0 machine to machine radio frequency identification smartdust tera play xbee platforms arduino contiki electric imp gadgeteer iobridge tinyos wiring xively applications ambient device cense connected car home automation homeos internet refrigerator nabaztag smart city smart tv smarter planet pioneers kevin ashton adam dunkels stefano marzano donald norman roel pieper josef preishuber pfl gl john seely brown bruce sterling mark weiser other ambient devices ambiesense ebbits project ipso alliance retrieved from http en wikipedia org w index php title profiling_practices amp oldid 527535631 categories identityidentity managementdata mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 11 december 2012 at 14 23 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/ROUGE_metric_ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/ROUGE_metric_ new file mode 100644 index 00000000..d6792855 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/ROUGE_metric_ @@ -0,0 +1 @@ +rouge metric wikipedia the free encyclopedia rouge metric from wikipedia the free encyclopedia jump to navigation search this article includes a list of references but its sources remain unclear because it has insufficient inline citations please help to improve this article by introducing more precise citations october 2012 rouge or recall oriented understudy for gisting evaluation 1 is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing the metrics compare an automatically produced summary or translation against a reference or a set of references human produced summary or translation contents 1 metrics 2 see also 3 references 4 external links metrics edit the following five evaluation metrics 2 are available rouge n n gram 3 based co occurrence statistics rouge l longest common subsequence lcs 4 based statistics longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co occurring in sequence n grams automatically rouge w weighted lcs based statistics that favors consecutive lcses rouge s skip bigram 5 based co occurrence statistics skip bigram is any pair of words in their sentence order rouge su skip bigram plus unigram based co occurrence statistics rouge can be downloaded from berouge download link see also edit bleu f measure meteor nist metric word error rate wer noun phrase chunking references edit slides of talk by chin yew lin lin chin yew 2004 rouge a package for automatic evaluation of summaries in proceedings of the workshop on text summarization branches out was 2004 barcelona spain july 25 26 2004 lin chin yew and e h hovy 2003 automatic evaluation of summaries using n gram co occurrence statistics in proceedings of 2003 language technology conference hlt naacl 2003 edmonton canada may 27 june 1 2003 lin chin yew and franz josef och 2004a automatic evaluation of machine translation quality using longest common subsequence and skip bigram statistics in proceedings of the 42nd annual meeting of the association for computational linguistics acl 2004 barcelona spain july 21 26 2004 lin chin yew and franz josef och 2004a automatic evaluation of machine translation quality using longest common subsequence and skip bigram statistics in proceedings of the 42nd annual meeting of the association for computational linguistics acl 2004 barcelona spain july 21 26 2004 external links edit rouge web site retrieved from http en wikipedia org w index php title rouge_ metric amp oldid 542628470 categories machine translationcomputational linguisticsnatural language processingtasks of natural language processingdata mininghidden categories articles lacking in text citations from october 2012all articles lacking in text citations navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 7 march 2013 at 18 00 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Receiver_operating_characteristic b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Receiver_operating_characteristic new file mode 100644 index 00000000..34f14065 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Receiver_operating_characteristic @@ -0,0 +1 @@ +receiver operating characteristic wikipedia the free encyclopedia receiver operating characteristic from wikipedia the free encyclopedia jump to navigation search roc curve of three epitope predictors in signal detection theory a receiver operating characteristic roc or simply roc curve is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied it is created by plotting the fraction of true positives out of the positives tpr true positive rate vs the fraction of false positives out of the negatives fpr false positive rate at various threshold settings tpr is also known as sensitivity also called recall in some fields and fpr is one minus the specificity or true negative rate in general if both of the probability distributions for detection and false alarm are known the roc curve can be generated by plotting the cumulative distribution function area under the probability distribution from inf to inf of the detection probability in the y axis versus the cumulative distribution function of the false alarm probability in x axis roc analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from and prior to specifying the cost context or the class distribution roc analysis is related in a direct and natural way to cost benefit analysis of diagnostic decision making the roc curve was first developed by electrical engineers and radar engineers during world war ii for detecting enemy objects in battlefields and was soon introduced to psychology to account for perceptual detection of stimuli roc analysis since then has been used in medicine radiology biometrics and other areas for many decades and is increasingly used in machine learning and data mining research the roc is also known as a relative operating characteristic curve because it is a comparison of two operating characteristics tpr and fpr as the criterion changes 1 contents 1 basic concept 2 roc space 3 curves in roc space 4 further interpretations 4 1 area under the curve 4 2 other measures 5 detection error tradeoff graph 6 z transformation 7 history 8 see also 9 references 9 1 general references 10 further reading 11 external links basic concept edit terminology and derivations from a confusion matrix true positive tp eqv with hit true negative tn eqv with correct rejection false positive fp eqv with false alarm type i error false negative fn eqv with miss type ii error sensitivity or true positive rate tpr eqv with hit rate recall false positive rate fpr eqv with fall out accuracy acc specificity spc or true negative rate positive predictive value ppv eqv with precision negative predictive value npv false discovery rate fdr matthews correlation coefficient mcc f1 score is the harmonic mean of precision and recall source fawcett 2006 see also type i and type ii errors 160 and sensitivity and specificity a classification model classifier or diagnosis is a mapping of instances between certain classes groups the classifier or diagnosis result can be a real value continuous output in which case the classifier boundary between classes must be determined by a threshold value for instance to determine whether a person has hypertension based on a blood pressure measure or it can be a discrete class label indicating one of the classes let us consider a two class prediction problem binary classification in which the outcomes are labeled either as positive p or negative n there are four possible outcomes from a binary classifier if the outcome from a prediction is p and the actual value is also p then it is called a true positive tp however if the actual value is n then it is said to be a false positive fp conversely a true negative tn has occurred when both the prediction outcome and the actual value are n and false negative fn is when the prediction outcome is n while the actual value is p to get an appropriate example in a real world problem consider a diagnostic test that seeks to determine whether a person has a certain disease a false positive in this case occurs when the person tests positive but actually does not have the disease a false negative on the other hand occurs when the person tests negative suggesting they are healthy when they actually do have the disease let us define an experiment from p positive instances and n negative instances the four outcomes can be formulated in a 2 2 contingency table or confusion matrix as follows 160 actual value 160 p n total prediction outcome p true positive false positive p n false negative true negative n total p n roc space edit the roc space and plots of the four prediction examples the contingency table can derive several evaluation metrics see infobox to draw a roc curve only the true positive rate tpr and false positive rate fpr are needed as functions of some classifier parameter the tpr defines how many correct positive results occur among all positive samples available during the test fpr on the other hand defines how many incorrect positive results occur among all negative samples available during the test a roc space is defined by fpr and tpr as x and y axes respectively which depicts relative trade offs between true positive benefits and false positive costs since tpr is equivalent with sensitivity and fpr is equal to 1 specificity the roc graph is sometimes called the sensitivity vs 1 specificity plot each prediction result or instance of a confusion matrix represents one point in the roc space the best possible prediction method would yield a point in the upper left corner or coordinate 0 1 of the roc space representing 100 sensitivity no false negatives and 100 specificity no false positives the 0 1 point is also called a perfect classification a completely random guess would give a point along a diagonal line the so called line of no discrimination from the left bottom to the top right corners regardless of the positive and negative base rates an intuitive example of random guessing is a decision by flipping coins heads or tails as the size of the sample increases a random classifier s roc point migrates towards 0 5 0 5 the diagonal divides the roc space points above the diagonal represent good classification results better than random points below the line poor results worse than random note that the output of a consistently poor predictor could simply be inverted to obtain a good predictor let us look into four prediction results from 100 positive and 100 negative instances a b c c tp 63 fp 28 91 fn 37 tn 72 109 100 100 200 tp 77 fp 77 154 fn 23 tn 23 46 100 100 200 tp 24 fp 88 112 fn 76 tn 12 88 100 100 200 tp 76 fp 12 88 fn 24 tn 88 112 100 100 200 tpr 0 63 tpr 0 77 tpr 0 24 tpr 0 76 fpr 0 28 fpr 0 77 fpr 0 88 fpr 0 12 ppv 0 69 ppv 0 50 ppv 0 21 ppv 0 86 f1 0 66 f1 0 61 f1 0 22 f1 0 81 acc 0 68 acc 0 50 acc 0 18 acc 0 82 plots of the four results above in the roc space are given in the figure the result of method a clearly shows the best predictive power among a b and c the result of b lies on the random guess line the diagonal line and it can be seen in the table that the accuracy of b is 50 however when c is mirrored across the center point 0 5 0 5 the resulting method c is even better than a this mirrored method simply reverses the predictions of whatever method or test produced the c contingency table although the original c method has negative predictive power simply reversing its decisions leads to a new predictive method c which has positive predictive power when the c method predicts p or n the c method would predict n or p respectively in this manner the c test would perform the best the closer a result from a contingency table is to the upper left corner the better it predicts but the distance from the random guess line in either direction is the best indicator of how much predictive power a method has if the result is below the line i e the method is worse than a random guess all of the method s predictions must be reversed in order to utilize its power thereby moving the result above the random guess line curves in roc space edit objects are often classified based on a continuous random variable for example imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g dl and 1 g dl respectively a medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease the experimenter can adjust the threshold black vertical line in the figure which will in turn change the false positive rate increasing the threshold would result in fewer false positives and more false negatives corresponding to a leftward movement on the curve the actual shape of the curve is determined by how much overlap the two distributions have further interpretations edit sometimes the roc is used to generate a summary statistic common versions are the intercept of the roc curve with the line at 90 degrees to the no discrimination line also called youden s j statistic the area between the roc curve and the no discrimination line citation needed the area under the roc curve or auc area under curve or a pronounced a prime 2 or c statistic 3 d pronounced d prime the distance between the mean of the distribution of activity in the system under noise alone conditions and its distribution under signal alone conditions divided by their standard deviation under the assumption that both these distributions are normal with the same standard deviation under these assumptions it can be proved that the shape of the roc depends only on d c concordance statistic this is a rank order statistic related to somers d statistic it is commonly used in the medical literature to quantify the capacity of the estimated risk score in discriminating among subjects with different event times it varies between 0 5 and 1 0 with higher values indicating a better predictive model for binary outcomes c is identical to the area under the receiver operating characteristic curve although bootstrapping to generate confidence intervals is possible the power of testing the differences between two or more c statistics is low and alternative methods such as logistic regression should probably be used 4 the c statistic has been generalized for use in survival analysis 5 and it is also possible to combine this with statistical weighting systems other extensions have been proposed 6 7 however any attempt to summarize the roc curve into a single number loses information about the pattern of tradeoffs of the particular discriminator algorithm area under the curve edit when using normalized units the area under the curve auc is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one assuming positive ranks higher than negative 8 it can be shown that the area under the roc curve often referred to as simply the auroc is closely related to the mann whitney u 9 10 which tests whether positives are ranked higher than negatives it is also equivalent to the wilcoxon test of ranks 10 the auc is related to the gini coefficient by the formula where 11 in this way it is possible to calculate the auc by using an average of a number of trapezoidal approximations it is also common to calculate the area under the roc convex hull roc auch roch auc as any point on the line segment between two prediction results can be achieved by randomly using one or other system with probabilities proportional to the relative length of the opposite component of the segment 12 interestingly it is also possible to invert concavities just as in the figure the worse solution can be reflected to become a better solution concavities can be reflected in any line segment but this more extreme form of fusion is much more likely to overfit the data 13 the machine learning community most often uses the roc auc statistic for model comparison 14 however this practice has recently been questioned based upon new machine learning research that shows that the auc is quite noisy as a classification measure 15 and has some other significant problems in model comparison 16 17 a reliable and valid auc estimate can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example however the critical research 15 16 suggests frequent failures in obtaining reliable and valid auc estimates thus the practical value of the auc measure has been called into question 17 raising the possibility that the auc may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution one recent explanation of the problem with roc auc is that reducing the roc curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system as well as ignoring the possibility of concavity repair so that related alternative measures such as informedness 18 or deltap are recommended 19 these measures are essentially equivalent to the gini for a single prediction point with deltap informedness 2auc 1 whilst deltap markedness represents the dual viz predicting the prediction from the real class and their geometric mean is matthews correlation coefficient 18 alternatively roc auc may be divided into two components its certainty roc cert which corresponds to the single point auc and its consistency roc con which corresponds to multipoint auc singlepoint auc with the pair of measures roc concert being argued to capture some of the additional information that roc adds to the single point measures noting that it can also be applied to roch and should be if it is to capture the real potential of the system whose parameterization is being investigated 20 other measures edit in engineering the area between the roc curve and the no discrimination line is often preferred due to its useful mathematical properties as a non parametric statistic citation needed this area is often simply known as the discrimination in psychophysics the sensitivity index d p or deltap is the most commonly used measure 21 and is equivalent to twice the discrimination being equal also to informedness deskewed wracc and gini coefficient in the single point case single parameterization or single system 18 these measures all have the advantage that 0 represents chance performance whilst informedness 1 represents perfect performance and 1 represents the perverse case of full informedness used to always give the wrong response with informedness being proven to be the probability of making an informed decision rather than guessing 22 roc auc and auch have a related property that chance performance has a fixed value but it is 0 5 and the normalization to 2auc 1 brings this to 0 and allows informedness and gini to be interpreted as kappa statistics but informedness has been shown to have desirable characteristics for machine learning versus other common definitions of kappa such as cohen kappa and fleiss kappa 18 23 the illustration at the top right of the page shows the use of roc graphs for the discrimination between the quality of different algorithms for predicting epitopes the graph shows that if one detects at least 60 of the epitopes in a virus protein at least 30 of the output is falsely marked as epitopes sometimes it can be more useful to look at a specific region of the roc curve rather than at the whole curve it is possible to compute partial auc 24 for example one could focus on the region of the curve with low false positive rate which is often of prime interest for population screening tests 25 another common approach for classification problems in which p n common in bioinformatics applications is to use a logarithmic scale for the x axis 26 detection error tradeoff graph edit example det graph an alternative to the roc curve is the detection error tradeoff det graph which plots the false negative rate missed detections vs the false positive rate false alarms on non linearly transformed x and y axes the transformation function is the quantile function of the normal distribution i e the inverse of the cumulative normal distribution it is in fact the same transformation as zroc below except that the complement of the hit rate the miss rate or false negative rate is used this alternative spends more graph area on the region of interest most of the roc area is of little interest one primarily cares about the region tight against the y axis and the top left corner which because of using miss rate instead of its complement the hit rate is the lower left corner in a det plot the det plot is used extensively in the automatic speaker recognition community where the name det was first used the analysis of the roc performance in graphs with this warping of the axes was used by psychologists in perception studies halfway the 20th century where this was dubbed double probability paper z transformation edit this section needs attention from an expert in statistics the specific problem is z transformation is the wrong term z standardization is better see the talk page for details wikiproject statistics or its portal may be able to help recruit an expert june 2013 if a z transformation is applied to the roc curve the curve will be transformed into a straight line 27 this z transformation is based on a normal distribution with a mean of zero and a standard deviation of one in memory strength theory one must assume that the zroc is not only linear but has a slope of 1 0 the normal distributions of targets studied objects that the subjects need to recall and lures non studied objects that the subjects attempt to recall is the factor causing the zroc to be linear the linearity of the zroc curve depends on the standard deviations of the target and lure strength distributions if the standard deviations are equal the slope will be 1 0 if the standard deviation of the target strength distribution is larger than the standard deviation of the lure strength distribution then the slope will be smaller than 1 0 in most studies it has been found that the zroc curve slopes constantly fall below 1 usually between 0 5 and 0 9 28 many experiments yielded a zroc slope of 0 8 a slope of 0 8 implies that the variability of the target strength distribution is 25 larger than the variability of the lure strength distribution 29 another variable used is 160 d d 160 is a measure of sensitivity for yes no recognition that can easily be expressed in terms of z values d 160 measures sensitivity in that it measures the degree of overlap between target and lure distributions it is calculated as the mean of the target distribution minus the mean of the lure distribution expressed in standard deviation units for a given hit rate and false alarm rate d 160 can be calculated with the following equation d 160 160 z hit rate 160 160 z false alarm rate although d 160 is a commonly used parameter it must be recognized that it is only relevant when strictly adhering to the very strong assumptions of strength theory made above 30 the z transformation of a roc curve is always linear as assumed except in special situations the yonelinas familiarity recollection model is a two dimensional account of recognition memory instead of the subject simply answering yes or no to a specific input the subject gives the input a feeling of familiarity which operates like the original roc curve what changes though is a parameter for recollection r recollection is assumed to be all or none and it trumps familiarity if there were no recollection component zroc would have a predicted slope of 1 however when adding the recollection component the zroc curve will be concave up with a decreased slope this difference in shape and slope result from an added element of variability due to some items being recollected patients with anterograde amnesia are unable to recollect so their yonelinas zroc curve would have a slope close to 1 0 31 history edit the roc curve was first used during world war ii for the analysis of radar signals before it was employed in signal detection theory 32 following the attack on pearl harbor in 1941 the united states army began new research to increase the prediction of correctly detected japanese aircraft from their radar signals in the 1950s roc curves were employed in psychophysics to assess human and occasionally non human animal detection of weak signals 32 in medicine roc analysis has been extensively used in the evaluation of diagnostic tests 33 34 roc curves are also used extensively in epidemiology and medical research and are frequently mentioned in conjunction with evidence based medicine in radiology roc analysis is a common technique to evaluate new radiology techniques 35 in the social sciences roc analysis is often called the roc accuracy ratio a common technique for judging the accuracy of default probability models roc curves also proved useful for the evaluation of machine learning techniques the first application of roc in machine learning was by spackman who demonstrated the value of roc curves in comparing and evaluating different classification algorithms 36 see also edit wikimedia commons has media related to receiver operating characteristic brier score coefficient of determination constant false alarm rate detection theory false alarm gain information retrieval precision and recall references edit swets john a signal detection theory and roc analysis in psychology and diagnostics 160 collected papers lawrence erlbaum associates mahwah nj 1996 fogarty james baker ryan s hudson scott e 2005 case studies in the use of roc curve analysis for sensor based estimates in human computer interaction acm international conference proceeding series proceedings of graphics interface 2005 waterloo on canadian human computer communications society 160 hastie trevor tibshirani robert friedman jerome h 2009 the elements of statistical learning data mining inference and prediction 2nd ed 160 lavalley mp 2008 logistic regression circulation 117 2395 2399 doi 10 1161 circulationaha 106 682658 heagerty pj zheng y 2005 survival model predictive accuracy and roc curves biometrics 61 92 105 gonen m heller g 2005 concordance probability and discriminatory power in proportional hazards regression biometrika 92 965 970 chambless le diao g 2006 estimation of time dependent area under the roc curve for long term risk prediction stat med 25 3474 3486 fawcett tom 2006 an introduction to roc analysis pattern recognition letters 27 861 874 hanley james a mcneil barbara j 1982 the meaning and use of the area under a receiver operating characteristic roc curve radiology 143 1 29 36 pmid 160 7063747 160 a b mason simon j graham nicholas e 2002 areas beneath the relative operating characteristics roc and relative operating levels rol curves statistical significance and interpretation quarterly journal of the royal meteorological society 128 2145 2166 160 hand david j and till robert j 2001 a simple generalization of the area under the roc curve for multiple class classification problems machine learning 45 171 186 provost f fawcett t 2001 robust classification for imprecise environments machine learning 44 203 231 160 repairing concavities in roc curves 19th international joint conference on artificial intelligence ijcai 05 2005 pp 160 702 707 160 hanley james a mcneil barbara j 1983 09 01 a method of comparing the areas under receiver operating characteristic curves derived from the same cases radiology 148 3 839 43 pmid 160 6878708 retrieved 2008 12 03 160 a b hanczar blaise hua jianping sima chao weinstein john bittner michael and dougherty edward r 2010 small sample precision of roc related estimates bioinformatics 26 6 822 830 a b lobo jorge m jim nez valverde alberto and real raimundo 2008 auc a misleading measure of the performance of predictive distribution models global ecology and biogeography 17 145 151 a b hand david j 2009 measuring classifier performance a coherent alternative to the area under the roc curve machine learning 77 103 123 a b c d powers david m w 2007 2011 evaluation from precision recall and f factor to roc informedness markedness amp correlation journal of machine learning technologies 2 1 37 63 160 powers david m w 2012 the problem of area under the curve international conference on information science and technology 160 powers david m w 2012 roc concert spring conference on engineering technology 160 perruchet p peereman r 2004 the exploitation of distributional information in syllable processing j neurolinguistics 17 97 119 160 powers david m w 2003 recall and precision versus the bookmaker proceedings of the international conference on cognitive science icsc 2003 sydney australia 2003 pp 529 534 160 powers david m w 2012 the problem with kappa conference of the european chapter of the association for computational linguistics eacl2012 joint robus unsup workshop 160 mcclish donna katzman 1989 08 01 analyzing a portion of the roc curve medical decision making 9 3 190 195 doi 10 1177 0272989x8900900307 pmid 160 2668680 retrieved 2008 09 29 160 dodd lori e pepe margaret s 2003 partial auc estimation and regression biometrics 59 3 614 623 doi 10 1111 1541 0420 00071 pmid 160 14601762 retrieved 2007 12 18 160 karplus kevin 2011 better than chance the importance of null models university of california santa cruz in proceedings of the first international workshop on pattern recognition in proteomics structural biology and bioinformatics pr ps bb 2011 macmillan neil a creelman c douglas 2005 detection theory a user s guide 2nd ed mahwah nj lawrence erlbaum associates isbn 160 1 4106 1114 0 160 glanzer murray kisok kim hilford andy adams john k 1999 slope of the receiver operating characteristic in recognition memory journal of experimental psychology learning memory and cognition 25 2 500 513 160 ratcliff roger mccoon gail tindall michael 1994 empirical generality of data from recognition memory roc functions and implications for gmms journal of experimental psychology learning memory and cognition 20 763 785 160 zhang jun mueller shane t 2005 a note on roc analysis and non parametric estimate of sensitivity psychometrika 70 203 212 160 yonelinas andrew p kroll neal e a dobbins ian g lazzara michele knight robert t 1998 recollection and familiarity deficits in amnesia convergence of remember know process dissociation and receiver operating characteristic data neuropsychology 12 323 339 160 a b green david m swets john a 1966 signal detection theory and psychophysics new york ny john wiley and sons inc isbn 160 0 471 32420 5 160 zweig mark h campbell gregory 1993 receiver operating characteristic roc plots a fundamental evaluation tool in clinical medicine clinical chemistry 39 8 561 577 pmid 160 8472349 160 pepe margaret s 2003 the statistical evaluation of medical tests for classification and prediction new york ny oxford isbn 160 0 19 856582 8 160 obuchowski nancy a 2003 receiver operating characteristic curves and their use in radiology radiology 229 1 3 8 doi 10 1148 radiol 2291010898 pmid 160 14519861 160 spackman kent a 1989 signal detection theory valuable tools for evaluating inductive learning proceedings of the sixth international workshop on machine learning san mateo ca morgan kaufmann pp 160 160 163 160 general references edit zhou xiao hua obuchowski nancy a mcclish donna k 2002 statistical methods in diagnostic medicine new york ny wiley amp sons isbn 160 978 0 471 34772 9 160 further reading edit balakrishnan narayanaswamy 1991 handbook of the logistic distribution marcel dekker inc isbn 978 0 8247 8587 1 brown christopher d and davis herbert t 2006 receiver operating characteristic curves and related decision measures a tutorial chemometrics and intelligent laboratory systems 80 24 38 fawcett tom 2004 roc graphs notes and practical considerations for researchers pattern recognition letters 27 8 882 891 gonen mithat 2007 analyzing receiver operating characteristic curves using sas sas press isbn 978 1 59994 298 1 green william h 2003 econometric analysis fifth edition prentice hall isbn 0 13 066189 9 heagerty patrick j lumley thomas and pepe margaret s 2000 time dependent roc curves for censored survival data and a diagnostic marker biometrics 56 337 344 hosmer david w and lemeshow stanley 2000 applied logistic regression 2nd ed new york ny wiley isbn 0 471 35632 8 lasko thomas a bhagwat jui g zou kelly h and ohno machado lucila 2005 the use of receiver operating characteristic curves in biomedical informatics journal of biomedical informatics 38 5 404 415 stephan carsten wesseling sebastian schink tania and jung klaus 2003 comparison of eight computer programs for receiver operating characteristic analysis clinical chemistry 49 433 439 swets john a dawes robyn m and monahan john 2000 better decisions through science scientific american october pp 160 82 87 zou kelly h o malley a james mauri laura 2007 receiver operating characteristic analysis for evaluating diagnostic tests and predictive models circulation 115 5 654 7 external links edit this article s use of external links may not follow wikipedia s policies or guidelines please improve this article by removing excessive or inappropriate external links and converting useful links where appropriate into footnote references march 2010 kelly h zou s bibliography of roc literature and articles tom fawcett s roc convex hull tutorial program and papers peter flach s tutorial on roc analysis in machine learning the magnificent roc an explanation and interactive demonstration of the connection of rocs to archetypal bi normal test result plots web based calculator for roc curves by john eng convex hull cost trade off etc retrieved from http en wikipedia org w index php title receiver_operating_characteristic amp oldid 561286817 categories detection theorydata miningsocioeconomicsbiostatisticsstatistical classificationhidden categories all articles with unsourced statementsarticles with unsourced statements from march 2010articles needing expert attention from june 2013all articles needing expert attentionstatistics articles needing expert attentionmiscellaneous articles needing expert attentioncommons category with local link same as on wikidataarticles with invalid isbnswikipedia external links cleanup from march 2010wikipedia spam cleanup from march 2010 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol fran ais italiano lietuvi nederlands portugus t rk e ti ng vi t edit links this page was last modified on 24 june 2013 at 00 36 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Regression_analysis b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Regression_analysis new file mode 100644 index 00000000..2855494f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Regression_analysis @@ -0,0 +1 @@ +regression analysis wikipedia the free encyclopedia regression analysis from wikipedia the free encyclopedia jump to navigation search regression analysis models linear regression simple regression ordinary least squares polynomial regression general linear model generalized linear model discrete choice logistic regression multinomial logit mixed logit probit multinomial probit ordered logit ordered probit poisson multilevel model fixed effects random effects mixed model nonlinear regression nonparametric semiparametric robust quantile isotonic principal components least angle local segmented errors in variables estimation least squares ordinary least squares linear math partial total generalized weighted non linear iteratively reweighted ridge regression lasso least absolute deviations bayesian bayesian multivariate background regression model validation mean and predicted response errors and residuals goodness of fit studentized residual gauss markov theorem statistics portal v t e in statistics regression analysis is a statistical technique for estimating the relationships among variables it includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables more specifically regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied while the other independent variables are held fixed most commonly regression analysis estimates the conditional expectation of the dependent variable given the independent variables that is the average value of the dependent variable when the independent variables are fixed less commonly the focus is on a quantile or other location parameter of the conditional distribution of the dependent variable given the independent variables in all cases the estimation target is a function of the independent variables called the regression function in regression analysis it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution regression analysis is widely used for prediction and forecasting where its use has substantial overlap with the field of machine learning regression analysis is also used to understand which among the independent variables are related to the dependent variable and to explore the forms of these relationships in restricted circumstances regression analysis can be used to infer causal relationships between the independent and dependent variables however this can lead to illusions or false relationships so caution is advisable 1 for example correlation does not imply causation a large body of techniques for carrying out regression analysis has been developed familiar methods such as linear regression and ordinary least squares regression are parametric in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions which may be infinite dimensional the performance of regression analysis methods in practice depends on the form of the data generating process and how it relates to the regression approach being used since the true form of the data generating process is generally not known regression analysis often depends to some extent on making assumptions about this process these assumptions are sometimes testable if many data are available regression models for prediction are often useful even when the assumptions are moderately violated although they may not perform optimally however in many applications especially with small effects or questions of causality based on observational data regression methods can give misleading results 2 3 contents 1 history 2 regression models 2 1 necessary number of independent measurements 2 2 statistical assumptions 3 underlying assumptions 4 linear regression 4 1 general linear model 4 2 diagnostics 4 3 limited dependent variables 5 interpolation and extrapolation 6 nonlinear regression 7 power and sample size calculations 8 other methods 9 software 10 see also 11 references 12 further reading 13 external links history edit the earliest form of regression was the method of least squares which was published by legendre in 1805 4 and by gauss in 1809 5 legendre and gauss both applied the method to the problem of determining from astronomical observations the orbits of bodies about the sun mostly comets but also later the then newly discovered minor planets gauss published a further development of the theory of least squares in 1821 6 including a version of the gauss markov theorem the term regression was coined by francis galton in the nineteenth century to describe a biological phenomenon the phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average a phenomenon also known as regression toward the mean 7 8 for galton regression had only this biological meaning 9 10 but his work was later extended by udny yule and karl pearson to a more general statistical context 11 12 in the work of yule and pearson the joint distribution of the response and explanatory variables is assumed to be gaussian this assumption was weakened by r a fisher in his works of 1922 and 1925 13 14 15 fisher assumed that the conditional distribution of the response variable is gaussian but the joint distribution need not be in this respect fisher s assumption is closer to gauss s formulation of 1821 in the 1950s and 1960s economists used electromechanical desk calculators to calculate regressions before 1970 it sometimes took up to 24 hours to receive the result from one regression 16 regression methods continue to be an area of active research in recent decades new methods have been developed for robust regression regression involving correlated responses such as time series and growth curves regression in which the predictor or response variables are curves images graphs or other complex data objects regression methods accommodating various types of missing data nonparametric regression bayesian methods for regression regression in which the predictor variables are measured with error regression with more predictor variables than observations and causal inference with regression regression models edit regression models involve the following variables the unknown parameters denoted as which may represent a scalar or a vector the independent variables x the dependent variable y in various fields of application different terminologies are used in place of dependent and independent variables a regression model relates y to a function of x and the approximation is usually formalized as e y 160 160 x 160 160 f x to carry out regression analysis the form of the function f must be specified sometimes the form of this function is based on knowledge about the relationship between y and x that does not rely on the data if no such knowledge is available a flexible or convenient form for f is chosen assume now that the vector of unknown parameters is of length k in order to perform a regression analysis the user must provide information about the dependent variable y if n data points of the form y x are observed where n lt k most classical approaches to regression analysis cannot be performed since the system of equations defining the regression model is underdetermined there are not enough data to recover if exactly n 160 160 k data points are observed and the function f is linear the equations y 160 160 f x can be solved exactly rather than approximately this reduces to solving a set of n equations with n unknowns the elements of which has a unique solution as long as the x are linearly independent if f is nonlinear a solution may not exist or many solutions may exist the most common situation is where n gt k data points are observed in this case there is enough information in the data to estimate a unique value for that best fits the data in some sense and the regression model when applied to the data can be viewed as an overdetermined system in in the last case the regression analysis provides the tools for finding a solution for unknown parameters that will for example minimize the distance between the measured and predicted values of the dependent variable y also known as method of least squares under certain statistical assumptions the regression analysis uses the surplus of information to provide statistical information about the unknown parameters and predicted values of the dependent variable y necessary number of independent measurements edit consider a regression model which has three unknown parameters 0 1 and 2 suppose an experimenter performs 10 measurements all at exactly the same value of independent variable vector x which contains the independent variables x1 x2 and x3 in this case regression analysis fails to give a unique set of estimated values for the three unknown parameters the experimenter did not provide enough information the best one can do is to estimate the average value and the standard deviation of the dependent variable y similarly measuring at two different values of x would give enough data for a regression with two unknowns but not for three or more unknowns if the experimenter had performed measurements at three different values of the independent variable vector x then regression analysis would provide a unique set of estimates for the three unknown parameters in in the case of general linear regression the above statement is equivalent to the requirement that the matrix xtx is invertible statistical assumptions edit when the number of measurements n is larger than the number of unknown parameters k and the measurement errors i are normally distributed then the excess of information contained in n k measurements is used to make statistical predictions about the unknown parameters this excess of information is referred to as the degrees of freedom of the regression underlying assumptions edit classical assumptions for regression analysis include the sample is representative of the population for the inference prediction the error is a random variable with a mean of zero conditional on the explanatory variables the independent variables are measured with no error note if this is not so modeling may be done instead using errors in variables model techniques the predictors are linearly independent i e it is not possible to express any predictor as a linear combination of the others the errors are uncorrelated that is the variance covariance matrix of the errors is diagonal and each non zero element is the variance of the error the variance of the error is constant across observations homoscedasticity if not weighted least squares or other methods might instead be used these are sufficient conditions for the least squares estimator to possess desirable properties in particular these assumptions imply that the parameter estimates will be unbiased consistent and efficient in the class of linear unbiased estimators it is important to note that actual data rarely satisfies the assumptions that is the method is used even though the assumptions are not true variation from the assumptions can sometimes be used as a measure of how far the model is from being useful many of these assumptions may be relaxed in more advanced treatments reports of statistical analyses usually include analyses of tests on the sample data and methodology for the fit and usefulness of the model assumptions include the geometrical support of the variables 17 clarification needed independent and dependent variables often refer to values measured at point locations there may be spatial trends and spatial autocorrelation in the variables that violates statistical assumptions of regression geographic weighted regression is one technique to deal with such data 18 also variables may include values aggregated by areas with aggregated data the modifiable areal unit problem can cause extreme variation in regression parameters 19 when analyzing data aggregated by political boundaries postal codes or census areas results may be very different with a different choice of units linear regression edit main article linear regression see simple linear regression for a derivation of these formulas and a numerical example in linear regression the model specification is that the dependent variable is a linear combination of the parameters but need not be linear in the independent variables for example in simple linear regression for modeling data points there is one independent variable and two parameters and straight line in multiple linear regression there are several independent variables or functions of independent variables adding a term in xi2 to the preceding regression gives parabola this is still linear regression although the expression on the right hand side is quadratic in the independent variable it is linear in the parameters and in both cases is an error term and the subscript indexes a particular observation given a random sample from the population we estimate the population parameters and obtain the sample linear regression model the residual is the difference between the value of the dependent variable predicted by the model and the true value of the dependent variable one method of estimation is ordinary least squares this method obtains parameter estimates that minimize the sum of squared residuals sse 20 21 also sometimes denoted rss minimization of this function results in a set of normal equations a set of simultaneous linear equations in the parameters which are solved to yield the parameter estimators illustration of linear regression on a data set in the case of simple regression the formulas for the least squares estimates are where is the mean average of the values and is the mean of the values under the assumption that the population error term has a constant variance the estimate of that variance is given by this is called the mean square error mse of the regression the denominator is the sample size reduced by the number of model parameters estimated from the same data n p for p regressors or n p 1 if an intercept is used 22 in this case p 1 so the denominator is n 2 the standard errors of the parameter estimates are given by under the further assumption that the population error term is normally distributed the researcher can use these estimated standard errors to create confidence intervals and conduct hypothesis tests about the population parameters general linear model edit for a derivation see linear least squares for a numerical example see linear regression in the more general multiple regression model there are p independent variables where xij is the ith observation on the jth independent variable and where the first independent variable takes the value 1 for all i so is the regression intercept the least squares parameter estimates are obtained from p normal equations the residual can be written as the normal equations are in matrix notation the normal equations are written as where the ij element of x is xij the i element of the column vector y is yi and the j element of is thus x is n p y is n 1 and is p 1 the solution is diagnostics edit see also category regression diagnostics once a regression model has been constructed it may be important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters commonly used checks of goodness of fit include the r squared analyses of the pattern of residuals and hypothesis testing statistical significance can be checked by an f test of the overall fit followed by t tests of individual parameters interpretations of these diagnostic tests rest heavily on the model assumptions although examination of the residuals can be used to invalidate a model the results of a t test or f test are sometimes more difficult to interpret if the model s assumptions are violated for example if the error term does not have a normal distribution in small samples the estimated parameters will not follow normal distributions and complicate inference with relatively large samples however a central limit theorem can be invoked such that hypothesis testing may proceed using asymptotic approximations limited dependent variables edit the phrase limited dependent is used in econometric statistics for categorical and constrained variables the response variable may be non continuous limited to lie on some subset of the real line for binary zero or one variables if analysis proceeds with least squares linear regression the model is called the linear probability model nonlinear models for binary dependent variables include the probit and logit model the multivariate probit model is a standard method of estimating a joint relationship between several binary dependent variables and some independent variables for categorical variables with more than two values there is the multinomial logit for ordinal variables with more than two values there are the ordered logit and ordered probit models censored regression models may be used when the dependent variable is only sometimes observed and heckman correction type models may be used when the sample is not randomly selected from the population of interest an alternative to such procedures is linear regression based on polychoric correlation or polyserial correlations between the categorical variables such procedures differ in the assumptions made about the distribution of the variables in the population if the variable is positive with low values and represents the repetition of the occurrence of an event then count models like the poisson regression or the negative binomial model may be used instead interpolation and extrapolation edit regression models predict a value of the y variable given known values of the x variables prediction within the range of values in the dataset used for model fitting is known informally as interpolation prediction outside this range of the data is known as extrapolation performing extrapolation relies strongly on the regression assumptions the further the extrapolation goes outside the data the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values it is generally advised citation needed that when performing extrapolation one should accompany the estimated value of the dependent variable with a prediction interval that represents the uncertainty such intervals tend to expand rapidly as the values of the independent variable s moved outside the range covered by the observed data for such reasons and others some tend to say that it might be unwise to undertake extrapolation 23 however this does not cover the full set of modelling errors that may be being made in particular the assumption of a particular form for the relation between y and x a properly conducted regression analysis will include an assessment of how well the assumed form is matched by the observed data but it can only do so within the range of values of the independent variables actually available this means that any extrapolation is particularly reliant on the assumptions being made about the structural form of the regression relationship best practice advice here citation needed is that a linear in variables and linear in parameters relationship should not be chosen simply for computational convenience but that all available knowledge should be deployed in constructing a regression model if this knowledge includes the fact that the dependent variable cannot go outside a certain range of values this can be made use of in selecting the model even if the observed dataset has no values particularly near such bounds the implications of this step of choosing an appropriate functional form for the regression can be great when extrapolation is considered at a minimum it can ensure that any extrapolation arising from a fitted model is realistic or in accord with what is known nonlinear regression edit main article nonlinear regression when the model function is not linear in the parameters the sum of squares must be minimized by an iterative procedure this introduces many complications which are summarized in differences between linear and non linear least squares power and sample size calculations edit there are no generally agreed methods for relating the number of observations versus the number of independent variables in the model one rule of thumb suggested by good and hardin is where is the sample size is the number of independent variables and is the number of observations needed to reach the desired precision if the model had only one independent variable 24 for example a researcher is building a linear regression model using a dataset that contains 1000 patients if he decides that five observations are needed to precisely define a straight line then the maximum number of independent variables his model can support is 4 because other methods edit although the parameters of a regression model are usually estimated using the method of least squares other methods which have been used include bayesian methods e g bayesian linear regression percentage regression for situations where reducing percentage errors is deemed more appropriate 25 least absolute deviations which is more robust in the presence of outliers leading to quantile regression nonparametric regression requires a large number of observations and is computationally intensive distance metric learning which is learned by the search of a meaningful distance metric in a given input space 26 software edit main article list of statistical packages all major statistical software packages perform least squares regression analysis and inference simple linear regression and multiple regression using least squares can be done in some spreadsheet applications and on some calculators while many statistical software packages can perform various types of nonparametric and robust regression these methods are less standardized different software packages implement different methods and a method with a given name may be implemented differently in different packages specialized regression software has been developed for use in fields such as survey analysis and neuroimaging see also edit statistics portal curve fitting forecasting fraction of variance unexplained kriging a linear least squares estimation algorithm local regression modifiable areal unit problem multivariate adaptive regression splines multivariate normal distribution pearson product moment correlation coefficient prediction interval robust regression segmented regression stepwise regression trend estimation references edit armstrong j scott 2012 illusions in regression analysis international journal of forecasting forthcoming 28 3 689 doi 10 1016 j ijforecast 2012 02 001 160 david a freedman statistical models theory and practice cambridge university press 2005 r dennis cook sanford weisberg criticism and influence analysis in regression sociological methodology vol 13 1982 pp 313 361 a m legendre nouvelles m thodes pour la d termination des orbites des com tes firmin didot paris 1805 sur la m thode des moindres quarr s appears as an appendix c f gauss theoria motus corporum coelestium in sectionibus conicis solem ambientum 1809 c f gauss theoria combinationis observationum erroribus minimis obnoxiae 1821 1823 mogull robert g 2004 second semester applied statistics kendall hunt publishing company p 160 59 isbn 160 0 7575 1181 3 160 galton francis 1989 kinship and correlation reprinted 1989 statistical science institute of mathematical statistics 4 2 80 86 doi 10 1214 ss 1177012581 jstor 160 2245330 160 francis galton typical laws of heredity nature 15 1877 492 495 512 514 532 533 galton uses the term reversion in this paper which discusses the size of peas francis galton presidential address section h anthropology 1885 galton uses the term regression in this paper which discusses the height of humans yule g udny 1897 on the theory of correlation journal of the royal statistical society blackwell publishing 60 4 812 54 doi 10 2307 2979746 jstor 160 2979746 160 pearson karl yule g u blanchard norman lee alice 1903 the law of ancestral heredity biometrika biometrika trust 2 2 211 236 doi 10 1093 biomet 2 2 211 jstor 160 2331683 160 fisher r a 1922 the goodness of fit of regression formulae and the distribution of regression coefficients journal of the royal statistical society blackwell publishing 85 4 597 612 doi 10 2307 2341124 jstor 160 2341124 160 ronald a fisher 1954 statistical methods for research workers twelfth ed edinburgh oliver and boyd isbn 160 0 05 002170 2 160 aldrich john 2005 fisher and regression statistical science 20 4 401 417 doi 10 1214 088342305000000331 jstor 160 20061201 160 rodney ramcharan regressions why are economists obessessed with them march 2006 accessed 2011 12 03 n cressie 1996 change of support and the modiable areal unit problem geographical systems 3 159 180 fotheringham a stewart brunsdon chris charlton martin 2002 geographically weighted regression the analysis of spatially varying relationships reprint ed chichester england john wiley isbn 160 978 0 471 49616 8 160 fotheringham as wong dws 1 january 1991 the modifiable areal unit problem in multivariate statistical analysis environment and planning a 23 7 1025 1044 doi 10 1068 a231025 160 m h kutner c j nachtsheim and j neter 2004 applied linear regression models 4th ed mcgraw hill irwin boston p 25 n ravishankar and d k dey 2002 a first course in linear model theory chapman and hall crc boca raton p 101 steel r g d and torrie j h principles and procedures of statistics with special reference to the biological sciences mcgraw hill 1960 page 288 chiang c l 2003 statistical methods of analysis world scientific isbn 981 238 310 7 page 274 section 9 7 4 interpolation vs extrapolation good p i hardin j w 2009 common errors in statistics and how to avoid them 3rd ed hoboken new jersey wiley p 160 211 isbn 160 978 0 470 45798 6 160 tofallis c 2009 least squares percentage regression journal of modern applied statistical methods 7 526 534 doi 10 2139 ssrn 1406472 160 yangjing long 2009 human age estimation by metric learning for regression problems proc international conference on computer analysis of images and patterns 74 82 160 further reading edit william h kruskal and judith m tanur ed 1978 linear hypotheses international encyclopedia of statistics free press v 1 evan j williams i regression pp 523 41 julian c stanley ii analysis of variance pp 541 554 lindley d v 1987 regression and correlation analysis new palgrave a dictionary of economics v 4 pp 160 120 23 birkes david and dodge y alternative methods of regression isbn 0 471 56881 3 chatfield c 1993 calculating interval forecasts journal of business and economic statistics 11 pp 160 121 135 draper n r smith h 1998 applied regression analysis 3rd ed john wiley isbn 160 0 471 17082 8 160 fox j 1997 applied regression analysis linear models and related methods sage hardle w applied nonparametric regression 1990 isbn 0 521 42950 1 meade n and t islam 1995 prediction intervals for growth curve forecasts journal of forecasting 14 pp 160 413 430 a sen m srivastava regression analysis theory methods and applications springer verlag berlin 2011 4th printing t strutz data fitting and uncertainty a practical introduction to weighted least squares and beyond vieweg teubner isbn 978 3 8348 1022 9 external links edit wikimedia commons has media related to regression analysis hazewinkel michiel ed 2001 regression analysis encyclopedia of mathematics springer isbn 160 978 1 55608 010 4 160 earliest uses regression basic history and references regression of weakly correlated data how linear regression mistakes can appear when y range is much smaller than x range regression to predict graphics card performance example of multivariate regression in action statistical interpolation with ordinary least squares v t e least squares and regression analysis computational statistics least squares linear least squares non linear least squares iteratively reweighted least squares correlation and dependence pearson product moment correlation rank correlation spearman s rho kendall s tau partial correlation confounding variable regression analysis ordinary least squares partial least squares total least squares ridge regression regression as a statistical model linear regression simple linear regression ordinary least squares generalized least squares weighted least squares general linear model predictor structure polynomial regression growth curve segmented regression local regression non standard nonlinear regression nonparametric semiparametric robust quantile isotonic non normal errors generalized linear model binomial poisson logistic decomposition of variance analysis of variance analysis of covariance multivariate aov model exploration mallows s cp stepwise regression model selection regression model validation background mean and predicted response gauss markov theorem errors and residuals goodness of fit studentized residual minimum mean square error design of experiments response surface methodology optimal design bayesian design numerical approximation numerical analysis approximation theory numerical integration gaussian quadrature orthogonal polynomials chebyshev polynomials chebyshev nodes applications curve fitting calibration curve numerical smoothing and differentiation system identification moving least squares regression analysis category statistics category statistics portal statistics outline statistics topics v t e statistics 160 descriptive statistics continuous data location mean arithmetic geometric harmonic median mode dispersion range standard deviation coefficient of variation percentile interquartile range shape variance skewness kurtosis moments l moments count data index of dispersion summary tables grouped data frequency distribution contingency table dependence pearson product moment correlation rank correlation spearman s rho kendall s tau partial correlation scatter plot statistical graphics bar chart biplot box plot control chart correlogram forest plot histogram q q plot run chart scatter plot stemplot radar chart 160 data collection designing studies effect size standard error statistical power sample size determination survey methodology sampling stratified sampling opinion poll questionnaire controlled experiment design of experiments randomized experiment random assignment replication blocking factorial experiment optimal design uncontrolled studies natural experiment quasi experiment observational study 160 statistical inference statistical theory sampling distribution order statistics sufficiency completeness exponential family permutation test randomization test empirical distribution bootstrap u statistic efficiency asymptotics robustness frequentist inference unbiased estimator mean unbiased minimum variance median unbiased biased estimators maximum likelihood method of moments minimum distance density estimation confidence interval testing hypotheses power parametric tests likelihood ratio wald score specific tests z normal student s t test f chi squared signed rank 1 sample 2 sample 1 way anova shapiro wilk kolmogorov smirnov bayesian inference bayesian probability prior posterior credible interval bayes factor bayesian estimator maximum posterior estimator 160 correlation and regression analysis correlation pearson product moment correlation partial correlation confounding variable coefficient of determination regression analysis errors and residuals regression model validation mixed effects models simultaneous equations models linear regression simple linear regression ordinary least squares general linear model bayesian regression non standard predictors nonlinear regression nonparametric semiparametric isotonic robust generalized linear model exponential families logistic bernoulli binomial poisson partition of variance analysis of variance anova analysis of covariance multivariate anova degrees of freedom 160 categorical multivariate time series or survival analysis categorical data cohen s kappa contingency table graphical model log linear model mcnemar s test multivariate statistics multivariate regression principal components factor analysis cluster analysis classification copulas time series analysis general decomposition trend stationarity seasonal adjustment time domain acf pacf xcf arma model arima model vector autoregression frequency domain spectral density estimation survival analysis survival function kaplan meier logrank test failure rate proportional hazards models accelerated failure time model 160 applications biostatistics bioinformatics clinical trials amp studies epidemiology medical statistics engineering statistics chemometrics methods engineering probabilistic design process amp quality control reliability system identification social statistics actuarial science census crime statistics demography econometrics national accounts official statistics population psychometrics spatial statistics cartography environmental statistics geographic information system geostatistics kriging category portal outline index retrieved from http en wikipedia org w index php title regression_analysis amp oldid 559928701 categories regression analysisstatistical methodseconometricsactuarial sciencehidden categories wikipedia articles needing clarification from february 2010all articles with unsourced statementsarticles with unsourced statements from february 2010articles with unsourced statements from march 2011commons category with local link same as on wikidataarticle feedback 5 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages az rbaycanca catal esky dansk deutsch espa ol esperanto fran ais bahasa indonesia italiano basa jawa latvie u magyar nederlands norsk bokm l norsk nynorsk o zbekcha polski portugus simple english basa sunda suomi svenska t rk e ti ng vi t edit links this page was last modified on 14 june 2013 at 20 51 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Ren_rou b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Ren_rou new file mode 100644 index 00000000..b9b1f61c --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Ren_rou @@ -0,0 +1 @@ +ren rou wikipedia the free encyclopedia ren rou from wikipedia the free encyclopedia jump to navigation search this article has multiple issues please help improve it or discuss these issues on the talk page this article may require cleanup to meet wikipedia s quality standards no cleanup reason has been specified please help improve this if you can may 2011 this article is an orphan as no other articles link to it please introduce links to this page from related articles suggestions may be available february 2011 ren rou chinese 人 pinyin r n r u or ren rou sou suo 人 means to mine certain specific information about someone or some people usually done by a group of non professional participants in coordinated effort with helps of modern technologies especially the internet this action is conducted usually without permission of the subject s being ren roued contents 1 purpose of ren rou 2 triggers for individuals become ren roued 3 methods of conduct information mining 4 pros amp cons 1 5 references purpose of ren rou edit gathering information about people to know more about someone promoting hatred by exposing individual actions or behaviors that is likely to be opposed by the public or certain group of people promoting assault against someone by exposing subject s private information including home address phone number email work place to make this subject available to go after holding people accountable to what they say online by exposing the true identity of an online commentator the subject commentator is attached to his her voice so he she will become targeted and risk being punished by public or law if he she has said something irresponsible online exposing someone s bad or illegal actions like fraud triggers for individuals become ren roued edit online posting or commenting usually in a forum which stimulates the anger of its viewers very badly evidences or traces being revealed against someone suggest that there is potentially a great evil behind him her making people want to find out more methods of conduct information mining edit going after ip address sometimes people who comment anonymously online can leave their ip address available to public or certain group of people ip address can be entered into some site to find out the physical location of that ip being assigned to therefore it can leak out more information to go after posting existent information about a subject being ren roued online so other people who know the subject can recognize and contribute more information about this subject hacking breaking into email inboxes hacking into subject computers to mine more information photo taking voice recording usually in covert web coordinated tracking use search engine to search information for instance the subject may have posted some resume on certain sites to be searched for pros amp cons 1 edit pro it s a great way to uncover fraud or illegal actions make people aware goods and evil through learning why the subject is being ren roued there is law to protect privacy issues so subjects being ren roued can still protect themselves against any illegal conduct of ren rou that harms them con it invades people s privacy it makes people fear and nervous even it is unnecessary there is no legal guide line on how to conduct it properly references edit 1 人 http tech sina com cn it 2008 12 30 10182704437 shtml retrieved from http en wikipedia org w index php title ren rou amp oldid 506415231 categories data mininghidden categories articles needing cleanup from may 2011all articles needing cleanupcleanup tagged articles without a reason field from may 2011wikipedia pages needing cleanup from may 2011orphaned articles from february 2011all orphaned articles navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 8 august 2012 at 16 38 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SEMMA b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SEMMA new file mode 100644 index 00000000..0a5aabf6 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SEMMA @@ -0,0 +1 @@ +semma wikipedia the free encyclopedia semma from wikipedia the free encyclopedia jump to navigation search semma is an acronym that stands for sample explore modify model and assess it is a list of sequential steps developed by sas institute inc one of the largest producers of statistics and business intelligence software it guides the implementation of data mining applications 1 although semma is often considered to be a general data mining methodology sas claims that it is rather a logical organisation of the functional tool set of one of their products sas enterprise miner for carrying out the core tasks of data mining 2 contents 1 background 2 phases of semma 3 criticism 4 see also 5 references background edit in the expanding field of data mining there has been a call for a standard methodology or a simply list of best practices for the deverisified and iterative process of data mining that users can apply to their data mining projects regardless of industry while the cross industry standard process for data mining or crisp dm founded by the european strategic program on research in information technology initiative aimed to create a netural methodology sas also offered a pattern to follow in its data mining tools phases of semma edit the phases of semma and related tasks are the following 2 sample the process starts with data sampling e g selecting the data set for modeling the data set should be large enough to contain sufficient information to retrieve yet small enough to be used efficiently this phase also deals with data partitioning explore this phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the variables and also abnormalities with the help of data visualization modify the modify phase contains methods to select create and transform variables in preparation for data modeling model in the model phase the focus is on applying various modeling data mining techniques on the prepared variables in order to create models that possibly provide the desired outcome assess the last phase is assess the evaluation of the modeling results shows the reliability and usefulness of the created models criticism edit semma mainly focuses on the modeling tasks of data mining projects leaving the business aspects out unlike i e crisp dm and its business understanding phase additionally semma is designed to help the users of the sas enterprise miner software therefore applying it outside enterprise miner can be ambiguous 3 see also edit cross industry standard process for data mining references edit azevedo a and santos m f kdd semma and crisp dm a parallel overview in proceedings of the iadis european conference on data mining 2008 pp 182 185 a b sas enterprise miner website rohanizadeh s s and moghadam m b a proposed data mining methodology and its application to industrial procedures journal of industrial engineering 4 2009 pp 37 50 retrieved from http en wikipedia org w index php title semma amp oldid 546308018 categories applied data mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 22 march 2013 at 15 12 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SIGKDD b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SIGKDD new file mode 100644 index 00000000..69ca3bbc --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SIGKDD @@ -0,0 +1 @@ +sigkdd wikipedia the free encyclopedia sigkdd from wikipedia the free encyclopedia jump to navigation search sigkdd is the association for computing machinery s special interest group on knowledge discovery and data mining it became an official acm sig in 1998 the official web page of sigkdd can be found on www kdd org the current chairman of sigkdd since 2009 is usama m fayyad ph d contents 1 conferences 2 kdd cup 3 awards 4 sigkdd explorations 5 current executive committee 6 information directors 7 references 8 external links conferences edit sigkdd has hosted an annual conference acm sigkdd conference on knowledge discovery and data mining kdd since 1995 kdd conferences grew from kdd knowledge discovery and data mining workshops at aaai conferences which were started by gregory piatetsky shapiro in 1989 1991 and 1993 and usama fayyad in 1994 1 conference papers of each proceedings of the sigkdd international conference on knowledge discovery and data mining are published through acm 2 kdd 2012 took place in beijing china 3 and kdd 2013 will take place in chicago united states aug 11 14 2013 kdd cup edit sigkdd sponsors the kdd cup competition every year in conjunction with the annual conference it is aimed at members of the industry and academia particularly students interested in kdd awards edit the group also annually recognizes members of the kdd community with its innovation award and service award additionally kdd presents a best paper award 4 to recognize the highest quality paper at each conference sigkdd explorations edit sigkdd has also published a biannual academic journal titled sigkdd explorations since june 1999 editors in chief bart goethals since 2010 osmar r zaiane 2008 2010 ramakrishnan srikant 2006 2007 sunita sarawagi 2003 2006 usama fayyad 1999 2002 current executive committee edit chair usama fayyad 2009 treasurer osmar r zaiane 2009 directors johannes gehrke robert grossman david d jensen 5 raghu ramakrishnan sunita sarawagi 6 ramakrishnan srikant 7 former chairpersons gregory piatetsky shapiro 8 2005 2008 won kim 1998 2004 information directors edit ankur teredesai 2011 gabor melli 9 2004 2011 ramakrishnan srikant 1998 2003 references edit http www sigkdd org conferences php http dl acm org event cfm id re329 http kdd2012 sigkdd org kdd conference best paper awards retrieved 2012 04 07 160 http kdl cs umass edu people jensen http www it iitb ac in sunita http www rsrikant com http www kdnuggets com gps html http www gabormeli com rkb external links edit acm sigkdd homepage acm sigkdd explorations homepage kdd 2013 conference homepage kdd 2012 conference homepage this computing article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title sigkdd amp oldid 558448906 categories association for computing machinery special interest groupsdata miningcomputing stubs navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages fran ais edit links this page was last modified on 5 june 2013 at 14 14 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SIGMOD b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SIGMOD new file mode 100644 index 00000000..ddb02c49 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SIGMOD @@ -0,0 +1 @@ +sigmod wikipedia the free encyclopedia sigmod from wikipedia the free encyclopedia jump to navigation search sigmod is the association for computing machinery s special interest group on management of data which specializes in large scale data management problems and databases the annual acm sigmod conference which began in 1975 is considered one of the most important in the field while traditionally this conference had always been held within north america it recently took place in europe in 2004 and asia in 2007 acceptance rate of acm sigmod conference averaged from 1996 to 2012 is 18 with the rate of 17 in 2012 1 in association with sigact and sigart sigmod also sponsors the annual acm symposium on principles of database systems pods conference on the theoretical aspects of database systems pods began in 1982 and has been held jointly with the sigmod conference since 1991 each year the group gives out several awards to contributions to the field of data management the most important of these is the sigmod edgar f codd innovations award named after the computer scientist edgar f codd which is awarded to innovative and highly significant contributions of enduring value to the development understanding or use of database systems and databases additionally sigmod presents a best paper award 2 to recognize the highest quality paper at each conference contents 1 venues of sigmod conferences 2 see also 3 external links 4 references venues of sigmod conferences edit year place link 2013 new york 1 2012 scottsdale 2 2011 athens 3 2010 indianapolis 4 2009 providence 5 2008 vancouver 6 2007 beijing 7 2006 chicago 8 2005 baltimore 9 2004 paris 10 2003 san diego 11 2002 madison 12 2001 santa barbara 13 2000 dallas 14 1999 philadelphia 1998 seattle 1997 tucson 1996 montreal 1995 san jose 1994 minneapolis 1993 washington dc 1992 san diego 1991 denver 1990 atlantic city 1989 portland 1988 chicago 1987 san francisco 1986 washington dc 1985 austin 1984 boston 1983 san jose california 1982 orlando florida 1981 ann arbor 1980 santa monica 1979 boston 1978 austin 1977 toronto 1976 washington dc 1975 san jose see also edit list of computer science conferences cidr conference on innovative data systems research icde ieee international conference on data engineering vldb international conference on very large data bases external links edit sigmod references edit proceedings of the 2012 acm sigmod international conference on management of data 2012 retrieved 2012 09 17 160 sigmod conference best paper awards retrieved 2012 04 07 160 retrieved from http en wikipedia org w index php title sigmod amp oldid 560941902 categories association for computing machinery special interest groups navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 21 june 2013 at 17 35 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SPSS_Modeler b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SPSS_Modeler new file mode 100644 index 00000000..8005ba7f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/SPSS_Modeler @@ -0,0 +1 @@ +spss modeler wikipedia the free encyclopedia spss modeler from wikipedia the free encyclopedia jump to navigation search this article needs additional citations for verification please help improve this article by adding citations to reliable sources unsourced material may be challenged and removed december 2008 ibm spss modeler data mining tool developer s ibm corp stable release 15 0 win unix linux june 2012 operating system windows linux unix type data mining and predictive analytics license proprietary software website http www 01 ibm com software analytics spss products modeler ibm spss modeler is a data mining software application from ibm it is a data mining and text analytics workbench used to build predictive models it has a visual interface which allows users to leverage statistical and data mining algorithms without programming spss modeler has been used in these and other industries customer analytics 1 and customer relationship management crm fraud detection and prevention 2 optimizing insurance claims citation needed risk management citation needed manufacturing quality improvement citation needed healthcare quality improvement 3 forecasting demand or sales citation needed law enforcement 4 and border security education 5 telecommunications 6 entertainment e g predicting movie box office receipts 7 spss modeler was originally named spss clementine by spss inc after which it was renamed pasw modeler in 2009 by spss 8 it was since acquired by ibm in its 2009 acquisition of spss inc and was subsequently renamed ibm spss modeler its current name contents 1 editions 2 architecture 3 features 4 release history 5 product history 6 competitors 7 see also 8 references 9 further reading 10 external links editions edit ibm sells the current version of spss modeler version 15 in two separate bundles of features these two bundles are called editions by ibm spss modeler professional used for structured data such as databases mainframe data systems flat files or bi systems spss modeler premium includes all the features of modeler professional with the addition of text analytics entity analytics social network analysis both editions are available in desktop and server configurations architecture edit spss modeler has a three tier design users manipulate icons and options in the front end application on windows operating systems this front end client application then communicates with the modeler server or directly with a database or dataset the most common configuration in large corporations is to house the modeler server software on a powerful analytical server box windows unix linux which then connects to the corporate data warehouse data processing commands are automatically converted from the icon based user interface into a command code which is not visible and is sent to the modeler server for processing where possible this command code will be further compiled into sql and processed in the data warehouse nb this section needs further updating features edit modeling algorithms included automatic classification binary and numeric automatic clustering anomaly detection apriori bayesian networks c amp rt c5 0 chaid amp quest carma cox regression decision list factor analysis pca feature selection k means kohonen two step discriminant support vector machine svm knn logistic regression for binary outcomes neural networks multi layer perceptrons with back propagation learning and radial basis function networks regression linear genlin glm generalized linear mixed models glmm linear equation modeling self learning response model slrm sequence support vector machine time series release history edit clementine 1 0 june 1994 by isl 9 clementine 5 1 jan 2000 clementine 12 0 jan 2008 pasw modeler 13 formerly clementine april 2009 ibm spss modeler 14 0 2010 ibm spss modeler 14 2 2011 ibm spss modeler 15 0 june 2012 product history edit early versions of the software were called clementine and were unix based and designed as a consulting tool and not for sale to customers originally developed by a uk company named integral solutions limited isl 9 the tool quickly garnered the attention of the data mining community at that time in its infancy original in many respects it was the first data mining tool to use an icon based graphical user interface rather than requiring users to write in a programming language in 1998 isl was acquired by spss inc who saw the potential for extended development as a commercial data mining tool in early 2000 the software was developed into a client server architecture and shortly afterward the client front end interface component was completely re written and replaced with a superior java front end spss clementine version 12 0 the client front end runs under windows the server back end unix variants sun hp ux aix linux and windows the graphical user interface is written in java ibm spss modeler 14 2 was the first release of modeler by ibm ibm spss modeler 15 released in june 2012 introduced significant new functionality for social network analysis and entity analytics competitors edit alpine data labs alpine angoss software corporation knowledgeseeker and knowledgestudio knime oracle data mining r programming language sas enterprise miner data mining software provided by the sas institute statistica data miner data mining software provided by statsoft weka see also edit ibm spss statistics list of statistical packages cross industry standard process for data mining references edit forrester research inc 2012 the forrester wave customer analytics solutions http www forrester com pimages rws reprints document 80281 oid 1 krb1c8 http www 01 ibm com software success cssdb nsf cs kkmh 88u29v opendocument amp site default amp cty en_us http www 01 ibm com software analytics spss 12 patient outcomes http www 01 ibm com software success cssdb nsf cs strd 8ljjgh opendocument amp site spss amp cty en_us http public dhe ibm com common ssi ecm en imw14303usen imw14303usen pdf http public dhe ibm com common ssi ecm en ytw03085usen ytw03085usen pdf delen dursun 2009 predicting movie box office receipts using spss clementine data mining software in nisbet robert elder john amp miner gary 2009 handbook of statistical analysis and data mining applications elsevier pp 160 391 415 isbn 160 978 0 12 374765 5 160 oh my darling spss says goodbye clementine hello pasw intelligent enterprise a b colin shearer 1994 mining the data lode times higher education november 18 1994 further reading edit chapman p clinton j kerber r khabaza t reinartz t shearer c et al 2000 crisp dm 1 0 chicago il spss nisbet r elder j and miner g 2009 handbook of statistical analysis and data mining applications burlington ma academic press elsevier external links edit users guide spss modeler 15 ibm spss modeler website v t e statistical software public domain dataplot epi info cspro x 12 arima open source admb dap gretl jags jmulti gnu octave openbugs pspp r simfit sofa statistics sage xlispstat freeware bv4 1 cumfreq xplore winbugs retail cross platform data desk gauss graphpad instat graphpad prism ibm spss statistics ibm spss modeler jmp matlab mathematica oxmetrics rats sas stata sudaan s plus world programming system wps windows only bmdp eviews genstat medcalc minitab ncss q shazam sigmastat statistica statxact systat the unscrambler unistat excel add ons analyse it spc xl sigmaxl unistat for excel xlfit rexcel category comparison retrieved from http en wikipedia org w index php title spss_modeler amp oldid 536090945 categories data miningdata mining and machine learning softwaredata minersanalysispredictionstatistical softwarestatistical algorithmshidden categories articles needing additional references from december 2008all articles needing additional referencespages using infoboxes with thumbnail imagesall articles with unsourced statementsarticles with unsourced statements from november 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 1 february 2013 at 21 58 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Sequence_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Sequence_mining new file mode 100644 index 00000000..79392db2 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Sequence_mining @@ -0,0 +1 @@ +sequence mining wikipedia the free encyclopedia sequence mining from wikipedia the free encyclopedia jump to navigation search sequence mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence 1 it is usually presumed that the values are discrete and thus time series mining is closely related but usually considered a different activity sequence mining is a special case of structured data mining there are several key traditional computational problems addressed within this field these include building efficient databases and indexes for sequence information extracting the frequently occurring patterns comparing sequences for similarity and recovering missing sequence members in general sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning contents 1 string mining 2 itemset mining 3 variants 4 application 5 algorithms 6 see also 7 references 8 external links string mining edit string mining typically deals with a limited alphabet for items that appear in a sequence but the sequence itself may be typically very long examples of an alphabet can be those in the ascii character set used in natural language text nucleotide bases a g c and t in dna sequences or amino acids for protein sequences in biology applications analysis of the arrangement of the alphabet in strings can be used to examine gene and protein sequences to determine their properties knowing the sequence of letters of a dna a protein is not an ultimate goal in itself rather the major task is to understand the sequence in terms of its structure and biological function this is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit in many cases this requires comparing a given sequence with previously studied ones the comparison between the strings becomes complicated when insertions deletions and mutations occur in a string a survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented by abouelhoda amp ghanem 2010 which include 2 repeat related problems that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats finding tandem repeats and finding unique subsequences and missing un spelled subsequences alignment problems that deal with comparison between strings by first aligning one or more sequences examples of popular methods include blast for comparing a single sequence with multiple sequences in a database and clustalw for multiple alignments alignment algorithms can be based on either exact or approximate methods and can also be classified as global alignments semi global alignments and local alignment see sequence alignment itemset mining edit some problems in sequence mining lend themselves discovering frequent itemsets and the order they appear for example one is seeking rules of the form if a customer buys a car he or she is likely to buy insurance within 1 week or in the context of stock prices if nokia up and ericsson up it is likely that motorola up and samsung up within 2 days traditionally itemset mining is used in marketing applications for discovering regularities between frequently co occurring items in large transactions for example by analysing transactions of customer shopping baskets in a supermarket one can produce a rule which reads if a customer buys onions and potatoes together he or she is likely to also buy hamburger meat in the same transaction a survey and taxonomy of the key algorithms for item set mining is presented by han et al 2007 3 the two common techniques that are applied to sequence databases for frequent itemset mining are the influential apriori algorithm and the more recent fp growth technique variants edit the traditional sequential pattern mining is modified including some constraints and some behaviour george and binu 2012 have integrated three significant marketing scenarios for mining promotion oriented sequential patterns 4 the promotion based market scenarios considered in their research are 1 product downturn 2 product revision and 3 product launch drl by considering these they developed a drl prefix span algorithm tailored from of the prefix span for mining all length drl patterns application edit with a great variation of products and user buying behaviors shelf on which products are being displayed is one of the most important resources in retail environment retailers can not only increase their profit but also decrease cost by proper management of shelf space allocation and products display to solve this problem george and binu 2013 have proposed an approach to mine user buying patterns using prefixspan algorithm and place the products on shelves based on the order of mined purchasing patterns 5 algorithms edit commonly used algorithms include gsp algorithm sequential p ttern discovery using equivalence classes spade apriori algorithm freespan prefixspan mapres 6 see also edit association rule learning data mining process mining sequence analysis bioinformatics sequence clustering sequence labeling string computer science sequence alignment time series references edit mabroukeh n r ezeife c i 2010 a taxonomy of sequential pattern mining algorithms acm computing surveys 43 1 doi 10 1145 1824795 1824798 160 edit abouelhoda m ghanem m 2010 string mining in bioinformatics in gaber m m scientific data mining and knowledge discovery springer doi 10 1007 978 3 642 02788 8_9 isbn 160 978 3 642 02787 1 160 han j cheng h xin d yan x 2007 frequent pattern mining current status and future directions data mining and knowledge discovery 15 1 55 86 doi 10 1007 s10618 006 0059 1 160 george aloysius binu d 2012 drl prefixspan a novel pattern growth algorithm for discovering downturn revision and launch drl sequential patterns central european journal of computer science 2 4 426 439 doi 10 2478 s13537 012 0030 8 160 george a binu d 2013 an approach to products placement in supermarkets using prefixspan algorithm journal of king saud university computer and information sciences 25 1 77 87 doi 10 1016 j jksuci 2012 07 001 160 ahmad ishtiaq qazi wajahat m khurshid ahmed ahmad munir hoessli daniel c khawaja iffat choudhary m iqbal shakoori abdul r nasir ud din 1 may 2008 mapres mining association patterns among preferred amino acid residues in the vicinity of amino acids targeted for post translational modifications proteomics 8 10 1954 1958 extra pages or at help doi 10 1002 pmic 200700657 160 external links edit implementations spmf a free open source data mining platform written in java offering more than 45 algorithms for sequential pattern mining sequential rule mining itemset mining and association rule mining retrieved from http en wikipedia org w index php title sequence_mining amp oldid 560316524 categories data miningbioinformaticsbioinformatics algorithmshidden categories pages with citations using conflicting page specifications navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol srpski edit links this page was last modified on 17 june 2013 at 16 29 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Society_for_Industrial_and_Applied_Mathematics b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Society_for_Industrial_and_Applied_Mathematics new file mode 100644 index 00000000..5b30d471 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Society_for_Industrial_and_Applied_Mathematics @@ -0,0 +1 @@ +society for industrial and applied mathematics wikipedia the free encyclopedia society for industrial and applied mathematics from wikipedia the free encyclopedia jump to navigation search not to be confused with soci t de math matiques appliqu es et industrielles this article relies largely or entirely upon a single source relevant discussion may be found on the talk page please help improve this article by introducing citations to additional sources december 2012 this article relies on references to primary sources please add references to secondary or tertiary sources december 2012 society for industrial and applied mathematics siam logo formation 1951 headquarters philadelphia pennsylvania united states membership gt 12 000 president lloyd n trefethen website www siam org the society for industrial and applied mathematics siam was founded by a small group of mathematicians from academia and industry who met in philadelphia in 1951 to start an organization whose members would meet periodically to exchange ideas about the uses of mathematics in industry this meeting led to the organization of the society for industrial and applied mathematics the membership of siam has grown from a few hundred in the early 1950s to more than 12 000 as of 2009 update siam retains its north american influence but it also has east asian argentinian bulgarian and uk amp ireland sections siam is one of the four parts of the joint policy board for mathematics contents 1 members 2 focus 3 activity groups siags 4 journals 5 books 6 conferences 7 siam news 8 prizes and recognition 8 1 siam fellows 9 moody s mega math m3 challenge 10 students 11 see also 12 references 13 external links members edit membership is open to both individuals and organizations focus edit the focus for the society is applied computational and industrial mathematics and the society often promotes its acronym as science and industry advance with mathematics it is composed of a combination of people from a wide variety of vocations members include engineers scientists industrial mathematicians and academic mathematicians the society is active in promoting the use of analysis and modeling in all settings the society also strives to support and provide guidance to educational institutions wishing to promote applied mathematics activity groups siags edit the society includes a number of activity groups to allow for more focused group discussions and collaborations algebraic geometry analysis of partial differential equations computational science and engineering control and systems theory data mining and analytics discrete mathematics dynamical systems financial mathematics and engineering geometric design geosciences imaging science life sciences linear algebra mathematical aspects of materials science nonlinear waves and coherent structures optimization orthogonal polynomials and special functions supercomputing uncertainty quantification journals edit as of 2012 update siam publishes 16 research journals 1 siam journal on applied mathematics siap since 1966 formerly journal of the society for industrial and applied mathematics since 1953 theory of probability and its applications tvp since 1956 translation of teoriya veroyatnostei i ee primeneniya siam review sirev since 1959 siam journal on control and optimization sicon since 1976 formerly siam journal on control since 1966 formerly journal of the society for industrial and applied mathematics series a control since 1962 siam journal on numerical analysis sinum since 1966 formerly journal of the society for industrial and applied mathematics series b numerical analysis since 1964 siam journal on mathematical analysis sima since 1970 siam journal on computing sicomp since 1972 siam journal on matrix analysis and applications simax since 1988 formerly siam journal on algebraic and discrete methods since 1980 siam journal on scientific computing sisc since 1993 formerly siam journal on scientific and statistical computing since 1980 siam journal on discrete mathematics sidma since 1988 siam journal on optimization siopt since 1991 siam journal on applied dynamical systems siads since 2002 multiscale modeling and simulation mms since 2003 siam journal on imaging sciences siims since 2008 siam journal on financial mathematics sifin since 2010 siam asa journal on uncertainty quantification juq since 2013 books edit siam publishes 20 25 books each year conferences edit siam organizes conferences and meetings throughout the year focused on various topics in applied math and computational science siam news edit siam news is a newsletter focused on the applied math and computational science community and is published ten times per year prizes and recognition edit siam recognizes applied mathematician and computational scientists for their contributions to the fields prizes include 2 germund dahlquist prize awarded to a young scientist normally under 45 for original contributions to fields associated with germund dahlquist numerical solution of differential equations and numerical methods for scientific computing 3 ralph e kleinman prize awarded for outstanding research or other contributions that bridge the gap between mathematics and applications each prize may be given either for a single notable achievement or for a collection of such achievements 4 j d crawford prize awarded to one individual for recent outstanding work on a topic in nonlinear science as evidenced by a publication in english in a peer reviewed journal within the four calendar years preceding the meeting at which the prize is awarded 5 richard c diprima prize awarded to a young scientist who has done outstanding research in applied mathematics defined as those topics covered by siam journals and who has completed his her doctoral dissertation and completed all other requirements for his her doctorate during the period running from three years prior to the award date to one year prior to the award date 6 george p lya prize is given every two years alternately in two categories 1 for a notable application of combinatorial theory 2 for a notable contribution in another area of interest to george p lya such as approximation theory complex analysis number theory orthogonal polynomials probability theory or mathematical discovery and learning 7 w t and idalia reid prize awarded for research in and contributions to areas of differential equations and control theory 8 theodore von k rm n prize awarded for notable application of mathematics to mechanics and or the engineering sciences made during the five to ten years preceding the award 9 james h wilkinson prize awarded for research in or other contributions to numerical analysis and scientific computing during the six years preceding the award 10 siam fellows edit in 2009 siam instituted a fellows program to recognize certain members who have made outstanding contributions to the fields siam serves 11 moody s mega math m3 challenge edit funded by the moody s foundation and organized by siam the moody s mega math challenge is an applied mathematics modeling competition for high school students along the entire east coast from maine through florida scholarship prizes total 100 000 students edit siam undergraduate research online publishes outstanding undergraduate research in applied and computational mathematics student memberships are generally discounted or free siam has career and job resources for students and other applied mathematicians and computational scientists see also edit american mathematical society references edit journals siam retrieved 2012 12 04 160 prizes awards lectures and fellows siam retrieved 2012 12 04 160 germund dahlquist prize siam retrieved 2012 12 04 160 ralph e kleinman prize siam retrieved 2012 12 04 160 j d crawford prize siag dynamical systems siam retrieved 2012 12 04 160 the richard c diprima prize siam retrieved 2012 12 04 160 george p lya prize siam retrieved 2012 12 04 160 w t and idalia reid prize in mathematics siam retrieved 2012 12 04 160 theodore von k rm n prize siam retrieved 2012 12 04 160 james h wilkinson prize in numerical analysis and scientific computing siam retrieved 2012 12 04 160 fellows program siam retrieved 2012 12 04 160 external links edit official website m3challenge siam org retrieved from http en wikipedia org w index php title society_for_industrial_and_applied_mathematics amp oldid 555254530 categories mathematical societiesorganizations established in 19511951 establishments in the united statesnon profit publishershidden categories articles needing additional references from december 2012all articles needing additional referencesarticles lacking reliable references from december 2012all articles lacking reliable referencesarticles containing potentially dated statements from 2009all articles containing potentially dated statementsarticles containing potentially dated statements from 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch fran ais portugus edit links this page was last modified on 15 may 2013 at 19 06 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Software_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Software_mining new file mode 100644 index 00000000..801de8f8 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Software_mining @@ -0,0 +1 @@ +software mining wikipedia the free encyclopedia software mining from wikipedia the free encyclopedia jump to navigation search software mining is an application of knowledge discovery in the area of software modernization which involves understanding existing software artifacts this process is related to a concept of reverse engineering usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary an entity relationship is a frequent format of representing knowledge obtained from existing software object management group omg developed specification knowledge discovery metamodel kdm which defines an ontology for software assets and their relationships for the purpose of performing knowledge discovery of existing code contents 1 software mining and data mining 2 text mining software tools 3 levels of software mining 4 forms of representing the results of software mining 5 see also 6 references software mining and data mining edit software mining is closely related to data mining since existing software artifacts contain enormous business value key for the evolution of software systems knowledge discovery from software systems addresses structure behavior as well as the data processed by the software system instead of mining individual data sets software mining focuses on metadata such as database schemas omg knowledge discovery metamodel provides an integrated representation to capturing application metadata as part of a holistic existing system metamodel another omg specification the common warehouse metamodel focuses entirely on mining enterprise metadata text mining software tools edit text mining software tools enable easy handling of text documents for the purpose of data analysis including automatic model generation and document classification document clustering document visualization dealing with web documents and crawling the web levels of software mining edit knowledge discovery in software is related to a concept of reverse engineering software mining addresses structure behavior as well as the data processed by the software system mining software systems may happen at various levels program level individual statements and variables design pattern level call graph level individual procedures and their relationships architectural level subsystems and their interfaces data level individual columns and attributes of data stores application level key data items and their flow through the applications business level domain concepts business rules and their implementation in code forms of representing the results of software mining edit data model metadata metamodels ontology knowledge representation business rule knowledge discovery metamodel kdm business process modeling notation bpmn intermediate representation resource description framework rdf abstract syntax tree ast software metrics graphical user interfaces 1 see also edit mining software repositories references edit interview with the creators of metawidget retrieved from http en wikipedia org w index php title software_mining amp oldid 516129846 categories static program analysis toolsdata mining navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 5 october 2012 at 10 52 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Spatial_index b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Spatial_index new file mode 100644 index 00000000..f69e99b0 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Spatial_index @@ -0,0 +1 @@ +spatial database wikipedia the free encyclopedia spatial database from wikipedia the free encyclopedia redirected from spatial index jump to navigation search a spatial database is a database that is optimized to store and query data that represents objects defined in a geometric space most spatial databases allow representing simple geometric objects such as points lines and polygons some spatial databases handle more complex structures such as 3d objects topological coverages linear networks and tins while typical databases are designed to manage various numeric and character types of data additional functionality needs to be added for databases to process spatial data types efficiently these are typically called geometry or feature the open geospatial consortium created the simple features specification and sets standards for adding spatial functionality to database systems 1 contents 1 features of spatial databases 2 spatial index 3 spatial database systems 4 see also 5 references 6 further reading 7 external links features of spatial databases edit database systems use indexes to quickly look up values and the way that most databases index data is not optimal for spatial queries instead spatial databases use a spatial index to speed up database operations in addition to typical sql queries such as select statements spatial databases can perform a wide variety of spatial operations the following operations and many more are specified by the open geospatial consortium standard spatial measurements computes line length polygon area the distance between geometries etc spatial functions modify existing features to create new ones for example by providing a buffer around them intersecting features etc spatial predicates allows true false queries about spatial relationships between geometries examples include do two polygons overlap or is there a residence located within a mile of the area we are planning to build the landfill see de 9im geometry constructors creates new geometries usually by specifying the vertices points or nodes which define the shape observer functions queries which return specific information about a feature such as the location of the center of a circle some databases support only simplified or modified sets of these operations especially in cases of nosql systems like mongodb and couchdb spatial index edit spatial indices are used by spatial databases databases which store information related to objects in space to optimize spatial queries conventional index types do not efficiently handle spatial queries such as how far two points differ or whether points fall within a spatial area of interest common spatial index methods include grid spatial index z order curve quadtree octree ub tree r tree typically the preferred method for indexing spatial data citation needed objects shapes lines and points are grouped using the minimum bounding rectangle mbr objects are added to an mbr within the index that will lead to the smallest increase in its size r tree r tree hilbert r tree x tree kd tree m tree an m tree index can be used for the efficient resolution of similarity queries on complex objects as compared using an arbitrary metric spatial database systems edit all opengis specifications compliant products 2 open source spatial databases and apis some of which are opengis compliant 3 boeing s spatial query server spatially enables sybase ase smallworld vmds the native ge smallworld gis database spatialite extends sqlite with spatial datatypes functions and utilities ibm db2 spatial extender can be used to enable any edition of db2 including the free db2 express c with support for spatial types oracle spatial microsoft sql server has support for spatial types since version 2008 postgresql dbms database management system uses the spatial extension postgis to implement the standardized datatype geometry and corresponding functions mysql dbms implements the datatype geometry plus some spatial functions that have been implemented according to the opengis specifications 4 however in mysql version 5 5 and earlier functions that test spatial relationships are limited to working with minimum bounding rectangles rather than the actual geometries mysql versions earlier than 5 0 16 only supported spatial data in myisam tables as of mysql 5 0 16 innodb ndb bdb and archive also support spatial features neo4j graph database that can build 1d and 2d indexes as btree quadtree and hilbert curve directly in the graph allegrograph a graph database provides a novel mechanism for efficient storage and retrieval of two dimensional geospatial coordinates for resource description framework data it includes an extension syntax for sparql queries mongodb supports geospatial indexes in 2d esri has a number of both single user and multiuser geodatabases spacebase is a real time spatial database 5 couchdb a document based database system that can be spatially enabled by a plugin called geocouch cartodb is a cloud based geospatial database on top of postgresql with postgis stormdb is an upcoming cloud based database on top of postgresql with geospatial capabilities spatialdb by minerp is the worlds first open standards ogc spatial database with spatial type extensions for the mining industry 6 see also edit object based spatial database spatiotemporal database spatial query spatial analysis location intelligence references edit ogc homepage all registered products at opengeospatial org open source gis website http dev mysql com doc refman 5 5 en gis introduction html spacebase product page on the parallel universe website spatialdb product page on the minerp website further reading edit spatial databases a tour shashi shekhar and sanjay chawla prentice hall 2003 isbn 0 13 017480 7 esri press esri press titles include modeling our world the esri guide to geodatabase design and designing geodatabases case studies in gis data modeling 2005 ben franklin award winner pma the independent book publishers association spatial databases with application to gis philippe rigaux michel scholl and agnes voisard morgan kauffman publishers 2002 isbn 1 55860 588 6 external links edit an introduction to postgresql postgis postgresql postgis as components in a service oriented architecture soa a trigger based security alarming scheme for moving objects on road networks sajimon abraham p sojan lal published by springer berlin heidelberg 2008 retrieved from http en wikipedia org w index php title spatial_database amp oldid 555930040 spatial_index categories spatial databasesweb mappinghidden categories all articles with unsourced statementsarticles with unsourced statements from june 2012 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages esky espa ol fran ais italiano polski portugus ti ng vi t edit links this page was last modified on 20 may 2013 at 11 16 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistical_inference b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistical_inference new file mode 100644 index 00000000..695ac468 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistical_inference @@ -0,0 +1 @@ +statistical inference wikipedia the free encyclopedia statistical inference from wikipedia the free encyclopedia jump to navigation search this article has multiple issues please help improve it or discuss these issues on the talk page this article relies on references to primary sources please add references to secondary or tertiary sources march 2012 this article may contain previously unpublished synthesis of published material that conveys ideas not attributable to the original sources relevant discussion may be found on the talk page march 2012 in statistics statistical inference is the process of drawing conclusions from data that is subject to random variation for example observational errors or sampling variation 1 more substantially the terms statistical inference statistical induction and inferential statistics are used to describe systems of procedures that can be used to draw conclusions from datasets arising from systems affected by random variation 2 such as observational errors random sampling or random experimentation 1 initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well defined situations and that it should be general enough to be applied across a range of situations the outcome of statistical inference may be an answer to the question what should be done next where this might be a decision about making further experiments or surveys or about drawing a conclusion before implementing some organizational or governmental policy contents 1 introduction 1 1 scope 1 2 comparison to descriptive statistics 2 models and assumptions 2 1 degree of models assumptions 2 2 importance of valid models assumptions 2 2 1 approximate distributions 2 3 randomization based models 2 3 1 model based analysis of randomized experiments 3 modes of inference 3 1 frequentist inference 3 1 1 examples of frequentist inference 3 1 2 frequentist inference objectivity and decision theory 3 2 bayesian inference 3 2 1 examples of bayesian inference 3 2 2 bayesian inference subjectivity and decision theory 3 3 other modes of inference besides frequentist and bayesian 3 3 1 information and computational complexity 3 3 2 fiducial inference 3 3 3 structural inference 4 inference topics 5 see also 6 notes 7 references 8 further reading 9 external links introduction edit scope edit for the most part statistical inference makes propositions about populations using data drawn from the population of interest via some form of random sampling more generally data about a random process is obtained from its observed behavior during a finite period of time given a parameter or hypothesis about which one wishes to make inference statistical inference most often uses a statistical model of the random process that is supposed to generate the data which is known when randomization has been used and a particular realization of the random process i e a set of data the conclusion of a statistical inference is a statistical proposition citation needed some common forms of statistical proposition are an estimate i e a particular value that best approximates some parameter of interest a confidence interval or set estimate i e an interval constructed using a dataset drawn from a population so that under repeated sampling of such datasets such intervals would contain the true parameter value with the probability at the stated confidence level a credible interval i e a set of values containing for example 95 of posterior belief rejection of a hypothesis 3 clustering or classification of data points into groups comparison to descriptive statistics edit statistical inference is generally distinguished from descriptive statistics in simple terms descriptive statistics can be thought of as being just a straightforward presentation of facts in which modeling decisions made by a data analyst have had minimal influence models and assumptions edit main articles statistical model and statistical assumptions any statistical inference requires some assumptions a statistical model is a set of assumptions concerning the generation of the observed data and similar data descriptions of statistical models usually emphasize the role of population quantities of interest about which we wish to draw inference 4 descriptive statistics are typically used as a preliminary step before more formal inferences are drawn 5 degree of models assumptions edit statisticians distinguish between three levels of modeling assumptions fully parametric the probability distributions describing the data generation process are assumed to be fully described by a family of probability distributions involving only a finite number of unknown parameters 4 for example one may assume that the distribution of population values is truly normal with unknown mean and variance and that datasets are generated by simple random sampling the family of generalized linear models is a widely used and flexible class of parametric models non parametric the assumptions made about the process generating the data are much less than in parametric statistics and may be minimal 6 for example every continuous probability distribution has a median which may be estimated using the sample median or the hodges lehmann sen estimator which has good properties when the data arise from simple random sampling semi parametric this term typically implies assumptions in between fully and non parametric approaches for example one may assume that a population distribution has a finite mean furthermore one may assume that the mean response level in the population depends in a truly linear manner on some covariate a parametric assumption but not make any parametric assumption describing the variance around that mean i e about the presence or possible form of any heteroscedasticity more generally semi parametric models can often be separated into structural and random variation components one component is treated parametrically and the other non parametrically the well known cox model is a set of semi parametric assumptions importance of valid models assumptions edit whatever level of assumption is made correctly calibrated inference in general requires these assumptions to be correct i e that the data generating mechanisms really has been correctly specified incorrect assumptions of simple random sampling can invalidate statistical inference 7 more complex semi and fully parametric assumptions are also cause for concern for example incorrectly assuming the cox model can in some cases lead to faulty conclusions 8 incorrect assumptions of normality in the population also invalidates some forms of regression based inference 9 the use of any parametric model is viewed skeptically by most experts in sampling human populations most sampling statisticians when they deal with confidence intervals at all limit themselves to statements about estimators based on very large samples where the central limit theorem ensures that these estimators will have distributions that are nearly normal 10 in particular a normal distribution would be a totally unrealistic and catastrophically unwise assumption to make if we were dealing with any kind of economic population 10 here the central limit theorem states that the distribution of the sample mean for very large samples is approximately normally distributed if the distribution is not heavy tailed approximate distributions edit main articles statistical distance asymptotic theory statistics and approximation theory given the difficulty in specifying exact distributions of sample statistics many methods have been developed for approximating these with finite samples approximation results measure how close a limiting distribution approaches the statistic s sample distribution for example with 10 000 independent samples the normal distribution approximates to two digits of accuracy the distribution of the sample mean for many population distributions by the berry esseen theorem 11 yet for many practical purposes the normal approximation provides a good approximation to the sample mean s distribution when there are 10 or more independent samples according to simulation studies and statisticians experience 11 following kolmogorov s work in the 1950s advanced statistics uses approximation theory and functional analysis to quantify the error of approximation in this approach the metric geometry of probability distributions is studied this approach quantifies approximation error with for example the kullback leibler distance bregman divergence and the hellinger distance 12 13 14 with indefinitely large samples limiting results like the central limit theorem describe the sample statistic s limiting distribution if one exists limiting results are not statements about finite samples and indeed are irrelevant to finite samples 15 16 17 however the asymptotic theory of limiting distributions is often invoked for work with finite samples for example limiting results are often invoked to justify the generalized method of moments and the use of generalized estimating equations which are popular in econometrics and biostatistics the magnitude of the difference between the limiting distribution and the true distribution formally the error of the approximation can be assessed using simulation 18 the heuristic application of limiting results to finite samples is common practice in many applications especially with low dimensional models with log concave likelihoods such as with one parameter exponential families randomization based models edit main article randomization see also random sample 160 and random assignment for a given dataset that was produced by a randomization design the randomization distribution of a statistic under the null hypothesis is defined by evaluating the test statistic for all of the plans that could have been generated by the randomization design in frequentist inference randomization allows inferences to be based on the randomization distribution rather than a subjective model and this is important especially in survey sampling and design of experiments 19 20 statistical inference from randomized studies is also more straightforward than many other situations 21 22 23 in bayesian inference randomization is also of importance in survey sampling use of sampling without replacement ensures the exchangeability of the sample with the population in randomized experiments randomization warrants a missing at random assumption for covariate information 24 objective randomization allows properly inductive procedures 25 26 27 28 many statisticians prefer randomization based analysis of data that was generated by well defined randomization procedures 29 however it is true that in fields of science with developed theoretical knowledge and experimental control randomized experiments may increase the costs of experimentation without improving the quality of inferences 30 31 similarly results from randomized experiments are recommended by leading statistical authorities as allowing inferences with greater reliability than do observational studies of the same phenomena 32 however a good observational study may be better than a bad randomized experiment the statistical analysis of a randomized experiment may be based on the randomization scheme stated in the experimental protocol and does not need a subjective model 33 34 however at any time some hypotheses cannot be tested using objective statistical models which accurately describe randomized experiments or random samples in some cases such randomized studies are uneconomical or unethical model based analysis of randomized experiments edit it is standard practice to refer to a statistical model often a linear model when analyzing data from randomized experiments however the randomization scheme guides the choice of a statistical model it is not possible to choose an appropriate model without knowing the randomization scheme 20 seriously misleading results can be obtained analyzing data from randomized experiments while ignoring the experimental protocol common mistakes include forgetting the blocking used in an experiment and confusing repeated measurements on the same experimental unit with independent replicates of the treatment applied to different experimental units 35 modes of inference edit different schools of statistical inference have become established these schools or paradigms are not mutually exclusive and methods which work well under one paradigm often have attractive interpretations under other paradigms the two main paradigms in use are frequentist and bayesian inference which are both summarized below frequentist inference edit see also frequentist inference this paradigm calibrates the production of propositions clarification needed complicated jargon by considering notional repeated sampling of datasets similar to the one at hand by considering its characteristics under repeated sample the frequentist properties of any statistical inference procedure can be described 160 although in practice this quantification may be challenging examples of frequentist inference edit p value confidence interval frequentist inference objectivity and decision theory edit one interpretation of frequentist inference or classical inference is that it is applicable only in terms of frequency probability that is in terms of repeated sampling from a population however the approach of neyman 36 develops these procedures in terms of pre experiment probabilities that is before undertaking an experiment one decides on a rule for coming to a conclusion such that the probability of being correct is controlled in a suitable way such a probability need not have a frequentist or repeated sampling interpretation in contrast bayesian inference works in terms of conditional probabilities i e probabilities conditional on the observed data compared to the marginal but conditioned on unknown parameters probabilities used in the frequentist approach the frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions however some elements of frequentist statistics such as statistical decision theory do incorporate utility functions citation needed in particular frequentist developments of optimal inference such as minimum variance unbiased estimators or uniformly most powerful testing make use of loss functions which play the role of negative utility functions loss functions need not be explicitly stated for statistical theorists to prove that a statistical procedure has an optimality property 37 however loss functions are often useful for stating optimality properties for example median unbiased estimators are optimal under absolute value loss functions in that they minimize expected loss and least squares estimators are optimal under squared error loss functions in that they minimize expected loss while statisticians using frequentist inference must choose for themselves the parameters of interest and the estimators test statistic to be used the absence of obviously explicit utilities and prior distributions has helped frequentist procedures to become widely viewed as objective citation needed bayesian inference edit see also bayesian inference the bayesian calculus describes degrees of belief using the language of probability beliefs are positive integrate to one and obey probability axioms bayesian inference uses the available posterior beliefs as the basis for making statistical propositions there are several different justifications for using the bayesian approach examples of bayesian inference edit credible intervals for interval estimation bayes factors for model comparison bayesian inference subjectivity and decision theory edit many informal bayesian inferences are based on intuitively reasonable summaries of the posterior for example the posterior mean median and mode highest posterior density intervals and bayes factors can all be motivated in this way while a user s utility function need not be stated for this sort of inference these summaries do all depend to some extent on stated prior beliefs and are generally viewed as subjective conclusions methods of prior construction which do not require external input have been proposed but not yet fully developed formally bayesian inference is calibrated with reference to an explicitly stated utility or loss function the bayes rule is the one which maximizes expected utility averaged over the posterior uncertainty formal bayesian inference therefore automatically provides optimal decisions in a decision theoretic sense given assumptions data and utility bayesian inference can be made for essentially any problem although not every statistical inference need have a bayesian interpretation analyses which are not formally bayesian can be logically incoherent a feature of bayesian procedures which use proper priors i e those integrable to one is that they are guaranteed to be coherent some advocates of bayesian inference assert that inference must take place in this decision theoretic framework and that bayesian inference should not conclude with the evaluation and summarization of posterior beliefs other modes of inference besides frequentist and bayesian edit information and computational complexity edit main article minimum description length see also information theory kolmogorov complexity 160 and data mining other forms of statistical inference have been developed from ideas in information theory 38 and the theory of kolmogorov complexity 39 for example the minimum description length mdl principle selects statistical models that maximally compress the data inference proceeds without assuming counterfactual or non falsifiable data generating mechanisms or probability models for the data as might be done in frequentist or bayesian approaches however if a data generating mechanism does exist in reality then according to shannon s source coding theorem it provides the mdl description of the data on average and asymptotically 40 in minimizing description length or descriptive complexity mdl estimation is similar to maximum likelihood estimation and maximum a posteriori estimation using maximum entropy bayesian priors however mdl avoids assuming that the underlying probability model is known the mdl principle can also be applied without assumptions that e g the data arose from independent sampling 40 41 the mdl principle has been applied in communication coding theory in information theory in linear regression and in time series analysis particularly for choosing the degrees of the polynomials in autoregressive moving average arma models 41 information theoretic statistical inference has been popular in data mining which has become a common approach for very large observational and heterogeneous datasets made possible by the computer revolution and internet 39 the evaluation of statistical inferential procedures often uses techniques or criteria from computational complexity theory or numerical analysis 42 43 fiducial inference edit main article fiducial inference fiducial inference was an approach to statistical inference based on fiducial probability also known as a fiducial distribution in subsequent work this approach has been called ill defined extremely limited in applicability and even fallacious 44 45 however this argument is the same as that which shows 46 that a so called confidence distribution is not a valid probability distribution and since this has not invalidated the application of confidence intervals it does not necessarily invalidate conclusions drawn from fiducial arguments structural inference edit developing ideas of fisher and of pitman from 1938 to 1939 47 george a barnard developed structural inference or pivotal inference 48 an approach using invariant probabilities on group families barnard reformulated the arguments behind fiducial inference on a restricted class of models on which fiducial procedures would be well defined and useful inference topics edit the topics below are usually included in the area of statistical inference statistical assumptions statistical decision theory estimation theory statistical hypothesis testing revising opinions in statistics design of experiments the analysis of variance and regression survey sampling summarizing statistical data see also edit statistics portal wikimedia commons has media related to statistical inference algorithmic inference induction philosophy philosophy of statistics predictive inference notes edit a b upton g cook i 2008 oxford dictionary of statistics oup isbn 978 0 19 954145 4 dodge y 2003 the oxford dictionary of statistical terms oup isbn 0 19 920613 9 entry for inferential statistics according to peirce acceptance means that inquiry on this question ceases for the time being in science all scientific theories are revisable a b cox 2006 page 2 evans et al michael 2004 probability and statistics the science of uncertainty freeman and company p 160 267 160 van der vaart a w 1998 asymptotic statistics cambridge university press isbn 0 521 78450 6 page 341 kruskal william december 1988 miracles and statistics the casual assumption of independence asa presidential address journal of the american statistical association 83 404 929 940 jstor 160 2290117 160 freedman d a 2008 survival analysis an epidemiological hazard the american statistician 2008 62 110 119 reprinted as chapter 11 pages 169 192 of freedman d a 2010 statistical models and causal inferences a dialogue with the social sciences edited by david collier jasjeet s sekhon and philip b stark cambridge university press isbn 978 0 521 12390 7 berk r 2003 regression analysis a constructive critique advanced quantitative techniques in the social sciences v 11 sage publications isbn 0 7619 2904 5 a b brewer ken 2002 combined survey sampling inference weighing of basu s elephants hodder arnold p 160 6 isbn 160 0 340 69229 4 978 0340692295 check isbn value help 160 a b j rgen hoffman j rgensen s probability with a view towards statistics volume i page 399 full citation needed le cam 1986 page 160 needed erik torgerson 1991 comparison of statistical experiments volume 36 of encyclopedia of mathematics cambridge university press full citation needed liese friedrich and miescke klaus j 2008 statistical decision theory estimation testing and selection springer isbn 160 0 387 73193 8 160 kolmogorov 1963a page 369 the frequency concept based on the notion of limiting frequency as the number of trials increases to infinity does not contribute anything to substantiate the applicability of the results of probability theory to real practical problems where we have always to deal with a finite number of trials page 369 indeed limit theorems as 160 tends to infinity are logically devoid of content about what happens at any particular 160 all they can do is suggest certain approaches whose performance must then be checked on the case at hand le cam 1986 page xiv pfanzagl 1994 the crucial drawback of asymptotic theory what we expect from asymptotic theory are results which hold approximately what asymptotic theory has to offer are limit theorems page ix what counts for applications are approximations not limits page 188 pfanzagl 1994 160 by taking a limit theorem as being approximately true for large sample sizes we commit an error the size of which is unknown realistic information about the remaining errors may be obtained by simulations page ix neyman j 1934 on the two different aspects of the representative method the method of stratified sampling and the method of purposive selection journal of the royal statistical society 97 4 557 625 jstor 160 2342192 a b hinkelmann and kempthorne 2008 page 160 needed asa guidelines for a first course in statistics for non statisticians available at the asa website david a freedman et alia s statistics david s moore and george mccabe introduction to the practice of statistics gelman rubin bayesian data analysis peirce 1877 1878 peirce 1883 david freedman et alia statistics and david a freedman statistical models rao c r 1997 statistics and truth putting chance to work world scientific isbn 981 02 3111 3 peirce freedman moore and mccabe citation needed box g e p and friends 2006 improving almost anything ideas and essays revised edition wiley isbn 978 0 471 72755 2 cox 2006 page 196 asa guidelines for a first course in statistics for non statisticians available at the asa website david a freedman et alia s statistics david s moore and george mccabe introduction to the practice of statistics neyman jerzy 1923 1990 on the application of probability theory to agriculturalexperiments essay on principles section 9 statistical science 5 4 465 472 trans dorota m dabrowska and terence p speed hinkelmann amp kempthorne 2008 page 160 needed hinkelmann and kempthorne 2008 chapter 6 neyman j 1937 outline of a theory of statistical estimation based on the classical theory of probability philosophical transactions of the royal society of london a 236 333 380 preface to pfanzagl soofi 2000 a b hansen amp yu 2001 a b hansen and yu 2001 page 747 a b rissanen 1989 page 84 joseph f traub g w wasilkowski and h wozniakowski 1988 page 160 needed judin and nemirovski neyman 1956 zabell 1992 cox 2006 page 66 davison page 12 full citation needed barnard g a 1995 pivotal models and the fiducial argument international statistical review 63 3 309 323 jstor 160 1403482 references edit bickel peter j doksum kjell a 2001 mathematical statistics basic and selected topics 1 second updated printing 2007 ed pearson prentice hall isbn 160 0 13 850363 x mr 160 443141 160 cox d r 2006 principles of statistical inference cup isbn 0 521 68567 2 fisher ronald 1955 statistical methods and scientific induction journal of the royal statistical society series b 17 69 78 criticism of statistical theories of jerzy neyman and abraham wald freedman david a 2009 statistical models theory and practice revised ed cambridge university press pp 160 xiv 442 pp isbn 160 978 0 521 74385 3 mr 160 2489600 160 hansen mark h yu bin june 2001 model selection and the principle of minimum description length review paper journal of the american statistical association 96 454 746 774 doi 10 1198 016214501753168398 jstor 160 2670311 mr 160 1939352 160 hinkelmann klaus kempthorne oscar 2008 introduction to experimental design second ed wiley isbn 160 978 0 471 72756 9 160 kolmogorov andrei n 1963a on tables of random numbers sankhy ser a 25 369 375 mr 160 178484 160 kolmogorov andrei n 1963b on tables of random numbers theoretical computer science 207 2 387 395 doi 10 1016 s0304 3975 98 00075 9 mr 160 1643414 160 le cam lucian 1986 asymptotic methods of statistical decision theory springer isbn 0 387 96307 3 neyman jerzy 1956 note on an article by sir ronald fisher journal of the royal statistical society series b 18 2 288 294 jstor 160 2983716 160 reply to fisher 1955 peirce c s 1877 1878 illustrations of the logic of science series popular science monthly vols 12 13 relevant individual papers 1878 march the doctrine of chances popular science monthly v 12 march issue pp 604 615 internet archive eprint 1878 april the probability of induction popular science monthly v 12 pp 705 718 internet archive eprint 1878 june the order of nature popular science monthly v 13 pp 203 217 internet archive eprint 1878 august deduction induction and hypothesis popular science monthly v 13 pp 470 482 internet archive eprint peirce c s 1883 a theory of probable inference studies in logic pp 126 181 little brown and company reprinted 1983 john benjamins publishing company isbn 90 272 3271 7 pfanzagl johann with the assistance of r hamb ker 1994 parametric statistical theory berlin walter de gruyter isbn 160 3 11 013863 8 mr 160 1291393 160 rissanen jorma 1989 stochastic complexity in statistical inquiry series in computer science 15 singapore world scientific isbn 160 9971 5 0859 1 mr 160 1082556 160 soofi ehsan s december 2000 principal information theoretic approaches vignettes for the year 2000 theory and methods ed by george casella journal of the american statistical association 95 452 1349 1353 jstor 160 2669786 mr 160 1825292 160 traub joseph f wasilkowski g w wozniakowski h 1988 information based complexity academic press isbn 160 0 12 697545 0 160 zabell s l aug 1992 r a fisher and fiducial argument statistical science 7 3 369 387 doi 10 1214 ss 1177011233 jstor 160 2246073 160 further reading edit casella g berger r l 2001 statistical inference duxbury press isbn 0 534 24312 6 david a freedman statistical models and shoe leather 1991 sociological methodology vol 21 pp 160 291 313 david a freedman statistical models and causal inferences a dialogue with the social sciences 2010 edited by david collier jasjeet s sekhon and philip b stark cambridge university press kruskal william december 1988 miracles and statistics the casual assumption of independence asa presidential address journal of the american statistical association 83 404 929 940 jstor 160 2290117 160 lenhard johannes 2006 models and statistical inference the controversy between fisher and neyman pearson british journal for the philosophy of science vol 57 issue 1 pp 160 69 91 lindley d 1958 fiducial distribution and bayes theorem journal of the royal statistical society series b 20 102 7 sudderth william d 1994 coherent inference and prediction in statistics in dag prawitz bryan skyrms and westerstahl eds logic methodology and philosophy of science ix proceedings of the ninth international congress of logic methodology and philosophy of science uppsala sweden august 7 14 1991 amsterdam elsevier trusted jennifer 1979 the logic of scientific inference an introduction london the macmillan press ltd young g a smith r l 2005 essentials of statistical inference cup isbn 0 521 83971 8 external links edit wikiversity has learning materials about statistical inference mit opencourseware statistical inference v t e statistics 160 descriptive statistics continuous data location mean arithmetic geometric harmonic median mode dispersion range standard deviation coefficient of variation percentile interquartile range shape variance skewness kurtosis moments l moments count data index of dispersion summary tables grouped data frequency distribution contingency table dependence pearson product moment correlation rank correlation spearman s rho kendall s tau partial correlation scatter plot statistical graphics bar chart biplot box plot control chart correlogram forest plot histogram q q plot run chart scatter plot stemplot radar chart 160 data collection designing studies effect size standard error statistical power sample size determination survey methodology sampling stratified sampling opinion poll questionnaire controlled experiment design of experiments randomized experiment random assignment replication blocking factorial experiment optimal design uncontrolled studies natural experiment quasi experiment observational study 160 statistical inference statistical theory sampling distribution order statistics sufficiency completeness exponential family permutation test randomization test empirical distribution bootstrap u statistic efficiency asymptotics robustness frequentist inference unbiased estimator mean unbiased minimum variance median unbiased biased estimators maximum likelihood method of moments minimum distance density estimation confidence interval testing hypotheses power parametric tests likelihood ratio wald score specific tests z normal student s t test f chi squared signed rank 1 sample 2 sample 1 way anova shapiro wilk kolmogorov smirnov bayesian inference bayesian probability prior posterior credible interval bayes factor bayesian estimator maximum posterior estimator 160 correlation and regression analysis correlation pearson product moment correlation partial correlation confounding variable coefficient of determination regression analysis errors and residuals regression model validation mixed effects models simultaneous equations models linear regression simple linear regression ordinary least squares general linear model bayesian regression non standard predictors nonlinear regression nonparametric semiparametric isotonic robust generalized linear model exponential families logistic bernoulli binomial poisson partition of variance analysis of variance anova analysis of covariance multivariate anova degrees of freedom 160 categorical multivariate time series or survival analysis categorical data cohen s kappa contingency table graphical model log linear model mcnemar s test multivariate statistics multivariate regression principal components factor analysis cluster analysis classification copulas time series analysis general decomposition trend stationarity seasonal adjustment time domain acf pacf xcf arma model arima model vector autoregression frequency domain spectral density estimation survival analysis survival function kaplan meier logrank test failure rate proportional hazards models accelerated failure time model 160 applications biostatistics bioinformatics clinical trials amp studies epidemiology medical statistics engineering statistics chemometrics methods engineering probabilistic design process amp quality control reliability system identification social statistics actuarial science census crime statistics demography econometrics national accounts official statistics population psychometrics spatial statistics cartography environmental statistics geographic information system geostatistics kriging category portal outline index retrieved from http en wikipedia org w index php title statistical_inference amp oldid 560654786 categories statistical inferencestatistical theoryinductive reasoningdeductive reasoninglogic and statisticsphilosophy of sciencepsychometricshidden categories pages with isbn errorsarticles needing more detailed referenceswikipedia articles needing page number citations from june 2011all articles with unsourced statementsarticles with unsourced statements from march 2010articles lacking reliable references from march 2012all articles lacking reliable referencesarticles that may contain original research from march 2012articles with unsourced statements from february 2012wikipedia articles needing clarification from may 2010articles with unsourced statements from april 2012commons category without a link on wikidata navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal deutsch espa ol euskara fran ais bahasa indonesia italiano polski portugus simple english basa sunda edit links this page was last modified on 23 june 2013 at 01 25 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistical_model b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistical_model new file mode 100644 index 00000000..5f342fbf --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistical_model @@ -0,0 +1 @@ +statistical model wikipedia the free encyclopedia statistical model from wikipedia the free encyclopedia jump to navigation search a statistical model is a formalization of relationships between variables in the form of mathematical equations a statistical model describes how one or more random variables are related to one or more other variables the model is statistical as the variables are not deterministically but stochastically related in mathematical terms a statistical model is frequently thought of as a pair where is the set of possible observations and the set of possible probability distributions on it is assumed that there is a distinct element of which generates the observed data statistical inference enables us to make statements about which element s of this set are likely to be the true one most statistical tests can be described in the form of a statistical model for example the student s t test for comparing the means of two groups can be formulated as seeing if an estimated parameter in the model is different from 0 another similarity between tests and models is that there are assumptions involved error is assumed to be normally distributed in most models 1 contents 1 formal definition 2 model comparison 3 an example 4 classification 5 see also 6 references formal definition edit a statistical model is a collection of probability distribution functions or probability density functions collectively referred to as distributions for brevity a parametric model is a collection of distributions each of which is indexed by a unique finite dimensional parameter where is a parameter and is the feasible region of parameters which is a subset of d dimensional euclidean space a statistical model may be used to describe the set of distributions from which one assumes that a particular data set is sampled for example if one assumes that data arise from a univariate gaussian distribution then one has assumed a gaussian model a non parametric model is a set of probability distributions with infinite dimensional parameters and might be written as a semi parametric model also has infinite dimensional parameters but is not dense in the space of distributions for example a mixture of gaussians with one gaussian at each data point is dense in the space of distributions formally if d is the dimension of the parameter and n is the number of samples if as and as then the model is semi parametric model comparison edit models can be compared to each other this can either be done when you have done an exploratory data analysis or a confirmatory data analysis in an exploratory analysis you formulate all models you can think of and see which describes your data best in a confirmatory analysis you test which of your models you have described before the data was collected fits the data best or test if your only model fits the data in linear regression analysis you can compare the amount of variance explained by the independent variables r2 across the different models in general you can compare models that are nested by using a likelihood ratio test nested models are models that can be obtained by restricting a parameter in a more complex model to be zero an example edit height and age are probabilistically distributed over humans they are stochastically related when you know that a person is of age 7 this influences the chance of this person being 6 feet tall you could formalize this relationship in a linear regression model of the following form heighti b0 b1agei i where b0 is the intercept b1 is a parameter that age is multiplied by to get a prediction of height is the error term and i is the subject this means that height starts at some value there is a minimum height when someone is born and it is predicted by age to some amount this prediction is not perfect as error is included in the model this error contains variance that stems from sex and other variables when sex is included in the model the error term will become smaller as you will have a better idea of the chance that a particular 16 year old is 6 feet tall when you know this 16 year old is a girl the model would become heighti b0 b1agei b2sexi i where the variable sex is dichotomous this model would presumably have a higher r2 the first model is nested in the second model the first model is obtained from the second when b2 is restricted to zero classification edit according to the number of the endogenous variables and the number of equations models can be classified as complete models the number of equations equal to the number of endogenous variables and incomplete models some other statistical models are the general linear model restricted to continuous dependent variables the generalized linear model for example logistic regression the multilevel model and the structural equation model 2 see also edit wikimedia commons has media related to statistical model a b testing mathematical diagram regression analysis this article includes a list of references but its sources remain unclear because it has insufficient inline citations please help to improve this article by introducing more precise citations september 2010 references edit field a 2005 discovering statistics using spss sage london ad r h j 2008 chapter 12 modelling in h j ad r amp g j mellenbergh eds with contributions by d j hand advising on research methods a consultant s companion pp 271 304 huizen the netherlands johannes van kessel publishing retrieved from http en wikipedia org w index php title statistical_model amp oldid 555584425 categories statistical modelsstatistical theoryscientific modelinghidden categories commons category without a link on wikidataarticles lacking in text citations from september 2010all articles lacking in text citations navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal espa ol fran ais polski suomi svenska tagalog t rk e edit links this page was last modified on 1 june 2013 at 16 21 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistics b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistics new file mode 100644 index 00000000..77d0d8fe --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Statistics @@ -0,0 +1 @@ +statistics wikipedia the free encyclopedia statistics from wikipedia the free encyclopedia jump to navigation search this article is about the discipline for other uses see statistics disambiguation it has been suggested that mathematical statistics be merged into this article discuss proposed since april 2013 statistics is the study of the collection organization analysis interpretation and presentation of data 1 2 it deals with all aspects of data including the planning of data collection in terms of the design of surveys and experiments 1 the word statistics when referring to the scientific discipline is singular as in statistics is an art 3 this should not be confused with the word statistic referring to a quantity such as mean or median calculated from a set of data 4 whose plural is statistics this statistic seems wrong or these statistics are misleading more probability density is found the closer one gets to the expected mean value in a normal distribution statistics used in standardized testing assessment are shown the scales include standard deviations cumulative percentages percentile equivalents z scores t scores standard nines and percentages in standard nines contents 1 scope 2 history 3 overview 4 statistical methods 4 1 experimental and observational studies 4 1 1 experiments 4 1 2 observational study 4 2 levels of measurement 4 3 key terms used in statistics 4 3 1 null hypothesis 4 3 2 error 4 3 3 interval estimation 4 3 4 significance 4 4 examples 5 specialized disciplines 6 statistical computing 7 misuse 8 statistics applied to mathematics or the arts 9 see also 10 references scope edit some consider statistics a mathematical body of science that pertains to the collection analysis interpretation or explanation and presentation of data 5 while others consider it a branch of mathematics 6 concerned with collecting and interpreting data because of its empirical roots and its focus on applications statistics is usually considered a distinct mathematical science rather than a branch of mathematics 7 8 much of statistics is non mathematical ensuring that data collection is undertaken in a way that produces valid conclusions coding and archiving data so that information is retained and made useful for international comparisons of official statistics reporting of results and summarised data tables and graphs in ways comprehensible to those who must use them implementing procedures that ensure the privacy of census information statisticians improve data quality by developing specific experiment designs and survey samples statistics itself also provides tools for prediction and forecasting the use of data and statistical models statistics is applicable to a wide variety of academic disciplines including natural and social sciences government and business statistical consultants can help organizations and companies that don t have in house expertise relevant to their particular questions statistical methods can summarize or describe a collection of data this is called descriptive statistics this is particularly useful in communicating the results of experiments and research in addition data patterns may be modeled in a way that accounts for randomness and uncertainty in the observations these models can be used to draw inferences about the process or population under study a practice called inferential statistics inference is a vital element of scientific advance since it provides a way to draw conclusions from data that are subject to random variation to prove the propositions being investigated further the conclusions are tested as well as part of the scientific method descriptive statistics and analysis of the new data tend to provide more information as to the truth of the proposition applied statistics comprises descriptive statistics and the application of inferential statistics 9 verification needed theoretical statistics concerns both the logical arguments underlying justification of approaches to statistical inference as well encompassing mathematical statistics mathematical statistics includes not only the manipulation of probability distributions necessary for deriving results related to methods of estimation and inference but also various aspects of computational statistics and the design of experiments statistics is closely related to probability theory with which it is often grouped the difference is roughly that probability theory starts from the given parameters of a total population to deduce probabilities that pertain to samples statistical inference however moves in the opposite direction inductively inferring from samples to the parameters of a larger or total population history edit main articles history of statistics and founders of statistics statistical methods date back at least to the 5th century bc the earliest known writing on statistics appears in a 9th century book entitled manuscript on deciphering cryptographic messages written by al kindi in this book al kindi provides a detailed description of how to use statistics and frequency analysis to decipher encrypted messages this was the birth of both statistics and cryptanalysis according to the saudi engineer ibrahim al kadi 10 11 the nuova cronica a 14th century history of florence by the florentine banker and official giovanni villani includes much statistical information on population ordinances commerce education and religious facilities and has been described as the first introduction of statistics as a positive element in history 12 some scholars pinpoint the origin of statistics to 1663 with the publication of natural and political observations upon the bills of mortality by john graunt 13 early applications of statistical thinking revolved around the needs of states to base policy on demographic and economic data hence its stat etymology the scope of the discipline of statistics broadened in the early 19th century to include the collection and analysis of data in general today statistics is widely employed in government business and natural and social sciences its mathematical foundations were laid in the 17th century with the development of the probability theory by blaise pascal and pierre de fermat probability theory arose from the study of games of chance the method of least squares was first described by carl friedrich gauss around 1794 the use of modern computers has expedited large scale statistical computation and has also made possible new methods that are impractical to perform manually overview edit in applying statistics to a scientific industrial or societal problem it is necessary to begin with a population or process to be studied populations can be diverse topics such as all persons living in a country or every atom composing a crystal a population can also be composed of observations of a process at various times with the data from each observation serving as a different member of the overall group data collected about this kind of population constitutes what is called a time series for practical reasons a chosen subset of the population called a sample is studied as opposed to compiling data about the entire group an operation called census once a sample that is representative of the population is determined data is collected for the sample members in an observational or experimental setting this data can then be subjected to statistical analysis serving two related purposes description and inference descriptive statistics summarize the population data by describing what was observed in the sample numerically or graphically numerical descriptors include mean and standard deviation for continuous data types like heights or weights while frequency and percentage are more useful in terms of describing categorical data like race inferential statistics uses patterns in the sample data to draw inferences about the population represented accounting for randomness these inferences may take the form of answering yes no questions about the data hypothesis testing estimating numerical characteristics of the data estimation describing associations within the data correlation and modeling relationships within the data for example using regression analysis inference can extend to forecasting prediction and estimation of unobserved values either in or associated with the population being studied it can include extrapolation and interpolation of time series or spatial data and can also include data mining 14 it is only the manipulation of uncertainty that interests us we are not concerned with the matter that is uncertain thus we do not study the mechanism of rain only whether it will rain dennis lindley 2000 15 the concept of correlation is particularly noteworthy for the potential confusion it can cause statistical analysis of a data set often reveals that two variables properties of the population under consideration tend to vary together as if they were connected for example a study of annual income that also looks at age of death might find that poor people tend to have shorter lives than affluent people the two variables are said to be correlated however they may or may not be the cause of one another the correlation phenomena could be caused by a third previously unconsidered phenomenon called a lurking variable or confounding variable for this reason there is no way to immediately infer the existence of a causal relationship between the two variables see correlation does not imply causation to use a sample as a guide to an entire population it is important that it truly represent the overall population representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole a major problem lies in determining the extent that the sample chosen is actually representative statistics offers methods to estimate and correct for any random trending within the sample and data collection procedures there are also methods of experimental design for experiments that can lessen these issues at the outset of a study strengthening its capability to discern truths about the population randomness is studied using the mathematical discipline of probability theory probability is used in mathematical statistics alternatively statistical theory to study the sampling distributions of sample statistics and more generally the properties of statistical procedures the use of any statistical method is valid when the system or population under consideration satisfies the assumptions of the method misuse of statistics can produce subtle but serious errors in description and interpretation subtle in the sense that even experienced professionals make such errors and serious in the sense that they can lead to devastating decision errors for instance social policy medical practice and the reliability of structures like bridges all rely on the proper use of statistics see below for further discussion even when statistical techniques are correctly applied the results can be difficult to interpret for those lacking expertise the statistical significance of a trend in the data which measures the extent to which a trend could be caused by random variation in the sample may or may not agree with an intuitive sense of its significance the set of basic statistical skills and skepticism that people need to deal with information in their everyday lives properly is referred to as statistical literacy statistical methods edit experimental and observational studies edit a common goal for a statistical research project is to investigate causality and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on dependent variables or response there are two major types of causal statistical studies experimental studies and observational studies in both types of studies the effect of differences of an independent variable or variables on the behavior of the dependent variable are observed the difference between the two types lies in how the study is actually conducted each can be very effective an experimental study involves taking measurements of the system under study manipulating the system and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements in contrast an observational study does not involve experimental manipulation instead data are gathered and correlations between predictors and response are investigated experiments edit the basic steps of a statistical experiment are planning the research including finding the number of replicates of the study using the following information preliminary estimates regarding the size of treatment effects alternative hypotheses and the estimated experimental variability consideration of the selection of experimental subjects and the ethics of research is necessary statisticians recommend that experiments compare at least one new treatment with a standard treatment or control to allow an unbiased estimate of the difference in treatment effects design of experiments using blocking to reduce the influence of confounding variables and randomized assignment of treatments to subjects to allow unbiased estimates of treatment effects and experimental error at this stage the experimenters and statisticians write the experimental protocol that shall guide the performance of the experiment and that specifies the primary analysis of the experimental data performing the experiment following the experimental protocol and analyzing the data following the experimental protocol further examining the data set in secondary analyses to suggest new hypotheses for future study documenting and presenting the results of the study experiments on human behavior have special concerns the famous hawthorne study examined changes to the working environment at the hawthorne plant of the western electric company the researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers the researchers first measured the productivity in the plant then modified the illumination in an area of the plant and checked if the changes in illumination affected productivity it turned out that productivity indeed improved under the experimental conditions however the study is heavily criticized today for errors in experimental procedures specifically for the lack of a control group and blindness the hawthorne effect refers to finding that an outcome in this case worker productivity changed due to observation itself those in the hawthorne study became more productive not because the lighting was changed but because they were being observed citation needed observational study edit an example of an observational study is one that explores the correlation between smoking and lung cancer this type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis in this case the researchers would collect observations of both smokers and non smokers perhaps through a case control study and then look for the number of cases of lung cancer in each group levels of measurement edit main article levels of measurement there are four main levels of measurement used in statistics nominal ordinal interval and ratio 16 each of these have different degrees of usefulness in statistical research ratio measurements have both a meaningful zero value and the distances between different measurements defined they provide the greatest flexibility in statistical methods that can be used for analyzing the data citation needed interval measurements have meaningful distances between measurements defined but the zero value is arbitrary as in the case with longitude and temperature measurements in celsius or fahrenheit ordinal measurements have imprecise differences between consecutive values but have a meaningful order to those values nominal measurements have no meaningful rank order among values because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically sometimes they are grouped together as categorical variables whereas ratio and interval measurements are grouped together as quantitative variables which can be either discrete or continuous due to their numerical nature key terms used in statistics edit null hypothesis edit interpretation of statistical information can often involve the development of a null hypothesis in that the assumption is that whatever is proposed as a cause has no effect on the variable being measured the best illustration for a novice is the predicament encountered by a jury trial the null hypothesis h0 asserts that the defendant is innocent whereas the alternative hypothesis h1 asserts that the defendant is guilty the indictment comes because of suspicion of the guilt the h0 status quo stands in opposition to h1 and is maintained unless h1 is supported by evidence beyond a reasonable doubt however failure to reject h0 in this case does not imply innocence but merely that the evidence was insufficient to convict so the jury does not necessarily accept h0 but fails to reject h0 while one can not prove a null hypothesis one can test how close it is to being true with a power test which tests for type ii errors error edit working from a null hypothesis two basic forms of error are recognized type i errors where the null hypothesis is falsely rejected giving a false positive type ii errors where the null hypothesis fails to be rejected and an actual difference between populations is missed giving a false negative error also refers to the extent to which individual observations in a sample differ from a central value such as the sample or population mean many statistical methods seek to minimize the mean squared error and these are called methods of least squares measurement processes that generate statistical data are also subject to error many of these errors are classified as random noise or systematic bias but other important types of errors e g blunder such as when an analyst reports incorrect units can also be important interval estimation edit main article interval estimation most studies only sample part of a population so results don t fully represent the whole population any estimates obtained from the sample only approximate the population value confidence intervals allow statisticians to express how closely the sample estimate matches the true value in the whole population often they are expressed as 95 confidence intervals formally a 95 confidence interval for a value is a range where if the sampling and analysis were repeated under the same conditions yielding a different dataset the interval would include the true population value 95 of the time this does not imply that the probability that the true value is in the confidence interval is 95 from the frequentist perspective such a claim does not even make sense as the true value is not a random variable either the true value is or is not within the given interval however it is true that before any data are sampled and given a plan for how to construct the confidence interval the probability is 95 that the yet to be calculated interval will cover the true value at this point the limits of the interval are yet to be observed random variables one approach that does yield an interval that can be interpreted as having a given probability of containing the true value is to use a credible interval from bayesian statistics this approach depends on a different way of interpreting what is meant by probability that is as a bayesian probability significance edit main article statistical significance this section includes a list of references related reading or external links but the sources of this section remain unclear because it lacks inline citations please improve this article by introducing more precise citations may 2012 statistics rarely give a simple yes no type answer to the question asked of them interpretation often comes down to the level of statistical significance applied to the numbers and often refers to the probability of a value accurately rejecting the null hypothesis sometimes referred to as the p value referring to statistical significance does not necessarily mean that the overall result is significant in real world terms for example in a large study of a drug it may be shown that the drug has a statistically significant but very small beneficial effect such that the drug is unlikely to help the patient noticeably criticisms arise because the hypothesis testing approach forces one hypothesis the null hypothesis to be favored and can also seem to exaggerate the importance of minor differences in large studies a difference that is highly statistically significant can still be of no practical significance but it is possible to properly formulate tests in account for this see also criticism of hypothesis testing one response involves going beyond reporting only the significance level to include the p value when reporting whether a hypothesis is rejected or accepted the p value however does not indicate the size of the effect a better and increasingly common approach is to report confidence intervals although these are produced from the same calculations as those of hypothesis tests or p values they describe both the size of the effect and the uncertainty surrounding it examples edit some well known statistical tests and procedures are analysis of variance anova chi squared test correlation factor analysis mann whitney u mean square weighted deviation mswd pearson product moment correlation coefficient regression analysis spearman s rank correlation coefficient student s t test time series analysis specialized disciplines edit main article list of fields of application of statistics statistical techniques are used in a wide range of types of scientific and social research including biostatistics computational biology computational sociology network biology social science sociology and social research some fields of inquiry use applied statistics so extensively that they have specialized terminology these disciplines include actuarial science assesses risk in the insurance and finance industries applied information economics biostatistics business statistics chemometrics for analysis of data from chemistry data mining applying statistics and pattern recognition to discover knowledge from data demography econometrics energy statistics engineering statistics epidemiology geography and geographic information systems specifically in spatial analysis image processing psychological statistics reliability engineering social statistics in addition there are particular types of statistical analysis that have also developed their own specialised terminology and methodology bootstrap amp jackknife resampling multivariate statistics statistical classification statistical surveys structured data analysis statistics structural equation modelling survival analysis statistics in various sports particularly baseball and cricket statistics form a key basis tool in business and manufacturing as well it is used to understand measurement systems variability control processes as in statistical process control or spc for summarizing data and to make data driven decisions in these roles it is a key tool and perhaps the only reliable tool statistical computing edit gretl an example of an open source statistical package main article computational statistics the rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science early statistical models were almost always from the class of linear models but powerful computers coupled with suitable numerical algorithms caused an increased interest in nonlinear models such as neural networks as well as the creation of new types such as generalized linear models and multilevel models increased computing power has also led to the growing popularity of computationally intensive methods based on resampling such as permutation tests and the bootstrap while techniques such as gibbs sampling have made use of bayesian models more feasible the computer revolution has implications for the future of statistics with new emphasis on experimental and empirical statistics a large number of both general and special purpose statistical software are now available misuse edit main article misuse of statistics there is a general perception that statistical knowledge is all too frequently intentionally misused by finding ways to interpret only the data that are favorable to the presenter 17 a mistrust and misunderstanding of statistics is associated with the quotation there are three kinds of lies lies damned lies and statistics misuse of statistics can be both inadvertent and intentional and the book how to lie with statistics 17 outlines a range of considerations in an attempt to shed light on the use and misuse of statistics reviews of statistical techniques used in particular fields are conducted e g warne lazo ramos and ritter 2012 18 ways to avoid misuse of statistics include using proper diagrams and avoiding bias 19 misuse can occur when conclusions are overgeneralized and claimed to be representative of more than they really are often by either deliberately or unconsciously overlooking sampling bias 20 bar graphs are arguably the easiest diagrams to use and understand and they can be made either by hand or with simple computer programs 19 unfortunately most people do not look for bias or errors so they are not noticed thus people may often believe that something is true even if it is not well represented 20 to make data gathered from statistics believable and accurate the sample taken must be representative of the whole 21 according to huff the dependability of a sample can be destroyed by bias allow yourself some degree of skepticism 22 to assist in the understanding of statistics huff proposed a series of questions to be asked in each case 22 who says so does he she have an axe to grind how does he she know does he she have the resources to know the facts what s missing does he she give us a complete picture did someone change the subject does he she offer us the right answer to the wrong problem does it make sense is his her conclusion logical and consistent with what we already know statistics applied to mathematics or the arts edit traditionally statistics was concerned with drawing inferences using a semi standardized methodology that was required learning in most sciences this has changed with use of statistics in non inferential contexts what was once considered a dry subject taken in many fields as a degree requirement is now viewed enthusiastically initially derided by some mathematical purists it is now considered essential methodology in certain areas in number theory scatter plots of data generated by a distribution function may be transformed with familiar tools used in statistics to reveal underlying patterns which may then lead to hypotheses methods of statistics including predictive methods in forecasting are combined with chaos theory and fractal geometry to create video works that are considered to have great beauty the process art of jackson pollock relied on artistic experiments whereby underlying distributions in nature were artistically revealed citation needed with the advent of computers statistical methods were applied to formalize such distribution driven natural processes to make and analyze moving video art citation needed methods of statistics may be used predicatively in performance art as in a card trick based on a markov process that only works some of the time the occasion of which can be predicted using statistical methodology statistics can be used to predicatively create art as in the statistical or stochastic music invented by iannis xenakis where the music is performance specific though this type of artistry does not always come out as expected it does behave in ways that are predictable and tunable using statistics see also edit statistics portal find more about statistics at wikipedia s sister projects definitions and translations from wiktionary media from commons learning resources from wikiversity news stories from wikinews quotations from wikiquote source texts from wikisource textbooks from wikibooks main article outline of statistics glossary of probability and statistics notation in probability and statistics list of statistics articles list of academic statistical associations list of national and international statistical services list of important publications in statistics list of university statistical consulting centers list of statistical packages software foundations of statistics list of statisticians official statistics multivariate analysis of variance references edit a b dodge y 2006 the oxford dictionary of statistical terms oup isbn 0 19 920613 9 the free online dictionary statistics merriam webster online dictionary 160 statistic merriam webster online dictionary 160 moses lincoln e 1986 think and explain with statistics addison wesley isbn 978 0 201 15619 5 pp 1 3 hays william lee 1973 statistics for the social sciences holt rinehart and winston p xii isbn 978 0 03 077945 9 moore david 1992 teaching statistics as a respectable subject in f gordon and s gordon statistics for the twenty first century washington dc the mathematical association of america pp 160 14 25 isbn 160 978 0 88385 078 7 160 chance beth l rossman allan j 2005 preface investigating statistical concepts applications and methods duxbury press isbn 160 978 0 495 05064 3 160 anderson d r sweeney d j williams t a 1994 introduction to statistics concepts and applications pp 5 9 west group isbn 978 0 314 03309 3 al kadi ibrahim a 1992 the origins of cryptology the arab contributions cryptologia 16 2 97 126 doi 10 1080 0161 119291866801 singh simon 2000 the code book 160 the science of secrecy from ancient egypt to quantum cryptography 1st anchor books ed new york anchor books isbn 160 0 385 49532 3 160 page 160 needed villani giovanni encyclop dia britannica encyclop dia britannica 2006 ultimate reference suite dvd retrieved on 2008 03 04 willcox walter 1938 the founder of statistics review of the international statistical institute 5 4 321 328 jstor 160 1400906 breiman leo 2001 statistical modelling the two cultures statistical science 16 3 199 231 doi 10 1214 ss 1009213726 mr 160 1874152 citeseerx 10 1 1 156 4933 160 lindley d 2000 the philosophy of statistics journal of the royal statistical society series d 49 3 293 337 doi 10 1111 1467 9884 00238 jstor 160 2681060 160 thompson b 2006 foundations of behavioral statistics new york ny guilford press a b huff darrell 1954 how to lie with statistics ww norton amp company inc new york ny isbn 0 393 31072 8 warne r lazo m ramos t and ritter n 2012 statistical methods used in gifted education journals 2006 2010 gifted child quarterly 56 3 134 149 doi 10 1177 0016986212444122 a b drennan robert d 2008 statistics in archaeology in pearsall deborah m encyclopedia of archaeology elsevier inc pp 160 2093 2100 isbn 160 978 0 12 373962 9 160 a b cohen jerome b december 1938 misuse of statistics journal of the american statistical association jstor 33 204 657 674 doi 10 1080 01621459 1938 10502344 160 freund j f 1988 modern elementary statistics credo reference 160 a b huff darrell irving geis 1954 how to lie with statistics new york norton the dependability of a sample can be destroyed by bias allow yourself some degree of skepticism 160 v t e statistics 160 descriptive statistics continuous data location mean arithmetic geometric harmonic median mode dispersion range standard deviation coefficient of variation percentile interquartile range shape variance skewness kurtosis moments l moments count data index of dispersion summary tables grouped data frequency distribution contingency table dependence pearson product moment correlation rank correlation spearman s rho kendall s tau partial correlation scatter plot statistical graphics bar chart biplot box plot control chart correlogram forest plot histogram q q plot run chart scatter plot stemplot radar chart 160 data collection designing studies effect size standard error statistical power sample size determination survey methodology sampling stratified sampling opinion poll questionnaire controlled experiment design of experiments randomized experiment random assignment replication blocking factorial experiment optimal design uncontrolled studies natural experiment quasi experiment observational study 160 statistical inference statistical theory sampling distribution order statistics sufficiency completeness exponential family permutation test randomization test empirical distribution bootstrap u statistic efficiency asymptotics robustness frequentist inference unbiased estimator mean unbiased minimum variance median unbiased biased estimators maximum likelihood method of moments minimum distance density estimation confidence interval testing hypotheses power parametric tests likelihood ratio wald score specific tests z normal student s t test f chi squared signed rank 1 sample 2 sample 1 way anova shapiro wilk kolmogorov smirnov bayesian inference bayesian probability prior posterior credible interval bayes factor bayesian estimator maximum posterior estimator 160 correlation and regression analysis correlation pearson product moment correlation partial correlation confounding variable coefficient of determination regression analysis errors and residuals regression model validation mixed effects models simultaneous equations models linear regression simple linear regression ordinary least squares general linear model bayesian regression non standard predictors nonlinear regression nonparametric semiparametric isotonic robust generalized linear model exponential families logistic bernoulli binomial poisson partition of variance analysis of variance anova analysis of covariance multivariate anova degrees of freedom 160 categorical multivariate time series or survival analysis categorical data cohen s kappa contingency table graphical model log linear model mcnemar s test multivariate statistics multivariate regression principal components factor analysis cluster analysis classification copulas time series analysis general decomposition trend stationarity seasonal adjustment time domain acf pacf xcf arma model arima model vector autoregression frequency domain spectral density estimation survival analysis survival function kaplan meier logrank test failure rate proportional hazards models accelerated failure time model 160 applications biostatistics bioinformatics clinical trials amp studies epidemiology medical statistics engineering statistics chemometrics methods engineering probabilistic design process amp quality control reliability system identification social statistics actuarial science census crime statistics demography econometrics national accounts official statistics population psychometrics spatial statistics cartography environmental statistics geographic information system geostatistics kriging category portal outline index v t e areas of mathematics areas arithmetic algebra elementary linear multilinear abstract geometry discrete algebraic differential finite trigonometry calculus analysis functional analysis set theory logic category theory number theory combinatorics graph theory topology lie theory differential equations dynamical systems mathematical physics numerical analysis computation information theory probability mathematical statistics mathematical optimization control theory game theory representation theory divisions pure mathematics applied mathematics discrete mathematics computational mathematics category mathematics portal outline lists retrieved from http en wikipedia org w index php title statistics amp oldid 560596890 categories auxiliary sciences of historydataevaluation methodsformal sciencesinformationmathematical and quantitative methods economics mathematical sciencesresearch methodsscientific methodstatisticshidden categories wikipedia articles needing page number citations from january 2012articles to be merged from april 2013all articles to be mergedall pages needing factual verificationwikipedia articles needing factual verification from march 2012all articles with unsourced statementsarticles with unsourced statements from april 2009articles with unsourced statements from november 2010articles lacking in text citations from may 2012all articles lacking in text citationsarticles with unsourced statements from march 2013 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages afrikaans aragon s az rbaycanca b n l m g bosanski brezhoneg catal esky cymraeg dansk deutsch eesti espa ol esperanto estreme u euskara f royskt fran ais frysk furlan gaeilge gaelg g idhlig galego hrvatski ido bahasa indonesia interlingua inuktitut slenska italiano basa jawa kurd ladino latina latvie u l tzebuergesch lietuvi limburgs magyar malagasy bahasa melayu mirand s nederlands norsk bokm l norsk nynorsk occitan o zbekcha piemont is polski portugus rom n scots seeltersk shqip sicilianu simple english sloven ina sloven ina srpski srpskohrvatski basa sunda suomi svenska tagalog tatar a t rk e t rkmen e v neto ti ng vi t vro winaray yor b emait ka edit links this page was last modified on 22 june 2013 at 13 28 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Structure_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Structure_mining new file mode 100644 index 00000000..a752196f --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Structure_mining @@ -0,0 +1 @@ +structure mining wikipedia the free encyclopedia structure mining from wikipedia the free encyclopedia jump to navigation search structure mining or structured data mining is the process of finding and extracting useful information from semi structured data sets graph mining is a special case of structured data mining citation needed contents 1 description 2 see also 3 references 4 external links description edit the growth of the use of semi structured data has created new opportunities for data mining which has traditionally been concerned with tabular data sets reflecting the strong association between data mining and relational databases much of the world s interesting and mineable data does not easily fold into relational databases though a generation of software engineers have been trained to believe this was the only way to handle data and data mining algorithms have generally been developed only to cope with tabular data xml being the most frequent way of representing semi structured data is able to represent both tabular data and arbitrary trees any particular representation of data to be exchanged between two applications in xml is normally described by a schema often written in xsd practical examples of such schemata for instance newsml are normally very sophisticated containing multiple optional subtrees used for representing special case data frequently around 90 of a schema is concerned with the definition of these optional data items and sub trees messages and data therefore that are transmitted or encoded using xml and that conform to the same schema are liable to contain very different data depending on what is being transmitted such data presents large problems for conventional data mining two messages that conform to the same schema may have little data in common building a training set from such data means that if one were to try to format it as tabular data for conventional data mining large sections of the tables would or could be empty there is a tacit assumption made in the design of most data mining algorithms that the data presented will be complete the other desideratum is that the actual mining algorithms employed whether supervised or unsupervised must be able to handle sparse data namely machine learning algorithms perform badly with incomplete data sets were only part of the information is supplied for instance methods based on neural networks citation needed or ross quinlan s id3 algorithm citation needed are highly accurate with good and representative samples of the problem but perform badly with biased data most of times better model presentation with more careful and unbiased representation of input and output is enough a particularly relevant area where finding the appropriate structure and model is the key issue is text mining xpath is the standard mechanism used to refer to nodes and data items within xml it has similarities to standard techniques for navigating directory hierarchies used in operating systems user interfaces to data and structure mine xml data of any form at least two extensions are required to conventional data mining these are the ability to associate an xpath statement with any data pattern and sub statements with each data node in the data pattern and the ability to mine the presence and count of any node or set of nodes within the document as an example if one were to represent a family tree in xml using these extensions one could create a data set containing all the individuals in the tree data items such as name and age at death and counts of related nodes such as number of children more sophisticated searches could extract data such as grandparents lifespans etc the addition of these data types related to the structure of a document or message facilitates structure mining see also edit molecule mining sequence mining data mining data warehousing structured content references edit andrew n edmonds on data mining tree structured data in xml data mining uk conference university of nottingham aug 2003 gusfield d algorithms on strings trees and sequences computer science and computational biology cambridge university press 1997 isbn 0 521 58519 8 r o duda p e hart d g stork pattern classification john wiley amp sons 2001 isbn 0 471 05669 3 f hadzic h tan t s dillon mining of data with complex structures springer 2010 isbn 978 3 642 17556 5 external links edit the 5th international workshop on mining and learning with graphs firenze aug 1 3 2007 retrieved from http en wikipedia org w index php title structure_mining amp oldid 559773104 categories data mininghidden categories all articles with unsourced statementsarticles with unsourced statements from november 2010 navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages italiano edit links this page was last modified on 13 june 2013 at 20 11 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Support_vector_machines b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Support_vector_machines new file mode 100644 index 00000000..df2d7f13 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Support_vector_machines @@ -0,0 +1 @@ +support vector machine wikipedia the free encyclopedia support vector machine from wikipedia the free encyclopedia redirected from support vector machines jump to navigation search not to be confused with secure virtual machine in machine learning support vector machines svms also support vector networks 1 are supervised learning models with associated learning algorithms that analyze data and recognize patterns used for classification and regression analysis the basic svm takes a set of input data and predicts for each given input which of two possible classes forms the output making it a non probabilistic binary linear classifier given a set of training examples each marked as belonging to one of two categories an svm training algorithm builds a model that assigns new examples into one category or the other an svm model is a representation of the examples as points in space mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible new examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on in addition to performing linear classification svms can efficiently perform non linear classification using what is called the kernel trick implicitly mapping their inputs into high dimensional feature spaces contents 1 formal definition 2 history 3 motivation 4 linear svm 4 1 primal form 4 2 dual form 4 3 biased and unbiased hyperplanes 5 soft margin 5 1 dual form 6 nonlinear classification 7 properties 7 1 parameter selection 7 2 issues 8 extensions 8 1 multiclass svm 8 2 transductive support vector machines 8 3 structured svm 8 4 regression 9 implementation 10 applications 11 see also 12 references 13 external links 14 bibliography formal definition edit more formally a support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space which can be used for classification regression or other tasks intuitively a good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class so called functional margin since in general the larger the margin the lower the generalization error of the classifier whereas the original problem may be stated in a finite dimensional space it often happens that the sets to discriminate are not linearly separable in that space for this reason it was proposed that the original finite dimensional space be mapped into a much higher dimensional space presumably making the separation easier in that space to keep the computational load reasonable the mappings used by svm schemes are designed to ensure that dot products may be computed easily in terms of the variables in the original space by defining them in terms of a kernel function selected to suit the problem 2 the hyperplanes in the higher dimensional space are defined as the set of points whose dot product with a vector in that space is constant the vectors defining the hyperplanes can be chosen to be linear combinations with parameters of images of feature vectors that occur in the data base with this choice of a hyperplane the points in the feature space that are mapped into the hyperplane are defined by the relation note that if becomes small as grows further away from each element in the sum measures the degree of closeness of the test point to the corresponding data base point in this way the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated note the fact that the set of points mapped into any hyperplane can be quite convoluted as a result allowing much more complex discrimination between sets which are not convex at all in the original space history edit the original svm algorithm was invented by vladimir n vapnik and the current standard incarnation soft margin was proposed by vapnik and corinna cortes in 1995 1 motivation edit h1 does not separate the classes h2 does but only with a small margin h3 separates them with the maximum margin classifying data is a common task in machine learning suppose some given data points each belong to one of two classes and the goal is to decide which class a new data point will be in in the case of support vector machines a data point is viewed as a p dimensional vector a list of p numbers and we want to know whether we can separate such points with a p 160 160 1 dimensional hyperplane this is called a linear classifier there are many hyperplanes that might classify the data one reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes so we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized if such a hyperplane exists it is known as the maximum margin hyperplane and the linear classifier it defines is known as a maximum margin classifier or equivalently the perceptron of optimal stability linear svm edit given some training data a set of n points of the form where the yi is either 1 or 1 indicating the class to which the point belongs each is a p dimensional real vector we want to find the maximum margin hyperplane that divides the points having from those having any hyperplane can be written as the set of points satisfying maximum margin hyperplane and margins for an svm trained with samples from two classes samples on the margin are called the support vectors where denotes the dot product and the normal vector to the hyperplane the parameter determines the offset of the hyperplane from the origin along the normal vector if the training data are linearly separable we can select two hyperplanes in a way that they separate the data and there are no points between them and then try to maximize their distance the region bounded by them is called the margin these hyperplanes can be described by the equations and by using geometry we find the distance between these two hyperplanes is so we want to minimize as we also have to prevent data points from falling into the margin we add the following constraint for each either of the first class or of the second this can be rewritten as we can put this together to get the optimization problem minimize in subject to for any primal form edit the optimization problem presented in the preceding section is difficult to solve because it depends on w the norm of w which involves a square root fortunately it is possible to alter the equation by substituting w with the factor of 1 2 being used for mathematical convenience without changing the solution the minimum of the original and the modified equation have the same w and b this is a quadratic programming optimization problem more clearly minimize in subject to for any by introducing lagrange multipliers the previous constrained problem can be expressed as that is we look for a saddle point in doing so all the points which can be separated as do not matter since we must set the corresponding to zero this problem can now be solved by standard quadratic programming techniques and programs the stationary karush kuhn tucker condition implies that the solution can be expressed as a linear combination of the training vectors only a few will be greater than zero the corresponding are exactly the support vectors which lie on the margin and satisfy from this one can derive that the support vectors also satisfy which allows one to define the offset in practice it is more robust to average over all support vectors dual form edit writing the classification rule in its unconstrained dual form reveals that the maximum margin hyperplane and therefore the classification task is only a function of the support vectors the subset of the training data that lie on the margin using the fact that and substituting one can show that the dual of the svm reduces to the following optimization problem maximize in subject to for any and to the constraint from the minimization in here the kernel is defined by can be computed thanks to the terms biased and unbiased hyperplanes edit for simplicity reasons sometimes it is required that the hyperplane pass through the origin of the coordinate system such hyperplanes are called unbiased whereas general hyperplanes not necessarily passing through the origin are called biased an unbiased hyperplane can be enforced by setting in the primal optimization problem the corresponding dual is identical to the dual given above without the equality constraint soft margin edit in 1995 corinna cortes and vladimir n vapnik suggested a modified maximum margin idea that allows for mislabeled examples 1 if there exists no hyperplane that can split the yes and no examples the soft margin method will choose a hyperplane that splits the examples as cleanly as possible while still maximizing the distance to the nearest cleanly split examples the method introduces non negative slack variables which measure the degree of misclassification of the data the objective function is then increased by a function which penalizes non zero and the optimization becomes a trade off between a large margin and a small error penalty if the penalty function is linear the optimization problem becomes subject to for any this constraint in 2 along with the objective of minimizing can be solved using lagrange multipliers as done above one has then to solve the following problem with dual form edit maximize in subject to for any and the key advantage of a linear penalty function is that the slack variables vanish from the dual problem with the constant c appearing only as an additional constraint on the lagrange multipliers for the above formulation and its huge impact in practice cortes and vapnik received the 2008 acm paris kanellakis award 3 nonlinear penalty functions have been used particularly to reduce the effect of outliers on the classifier but unless care is taken the problem becomes non convex and thus it is considerably more difficult to find a global solution nonlinear classification edit kernel machine the original optimal hyperplane algorithm proposed by vapnik in 1963 was a linear classifier however in 1992 bernhard e boser isabelle m guyon and vladimir n vapnik suggested a way to create nonlinear classifiers by applying the kernel trick originally proposed by aizerman et al 4 to maximum margin hyperplanes 5 the resulting algorithm is formally similar except that every dot product is replaced by a nonlinear kernel function this allows the algorithm to fit the maximum margin hyperplane in a transformed feature space the transformation may be nonlinear and the transformed space high dimensional thus though the classifier is a hyperplane in the high dimensional feature space it may be nonlinear in the original input space if the kernel used is a gaussian radial basis function the corresponding feature space is a hilbert space of infinite dimensions maximum margin classifiers are well regularized so the infinite dimensions do not spoil the results some common kernels include polynomial homogeneous polynomial inhomogeneous gaussian radial basis function for sometimes parametrized using hyperbolic tangent for some not every and the kernel is related to the transform by the equation the value w is also in the transformed space with dot products with w for classification can again be computed by the kernel trick i e however there does not in general exist a value w such that properties edit svms belong to a family of generalized linear classifiers and can be interpreted as an extension of the perceptron they can also be considered a special case of tikhonov regularization a special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin hence they are also known as maximum margin classifiers a comparison of the svm to other classifiers has been made by meyer leisch and hornik 6 parameter selection edit the effectiveness of svm depends on the selection of kernel the kernel s parameters and soft margin parameter c a common choice is a gaussian kernel which has a single parameter the best combination of c and is often selected by a grid search with exponentially growing sequences of c and for example typically each combination of parameter choices is checked using cross validation and the parameters with best cross validation accuracy are picked the final model which is used for testing and for classifying new data is then trained on the whole training set using the selected parameters 7 issues edit potential drawbacks of the svm are the following three aspects uncalibrated class membership probabilities the svm is only directly applicable for two class tasks therefore algorithms that reduce the multi class task to several binary problems have to be applied see the multi class svm section parameters of a solved model are difficult to interpret extensions edit multiclass svm edit multiclass svm aims to assign labels to instances by using support vector machines where the labels are drawn from a finite set of several elements the dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems 8 common methods for such reduction include 8 9 building binary classifiers which distinguish between i one of the labels and the rest one versus all or ii between every pair of classes one versus one classification of new instances for the one versus all case is done by a winner takes all strategy in which the classifier with the highest output function assigns the class it is important that the output functions be calibrated to produce comparable scores for the one versus one approach classification is done by a max wins voting strategy in which every classifier assigns the instance to one of the two classes then the vote for the assigned class is increased by one vote and finally the class with the most votes determines the instance classification directed acyclic graph svm dagsvm 10 error correcting output codes 11 crammer and singer proposed a multiclass svm method which casts the multiclass classification problem into a single optimization problem rather than decomposing it into multiple binary classification problems 12 see also lee lin and wahba 13 14 transductive support vector machines edit transductive support vector machines extend svms in that they could also treat partially labeled data in semi supervised learning by following the principles of transduction here in addition to the training set the learner is also given a set of test examples to be classified formally a transductive support vector machine is defined by the following primal optimization problem 15 minimize in subject to for any and any and transductive support vector machines were introduced by vladimir n vapnik in 1998 structured svm edit svms have been generalized to structured svms where the label space is structured and of possibly infinite size regression edit a version of svm for regression was proposed in 1996 by vladimir n vapnik harris drucker christopher j c burges linda kaufman and alexander j smola 16 this method is called support vector regression svr the model produced by support vector classification as described above depends only on a subset of the training data because the cost function for building the model does not care about training points that lie beyond the margin analogously the model produced by svr depends only on a subset of the training data because the cost function for building the model ignores any training data close to the model prediction within a threshold another svm version known as least squares support vector machine ls svm has been proposed by suykens and vandewalle 17 implementation edit the parameters of the maximum margin hyperplane are derived by solving the optimization there exist several specialized algorithms for quickly solving the qp problem that arises from svms mostly relying on heuristics for breaking the problem down into smaller more manageable chunks a common method is platt s sequential minimal optimization smo algorithm which breaks the problem down into 2 dimensional sub problems that may be solved analytically eliminating the need for a numerical optimization algorithm another approach is to use an interior point method that uses newton like iterations to find a solution of the karush kuhn tucker conditions of the primal and dual problems 18 instead of solving a sequence of broken down problems this approach directly solves the problem as a whole to avoid solving a linear system involving the large kernel matrix a low rank approximation to the matrix is often used in the kernel trick applications edit svm can be used to solve various real world problems svm is helpful in text and hypertext categorization as its application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings classification of images can also be performed using svm experimental results show that svm achieves significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback svm is also useful in medical science to classify protein as up to 90 of the compounds can classify correctly hand written characters can be recognized using svm see also edit in situ adaptive tabulation kernel machines polynomial kernel fisher kernel predictive analytics relevance vector machine a probabilistic sparse kernel model identical in functional form to svm sequential minimal optimization winnow algorithm regularization perspectives on support vector machines references edit a b c cortes corinna and vapnik vladimir n support vector networks machine learning 20 1995 http www springerlink com content k238jx04hm87j80g press william h teukolsky saul a vetterling william t flannery b p 2007 section 16 5 support vector machines numerical recipes the art of scientific computing 3rd ed new york cambridge university press isbn 160 978 0 521 88068 8 160 acm website press release of march 17th 2009 http www acm org press room news releases awards 08 groupa aizerman mark a braverman emmanuel m and rozonoer lev i 1964 theoretical foundations of the potential function method in pattern recognition learning automation and remote control 25 821 837 160 boser bernhard e guyon isabelle m and vapnik vladimir n a training algorithm for optimal margin classifiers in haussler david editor 5th annual acm workshop on colt pages 144 152 pittsburgh pa 1992 acm press meyer david leisch friedrich and hornik kurt the support vector machine under test neurocomputing 55 1 2 169 186 2003 http dx doi org 10 1016 s0925 2312 03 00431 4 hsu chih wei chang chih chung and lin chih jen 2003 a practical guide to support vector classification department of computer science and information engineering national taiwan university http www csie ntu edu tw cjlin papers guide guide pdf a b duan kai bo and keerthi s sathiya 2005 which is the best multiclass svm method an empirical study proceedings of the sixth international workshop on multiple classifier systems lecture notes in computer science 3541 278 doi 10 1007 11494683_28 isbn 160 978 3 540 26306 7 160 hsu chih wei and lin chih jen 2002 a comparison of methods for multiclass support vector machines ieee transactions on neural networks 160 platt john cristianini n and shawe taylor j 2000 large margin dags for multiclass classification in solla sara a leen todd k and m ller klaus robert eds advances in neural information processing systems mit press pp 160 547 553 160 dietterich thomas g and bakiri ghulum bakiri 1995 solving multiclass learning problems via error correcting output codes journal of artificial intelligence research vol 2 2 263 286 arxiv cs 9501101 bibcode 1995cs 1101d 160 unknown parameter class ignored help crammer koby and singer yoram 2001 on the algorithmic implementation of multiclass kernel based vector machines j of machine learning research 2 265 292 160 lee y lin y and wahba g 2001 multicategory support vector machines computing science and statistics 33 160 lee y lin y and wahba g 2004 multicategory support vector machines theory and application to the classification of microarray data and satellite radiance data journal of the american statistical association 99 465 67 81 doi 10 1198 016214504000000098 160 joachims thorsten transductive inference for text classification using support vector machines proceedings of the 1999 international conference on machine learning icml 1999 pp 200 209 drucker harris burges christopher j c kaufman linda smola alexander j and vapnik vladimir n 1997 support vector regression machines in advances in neural information processing systems 9 nips 1996 155 161 mit press suykens johan a k vandewalle joos p l least squares support vector machine classifiers neural processing letters vol 9 no 3 jun 1999 pp 293 300 ferris michael c and munson todd s 2002 interior point methods for massive support vector machines siam journal on optimization 13 3 783 804 doi 10 1137 s1052623400374379 160 external links edit burges christopher j c a tutorial on support vector machines for pattern recognition data mining and knowledge discovery 2 121 167 1998 www kernel machines org general information and collection of research papers teknomo k svm tutorial using spreadsheet visual introduction to svm www support vector machines org literature review software links related to support vector machines 160 academic site videolectures net svm related video lectures animation clip svm with polynomial kernel visualization fletcher tristan a very basic svm tutorial for complete beginners karatzoglou alexandros et al support vector machines in r journal of statistical software april 2006 volume 15 issue 9 shogun toolbox contains about 20 different implementations of svms written in c with matlab octave python r java lua ruby and c interffaces libsvm libsvm is a library of svms which is actively patched liblinear liblinear is a library for large linear classification including some svms flssvm flssvm is a least squares svm implementation written in fortran shark shark is a c machine learning library implementing various types of svms dlib dlib is a c library for working with kernel methods and svms svm light is a collection of software tools for learning and classification using svm svmjs live demo is a gui demo for javascript implementation of svms stanford university andrew ng video on svm byvatov e schneider g support vector machine applications in bioinformatics appl bioinformatics 2003 2 2 67 77 simon tong edward chang support vector machine active learning for image retrieval proceeding multimedia 01 proceedings of the ninth acm international conference on multimedia pages 107 118 simon tong daphne koller support vector machine active learning with applications to text classification journal of machine learning research 2001 bibliography edit theodoridis sergios and koutroumbas konstantinos pattern recognition 4th edition academic press 2009 isbn 978 1 59749 272 0 cristianini nello and shawe taylor john an introduction to support vector machines and other kernel based learning methods cambridge university press 2000 isbn 0 521 78019 5 1 svm book huang te ming kecman vojislav and kopriva ivica 2006 kernel based algorithms for mining huge data sets in supervised semi supervised and unsupervised learning springer verlag berlin heidelberg 260 pp 160 96 illus hardcover isbn 3 540 31681 7 2 kecman vojislav learning and soft computing 160 support vector machines neural networks fuzzy logic systems the mit press cambridge ma 2001 3 sch lkopf bernhard and smola alexander j learning with kernels mit press cambridge ma 2002 isbn 0 262 19475 9 sch lkopf bernhard burges christopher j c and smola alexander j editors advances in kernel methods support vector learning mit press cambridge ma 1999 isbn 0 262 19416 3 4 shawe taylor john and cristianini nello kernel methods for pattern analysis cambridge university press 2004 isbn 0 521 81397 2 5 kernel methods book steinwart ingo and christmann andreas support vector machines springer verlag new york 2008 isbn 978 0 387 77241 7 6 svm book tan peter jing and dowe david l 2004 mml inference of oblique decision trees lecture notes in artificial intelligence lnai 3339 springer verlag pp1082 1088 this paper uses minimum message length mml and actually incorporates probabilistic support vector machines in the leaves of decision trees vapnik vladimir n the nature of statistical learning theory springer verlag 1995 isbn 0 387 98780 0 vapnik vladimir n and kotz samuel estimation of dependences based on empirical data springer 2006 isbn 0 387 30865 2 510 pages this is a reprint of vapnik s early book describing philosophy behind svm approach the 2006 appendix describes recent development fradkin dmitriy and muchnik ilya support vector machines for classification in abello j and carmode g eds discrete methods in epidemiology dimacs series in discrete mathematics and theoretical computer science volume 70 pp 160 13 20 2006 7 succinctly describes theoretical ideas behind svm bennett kristin p and campbell colin support vector machines hype or hallelujah sigkdd explorations 2 2 2000 1 13 8 excellent introduction to svms with helpful figures ivanciuc ovidiu applications of support vector machines in chemistry in reviews in computational chemistry volume 23 2007 pp 160 291 400 reprint available 9 catanzaro bryan sundaram narayanan and keutzer kurt fast support vector machine training and classification on graphics processors in international conference on machine learning 2008 10 campbell colin and ying yiming learning with support vector machines 2011 morgan and claypool isbn 978 1 60845 616 1 11 retrieved from http en wikipedia org w index php title support_vector_machine amp oldid 560723031 categories support vector machinesclassification algorithmsstatistical classificationhidden categories pages with citations using unsupported parameters navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages catal esky deutsch espa ol euskara fran ais italiano polski portugus sloven ina suomi svenska ti ng vi t edit links this page was last modified on 20 june 2013 at 08 32 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Text_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Text_mining new file mode 100644 index 00000000..40c75588 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Text_mining @@ -0,0 +1 @@ +text mining wikipedia the free encyclopedia text mining from wikipedia the free encyclopedia jump to navigation search text mining also referred to as text data mining roughly equivalent to text analytics refers to the process of deriving high quality information from text high quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning text mining usually involves the process of structuring the input text usually parsing along with the addition of some derived linguistic features and the removal of others and subsequent insertion into a database deriving patterns within the structured data and finally evaluation and interpretation of the output high quality in text mining usually refers to some combination of relevance novelty and interestingness typical text mining tasks include text categorization text clustering concept entity extraction production of granular taxonomies sentiment analysis document summarization and entity relation modeling i e learning relations between named entities text analysis involves information retrieval lexical analysis to study word frequency distributions pattern recognition tagging annotation information extraction data mining techniques including link and association analysis visualization and predictive analytics the overarching goal is essentially to turn text into data for analysis via application of natural language processing nlp and analytical methods a typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted contents 1 text mining and text analytics 2 history 3 text analysis processes 4 applications 4 1 security applications 4 2 biomedical applications 4 3 software applications 4 4 online media applications 4 5 marketing applications 4 6 sentiment analysis 4 7 academic applications 5 software and applications 5 1 commercial 5 2 open source 6 implications 7 see also 8 notes 9 references 10 external links text mining and text analytics edit the term text analytics describes a set of linguistic statistical and machine learning techniques that model and structure the information content of textual sources for business intelligence exploratory data analysis research or investigation 1 the term is roughly synonymous with text mining indeed ronen feldman modified a 2000 description of text mining 2 in 2004 to describe text analytics 3 the latter term is now used more frequently in business settings while text mining is used in some of the earliest application areas dating to the 1980s 4 notably life sciences research and government intelligence the term text analytics also describes that application of text analytics to respond to business problems whether independently or in conjunction with query and analysis of fielded numerical data it is a truism that 80 percent of business relevant information originates in unstructured form primarily text 5 these techniques and processes discover and present knowledge facts business rules and relationships that is otherwise locked in textual form impenetrable to automated processing history edit labor intensive manual text mining approaches first surfaced in the mid 1980s 6 but technological advances have enabled the field to advance during the past decade text mining is an interdisciplinary field that draws on information retrieval data mining machine learning statistics and computational linguistics as most information common estimates say over 80 5 is currently stored as text text mining is believed to have a high commercial potential value increasing interest is being paid to multilingual data mining the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning the challenge of exploiting the large proportion of enterprise information that originates in unstructured form has been recognized for decades 7 it is recognized in the earliest definition of business intelligence bi in an october 1958 ibm journal article by h p luhn a business intelligence system which describes a system that will utilize data processing machines for auto abstracting and auto encoding of documents and for creating interest profiles for each of the action points in an organization both incoming and internally generated documents are automatically abstracted characterized by a word pattern and sent automatically to appropriate action points yet as management information systems developed starting in the 1960s and as bi emerged in the 80s and 90s as a software category and field of practice the emphasis was on numerical data stored in relational databases this is not surprising text in unstructured documents is hard to process the emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application as described by prof marti a hearst in the paper untangling text data mining 8 for almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms in this paper i have attempted to suggest a new emphasis the use of large online text collections to discover new facts and trends about the world itself i suggest that to make progress we do not need fully artificial intelligent text analysis rather a mixture of computationally driven and user guided analysis may open the door to exciting new results hearst s 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later text analysis processes edit subtasks components of a larger text analytics effort typically include information retrieval or identification of a corpus is a preparatory step collecting or identifying a set textual materials on the web or held in a file system database or content management system for analysis although some text analytics systems apply exclusively advanced statistical methods many others apply more extensive natural language processing such as part of speech tagging syntactic parsing and other types of linguistic analysis citation needed named entity recognition is the use of gazetteers or statistical techniques to identify named text features people organizations place names stock ticker symbols certain abbreviations and so on disambiguation the use of contextual clues may be required to decide where for instance ford refers to a former u s president a vehicle manufacturer a movie star glenn or harrison who a river crossing or some other entity recognition of pattern identified entities features such as telephone numbers e mail addresses quantities with units can be discerned via regular expression or other pattern matches coreference identification of noun phrases and other terms that refer to the same object relationship fact and event extraction identification of associations among entities and other information in text sentiment analysis involves discerning subjective as opposed to factual material and extracting various forms of attitudinal information sentiment opinion mood and emotion text analytics techniques are helpful in analyzing sentiment at the entity concept or topic level and in distinguishing opinion holder and opinion object 9 quantitative text analysis is a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of usually a casual personal text for the purpose of psychological profiling etc 10 applications edit the technology is now broadly applied for a wide variety of government research and business needs applications can be sorted into a number of categories by analysis type or by business function using this approach to classifying solutions application categories include enterprise business intelligence data mining competitive intelligence e discovery records management national security intelligence scientific discovery especially life sciences sentiment analysis tools listening platforms natural language semantic toolkit or service publishing automated ad placement search information access social media monitoring security applications edit many text mining software packages are marketed for security applications especially monitoring and analysis of online plain text sources such as internet news blogs etc for national security purposes 11 it is also involved in the study of text encryption decryption biomedical applications edit main article biomedical text mining a range of text mining applications in the biomedical literature has been described 12 one online text mining application in the biomedical literature is gopubmed 13 gopubmed was the first semantic search engine on the web citation needed another example is pubgene that combines biomedical text mining with network visualization as an internet service 14 15 tpx is a concept assisted search and navigation tool for biomedical literature analyses 16 it runs on pubmed pmc and can be configured on request to run on local literature repositories too software applications edit text mining methods and software is also being researched and developed by major firms including ibm and microsoft to further automate the mining and analysis processes and by different firms working in the area of search and indexing in general as a way to improve their results within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities 17 online media applications edit text mining is being used by large media companies such as the tribune company to clarify information and to provide readers with greater search experiences which in turn increases site stickiness and revenue additionally on the back end editors are benefiting by being able to share associate and package news across properties significantly increasing opportunities to monetize content marketing applications edit text mining is starting to be used in marketing as well more specifically in analytical customer relationship management coussement and van den poel 2008 18 19 apply it to improve predictive analytics models for customer churn customer attrition 18 sentiment analysis edit sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie 20 such an analysis may need a labeled data set or labeling of the affectivity of words resources for affectivity of words and concepts have been made for wordnet 21 and conceptnet 22 respectively text has been used to detect emotions in the related area of affective computing 23 text based approaches to affective computing have been used on multiple corpora such as students evaluations children stories and news stories academic applications edit the issue of text mining is of importance to publishers who hold large databases of information needing indexing for retrieval this is especially true in scientific disciplines in which highly specific information is often contained within written text therefore initiatives have been taken such as nature s proposal for an open text mining interface otmi and the national institutes of health s common journal publishing document type definition dtd that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access academic institutions have also become involved in the text mining initiative the national centre for text mining nactem is the first publicly funded text mining centre in the world nactem is operated by the university of manchester 24 in close collaboration with the tsujii lab 25 university of tokyo 26 nactem provides customised tools research facilities and offers advice to the academic community they are funded by the joint information systems committee jisc and two of the uk research councils epsrc amp bbsrc with an initial focus on text mining in the biological and biomedical sciences research has since expanded into the areas of social sciences in the united states the school of information at university of california berkeley is developing a program called biotext to assist biology researchers in text mining and analysis further private initiatives also offer tools for academic text mining newsanalytics net provides researchers with a free scalable solution for keyword based text analysis the initiative s research apps were developed to support news analytics news analytics but are equally useful for regular text analysis applications software and applications edit text mining computer programs are available from many commercial and open source companies and sources commercial edit aerotext a suite of text mining applications for content analysis content used can be in multiple languages angoss angoss text analytics provides entity and theme extraction topic categorization sentiment analysis and document summarization capabilities via the embedded lexalytics salience engine the software provides the unique capability of merging the output of unstructured text based analysis with structured data to provide additional predictive variables for improved predictive models and association analysis attensity hosted integrated and stand alone text mining analytics software that uses natural language processing technology to address collective intelligence in social media and forums the voice of the customer in surveys and emails customer relationship management e services research and e discovery risk and compliance and intelligence analysis autonomy text mining clustering and categorization software basis technology provides a suite of text analysis modules to identify language enable search in more than 20 languages extract entities and efficiently search for and translate entities clarabridge text analytics text mining software including natural language nlp machine learning clustering and categorization provides saas hosted and on premise text and sentiment analytics that enables companies to collect listen to analyze and act on the voice of the customer voc from both external twitter facebook yelp product forums etc and internal sources call center notes crm enterprise data warehouse bi surveys emails etc endeca technologies provides software to analyze and cluster unstructured text expert system s p a suite of semantic technologies and products for developers and knowledge managers fair isaac leading provider of decision management solutions powered by advanced analytics includes text analytics general sentiment social intelligence platform that uses natural language processing to discover affinities between the fans of brands with the fans of traditional television shows in social media stand alone text analytics to capture social knowledge base on billions of topics stored to 2004 ibm languageware the ibm suite for text analytics tools and runtime ibm spss provider of modeler premium previously called ibm spss modeler and ibm spss text analytics which contains advanced nlp based text analysis capabilities multi lingual sentiment event and fact extraction that can be used in conjunction with predictive modeling text analytics for surveys provides the ability to categorize survey responses using nlp based capabilities for further analysis or reporting inxight provider of text analytics search and unstructured visualization technologies inxight was bought by business objects that was bought by sap ag in 2008 languageware text analysis libraries and customization software from ibm language computer corporation text extraction and analysis tools available in multiple languages lexalytics provider of a text analytics engine used in social media monitoring voice of customer survey analysis and other applications lexisnexis provider of business intelligence solutions based on an extensive news and company information content set lexisnexis acquired dataops to pursue search mathematica provides built in tools for text alignment pattern matching clustering and semantic analysis medallia offers one system of record for survey social text written and online feedback omniviz from instem scientific data mining and visual analytics tool 27 sas sas text miner and teragram commercial text analytics natural language processing and taxonomy software used for information management smartlogic semaphore content intelligence platform containing commercial text analytics natural language processing rule based classification ontology taxonomy modelling and information vizualization software used for information management statsoft provides statistica text miner as an optional extension to statistica data miner for predictive analytics solutions sysomos provider social media analytics software platform including text analytics and sentiment analysis on online consumer conversations wordstat content analysis and text mining add on module of qda miner for analyzing large amounts of text data xpresso xpresso an engine developed by the abzooba s core technology group is focused on the automated distillation of expressions in social media conversations 28 thomson data analyzer enables complex analysis on patent information scientific publications and news open source edit querytermanalyzer query term weight analyzer carrot2 text and search results clustering framework gate general architecture for text engineering an open source toolbox for natural language processing and language engineering opennlp natural language processing natural language toolkit nltk a suite of libraries and programs for symbolic and statistical natural language processing nlp for the python programming language rapidminer with its text processing extension data and text mining software unstructured information management architecture uima a component framework to analyze unstructured content such as text audio and video originally developed by ibm the programming language r provides a framework for text mining applications in the package tm the knime text processing extension kh coder for content analysis text mining or corpus linguistics the plos text mining collection 29 implications edit until recently websites most often used text based searches which only found documents containing specific user defined words or phrases now through use of a semantic web text mining can find content based on meaning and context rather than just by a specific word additionally text mining software can be used to build large dossiers of information about specific people and events for example large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter intelligence in effect the text mining software may act in a capacity similar to an intelligence analyst or research librarian albeit with a more limited scope of analysis text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material see also edit approximate nonnegative matrix factorization an algorithm used for text mining biocreative text mining evaluation in biomedical literature concept mining name resolution stop words text classification sometimes is considered a sub task of text mining web mining a task that may involve text mining e g first find appropriate web pages by classifying crawled web pages then extract the desired information from the text content of these pages considered relevant w shingling sequence mining string and sequence mining noisy text analytics named entity recognition identity resolution news analytics notes edit this article uses bare urls for citations please consider adding full citations so that the article remains verifiable several templates and the reflinks tool are available to assist in formatting reflinks documentation april 2013 defining text analytics dead link kdd 2000 workshop on text mining text analytics theory and practice dead link hobbs jerry r walker donald e amsler robert a 1982 natural language access to structured text proceedings of the 9th conference on computational linguistics 1 pp 160 127 32 doi 10 3115 991813 991833 160 a b unstructured data and the 80 percent rule dead link content analysis of verbatim explanations http www b eye network com view 6311 full citation needed hearst marti a 1999 untangling text data mining proceedings of the 37th annual meeting of the association for computational linguistics on computational linguistics pp 160 3 10 doi 10 3115 1034678 1034679 isbn 160 1 55860 609 2 160 http www clarabridge com default aspx tabid 137 amp moduleid 635 amp articleid 722 dead link mehl matthias r 2006 quantitative text analysis handbook of multimethod measurement in psychology p 160 141 doi 10 1037 11383 011 isbn 160 1 59147 318 7 160 zanasi alessandro 2009 virtual weapons for real wars text mining for national security proceedings of the international workshop on computational intelligence in security for information systems cisis 08 advances in soft computing 53 p 160 53 doi 10 1007 978 3 540 88181 0_7 isbn 160 978 3 540 88180 3 160 cohen k bretonnel hunter lawrence 2008 getting started in text mining plos computational biology 4 1 e20 doi 10 1371 journal pcbi 0040020 pmc 160 2217579 pmid 160 18225946 160 doms a schroeder m 2005 gopubmed exploring pubmed with the gene ontology nucleic acids research 33 web server issue w783 6 doi 10 1093 nar gki470 pmc 160 1160231 pmid 160 15980585 160 jenssen tor kristian l greid astrid komorowski jan hovig eivind 2001 a literature network of human genes for high throughput analysis of gene expression nature genetics 28 1 21 8 doi 10 1038 ng0501 21 pmid 160 11326270 160 masys daniel r 2001 linking microarray data to the literature nature genetics 28 1 9 10 doi 10 1038 ng0501 9 pmid 160 11326264 160 joseph thomas saipradeep vangala g venkat raghavan ganesh sekar srinivasan rajgopal rao aditya kotte sujatha sivadasan naveen 2012 tpx biomedical literature search made easy bioinformation 8 12 578 80 doi 10 6026 97320630008578 pmc 160 3398782 pmid 160 22829734 160 texor a b coussement kristof van den poel dirk 2008 integrating the voice of customers through call center emails into a decision support system for churn prediction information amp management 45 3 164 74 doi 10 1016 j im 2008 01 005 160 coussement kristof van den poel dirk 2008 improving customer complaint management by automatic email classification using linguistic style features as predictors decision support systems 44 4 870 82 doi 10 1016 j dss 2007 10 010 160 pang bo lee lillian vaithyanathan shivakumar 2002 thumbs up proceedings of the acl 02 conference on empirical methods in natural language processing 10 pp 160 79 86 doi 10 3115 1118693 1118704 160 alessandro valitutti carlo strapparava oliviero stock 2005 developing affective lexical resources psychology journal 2 1 61 83 160 erik cambria robert speer catherine havasi and amir hussain 2010 senticnet a publicly available semantic resource for opinion mining proceedings of aaai csk pp 160 14 18 160 calvo rafael a d mello sidney 2010 affect detection an interdisciplinary review of models methods and their applications ieee transactions on affective computing 1 1 18 37 doi 10 1109 t affc 2010 1 160 the university of manchester tsujii laboratory the university of tokyo yang yunyun akers lucy klose thomas barcelon yang cynthia 2008 text mining and visualization tools impressions of emerging capabilities world patent information 30 4 280 doi 10 1016 j wpi 2008 01 007 160 http www abzooba com product html table of contents text mining plos 160 references edit ananiadou s and mcnaught j editors 2006 text mining for biology and biomedicine artech house books isbn 978 1 58053 984 5 bilisoly r 2008 practical text mining with perl new york john wiley amp sons isbn 978 0 470 17643 6 feldman r and sanger j 2006 the text mining handbook new york cambridge university press isbn 978 0 521 83657 9 indurkhya n and damerau f 2010 handbook of natural language processing 2nd edition boca raton fl crc press isbn 978 1 4200 8592 1 kao a and poteet s editors natural language processing and text mining springer isbn 1 84628 175 x konchady m text mining application programming programming series charles river media isbn 1 58450 460 9 manning c and schutze h 1999 foundations of statistical natural language processing cambridge ma mit press isbn 978 0 262 13360 9 miner g elder j hill t nisbet r delen d and fast a 2012 practical text mining and statistical analysis for non structured text data applications elsevier academic press isbn 978 0 12 386979 1 mcknight w 2005 building business intelligence text data mining in business intelligence dm review 21 22 srivastava a and sahami m 2009 text mining classification clustering and applications boca raton fl crc press isbn 978 1 4200 5940 3 external links edit marti hearst what is text mining october 2003 automatic content extraction linguistic data consortium automatic content extraction nist retrieved from http en wikipedia org w index php title text_mining amp oldid 558684266 categories artificial intelligence applicationsdata miningcomputational linguisticsdata analysisnatural language processingstatistical natural language processinghidden categories all articles with dead external linksarticles with dead external links from april 2013articles needing more detailed referencesall articles with unsourced statementsarticles with unsourced statements from february 2012all articles with specifically marked weasel worded phrasesarticles with specifically marked weasel worded phrases from february 2012articles with unsourced statements from april 2012articles needing cleanup from april 2013articles needing link rot cleanup from april 2013all articles needing link rot cleanuparticles covered by wikiproject wikify from april 2013all articles covered by wikiproject wikify navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages esky deutsch espa ol fran ais bahasa indonesia magyar polski portugus svenska ti ng vi t edit links this page was last modified on 7 june 2013 at 00 14 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Uncertain_data b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Uncertain_data new file mode 100644 index 00000000..f7308611 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Uncertain_data @@ -0,0 +1 @@ +uncertain data wikipedia the free encyclopedia uncertain data from wikipedia the free encyclopedia jump to navigation search in computer science uncertain data is the notion of data that contains specific uncertainty uncertain data is typically found in the area of sensor networks when representing such data in a database some indication of the probability of the various values there are three main models of uncertain data in databases in attribute uncertainty each uncertain attribute in a tuple is subject to its own independent probability distribution 1 for example if readings are taken of temperature and wind speed each would be described by its own probability distribution as knowing the reading for one measurement would not provide any information about the other in correlated uncertainty multiple attributes may be described by a joint probability distribution 1 for example if readings are taken of the position of an object and the x and y coordinates stored the probability of different values may depend on the distance from the recorded coordinates as distance depends on both coordinates it may be appropriate to use a joint distribution for these coordinates as they are not independent in tuple uncertainty all the attributes of a tuple are subject to a joint probability distribution this covers the case of correlated uncertainty but also includes the case where there is a probability of a tuple not belonging in the relevant relation which is indicates by all the probabilities not summing to one 1 for example assume we have the following tuple from a probabilistic database a 0 4 b 0 5 then the tuple has 10 chance of not existing in the database references edit a b c prabhakar sunil orion managing uncertain sensor data 160 error aware density based clustering of imprecise measurement values seventh ieee international conference on data mining workshops 2007 icdm workshops 2007 ieee 160 unknown parameter later ignored help accessdate requires url help clustering uncertain data with possible worlds proceedings of the 1st workshop on management and mining of uncertain data in conjunction with the 25th international conference on data engineering 2009 ieee 160 unknown parameter later ignored help accessdate requires url help this computer science article is a stub you can help wikipedia by expanding it v t e retrieved from http en wikipedia org w index php title uncertain_data amp oldid 532093389 categories machine learningdata mininguncertainty of numberscomputer science stubshidden categories pages with citations using unsupported parameterspages using citations with accessdate and no urlwikiproject computer science stubs navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages edit links this page was last modified on 9 january 2013 at 02 42 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Ward_s_method b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Ward_s_method new file mode 100644 index 00000000..a196080d --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Ward_s_method @@ -0,0 +1 @@ +bad title wikipedia the free encyclopedia bad title jump to navigation search the requested page title is invalid it may be empty contain unsupported characters or include a non local or incorrectly linked interwiki prefix you may be able to locate the desired page by searching for its name with interwiki prefix if any in the search box possible causes are an attempt to follow a link to a diff for a page that has since been deleted an attempt to load a url such as http en wikipedia org wiki the character is not permitted in page titles an attempt to load a url pointing to a non local interwiki page usually those not run by the wikimedia foundation for example the url http en wikipedia org wiki meatball wikipedia will give this error because the meatball interwiki prefix is not marked as local in the interwiki table certain interwiki prefixes are marked as local in the table for example the url http en wikipedia org wiki meta main_page can be used to load meta main_page all interlanguage prefixes are marked as local and thus urls such as http en wikipedia org wiki fr accueil will work as expected however non local interwiki pages can still be accessed by interwiki linking or by entering them in the search box for example meatball wikipedia can be used on a page like this meatball wikipedia if you tried to access a non local interwiki page you may be able to access that page by clicking the article tab on this page return to main page retrieved from http en wikipedia org wiki special badtitle navigation menu personal tools create accountlog in namespaces special page variants views actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox upload file special pages privacy policy about wikipedia disclaimers contact wikipedia \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Web_mining b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Web_mining new file mode 100644 index 00000000..94573741 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/articles_text/Web_mining @@ -0,0 +1 @@ +web mining wikipedia the free encyclopedia web mining from wikipedia the free encyclopedia jump to navigation search this article may require cleanup to meet wikipedia s quality standards no cleanup reason has been specified please help improve this article if you can june 2009 web mining is the application of data mining techniques to discover patterns from the web according to analysis targets web mining can be divided into three different types which are web usage mining web content mining and web structure mining contents 1 web usage mining 2 web structure mining 3 web content mining 3 1 web mining in foreign languages 4 web usage mining pros and cons 4 1 pros 4 2 cons 5 resources 6 external links 6 1 books 6 2 bibliographic references 7 references web usage mining edit web usage mining is the process of extracting useful information from server logs e g users history web usage mining is the process of finding out what users are looking for on the internet some users might be looking at only textual data whereas some others might be interested in multimedia data web usage mining is the application of data mining techniques to discover interesting usage patterns from web data in order to understand and better serve the needs of web based applications usage data captures the identity or origin of web users along with their browsing behavior at a web site web usage mining itself can be classified further depending on the kind of usage data considered web server data the user logs are collected by the web server typical data includes ip address page reference and access time application server data commercial application servers have significant features to enable e commerce applications to be built on top of them with little effort a key feature is the ability to track various kinds of business events and log them in application server logs application level data new kinds of events can be defined in an application and logging can be turned on for them thus generating histories of these specially defined events it must be noted however that many end applications require a combination of one or more of the techniques applied in the categories above web structure mining edit web structure mining is the process of using graph theory to analyze the node and connection structure of a web site according to the type of web structural data web structure mining can be divided a into two kinds 1 extracting patterns from hyperlinks in the web a hyperlink is a structural component that connects the web page to a different location 2 mining the document structure analysis of the tree like structure of page structures to describe html or xml tag usage web content mining edit web content mining is the mining extraction and integration of useful data information and knowledge from web page content the heterogeneity and the lack of structure that permeates much of the ever expanding information sources on the world wide web such as hypertext documents makes automated discovery organization and search and indexing tools of the internet and the world wide web such as lycos alta vista webcrawler aliweb 6 metacrawler and others provide some comfort to users but they do not generally provide structural information nor categorize filter or interpret documents in recent years these factors have prompted researchers to develop more intelligent tools for information retrieval such as intelligent web agents as well as to extend database and data mining techniques to provide a higher level of organization for semi structured data available on the web the agent based approach to web mining involves the development of sophisticated ai systems that can act autonomously or semi autonomously on behalf of a particular user to discover and organize web based information web content mining is differentiated from two different points of view 1 information retrieval view and database view r kosala et al 2 summarized the research works done for unstructured data and semi structured data from information retrieval view it shows that most of the researches use bag of words which is based on the statistics about single words in isolation to represent unstructured text and take single word found in the training corpus as features for the semi structured data all the works utilize the html structures inside the documents and some utilized the hyperlink structure between the documents for document representation as for the database view in order to have the better information management and querying on the web the mining always tries to infer the structure of the web site to transform a web site to become a database there are several ways to represent documents vector space model is typically used the documents constitute the whole vector space if a term t occurs n d t in document d the t th coordinate of d is n d t when the length of the words in a document goes to d maxt n d t this representation does not realize the importance of words in a document to resolve this tf idf term frequency times inverse document frequency is introduced by multi scanning the document we can implement feature selection under the condition that the category result is rarely affected the extraction of feature subset is needed the general algorithm is to construct an evaluating function to evaluate the features as feature set information gain cross entropy mutual information and odds ratio are usually used the classifier and pattern analysis methods of text data mining are very similar to traditional data mining techniques the usual evaluative merits are classification accuracy precision recall and information score web mining in foreign languages edit it should be noted that the language code of chinese words is very complicated compared to that of english the gb code big5 code and hz code are common chinese word codes in web documents before text mining one needs to identify the code standard of the html documents and transform it into inner code then use other data mining techniques to find useful knowledge and patterns web usage mining pros and cons edit pros edit web usage mining essentially has many advantages which makes this technology attractive to corporations including the government agencies this technology has enabled e commerce to do personalized marketing which eventually results in higher trade volumes government agencies are using this technology to classify threats and fight against terrorism the predicting capability of mining applications can benefit society by identifying criminal activities the companies can establish better customer relationship by giving them exactly what they need companies can understand the needs of the customer better and they can react to customer needs faster the companies can find attract and retain customers they can save on production costs by utilizing the acquired insight of customer requirements they can increase profitability by target pricing based on the profiles created they can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer thus reducing the risk of losing a customer or customers cons edit web usage mining by itself does not create issues but this technology when used on data of personal nature might cause concerns the most criticized ethical issue involving web usage mining is the invasion of privacy privacy is considered lost when information concerning an individual is obtained used or disseminated especially if this occurs without their knowledge or consent 3 the obtained data will be analyzed and clustered to form profiles the data will be made anonymous before clustering so that there are no personal profiles 3 thus these applications de individualize the users by judging them by their mouse clicks de individualization can be defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits 3 another important concern is that the companies collecting the data for a specific purpose might use the data for a totally different purpose and this essentially violates the user s interests the growing trend of selling personal data as a commodity encourages website owners to trade personal data obtained from their site this trend has increased the amount of data being captured and traded increasing the likeliness of one s privacy being invaded the companies which buy the data are obliged make it anonymous and these companies are considered authors of any specific release of mining patterns they are legally responsible for the contents of the release any inaccuracies in the release will result in serious lawsuits but there is no law preventing them from trading the data some mining algorithms might use controversial attributes like sex race religion or sexual orientation to categorize individuals these practices might be against the anti discrimination legislation 4 the applications make it hard to identify the use of such controversial attributes and there is no strong rule against the usage of such algorithms with such attributes this process could result in denial of service or a privilege to an individual based on his race religion or sexual orientation right now this situation can be avoided by the high ethical standards maintained by the data mining company the collected data is being made anonymous so that the obtained data and the obtained patterns cannot be traced back to an individual it might look as if this poses no threat to one s privacy actually many extra information can be inferred by the application by combining two separate unscrupulous data from the user resources edit this article includes a list of references but its sources remain unclear because it has insufficient inline citations please help to improve this article by introducing more precise citations september 2009 external links edit the future of web sites web services with a section on web scraping open social software directory compare and review web mining programs books edit jesus mena data mining your website digital press 1999 soumen chakrabarti mining the web analysis of hypertext and semi structured data morgan kaufmann 2002 bing liu web data mining exploring hyperlinks contents and usage data springer 2007 advances in web mining and web usage analysis 2005 revised papers from 7 th workshop on knowledge discovery on the web olfa nasraoui osmar zaiane myra spiliopoulou bamshad mobasher philip yu brij masand eds springer lecture notes in artificial intelligence lnai 4198 2006 web mining and web usage analysis 2004 revised papers from 6 th workshop on knowledge discovery on the web bamshad mobasher olfa nasraoui bing liu brij masand eds springer lecture notes in artificial intelligence 2006 mike thelwall link analysis an information science approach 2004 academic press 5 bibliographic references edit baraglia r silvestri f 2007 dynamic personalization of web sites without user intervention in communication of the acm 50 2 63 67 cooley r mobasher b and srivastave j 1997 web mining information and pattern discovery on the world wide web in proceedings of the 9th ieee international conference on tool with artificial intelligence cooley r mobasher b and srivastava j data preparation for mining world wide web browsing patterns journal of knowledge and information system vol 1 issue 1 pp 160 5 32 1999 kohavi r mason l and zheng z 2004 lessons and challenges from mining retail e commerce data machine learning vol 57 pp 160 83 113 lillian clark i hsien ting chris kimble peter wright daniel kudenko 2006 combining ethnographic and clickstream data to identify user web browsing strategies journal of information research vol 11 no 2 january 2006 eirinaki m vazirgiannis m 2003 web mining for web personalization acm transactions on internet technology vol 3 no 1 february 2003 mobasher b cooley r and srivastava j 2000 automatic personalization based on web usage mining communications of the acm vol 43 no 8 pp 160 142 151 mobasher b dai h kuo t and nakagawa m 2001 effective personalization based on association rule discover from web usage data in proceedings of widm 2001 atlanta ga usa pp 160 9 15 nasraoui o petenes c combining web usage mining and fuzzy inference for website personalization in proc of webkdd 2003 kdd workshop on web mining as a premise to effective and intelligent web applications washington dc august 2003 p 160 37 nasraoui o frigui h joshi a and krishnapuram r mining web access logs using relational competitive fuzzy clustering proceedings of the eighth international fuzzy systems association congress hsinchu taiwan august 1999 nasraoui o world wide web personalization invited chapter in encyclopedia of data mining and data warehousing j wang ed idea group 2005 pierrakos d paliouras g papatheodorou c spyropoulos c d 2003 web usage mining as a tool for personalization a survey user modelling and user adapted interaction journal vol 13 issue 4 pp 160 311 372 i hsien ting chris kimble daniel kudenko 2005 a pattern restore method for restoring missing patterns in server side clickstream data i hsien ting chris kimble daniel kudenko 2006 ubb mining finding unexpected browsing behaviour in clickstream data to improve a web site s design references edit wang yan web mining and knowledge discovery of usage patterns 160 kosala raymond hendrik blockeel july 2000 web mining research a survey sigkdd explorations 2 1 160 a b c lita van wel and lamb r royakkers 2004 ethical issues in web data mining ethical issues in web data mining 160 kirsten wahlstrom john f roddick vladimir estivill castro denise de vries 2007 legal and technical issues of privacy preservation in data mining legal and technical issues of privacy preservation in data mining 160 data mining by korth retrieved from http en wikipedia org w index php title web_mining amp oldid 557155996 categories data collectiondata mininghidden categories articles needing cleanup from june 2009all articles needing cleanupcleanup tagged articles without a reason field from june 2009wikipedia pages needing cleanup from june 2009articles lacking in text citations from september 2009all articles lacking in text citations navigation menu personal tools create accountlog in namespaces article talk variants views read edit view history actions search navigation main page contents featured content current events random article donate to wikipedia interaction help about wikipedia community portal recent changes contact wikipedia toolbox what links here related changes upload file special pages permanent link page information cite this page print export create a book download as pdf printable version languages deutsch espa ol euskara fran ais hrvatski magyar portugus sloven ina edit links this page was last modified on 28 may 2013 at 11 36 text is available under the creative commons attribution sharealike license additional terms may apply by using this site you agree to the terms of use and privacy policy wikipedia is a registered trademark of the wikimedia foundation inc a non profit organization privacy policy about wikipedia disclaimers contact wikipedia mobile view \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/1_htmlToText.php b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/1_htmlToText.php new file mode 100644 index 00000000..af919d2e --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/1_htmlToText.php @@ -0,0 +1,26 @@ + \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/1_wikiCrawler.php b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/1_wikiCrawler.php new file mode 100644 index 00000000..be30ad5c --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/1_wikiCrawler.php @@ -0,0 +1,94 @@ + 0) { + $article = array_shift($articlesToCrawl); + + echo "Current article: ".$article."\n"; + $content = $iwa->get($BASE_URL . urlencode($article), array(), array(CURLOPT_FOLLOWLOCATION => true)); + $content = preg_replace("/[\r\n]/", " ", $content); + #$dom = str_get_html($content); + #$dom = file_get_html($BASE_URL . urlencode($article)); + + #if ($dom == null) + # continue; + preg_match_all('/href="(.+?)"/', $content, $hrefMatches); + if (count($hrefMatches[1]) <= 0) + continue; + + file_put_contents("articles_html/".fixChars($article), $content); + + # Find all links to other articles + $nextArticles = array(); + foreach ($hrefMatches[1] as $href) { + if (preg_match("/^\/wiki\/([^:]+)$/", $href, $matches)) { + if ($matches[1] != 'Main_Page') # only crawl non-Main_Page articles + $nextArticles[] = $matches[1]; + + # remember the link to this article + if (! isset($linkedArticles[$article])) + $linkedArticles[$article] = array(); + array_push($linkedArticles[$article], $matches[1]); + } + } # end foreach link + + # Add current article to crawled articles + $crawledArticles[] = $article; + # Add linked articles to articles which should be crawled + $articlesToCrawl = array_diff(array_unique(array_merge($articlesToCrawl, $nextArticles)), $crawledArticles); + #print_r($articlesToCrawl); + + #sleep(1); + } # end while + + # Finally, delete all article links of uncrawled articles and save the + # graph as a DOT formatted file + $dotGraph = "digraph {\n"; + foreach ($linkedArticles as $fromArticle => $toArticles) { + if (strpos($fromArticle, ':') !== false) + continue; # Skip special pages (entry point) + + $fromArticle = fixChars($fromArticle); + + $i = 0; + while ($i < count($toArticles)) { + if (! in_array($toArticles[$i], $crawledArticles)) + array_splice($toArticles, $i, 1); + else + $i++; + } + + foreach (array_unique($toArticles) as $toArticle) { + $toArticle = fixChars($toArticle); + if ($toArticle != $fromArticle) # ignore self-referal links + $dotGraph .= "\t".$fromArticle." -> ".$toArticle."\n"; + } + } + $dotGraph .= "}\n"; + + file_put_contents("wikigraph.dot", $dotGraph); +?> \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/2_1-3.php b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/2_1-3.php new file mode 100644 index 00000000..6ff8304a --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/2_1-3.php @@ -0,0 +1,11 @@ +name()."\n"; + + #$query = array('machine', 'learning'); + #$query = array('frequent', 'itemsets'); + $query = array('web', 'mining'); + doQuery($query, true, true); +?> \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/2_4.php b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/2_4.php new file mode 100644 index 00000000..ae2e9a0d --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/2_4.php @@ -0,0 +1,28 @@ +name())); + + $ranking = doQuery($query, false); + #print_r($ranking); + $hitsRank = array_search($article->name(), $ranking['HITS']); + $prRank = array_search($article->name(), $ranking['PageRank']); + + if ($hitsRank === false || $prRank === false) + continue; + + $numArticles++; + + $totalHITS += $hitsRank; + $totalPR += $prRank; + echo "HITS: ".$hitsRank." PageRank: ".$prRank."\n"; + } + + echo "Avg. Rankings - HITS: ".($totalHITS / $numArticles)." PageRank: ".($totalPR / $numArticles)."\n"; + +?> \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/functions.inc.php b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/functions.inc.php new file mode 100644 index 00000000..31122d98 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/code/functions.inc.php @@ -0,0 +1,360 @@ +(.+?)<\/body>(.*)$/i', 'xx', $html); + $text = preg_replace('/(.*?)<\/script>/', '', $html); + if ($print) + echo $text; + $text = preg_replace('/(.*?)<\/style>/', '', $text); + $text = preg_replace('/<(.+?)>/', '', $text); + $text = strtolower($text); + $text = preg_match_all('/\w+/', $text, $matches); + + return $matches[0]; + } + + function loadDotGraph($file) { + $graphLines = file($file); + $nodesByName = array(); + for ($i = 1; $i < count($graphLines)-1; $i++) { + $elms = explode(' -> ', $graphLines[$i]); + $elms[0] = trim($elms[0]); + $elms[1] = trim($elms[1]); + + if (count($elms) == 2) { + if (! isset($nodesByName[$elms[0]])) + $nodesByName[$elms[0]] = new Article($elms[0]); + if (! isset($nodesByName[$elms[1]])) + $nodesByName[$elms[1]] = new Article($elms[1]); + + $nodesByName[$elms[0]]->addOutLink($nodesByName[$elms[1]]); + } + } + return array_values($nodesByName); + } + + function doQuery($query, $verbose=false, $saveGraph=false) { + $articles = loadDotGraph('wikigraph.dot'); + + echo "Query: ".implode(' ', $query)."\n\n"; + + $N = count($articles); # number of total pages + + # match articles against the query + $rootSet = array(); + $rootSetNames = array(); + $articleByName = array(); + foreach ($articles as $article) { + if ($article->matchesAll($query)) { + array_push($rootSet, $article); + array_push($rootSetNames, $article->name()); + } + $articleByName[$article->name()] = $article; + } + + if ($verbose) { + echo "Root set:\n"; + foreach ($rootSet as $article) + echo $article->name()."\n"; + } + + # contruct the base set (all articles linking to a root-set article + # or being linked by a root-set article) + $baseSet = array(); + $baseSetNames = array(); + foreach ($rootSet as $article) + foreach ($article->inLinks() as $linkingArticle) { + if (in_array($linkingArticle->name(), $baseSetNames)) # ensure unique items + continue; + + array_push($baseSet, $linkingArticle); + array_push($baseSetNames, $linkingArticle->name()); + } + + foreach ($article->outLinks() as $linkedArticle) { + if (in_array($linkedArticle->name(), $baseSetNames)) # ensure unique items + continue; + + array_push($baseSet, $linkedArticle); + array_push($baseSetNames, $linkedArticle->name()); + } + + if ($verbose) { + echo "\nBase set:\n"; + foreach ($baseSet as $article) + echo $article->name()."\n"; + } + + # no results? + if (count($rootSet) <= 0) { + #if ($verbose) + echo "No results!\n"; + return array('HITS' => array(), 'PageRank' => array()); + } + + $iter = 0; + $weightChange = 1; # for logging the weight change + $totalWeight = count($rootSet)*2 + count($baseSet)*2; # every article has an initial hub/authority score (1) + while ($iter < 10000 && $weightChange >= 1.0/10000) { + if ($totalWeight == 0) + return array('HITS' => array(), 'PageRank' => array()); + + if ($verbose) + echo "Iteration #".($iter+1).". weightChange = ".$weightChange.", totalWeight = ".$totalWeight."\n"; + $weightChange = 0; + $newTotalWeight = 0; + + # recalculate the hub scores + foreach ($baseSet as $article) { + $linksToRoot = 0; # number of outgoing links to root articles + $hubScore = 0; # the hub score + foreach ($article->outLinks() as $rootArticle) { + if (! in_array($rootArticle->name(), $rootSetNames)) + continue; + + $hubScore += $rootArticle->authScore(); + $linksToRoot++; + } + + $hubScore /= $totalWeight; # normalization + $newTotalWeight += $hubScore; + $weightChange += abs($hubScore - $article->hubScore()); + if ($verbose) echo "New hub score for ".$article->name().": ".$hubScore.", old: ".$article->hubScore().", change: ".abs($hubScore - $article->hubScore())."\n"; + $article->hubScore($hubScore); # update + } + + # recalculate the authority scores + foreach ($rootSet as $article) { + $linksFromBase = 0; # number of ingoing links from base articles + $authScore = 0; # the hub score + foreach ($article->inLinks() as $baseArticle) { + if (! in_array($baseArticle->name(), $baseSetNames)) + continue; + + $authScore += $baseArticle->hubScore(); + $linksFromBase++; + } + + $authScore /= $totalWeight; # normalization + $newTotalWeight += $authScore; + $weightChange += abs($authScore - $article->authScore()); + if ($verbose) echo "New authority score for ".$article->name().": ".$authScore.", old: ".$article->authScore().", change: ".abs($authScore - $article->authScore())."\n"; + $article->authScore($authScore); # update + } + + $iter++; + $totalWeight = $newTotalWeight; + #echo "Total weight: ".$totalWeight."\n\n"; + } + + if ($verbose) + echo "Final weightChange = ".$weightChange."\n\n"; + + # prepare the authority and hub scores and the PageRanks for sorting + $authScores = array(); + foreach ($rootSet as $article) { + $authScores[$article->name()] = $article->authScore(); + } + + $hubScores = array(); + foreach ($baseSet as $article) { + $hubScores[$article->name()] = $article->hubScore(); + } + + $pageRanks = array(); + foreach ($rootSet as $article) { + $iterations = 0; + $pageRanks[$article->name()] = $article->pageRank($N, $iterations); + if ($verbose) echo "Number of iterations for ".$article->name().": $iterations\n"; + } + + asort($authScores, SORT_NUMERIC); + asort($hubScores, SORT_NUMERIC); + asort($pageRanks, SORT_NUMERIC); + $authKeys = array_reverse(array_keys($authScores)); + $hubKeys = array_reverse(array_keys($hubScores)); + $prKeys = array_reverse(array_keys($pageRanks)); + + # print the final scores + if ($verbose) echo "\nFinal hub scores:\n"; + foreach ($hubKeys as $articleName) { + $hubScore = $hubScores[$articleName]; + if ($verbose) echo $articleName." - hub score: ".$hubScore." - inlinks: ".count($articleByName[$articleName]->inLinks())." - outlinks: ".count($articleByName[$articleName]->outLinks())."\n"; + } + + $ret = array('HITS' => array(), 'PageRank' => array()); + if ($verbose) echo "\nFinal authority scores:\n"; + foreach ($authKeys as $articleName) { + $authScore = $authScores[$articleName]; + if ($verbose) echo $articleName." - auth. score: ".$authScore." - inlinks: ".count($articleByName[$articleName]->inLinks())." - outlinks: ".count($articleByName[$articleName]->outLinks())."\n"; + array_push($ret['HITS'], $articleName); + } + + if ($verbose) echo "\nPage Ranks:\n"; + foreach ($prKeys as $articleName) { + $pageRank = $pageRanks[$articleName]; + if ($verbose) echo $articleName." - PageRank: ".$pageRank." - inlinks: ".count($articleByName[$articleName]->inLinks())." - outlinks: ".count($articleByName[$articleName]->outLinks())."\n"; + array_push($ret['PageRank'], $articleName); + } + + #[authScore = ".sprintf("%.4f", $baseArticle->hubScore())." , hubScore = ".sprintf("%.4f", $baseArticle->hubScore())." , pageRank = ".sprintf("%.4f", $baseArticle->pageRank())."] + + if ($saveGraph) { + # Save a dot graph of the root and base set + $dotGraph = "digraph {\n"; + + # define custom labels and colors + foreach ($rootSet as $rootArticle) { + $dotGraph .= "\t".$rootArticle->name()." [color=deeppink, label=\"".$rootArticle->name()."\\nauthScore = ".sprintf("%.4f", $rootArticle->authScore())."\\nhubScore = ".sprintf("%.4f", $rootArticle->hubScore())."\\npageRank = ".sprintf("%.4f", $rootArticle->pageRank($N))."\"];\n"; + } + + foreach ($baseSet as $baseArticle) { + if (in_array($baseArticle->name(), $rootSetNames)) + continue; + + $dotGraph .= "\t".$baseArticle->name()." [color=cyan2, label=\"".$baseArticle->name()."\\nhubScore = ".sprintf("%.4f", $baseArticle->hubScore())."\\npageRank = ".sprintf("%.4f", $baseArticle->pageRank($N))."\"];\n"; + } + + foreach ($rootSet as $rootArticle) { + foreach ($rootArticle->inLinks() as $baseArticle) { + if (! in_array($baseArticle->name(), $baseSetNames)) + continue; + + $dotGraph .= "\t".$baseArticle->name()." -> ".$rootArticle->name()."\n"; + } + + foreach ($rootArticle->outLinks() as $otherArticle) { + if (in_array($otherArticle->name(), $baseSetNames) && ! in_array($otherArticle->name(), $rootSetNames)) + $dotGraph .= "\t".$rootArticle->name()." -> ".$otherArticle->name()."\n"; + } + } + + $dotGraph .= "}\n"; + + file_put_contents("wikigraph_rootbase.dot", $dotGraph); + } + + return $ret; + } + + class Article { + private $outLinks = array(); # Page[] + private $outLinkNames = array(); # String[] + private $inLinks = array(); # Page[] + private $inLinkNames = array(); # String[] + private $name; # String + private $hubScore = 1; # float + private $authScore = 1; # float + private $pageRank = -1; # float + + function __construct($name) { + $this->name = $name; + } + + public function addOutLink($page) { + #echo "adding (out)link from ".$this->name()." to ".$page->name()."\n"; + if (in_array($page->name(), $this->outLinkNames) || $page->name() == $this->name()) + return; + + array_push($this->outLinks, $page); + array_push($this->outLinkNames, $page->name()); + $page->addInLink($this); + } + + public function addInLink($page) { + #echo "adding (in)link from ".$page->name()." to ".$this->name()."\n"; + if (in_array($page->name(), $this->inLinkNames) || $page->name() == $this->name()) + return; + + array_push($this->inLinks, $page); + array_push($this->inLinkNames, $page->name()); + $page->addOutLink($this); + } + + public function name() { + return $this->name; + } + + public function inLinks() { + return $this->inLinks; + } + + public function outLinks() { + return $this->outLinks; + } + + public function hubScore($hubScore=null) { + if (isset($hubScore)) + $this->hubScore = $hubScore; + return $this->hubScore; + } + + public function authScore($authScore=null) { + if (isset($authScore)) + $this->authScore = $authScore; + return $this->authScore; + } + + public function pageRank($N, &$iterations=null, $visited=array()) { + #echo "Calculating the PageRank of ".$this->name()."\n"; + + array_push($visited, $this->name()); # log visited articles to prevent cycles + #echo "visited: "; + #print_r($visited); + #echo "\n\n"; + + if ($this->pageRank != -1) + return $this->pageRank; + + if (isset($iterations)) + $iterations += 1; + + $inhPr = 0; + foreach ($this->inLinks() as $inPage) { + if (in_array($inPage->name(), $visited)) + continue; + + $inhPr += $inPage->pageRank($N, $iterations, $visited) / count($inPage->outLinks()); + } + + $dampingFactor = 0.85; + $this->pageRank = (1.0 - $dampingFactor) * (1.0 / $N) + $dampingFactor * $inhPr; + return $this->pageRank; + } + + public function matchesAll($query=array()) { + if (count($query) <= 0) + return true; + + $text = file_get_contents('articles_text/' . $this->name); + foreach ($query as $subQuery) { + if (strlen($subQuery) <= 0) + continue; + + if (strpos($text, $subQuery) === false) + return false; + } + + return true; + } + } +?> \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/auth scores.csv b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/auth scores.csv new file mode 100644 index 00000000..57c132d4 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/auth scores.csv @@ -0,0 +1,49 @@ +Article;Authority score;Inlinks;Outlinks +Data_mining;0,481420279557;47;60 +Machine_learning;0,300243806287;26;17 +Predictive_analytics;0,193431219649;14;8 +Regression_analysis;0,190684464005;14;7 +Cluster_analysis;0,138648827906;14;9 +Decision_tree_learning;0,13040241942;8;3 +Data_analysis;0,129149293692;9;9 +Computer_science;0,128318421832;14;6 +Text_mining;0,127891272332;11;10 +Artificial_intelligence;0,121162536689;8;11 +Association_rule_learning;0,115689631449;11;6 +Neural_networks;0,0947669703546;5;6 +Association_for_Computing_Machinery;0,0884180831674;11;5 +Data_set;0,0882976372053;7;2 +Genetic_algorithms;0,0871920311758;4;4 +Information_extraction;0,0859697852294;5;2 +SPSS_Modeler;0,0791032803301;4;8 +Anomaly_detection;0,0712557567539;5;2 +Analytics;0,0701682516193;5;7 +Support_vector_machines;0,0653555399733;3;5 +Web_mining;0,0571017156083;3;1 +Receiver_operating_characteristic;0,0559879366163;4;2 +Gene_expression_programming;0,0464757154709;4;6 +List_of_machine_learning_algorithms;0,0423439347075;2;4 +Automatic_summarization;0,0335523303773;2;2 +Multifactor_dimensionality_reduction;0,0331381912581;1;4 +Online_algorithm;0,0331381912581;1;1 +Data_pre_processing;0,0331381912581;1;2 +Database_system;0,0331381912581;1;9 +Association_rule_mining;0,0331381912581;1;7 +Profiling_practices;0,0331381912581;1;1 +European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases;0,0331381912581;1;2 +Nearest_neighbor_search;0,0283998093258;2;2 +ECML_PKDD;0,0262579686702;2;1 +Biomedical_text_mining;0,0139741065714;1;2 +K_optimal_pattern_discovery;0,0136493347825;2;3 +Data_stream_mining;0,0121860939165;1;4 +Formal_concept_analysis;0,0102033736269;1;5 +Concept_drift;0,00976871892171;1;4 +Molecule_mining;0,00761477134993;1;1 +Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning;1,51340445751E-104;1;4 +Uncertain_data;0;0;1 +Structure_mining;0;0;4 +Document_classification;0;0;7 +Local_outlier_factor;0;0;1 +Accuracy_paradox;0;0;2 +Elastic_map;0;0;1 +Feature_vector;0;0;2 diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/hub scores.csv b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/hub scores.csv new file mode 100644 index 00000000..4f335ef5 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/hub scores.csv @@ -0,0 +1,85 @@ +Article;Hub score;Inlinks;Outlinks +Data_mining;0,2983;47;60 +Machine_learning;0,2001;26;17 +Artificial_intelligence;0,1811;8;11 +Predictive_analytics;0,1514;14;8 +Neural_networks;0,1481;5;6 +Cluster_analysis;0,1363;14;9 +Data;0,1285;9;11 +Text_mining;0,1258;11;10 +Analytics;0,1225;5;7 +Data_analysis;0,1193;9;9 +Formal_concept_analysis;0,1165;1;5 +Multifactor_dimensionality_reduction;0,1119;1;4 +Statistics;0,1104;18;9 +Computer_science;0,1101;14;6 +Concept_drift;0,1097;1;4 +Business_intelligence;0,1065;12;8 +Gene_expression_programming;0,1054;4;6 +SPSS_Modeler;0,1041;4;8 +Support_vector_machines;0,1003;3;5 +Data_visualization;0,0997;5;10 +Affinity_analysis;0,0961;0;4 +Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning;0,0948;1;4 +Concept_mining;0,0918;4;6 +Statistical_inference;0,0901;7;7 +Document_classification;0,0899;0;7 +Decision_support_system;0,0884;4;5 +Data_stream_mining;0,0879;1;4 +Data_pre_processing;0,0868;1;2 +Decision_tree_learning;0,0868;8;3 +Receiver_operating_characteristic;0,0868;4;2 +Data_dredging;0,0848;1;3 +Lift_data_mining_;0,0726;2;3 +Structure_mining;0,0685;0;4 +Association_rule_mining;0,0679;1;7 +Data_warehouse;0,0678;7;4 +Data_management;0,0678;4;4 +Evolutionary_data_mining;0,0678;0;3 +Data_Mining_and_Knowledge_Discovery;0,0677;0;2 +Sequence_mining;0,0663;8;4 +K_optimal_pattern_discovery;0,0663;2;3 +FSA_Red_Algorithm;0,0663;1;3 +Contrast_set_learning;0,0663;2;2 +Database_system;0,0633;1;9 +Software_mining;0,0633;0;3 +Cross_Industry_Standard_Process_for_Data_Mining;0,0623;3;3 +Association_rule_learning;0,0550;11;6 +Molecule_mining;0,0535;1;1 +Profiling_practices;0,0535;1;1 +Nothing_to_hide_argument;0,0535;0;1 +Web_mining;0,0535;3;1 +SEMMA;0,0535;2;5 +Regression_analysis;0,0488;14;7 +Genetic_algorithms;0,0483;4;4 +Automatic_summarization;0,0476;2;2 +List_of_machine_learning_algorithms;0,0408;2;4 +Data_collection;0,0366;5;5 +European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases;0,0363;1;2 +Feature_vector;0,0334;0;2 +Information_extraction;0,0334;5;2 +ECML_PKDD;0,0334;2;1 +Anomaly_detection;0,0283;5;2 +Accuracy_paradox;0,0277;0;2 +Conference_on_Information_and_Knowledge_Management;0,0241;3;2 +CIKM_Conference;0,0241;1;3 +Statistical_model;0,0212;4;2 +Elastic_map;0,0154;0;1 +Nearest_neighbor_search;0,0154;2;2 +Optimal_matching;0,0154;0;1 +Online_algorithm;0,0143;1;1 +IEEE;0,0143;5;1 +Uncertain_data;0,0143;0;1 +Association_for_Computing_Machinery;0,0143;11;5 +Biomedical_text_mining;0,0142;1;2 +Co_occurrence_networks;0,0142;1;1 +Apriori_algorithm;0,0129;7;2 +KDD_Conference;0,0098;1;3 +SIGMOD;0,0098;4;1 +Conference_on_Knowledge_Discovery_and_Data_Mining;0,0098;1;3 +SIGKDD;0,0098;6;2 +Data_classification_business_intelligence_;0,0098;0;2 +Local_outlier_factor;0,0079;0;1 +FICO;0,0078;1;1 +ROUGE_metric_;0,0037;0;1 +Anomaly_Detection_at_Multiple_Scales;0,0000;1;1 diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/pageranks.csv b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/pageranks.csv new file mode 100644 index 00000000..ade2c314 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/pageranks.csv @@ -0,0 +1,49 @@ +Article;PageRank;Inlinks;Outlinks +Data_mining;0,0313071789372;47;60 +Machine_learning;0,0158538977353;26;17 +Computer_science;0,0126722570331;14;6 +Association_rule_learning;0,00990670646945;11;6 +Cluster_analysis;0,0088622662563;14;9 +Association_for_Computing_Machinery;0,00760814223058;11;5 +Text_mining;0,00722965423835;11;10 +Artificial_intelligence;0,00483748324352;8;11 +Anomaly_detection;0,00464474010422;5;2 +Data_analysis;0,00452250176585;9;9 +Analytics;0,00398360477734;5;7 +Predictive_analytics;0,00377315039256;14;8 +Decision_tree_learning;0,00359688078719;8;3 +Data_set;0,00314473684211;7;2 +Receiver_operating_characteristic;0,00308353986382;4;2 +Neural_networks;0,00302106166424;5;6 +Automatic_summarization;0,00292105263158;2;2 +Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning;0,00292105263158;1;4 +SPSS_Modeler;0,00250326565026;4;8 +Information_extraction;0,00242076686785;5;2 +Web_mining;0,00238519730199;3;1 +Gene_expression_programming;0,00234245741998;4;6 +ECML_PKDD;0,00225;2;1 +Formal_concept_analysis;0,00206260065804;1;5 +Nearest_neighbor_search;0,0020060725352;2;2 +List_of_machine_learning_algorithms;0,00195275289178;2;4 +Concept_drift;0,00191447368421;1;4 +Molecule_mining;0,00191447368421;1;1 +Regression_analysis;0,00187996240602;14;7 +Support_vector_machines;0,00177067669173;3;5 +K_optimal_pattern_discovery;0,00177067669173;2;3 +Multifactor_dimensionality_reduction;0,00157894736842;1;4 +Database_system;0,00157894736842;1;9 +Local_outlier_factor;0,00157894736842;0;1 +Profiling_practices;0,00157894736842;1;1 +Uncertain_data;0,00157894736842;0;1 +Accuracy_paradox;0,00157894736842;0;2 +Structure_mining;0,00157894736842;0;4 +Data_pre_processing;0,00157894736842;1;2 +Online_algorithm;0,00157894736842;1;1 +Document_classification;0,00157894736842;0;7 +Genetic_algorithms;0,00157894736842;4;4 +Data_stream_mining;0,00157894736842;1;4 +European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases;0,00157894736842;1;2 +Biomedical_text_mining;0,00157894736842;1;2 +Feature_vector;0,00157894736842;0;2 +Elastic_map;0,00157894736842;0;1 +Association_rule_mining;0,00157894736842;1;7 \ No newline at end of file diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/query data mining.txt b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/query data mining.txt new file mode 100644 index 00000000..c15fddf5 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/query data mining.txt @@ -0,0 +1,427 @@ +Query: machine learning + +Root set: +Data_mining +Analytics +Information_extraction +Data_analysis +Computer_science +Data_set +Artificial_intelligence +Machine_learning +Database_system +Data_pre_processing +Online_algorithm +Cluster_analysis +Anomaly_detection +Association_rule_mining +Predictive_analytics +Regression_analysis +Neural_networks +Genetic_algorithms +Decision_tree_learning +Support_vector_machines +Association_for_Computing_Machinery +European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases +Association_rule_learning +Automatic_summarization +Receiver_operating_characteristic +Multifactor_dimensionality_reduction +SPSS_Modeler +Text_mining +Web_mining +Profiling_practices +Accuracy_paradox +Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning +K_optimal_pattern_discovery +Biomedical_text_mining +Nearest_neighbor_search +Concept_drift +Data_stream_mining +Formal_concept_analysis +Document_classification +ECML_PKDD +Elastic_map +Feature_vector +Gene_expression_programming +List_of_machine_learning_algorithms +Local_outlier_factor +Molecule_mining +Structure_mining +Uncertain_data + +Base set: +Affinity_analysis +Association_rule_learning +Cluster_analysis +Concept_drift +Concept_mining +Contrast_set_learning +Data_dredging +Data_Mining_and_Knowledge_Discovery +Data_stream_mining +Decision_tree_learning +Evolutionary_data_mining +Formal_concept_analysis +FSA_Red_Algorithm +K_optimal_pattern_discovery +Lift_data_mining_ +Molecule_mining +Multifactor_dimensionality_reduction +Nothing_to_hide_argument +Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning +Profiling_practices +Receiver_operating_characteristic +Sequence_mining +Software_mining +SPSS_Modeler +Structure_mining +Text_mining +Web_mining +Analytics +Data_analysis +Computer_science +Artificial_intelligence +Machine_learning +Statistics +Database_system +Data_management +Data_pre_processing +Statistical_inference +Data_visualization +Data_warehouse +Decision_support_system +Business_intelligence +Association_rule_mining +Predictive_analytics +Data +Neural_networks +Cross_Industry_Standard_Process_for_Data_Mining +SEMMA +Data_mining +FICO +Document_classification +Uncertain_data +Online_algorithm +Genetic_algorithms +Association_for_Computing_Machinery +CIKM_Conference +Conference_on_Information_and_Knowledge_Management +IEEE +Data_classification_business_intelligence_ +Gene_expression_programming +Automatic_summarization +ECML_PKDD +Feature_vector +Information_extraction +Regression_analysis +Support_vector_machines +European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases +Anomaly_detection +Elastic_map +Nearest_neighbor_search +Optimal_matching +Data_collection +Local_outlier_factor +Accuracy_paradox +List_of_machine_learning_algorithms +Statistical_model +Conference_on_Knowledge_Discovery_and_Data_Mining +SIGKDD +KDD_Conference +SIGMOD +Apriori_algorithm +ROUGE_metric_ +Biomedical_text_mining +Co_occurrence_networks +Anomaly_Detection_at_Multiple_Scales +Iteration #1. weightChange = 1 +Iteration #2. weightChange = 130,99215163 +Iteration #3. weightChange = 2,53848909943 +Iteration #4. weightChange = 39,5367828957 +Iteration #5. weightChange = 29,9701756704 +Iteration #6. weightChange = 9,28301673042 +Iteration #7. weightChange = 12,3452005519 +Iteration #8. weightChange = 11,9434800892 +Iteration #9. weightChange = 8,39872629851 +Iteration #10. weightChange = 2,716960015 +Iteration #11. weightChange = 3,96835990824 +Iteration #12. weightChange = 3,68852997114 +Iteration #13. weightChange = 1,62022052321 +Iteration #14. weightChange = 2,4953660272 +Iteration #15. weightChange = 1,26152905201 +Iteration #16. weightChange = 1,47376149063 +Iteration #17. weightChange = 0,713186541076 +Iteration #18. weightChange = 0,651543392285 +Iteration #19. weightChange = 0,710120372528 +Iteration #20. weightChange = 0,352334226338 +Iteration #21. weightChange = 0,489764902618 +Iteration #22. weightChange = 0,210758230974 +Iteration #23. weightChange = 0,270268234749 +Iteration #24. weightChange = 0,160987314533 +Iteration #25. weightChange = 0,108494890058 +Iteration #26. weightChange = 0,142045825188 +Iteration #27. weightChange = 0,0696068114234 +Iteration #28. weightChange = 0,0939834167458 +Iteration #29. weightChange = 0,0376808362705 +Iteration #30. weightChange = 0,0488871451107 +Iteration #31. weightChange = 0,0346980226486 +Iteration #32. weightChange = 0,0206839827982 +Iteration #33. weightChange = 0,028283920176 +Iteration #34. weightChange = 0,0133233252969 +Iteration #35. weightChange = 0,0177940092064 +Iteration #36. weightChange = 0,00673387406082 +Iteration #37. weightChange = 0,00869253801166 +Iteration #38. weightChange = 0,00729875236984 +Iteration #39. weightChange = 0,00409596506738 +Iteration #40. weightChange = 0,00557165652769 +Iteration #41. weightChange = 0,00250962933041 +Iteration #42. weightChange = 0,00332805780094 +Iteration #43. weightChange = 0,00150357884973 +Iteration #44. weightChange = 0,00151529447295 +Iteration #45. weightChange = 0,0015078755055 +Iteration #46. weightChange = 0,000802149095372 +Iteration #47. weightChange = 0,00108513361219 +Iteration #48. weightChange = 0,000466532367817 +Iteration #49. weightChange = 0,000614575718123 +Iteration #50. weightChange = 0,000336209116251 +Iteration #51. weightChange = 0,000257852685663 +Iteration #52. weightChange = 0,000306807018515 +Iteration #53. weightChange = 0,000155330048824 +Iteration #54. weightChange = 0,000208956172037 +Final weightChange = 8,55713898368E-5 + +Number of iterations for Data_mining: 88 +Number of iterations for Analytics: 0 +Number of iterations for Information_extraction: 0 +Number of iterations for Data_analysis: 0 +Number of iterations for Computer_science: 0 +Number of iterations for Data_set: 0 +Number of iterations for Artificial_intelligence: 0 +Number of iterations for Machine_learning: 0 +Number of iterations for Database_system: 0 +Number of iterations for Data_pre_processing: 0 +Number of iterations for Online_algorithm: 0 +Number of iterations for Cluster_analysis: 0 +Number of iterations for Anomaly_detection: 0 +Number of iterations for Association_rule_mining: 0 +Number of iterations for Predictive_analytics: 0 +Number of iterations for Regression_analysis: 0 +Number of iterations for Neural_networks: 0 +Number of iterations for Genetic_algorithms: 0 +Number of iterations for Decision_tree_learning: 0 +Number of iterations for Support_vector_machines: 0 +Number of iterations for Association_for_Computing_Machinery: 0 +Number of iterations for European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases: 0 +Number of iterations for Association_rule_learning: 0 +Number of iterations for Automatic_summarization: 0 +Number of iterations for Receiver_operating_characteristic: 0 +Number of iterations for Multifactor_dimensionality_reduction: 0 +Number of iterations for SPSS_Modeler: 0 +Number of iterations for Text_mining: 0 +Number of iterations for Web_mining: 0 +Number of iterations for Profiling_practices: 0 +Number of iterations for Accuracy_paradox: 0 +Number of iterations for Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning: 0 +Number of iterations for K_optimal_pattern_discovery: 0 +Number of iterations for Biomedical_text_mining: 0 +Number of iterations for Nearest_neighbor_search: 0 +Number of iterations for Concept_drift: 0 +Number of iterations for Data_stream_mining: 0 +Number of iterations for Formal_concept_analysis: 0 +Number of iterations for Document_classification: 0 +Number of iterations for ECML_PKDD: 0 +Number of iterations for Elastic_map: 0 +Number of iterations for Feature_vector: 0 +Number of iterations for Gene_expression_programming: 0 +Number of iterations for List_of_machine_learning_algorithms: 0 +Number of iterations for Local_outlier_factor: 0 +Number of iterations for Molecule_mining: 0 +Number of iterations for Structure_mining: 0 +Number of iterations for Uncertain_data: 0 + +Final hub scores: +Data_mining - hub score: 0,298272125826 - inlinks: 47 - outlinks: 60 +Machine_learning - hub score: 0,200070431167 - inlinks: 26 - outlinks: 17 +Artificial_intelligence - hub score: 0,181061276435 - inlinks: 8 - outlinks: 11 +Predictive_analytics - hub score: 0,151381551863 - inlinks: 14 - outlinks: 8 +Neural_networks - hub score: 0,148140482825 - inlinks: 5 - outlinks: 6 +Cluster_analysis - hub score: 0,136340446151 - inlinks: 14 - outlinks: 9 +Data - hub score: 0,128487292839 - inlinks: 9 - outlinks: 11 +Text_mining - hub score: 0,125778937091 - inlinks: 11 - outlinks: 10 +Analytics - hub score: 0,122539939864 - inlinks: 5 - outlinks: 7 +Data_analysis - hub score: 0,119282180765 - inlinks: 9 - outlinks: 9 +Formal_concept_analysis - hub score: 0,116453708966 - inlinks: 1 - outlinks: 5 +Multifactor_dimensionality_reduction - hub score: 0,111857493134 - inlinks: 1 - outlinks: 4 +Statistics - hub score: 0,110411515736 - inlinks: 18 - outlinks: 9 +Computer_science - hub score: 0,11012560721 - inlinks: 14 - outlinks: 6 +Concept_drift - hub score: 0,109685290596 - inlinks: 1 - outlinks: 4 +Business_intelligence - hub score: 0,106530033057 - inlinks: 12 - outlinks: 8 +Gene_expression_programming - hub score: 0,105399184072 - inlinks: 4 - outlinks: 6 +SPSS_Modeler - hub score: 0,104075972761 - inlinks: 4 - outlinks: 8 +Support_vector_machines - hub score: 0,100341766796 - inlinks: 3 - outlinks: 5 +Data_visualization - hub score: 0,0996944959997 - inlinks: 5 - outlinks: 10 +Affinity_analysis - hub score: 0,0960898168973 - inlinks: 0 - outlinks: 4 +Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning - hub score: 0,0947579493008 - inlinks: 1 - outlinks: 4 +Concept_mining - hub score: 0,0918391084951 - inlinks: 4 - outlinks: 6 +Statistical_inference - hub score: 0,090073348026 - inlinks: 7 - outlinks: 7 +Document_classification - hub score: 0,0899133224692 - inlinks: 0 - outlinks: 7 +Decision_support_system - hub score: 0,088435810599 - inlinks: 4 - outlinks: 5 +Data_stream_mining - hub score: 0,087926843583 - inlinks: 1 - outlinks: 4 +Data_pre_processing - hub score: 0,0868415554632 - inlinks: 1 - outlinks: 2 +Decision_tree_learning - hub score: 0,0868415554632 - inlinks: 8 - outlinks: 3 +Receiver_operating_characteristic - hub score: 0,0868415554632 - inlinks: 4 - outlinks: 2 +Data_dredging - hub score: 0,0847845759536 - inlinks: 1 - outlinks: 3 +Lift_data_mining_ - hub score: 0,0725580642431 - inlinks: 2 - outlinks: 3 +Structure_mining - hub score: 0,0685394691742 - inlinks: 0 - outlinks: 4 +Association_rule_mining - hub score: 0,0678543173191 - inlinks: 1 - outlinks: 7 +Data_warehouse - hub score: 0,0678332450214 - inlinks: 7 - outlinks: 4 +Data_management - hub score: 0,0678332450214 - inlinks: 4 - outlinks: 4 +Evolutionary_data_mining - hub score: 0,0678332450214 - inlinks: 0 - outlinks: 3 +Data_Mining_and_Knowledge_Discovery - hub score: 0,0677409365656 - inlinks: 0 - outlinks: 2 +Sequence_mining - hub score: 0,0663378993526 - inlinks: 8 - outlinks: 4 +K_optimal_pattern_discovery - hub score: 0,0663378993526 - inlinks: 2 - outlinks: 3 +FSA_Red_Algorithm - hub score: 0,0663378993526 - inlinks: 1 - outlinks: 3 +Contrast_set_learning - hub score: 0,0663378993526 - inlinks: 2 - outlinks: 2 +Database_system - hub score: 0,0633080765482 - inlinks: 1 - outlinks: 9 +Software_mining - hub score: 0,0632946952059 - inlinks: 0 - outlinks: 3 +Cross_Industry_Standard_Process_for_Data_Mining - hub score: 0,0622732177363 - inlinks: 3 - outlinks: 3 +Association_rule_learning - hub score: 0,055001395293 - inlinks: 11 - outlinks: 6 +Molecule_mining - hub score: 0,0534849773265 - inlinks: 1 - outlinks: 1 +Profiling_practices - hub score: 0,0534849773265 - inlinks: 1 - outlinks: 1 +Nothing_to_hide_argument - hub score: 0,0534849773265 - inlinks: 0 - outlinks: 1 +Web_mining - hub score: 0,0534849773265 - inlinks: 3 - outlinks: 1 +SEMMA - hub score: 0,0534849773265 - inlinks: 2 - outlinks: 5 +Regression_analysis - hub score: 0,0487602279865 - inlinks: 14 - outlinks: 7 +Genetic_algorithms - hub score: 0,0482839348528 - inlinks: 4 - outlinks: 4 +Automatic_summarization - hub score: 0,0475650817901 - inlinks: 2 - outlinks: 2 +List_of_machine_learning_algorithms - hub score: 0,0408355819287 - inlinks: 2 - outlinks: 4 +Data_collection - hub score: 0,0365883706995 - inlinks: 5 - outlinks: 5 +European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases - hub score: 0,0362737939637 - inlinks: 1 - outlinks: 2 +Feature_vector - hub score: 0,0333565781367 - inlinks: 0 - outlinks: 2 +Information_extraction - hub score: 0,0333565781367 - inlinks: 5 - outlinks: 2 +ECML_PKDD - hub score: 0,0333565781367 - inlinks: 2 - outlinks: 1 +Anomaly_detection - hub score: 0,0282565718759 - inlinks: 5 - outlinks: 2 +Accuracy_paradox - hub score: 0,0277100456381 - inlinks: 0 - outlinks: 2 +Conference_on_Information_and_Knowledge_Management - hub score: 0,0240790584608 - inlinks: 3 - outlinks: 2 +CIKM_Conference - hub score: 0,0240790584608 - inlinks: 1 - outlinks: 3 +Statistical_model - hub score: 0,0211847208497 - inlinks: 4 - outlinks: 2 +Elastic_map - hub score: 0,0154036498498 - inlinks: 0 - outlinks: 1 +Nearest_neighbor_search - hub score: 0,0154036498498 - inlinks: 2 - outlinks: 2 +Optimal_matching - hub score: 0,0154036498498 - inlinks: 0 - outlinks: 1 +Online_algorithm - hub score: 0,0142559592391 - inlinks: 1 - outlinks: 1 +IEEE - hub score: 0,0142559592391 - inlinks: 5 - outlinks: 1 +Uncertain_data - hub score: 0,0142559592391 - inlinks: 0 - outlinks: 1 +Association_for_Computing_Machinery - hub score: 0,0142559592391 - inlinks: 11 - outlinks: 5 +Biomedical_text_mining - hub score: 0,0142085036534 - inlinks: 1 - outlinks: 2 +Co_occurrence_networks - hub score: 0,0142085036534 - inlinks: 1 - outlinks: 1 +Apriori_algorithm - hub score: 0,0128529220261 - inlinks: 7 - outlinks: 2 +KDD_Conference - hub score: 0,00982309922176 - inlinks: 1 - outlinks: 3 +SIGMOD - hub score: 0,00982309922176 - inlinks: 4 - outlinks: 1 +Conference_on_Knowledge_Discovery_and_Data_Mining - hub score: 0,00982309922176 - inlinks: 1 - outlinks: 3 +SIGKDD - hub score: 0,00982309922176 - inlinks: 6 - outlinks: 2 +Data_classification_business_intelligence_ - hub score: 0,00980971787945 - inlinks: 0 - outlinks: 2 +Local_outlier_factor - hub score: 0,00791639383755 - inlinks: 0 - outlinks: 1 +FICO - hub score: 0,0077955738598 - inlinks: 1 - outlinks: 1 +ROUGE_metric_ - hub score: 0,00372760705288 - inlinks: 0 - outlinks: 1 +Anomaly_Detection_at_Multiple_Scales - hub score: 1,36219373369E-103 - inlinks: 1 - outlinks: 1 + +Final authority scores: +Data_mining - auth. score: 0,481420279557 - inlinks: 47 - outlinks: 60 +Machine_learning - auth. score: 0,300243806287 - inlinks: 26 - outlinks: 17 +Predictive_analytics - auth. score: 0,193431219649 - inlinks: 14 - outlinks: 8 +Regression_analysis - auth. score: 0,190684464005 - inlinks: 14 - outlinks: 7 +Cluster_analysis - auth. score: 0,138648827906 - inlinks: 14 - outlinks: 9 +Decision_tree_learning - auth. score: 0,13040241942 - inlinks: 8 - outlinks: 3 +Data_analysis - auth. score: 0,129149293692 - inlinks: 9 - outlinks: 9 +Computer_science - auth. score: 0,128318421832 - inlinks: 14 - outlinks: 6 +Text_mining - auth. score: 0,127891272332 - inlinks: 11 - outlinks: 10 +Artificial_intelligence - auth. score: 0,121162536689 - inlinks: 8 - outlinks: 11 +Association_rule_learning - auth. score: 0,115689631449 - inlinks: 11 - outlinks: 6 +Neural_networks - auth. score: 0,0947669703546 - inlinks: 5 - outlinks: 6 +Association_for_Computing_Machinery - auth. score: 0,0884180831674 - inlinks: 11 - outlinks: 5 +Data_set - auth. score: 0,0882976372053 - inlinks: 7 - outlinks: 2 +Genetic_algorithms - auth. score: 0,0871920311758 - inlinks: 4 - outlinks: 4 +Information_extraction - auth. score: 0,0859697852294 - inlinks: 5 - outlinks: 2 +SPSS_Modeler - auth. score: 0,0791032803301 - inlinks: 4 - outlinks: 8 +Anomaly_detection - auth. score: 0,0712557567539 - inlinks: 5 - outlinks: 2 +Analytics - auth. score: 0,0701682516193 - inlinks: 5 - outlinks: 7 +Support_vector_machines - auth. score: 0,0653555399733 - inlinks: 3 - outlinks: 5 +Web_mining - auth. score: 0,0571017156083 - inlinks: 3 - outlinks: 1 +Receiver_operating_characteristic - auth. score: 0,0559879366163 - inlinks: 4 - outlinks: 2 +Gene_expression_programming - auth. score: 0,0464757154709 - inlinks: 4 - outlinks: 6 +List_of_machine_learning_algorithms - auth. score: 0,0423439347075 - inlinks: 2 - outlinks: 4 +Automatic_summarization - auth. score: 0,0335523303773 - inlinks: 2 - outlinks: 2 +Multifactor_dimensionality_reduction - auth. score: 0,0331381912581 - inlinks: 1 - outlinks: 4 +Online_algorithm - auth. score: 0,0331381912581 - inlinks: 1 - outlinks: 1 +Data_pre_processing - auth. score: 0,0331381912581 - inlinks: 1 - outlinks: 2 +Database_system - auth. score: 0,0331381912581 - inlinks: 1 - outlinks: 9 +Association_rule_mining - auth. score: 0,0331381912581 - inlinks: 1 - outlinks: 7 +Profiling_practices - auth. score: 0,0331381912581 - inlinks: 1 - outlinks: 1 +European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases - auth. score: 0,0331381912581 - inlinks: 1 - outlinks: 2 +Nearest_neighbor_search - auth. score: 0,0283998093258 - inlinks: 2 - outlinks: 2 +ECML_PKDD - auth. score: 0,0262579686702 - inlinks: 2 - outlinks: 1 +Biomedical_text_mining - auth. score: 0,0139741065714 - inlinks: 1 - outlinks: 2 +K_optimal_pattern_discovery - auth. score: 0,0136493347825 - inlinks: 2 - outlinks: 3 +Data_stream_mining - auth. score: 0,0121860939165 - inlinks: 1 - outlinks: 4 +Formal_concept_analysis - auth. score: 0,0102033736269 - inlinks: 1 - outlinks: 5 +Concept_drift - auth. score: 0,00976871892171 - inlinks: 1 - outlinks: 4 +Molecule_mining - auth. score: 0,00761477134993 - inlinks: 1 - outlinks: 1 +Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning - auth. score: 1,51340445751E-104 - inlinks: 1 - outlinks: 4 +Uncertain_data - auth. score: 0 - inlinks: 0 - outlinks: 1 +Structure_mining - auth. score: 0 - inlinks: 0 - outlinks: 4 +Document_classification - auth. score: 0 - inlinks: 0 - outlinks: 7 +Local_outlier_factor - auth. score: 0 - inlinks: 0 - outlinks: 1 +Accuracy_paradox - auth. score: 0 - inlinks: 0 - outlinks: 2 +Elastic_map - auth. score: 0 - inlinks: 0 - outlinks: 1 +Feature_vector - auth. score: 0 - inlinks: 0 - outlinks: 2 + +Page Ranks: +Data_mining - PageRank: 0,0313071789372 - inlinks: 47 - outlinks: 60 +Machine_learning - PageRank: 0,0158538977353 - inlinks: 26 - outlinks: 17 +Computer_science - PageRank: 0,0126722570331 - inlinks: 14 - outlinks: 6 +Association_rule_learning - PageRank: 0,00990670646945 - inlinks: 11 - outlinks: 6 +Cluster_analysis - PageRank: 0,0088622662563 - inlinks: 14 - outlinks: 9 +Association_for_Computing_Machinery - PageRank: 0,00760814223058 - inlinks: 11 - outlinks: 5 +Text_mining - PageRank: 0,00722965423835 - inlinks: 11 - outlinks: 10 +Artificial_intelligence - PageRank: 0,00483748324352 - inlinks: 8 - outlinks: 11 +Anomaly_detection - PageRank: 0,00464474010422 - inlinks: 5 - outlinks: 2 +Data_analysis - PageRank: 0,00452250176585 - inlinks: 9 - outlinks: 9 +Analytics - PageRank: 0,00398360477734 - inlinks: 5 - outlinks: 7 +Predictive_analytics - PageRank: 0,00377315039256 - inlinks: 14 - outlinks: 8 +Decision_tree_learning - PageRank: 0,00359688078719 - inlinks: 8 - outlinks: 3 +Data_set - PageRank: 0,00314473684211 - inlinks: 7 - outlinks: 2 +Receiver_operating_characteristic - PageRank: 0,00308353986382 - inlinks: 4 - outlinks: 2 +Neural_networks - PageRank: 0,00302106166424 - inlinks: 5 - outlinks: 6 +Automatic_summarization - PageRank: 0,00292105263158 - inlinks: 2 - outlinks: 2 +Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning - PageRank: 0,00292105263158 - inlinks: 1 - outlinks: 4 +SPSS_Modeler - PageRank: 0,00250326565026 - inlinks: 4 - outlinks: 8 +Information_extraction - PageRank: 0,00242076686785 - inlinks: 5 - outlinks: 2 +Web_mining - PageRank: 0,00238519730199 - inlinks: 3 - outlinks: 1 +Gene_expression_programming - PageRank: 0,00234245741998 - inlinks: 4 - outlinks: 6 +ECML_PKDD - PageRank: 0,00225 - inlinks: 2 - outlinks: 1 +Formal_concept_analysis - PageRank: 0,00206260065804 - inlinks: 1 - outlinks: 5 +Nearest_neighbor_search - PageRank: 0,0020060725352 - inlinks: 2 - outlinks: 2 +List_of_machine_learning_algorithms - PageRank: 0,00195275289178 - inlinks: 2 - outlinks: 4 +Concept_drift - PageRank: 0,00191447368421 - inlinks: 1 - outlinks: 4 +Molecule_mining - PageRank: 0,00191447368421 - inlinks: 1 - outlinks: 1 +Regression_analysis - PageRank: 0,00187996240602 - inlinks: 14 - outlinks: 7 +Support_vector_machines - PageRank: 0,00177067669173 - inlinks: 3 - outlinks: 5 +K_optimal_pattern_discovery - PageRank: 0,00177067669173 - inlinks: 2 - outlinks: 3 +Multifactor_dimensionality_reduction - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 4 +Database_system - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 9 +Local_outlier_factor - PageRank: 0,00157894736842 - inlinks: 0 - outlinks: 1 +Profiling_practices - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 1 +Uncertain_data - PageRank: 0,00157894736842 - inlinks: 0 - outlinks: 1 +Accuracy_paradox - PageRank: 0,00157894736842 - inlinks: 0 - outlinks: 2 +Structure_mining - PageRank: 0,00157894736842 - inlinks: 0 - outlinks: 4 +Data_pre_processing - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 2 +Online_algorithm - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 1 +Document_classification - PageRank: 0,00157894736842 - inlinks: 0 - outlinks: 7 +Genetic_algorithms - PageRank: 0,00157894736842 - inlinks: 4 - outlinks: 4 +Data_stream_mining - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 4 +European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 2 +Biomedical_text_mining - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 2 +Feature_vector - PageRank: 0,00157894736842 - inlinks: 0 - outlinks: 2 +Elastic_map - PageRank: 0,00157894736842 - inlinks: 0 - outlinks: 1 +Association_rule_mining - PageRank: 0,00157894736842 - inlinks: 1 - outlinks: 7 diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/query frequent itemsets.txt b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/query frequent itemsets.txt new file mode 100644 index 00000000..635a8b20 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/query frequent itemsets.txt @@ -0,0 +1,126 @@ +Query: frequent itemsets + +Root set: +Association_rule_mining +Association_rule_learning +Sequence_mining +Apriori_algorithm + +Base set: +Data_mining +Affinity_analysis +Anomaly_detection +Apriori_algorithm +Contrast_set_learning +FSA_Red_Algorithm +K_optimal_pattern_discovery +Lift_data_mining_ +Sequence_mining +Machine_learning +Association_rule_mining +Association_rule_learning +Data_stream_mining +GSP_Algorithm +Structure_mining +Text_mining +List_of_machine_learning_algorithms +SPSS_Modeler +Iteration #1. weightChange = 1 +Iteration #2. weightChange = 21,361053719 +Iteration #3. weightChange = 1,17507732151 +Iteration #4. weightChange = 13,8618290252 +Iteration #5. weightChange = 9,51035435487 +Iteration #6. weightChange = 4,42768707905 +Iteration #7. weightChange = 3,94039537316 +Iteration #8. weightChange = 4,39889465042 +Iteration #9. weightChange = 4,16966106841 +Iteration #10. weightChange = 1,00935430022 +Iteration #11. weightChange = 2,5649785358 +Iteration #12. weightChange = 1,06556791313 +Iteration #13. weightChange = 1,38874274972 +Iteration #14. weightChange = 0,985627745864 +Iteration #15. weightChange = 0,772236012262 +Iteration #16. weightChange = 0,976314448843 +Iteration #17. weightChange = 0,416743891307 +Iteration #18. weightChange = 0,691135838427 +Iteration #19. weightChange = 0,278411066722 +Iteration #20. weightChange = 0,444376759913 +Iteration #21. weightChange = 0,207265309711 +Iteration #22. weightChange = 0,243848859214 +Iteration #23. weightChange = 0,2249308424 +Iteration #24. weightChange = 0,129124604584 +Iteration #25. weightChange = 0,18216496301 +Iteration #26. weightChange = 0,0840570015456 +Iteration #27. weightChange = 0,129467260528 +Iteration #28. weightChange = 0,0423048356336 +Iteration #29. weightChange = 0,078976294988 +Iteration #30. weightChange = 0,0476083203922 +Iteration #31. weightChange = 0,0395283708318 +Iteration #32. weightChange = 0,0448687275205 +Iteration #33. weightChange = 0,0244222589408 +Iteration #34. weightChange = 0,0351593128489 +Iteration #35. weightChange = 0,0143366655559 +Iteration #36. weightChange = 0,0236750977819 +Iteration #37. weightChange = 0,00856845338502 +Iteration #38. weightChange = 0,0136506701717 +Iteration #39. weightChange = 0,0101327087761 +Iteration #40. weightChange = 0,00671851087895 +Iteration #41. weightChange = 0,00893764007155 +Iteration #42. weightChange = 0,00440494374158 +Iteration #43. weightChange = 0,00663474533033 +Iteration #44. weightChange = 0,00246501012924 +Iteration #45. weightChange = 0,00426103300755 +Iteration #46. weightChange = 0,00201514083546 +Iteration #47. weightChange = 0,00231050290628 +Iteration #48. weightChange = 0,00210751977271 +Iteration #49. weightChange = 0,00125715755001 +Iteration #50. weightChange = 0,00174350227946 +Iteration #51. weightChange = 0,000786496798444 +Iteration #52. weightChange = 0,00123399552636 +Iteration #53. weightChange = 0,000410611134625 +Iteration #54. weightChange = 0,000753528382703 +Iteration #55. weightChange = 0,000449152173346 +Iteration #56. weightChange = 0,000379709125549 +Iteration #57. weightChange = 0,000427547600907 +Iteration #58. weightChange = 0,000231919305357 +Iteration #59. weightChange = 0,00033470289275 +Iteration #60. weightChange = 0,000137612626441 +Iteration #61. weightChange = 0,000225928294367 +Final weightChange = 8,07391641046E-5 + +Number of iterations for Association_rule_mining: 88 +Number of iterations for Association_rule_learning: 0 +Number of iterations for Sequence_mining: 0 +Number of iterations for Apriori_algorithm: 0 + +Final hub scores: +Association_rule_mining - hub score: 0,295536064344 - inlinks: 1 - outlinks: 7 +Data_mining - hub score: 0,231085810415 - inlinks: 47 - outlinks: 60 +Machine_learning - hub score: 0,215801905321 - inlinks: 26 - outlinks: 17 +FSA_Red_Algorithm - hub score: 0,206412353649 - inlinks: 1 - outlinks: 3 +Sequence_mining - hub score: 0,206412353649 - inlinks: 8 - outlinks: 4 +GSP_Algorithm - hub score: 0,168857869718 - inlinks: 1 - outlinks: 2 +Association_rule_learning - hub score: 0,168857869718 - inlinks: 11 - outlinks: 6 +Contrast_set_learning - hub score: 0,126678194626 - inlinks: 2 - outlinks: 2 +Apriori_algorithm - hub score: 0,126678194626 - inlinks: 7 - outlinks: 2 +Anomaly_detection - hub score: 0,126678194626 - inlinks: 5 - outlinks: 2 +Affinity_analysis - hub score: 0,126678194626 - inlinks: 0 - outlinks: 4 +K_optimal_pattern_discovery - hub score: 0,126678194626 - inlinks: 2 - outlinks: 3 +Lift_data_mining_ - hub score: 0,126678194626 - inlinks: 2 - outlinks: 3 +Data_stream_mining - hub score: 0,0891237106943 - inlinks: 1 - outlinks: 4 +Structure_mining - hub score: 0,0891237106943 - inlinks: 0 - outlinks: 4 +Text_mining - hub score: 0,0891237106943 - inlinks: 11 - outlinks: 10 +SPSS_Modeler - hub score: 0,0797341590232 - inlinks: 4 - outlinks: 8 +List_of_machine_learning_algorithms - hub score: 0,0797341590232 - inlinks: 2 - outlinks: 4 + +Final authority scores: +Association_rule_learning - auth. score: 0,492558858836 - inlinks: 11 - outlinks: 6 +Sequence_mining - auth. score: 0,346536934508 - inlinks: 8 - outlinks: 4 +Apriori_algorithm - auth. score: 0,31002783466 - inlinks: 7 - outlinks: 2 +Association_rule_mining - auth. score: 0,0594279297567 - inlinks: 1 - outlinks: 7 + +Page Ranks: +Association_rule_learning - PageRank: 0,0093760793441 - inlinks: 11 - outlinks: 6 +Apriori_algorithm - PageRank: 0,00448536675772 - inlinks: 7 - outlinks: 2 +Sequence_mining - PageRank: 0,00365599706795 - inlinks: 8 - outlinks: 4 +Association_rule_mining - PageRank: 0,00201495426623 - inlinks: 1 - outlinks: 7 diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/results 2_1-3.txt b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/results 2_1-3.txt new file mode 100644 index 00000000..635a8b20 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/results/results 2_1-3.txt @@ -0,0 +1,126 @@ +Query: frequent itemsets + +Root set: +Association_rule_mining +Association_rule_learning +Sequence_mining +Apriori_algorithm + +Base set: +Data_mining +Affinity_analysis +Anomaly_detection +Apriori_algorithm +Contrast_set_learning +FSA_Red_Algorithm +K_optimal_pattern_discovery +Lift_data_mining_ +Sequence_mining +Machine_learning +Association_rule_mining +Association_rule_learning +Data_stream_mining +GSP_Algorithm +Structure_mining +Text_mining +List_of_machine_learning_algorithms +SPSS_Modeler +Iteration #1. weightChange = 1 +Iteration #2. weightChange = 21,361053719 +Iteration #3. weightChange = 1,17507732151 +Iteration #4. weightChange = 13,8618290252 +Iteration #5. weightChange = 9,51035435487 +Iteration #6. weightChange = 4,42768707905 +Iteration #7. weightChange = 3,94039537316 +Iteration #8. weightChange = 4,39889465042 +Iteration #9. weightChange = 4,16966106841 +Iteration #10. weightChange = 1,00935430022 +Iteration #11. weightChange = 2,5649785358 +Iteration #12. weightChange = 1,06556791313 +Iteration #13. weightChange = 1,38874274972 +Iteration #14. weightChange = 0,985627745864 +Iteration #15. weightChange = 0,772236012262 +Iteration #16. weightChange = 0,976314448843 +Iteration #17. weightChange = 0,416743891307 +Iteration #18. weightChange = 0,691135838427 +Iteration #19. weightChange = 0,278411066722 +Iteration #20. weightChange = 0,444376759913 +Iteration #21. weightChange = 0,207265309711 +Iteration #22. weightChange = 0,243848859214 +Iteration #23. weightChange = 0,2249308424 +Iteration #24. weightChange = 0,129124604584 +Iteration #25. weightChange = 0,18216496301 +Iteration #26. weightChange = 0,0840570015456 +Iteration #27. weightChange = 0,129467260528 +Iteration #28. weightChange = 0,0423048356336 +Iteration #29. weightChange = 0,078976294988 +Iteration #30. weightChange = 0,0476083203922 +Iteration #31. weightChange = 0,0395283708318 +Iteration #32. weightChange = 0,0448687275205 +Iteration #33. weightChange = 0,0244222589408 +Iteration #34. weightChange = 0,0351593128489 +Iteration #35. weightChange = 0,0143366655559 +Iteration #36. weightChange = 0,0236750977819 +Iteration #37. weightChange = 0,00856845338502 +Iteration #38. weightChange = 0,0136506701717 +Iteration #39. weightChange = 0,0101327087761 +Iteration #40. weightChange = 0,00671851087895 +Iteration #41. weightChange = 0,00893764007155 +Iteration #42. weightChange = 0,00440494374158 +Iteration #43. weightChange = 0,00663474533033 +Iteration #44. weightChange = 0,00246501012924 +Iteration #45. weightChange = 0,00426103300755 +Iteration #46. weightChange = 0,00201514083546 +Iteration #47. weightChange = 0,00231050290628 +Iteration #48. weightChange = 0,00210751977271 +Iteration #49. weightChange = 0,00125715755001 +Iteration #50. weightChange = 0,00174350227946 +Iteration #51. weightChange = 0,000786496798444 +Iteration #52. weightChange = 0,00123399552636 +Iteration #53. weightChange = 0,000410611134625 +Iteration #54. weightChange = 0,000753528382703 +Iteration #55. weightChange = 0,000449152173346 +Iteration #56. weightChange = 0,000379709125549 +Iteration #57. weightChange = 0,000427547600907 +Iteration #58. weightChange = 0,000231919305357 +Iteration #59. weightChange = 0,00033470289275 +Iteration #60. weightChange = 0,000137612626441 +Iteration #61. weightChange = 0,000225928294367 +Final weightChange = 8,07391641046E-5 + +Number of iterations for Association_rule_mining: 88 +Number of iterations for Association_rule_learning: 0 +Number of iterations for Sequence_mining: 0 +Number of iterations for Apriori_algorithm: 0 + +Final hub scores: +Association_rule_mining - hub score: 0,295536064344 - inlinks: 1 - outlinks: 7 +Data_mining - hub score: 0,231085810415 - inlinks: 47 - outlinks: 60 +Machine_learning - hub score: 0,215801905321 - inlinks: 26 - outlinks: 17 +FSA_Red_Algorithm - hub score: 0,206412353649 - inlinks: 1 - outlinks: 3 +Sequence_mining - hub score: 0,206412353649 - inlinks: 8 - outlinks: 4 +GSP_Algorithm - hub score: 0,168857869718 - inlinks: 1 - outlinks: 2 +Association_rule_learning - hub score: 0,168857869718 - inlinks: 11 - outlinks: 6 +Contrast_set_learning - hub score: 0,126678194626 - inlinks: 2 - outlinks: 2 +Apriori_algorithm - hub score: 0,126678194626 - inlinks: 7 - outlinks: 2 +Anomaly_detection - hub score: 0,126678194626 - inlinks: 5 - outlinks: 2 +Affinity_analysis - hub score: 0,126678194626 - inlinks: 0 - outlinks: 4 +K_optimal_pattern_discovery - hub score: 0,126678194626 - inlinks: 2 - outlinks: 3 +Lift_data_mining_ - hub score: 0,126678194626 - inlinks: 2 - outlinks: 3 +Data_stream_mining - hub score: 0,0891237106943 - inlinks: 1 - outlinks: 4 +Structure_mining - hub score: 0,0891237106943 - inlinks: 0 - outlinks: 4 +Text_mining - hub score: 0,0891237106943 - inlinks: 11 - outlinks: 10 +SPSS_Modeler - hub score: 0,0797341590232 - inlinks: 4 - outlinks: 8 +List_of_machine_learning_algorithms - hub score: 0,0797341590232 - inlinks: 2 - outlinks: 4 + +Final authority scores: +Association_rule_learning - auth. score: 0,492558858836 - inlinks: 11 - outlinks: 6 +Sequence_mining - auth. score: 0,346536934508 - inlinks: 8 - outlinks: 4 +Apriori_algorithm - auth. score: 0,31002783466 - inlinks: 7 - outlinks: 2 +Association_rule_mining - auth. score: 0,0594279297567 - inlinks: 1 - outlinks: 7 + +Page Ranks: +Association_rule_learning - PageRank: 0,0093760793441 - inlinks: 11 - outlinks: 6 +Apriori_algorithm - PageRank: 0,00448536675772 - inlinks: 7 - outlinks: 2 +Sequence_mining - PageRank: 0,00365599706795 - inlinks: 8 - outlinks: 4 +Association_rule_mining - PageRank: 0,00201495426623 - inlinks: 1 - outlinks: 7 diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph.dot b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph.dot new file mode 100644 index 00000000..2d059a05 --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph.dot @@ -0,0 +1,410 @@ +digraph { + Data_mining -> Analytics + Data_mining -> Information_extraction + Data_mining -> Data_analysis + Data_mining -> Computer_science + Data_mining -> Data_set + Data_mining -> Artificial_intelligence + Data_mining -> Machine_learning + Data_mining -> Statistics + Data_mining -> Database_system + Data_mining -> Data_management + Data_mining -> Data_pre_processing + Data_mining -> Statistical_model + Data_mining -> Statistical_inference + Data_mining -> Computational_complexity_theory + Data_mining -> Data_visualization + Data_mining -> Online_algorithm + Data_mining -> Buzzword + Data_mining -> Data_collection + Data_mining -> Data_warehouse + Data_mining -> Decision_support_system + Data_mining -> Business_intelligence + Data_mining -> Discovery_observation_ + Data_mining -> Cluster_analysis + Data_mining -> Anomaly_detection + Data_mining -> Association_rule_mining + Data_mining -> Spatial_index + Data_mining -> Predictive_analytics + Data_mining -> Data_dredging + Data_mining -> FICO + Data_mining -> Data + Data_mining -> Bayes_theorem + Data_mining -> Regression_analysis + Data_mining -> Neural_networks + Data_mining -> Genetic_algorithms + Data_mining -> Decision_tree_learning + Data_mining -> Support_vector_machines + Data_mining -> Association_for_Computing_Machinery + Data_mining -> SIGKDD + Data_mining -> Academic_journal + Data_mining -> CIKM_Conference + Data_mining -> Conference_on_Information_and_Knowledge_Management + Data_mining -> European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases + Data_mining -> IEEE + Data_mining -> KDD_Conference + Data_mining -> Conference_on_Knowledge_Discovery_and_Data_Mining + Data_mining -> Society_for_Industrial_and_Applied_Mathematics + Data_mining -> SIGMOD + Data_mining -> International_Conference_on_Very_Large_Data_Bases + Data_mining -> Cross_Industry_Standard_Process_for_Data_Mining + Data_mining -> SEMMA + Data_mining -> Association_rule_learning + Data_mining -> Automatic_summarization + Data_mining -> Receiver_operating_characteristic + Data_mining -> Sequence_mining + Data_mining -> Multifactor_dimensionality_reduction + Data_mining -> Mining_Software_Repositories + Data_mining -> SPSS_Modeler + Data_mining -> Text_mining + Data_mining -> Web_mining + Data_mining -> Profiling_practices + Accuracy_paradox -> Predictive_analytics + Accuracy_paradox -> Receiver_operating_characteristic + Affinity_analysis -> Data_analysis + Affinity_analysis -> Data_mining + Affinity_analysis -> Cluster_analysis + Affinity_analysis -> Association_rule_learning + Anomaly_detection -> Cluster_analysis + Anomaly_detection -> Association_rule_learning + Anomaly_Detection_at_Multiple_Scales -> Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning + Apriori_algorithm -> Association_rule_learning + Apriori_algorithm -> FSA_Red_Algorithm + Association_rule_learning -> Data_mining + Association_rule_learning -> Sequence_mining + Association_rule_learning -> Lift_data_mining_ + Association_rule_learning -> Apriori_algorithm + Association_rule_learning -> Contrast_set_learning + Association_rule_learning -> K_optimal_pattern_discovery + Automatic_summarization -> Machine_learning + Automatic_summarization -> Text_mining + Biomedical_text_mining -> Text_mining + Biomedical_text_mining -> Co_occurrence_networks + Cluster_analysis -> Data_mining + Cluster_analysis -> Statistics + Cluster_analysis -> Data_analysis + Cluster_analysis -> Machine_learning + Cluster_analysis -> Anomaly_detection + Cluster_analysis -> Computer_science + Cluster_analysis -> Nearest_neighbor_search + Cluster_analysis -> Association_for_Computing_Machinery + Cluster_analysis -> SIGKDD + Co_occurrence_networks -> Text_mining + Concept_drift -> Predictive_analytics + Concept_drift -> Machine_learning + Concept_drift -> Data_stream_mining + Concept_drift -> Data_mining + Concept_mining -> Artificial_intelligence + Concept_mining -> Statistics + Concept_mining -> Data_mining + Concept_mining -> Text_mining + Concept_mining -> Formal_concept_analysis + Concept_mining -> Information_extraction + Conference_on_Knowledge_Discovery_and_Data_Mining -> SIGKDD + Conference_on_Knowledge_Discovery_and_Data_Mining -> Association_for_Computing_Machinery + Conference_on_Knowledge_Discovery_and_Data_Mining -> Academic_journal + Contrast_set_learning -> Association_rule_learning + Contrast_set_learning -> Data_mining + Data_classification_business_intelligence_ -> Business_intelligence + Data_classification_business_intelligence_ -> Data_set + Data_dredging -> Data_mining + Data_dredging -> Data_set + Data_dredging -> Predictive_analytics + Data_Mining_and_Knowledge_Discovery -> Computer_science + Data_Mining_and_Knowledge_Discovery -> Data_mining + Data_stream_mining -> Data_mining + Data_stream_mining -> Machine_learning + Data_stream_mining -> Concept_drift + Data_stream_mining -> Sequence_mining + Decision_tree_learning -> Statistics + Decision_tree_learning -> Data_mining + Decision_tree_learning -> Machine_learning + Document_classification -> Computer_science + Document_classification -> Support_vector_machines + Document_classification -> Decision_tree_learning + Document_classification -> Machine_learning + Document_classification -> Text_mining + Document_classification -> Web_mining + Document_classification -> Concept_mining + ECML_PKDD -> Machine_learning + Elastic_map -> Cluster_analysis + Evolutionary_data_mining -> Data_mining + Evolutionary_data_mining -> Data_analysis + Evolutionary_data_mining -> IEEE + Feature_vector -> Machine_learning + Feature_vector -> Statistics + Formal_concept_analysis -> Data_mining + Formal_concept_analysis -> Text_mining + Formal_concept_analysis -> Machine_learning + Formal_concept_analysis -> Cluster_analysis + Formal_concept_analysis -> Concept_mining + FSA_Red_Algorithm -> Apriori_algorithm + FSA_Red_Algorithm -> Data_mining + FSA_Red_Algorithm -> Association_rule_learning + Gene_expression_programming -> Genetic_algorithms + Gene_expression_programming -> Machine_learning + Gene_expression_programming -> Regression_analysis + Gene_expression_programming -> Receiver_operating_characteristic + Gene_expression_programming -> Predictive_analytics + Gene_expression_programming -> Artificial_intelligence + GSP_Algorithm -> Sequence_mining + GSP_Algorithm -> Apriori_algorithm + K_optimal_pattern_discovery -> Data_mining + K_optimal_pattern_discovery -> Association_rule_learning + K_optimal_pattern_discovery -> Data + Lift_data_mining_ -> Data_mining + Lift_data_mining_ -> Association_rule_learning + Lift_data_mining_ -> Receiver_operating_characteristic + List_of_machine_learning_algorithms -> Decision_tree_learning + List_of_machine_learning_algorithms -> Gene_expression_programming + List_of_machine_learning_algorithms -> Regression_analysis + List_of_machine_learning_algorithms -> Apriori_algorithm + Local_outlier_factor -> Anomaly_detection + Molecule_mining -> Data_mining + Multifactor_dimensionality_reduction -> Data_mining + Multifactor_dimensionality_reduction -> Machine_learning + Multifactor_dimensionality_reduction -> Decision_tree_learning + Multifactor_dimensionality_reduction -> Neural_networks + Nearest_neighbor_search -> Cluster_analysis + Nearest_neighbor_search -> Spatial_index + Nothing_to_hide_argument -> Data_mining + Optimal_matching -> Cluster_analysis + Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning -> Data_mining + Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning -> Anomaly_Detection_at_Multiple_Scales + Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning -> Machine_learning + Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning -> Anomaly_detection + Profiling_practices -> Data_mining + Receiver_operating_characteristic -> Machine_learning + Receiver_operating_characteristic -> Data_mining + ROUGE_metric_ -> Automatic_summarization + Sequence_mining -> Data_mining + Sequence_mining -> Association_rule_learning + Sequence_mining -> Apriori_algorithm + Sequence_mining -> GSP_Algorithm + SIGKDD -> Association_for_Computing_Machinery + SIGKDD -> Academic_journal + Software_mining -> Data_mining + Software_mining -> Data_set + Software_mining -> Mining_Software_Repositories + SPSS_Modeler -> Data_mining + SPSS_Modeler -> Predictive_analytics + SPSS_Modeler -> Business_intelligence + SPSS_Modeler -> Data_warehouse + SPSS_Modeler -> Anomaly_detection + SPSS_Modeler -> Apriori_algorithm + SPSS_Modeler -> Regression_analysis + SPSS_Modeler -> Cross_Industry_Standard_Process_for_Data_Mining + Structure_mining -> Data_mining + Structure_mining -> Text_mining + Structure_mining -> Molecule_mining + Structure_mining -> Sequence_mining + Text_mining -> Data_mining + Text_mining -> Concept_mining + Text_mining -> Information_extraction + Text_mining -> Predictive_analytics + Text_mining -> Machine_learning + Text_mining -> Business_intelligence + Text_mining -> Statistics + Text_mining -> Biomedical_text_mining + Text_mining -> Web_mining + Text_mining -> Sequence_mining + Uncertain_data -> Computer_science + Web_mining -> Data_mining + Analytics -> Statistics + Analytics -> Data_visualization + Analytics -> Text_mining + Analytics -> Business_intelligence + Analytics -> Data_mining + Analytics -> Machine_learning + Analytics -> Predictive_analytics + Information_extraction -> Machine_learning + Information_extraction -> Concept_mining + Data_analysis -> Data + Data_analysis -> Data_mining + Data_analysis -> Business_intelligence + Data_analysis -> Statistics + Data_analysis -> Predictive_analytics + Data_analysis -> Data_visualization + Data_analysis -> Analytics + Data_analysis -> Machine_learning + Data_analysis -> Nearest_neighbor_search + Computer_science -> Computational_complexity_theory + Computer_science -> Artificial_intelligence + Computer_science -> Machine_learning + Computer_science -> Statistics + Computer_science -> Association_for_Computing_Machinery + Computer_science -> Data_mining + Data_set -> Data + Data_set -> Statistics + Artificial_intelligence -> Computer_science + Artificial_intelligence -> Data_mining + Artificial_intelligence -> Machine_learning + Artificial_intelligence -> Regression_analysis + Artificial_intelligence -> Text_mining + Artificial_intelligence -> Neural_networks + Artificial_intelligence -> Genetic_algorithms + Artificial_intelligence -> Gene_expression_programming + Artificial_intelligence -> Decision_tree_learning + Artificial_intelligence -> List_of_machine_learning_algorithms + Artificial_intelligence -> Computational_complexity_theory + Machine_learning -> Artificial_intelligence + Machine_learning -> Data_mining + Machine_learning -> Discovery_observation_ + Machine_learning -> ECML_PKDD + Machine_learning -> Statistical_inference + Machine_learning -> List_of_machine_learning_algorithms + Machine_learning -> Decision_tree_learning + Machine_learning -> Association_rule_learning + Machine_learning -> Genetic_algorithms + Machine_learning -> Support_vector_machines + Machine_learning -> Regression_analysis + Machine_learning -> Cluster_analysis + Machine_learning -> Statistics + Machine_learning -> Data_analysis + Machine_learning -> Sequence_mining + Machine_learning -> SPSS_Modeler + Machine_learning -> Predictive_analytics + Statistics -> Data + Statistics -> Data_collection + Statistics -> Statistical_model + Statistics -> Statistical_inference + Statistics -> Regression_analysis + Statistics -> Data_mining + Statistics -> Data_set + Statistics -> Neural_networks + Statistics -> Cluster_analysis + Database_system -> Academic_journal + Database_system -> Association_for_Computing_Machinery + Database_system -> SIGMOD + Database_system -> IEEE + Database_system -> Data_warehouse + Database_system -> Data_mining + Database_system -> Data + Database_system -> Business_intelligence + Database_system -> Decision_support_system + Data_management -> Data + Data_management -> Data_analysis + Data_management -> Business_intelligence + Data_management -> Data_mining + Data_pre_processing -> Data_mining + Data_pre_processing -> Machine_learning + Statistical_model -> Statistical_inference + Statistical_model -> Regression_analysis + Statistical_inference -> Statistics + Statistical_inference -> Statistical_model + Statistical_inference -> Cluster_analysis + Statistical_inference -> Data_mining + Statistical_inference -> Computational_complexity_theory + Statistical_inference -> Regression_analysis + Statistical_inference -> Data_collection + Data_visualization -> Data + Data_visualization -> Computer_science + Data_visualization -> Data_analysis + Data_visualization -> Data_mining + Data_visualization -> Statistics + Data_visualization -> Data_management + Data_visualization -> Business_intelligence + Data_visualization -> Data_set + Data_visualization -> Data_warehouse + Data_visualization -> Analytics + Online_algorithm -> Computer_science + Data_collection -> Data_management + Data_collection -> Statistics + Data_collection -> Statistical_inference + Data_collection -> Regression_analysis + Data_collection -> Cluster_analysis + Data_warehouse -> Data_analysis + Data_warehouse -> Data_mining + Data_warehouse -> Decision_support_system + Data_warehouse -> Business_intelligence + Decision_support_system -> Data_warehouse + Decision_support_system -> Artificial_intelligence + Decision_support_system -> Predictive_analytics + Decision_support_system -> Business_intelligence + Decision_support_system -> Data_mining + Business_intelligence -> Analytics + Business_intelligence -> Data_mining + Business_intelligence -> Text_mining + Business_intelligence -> Data_warehouse + Business_intelligence -> Predictive_analytics + Business_intelligence -> Data_visualization + Business_intelligence -> Information_extraction + Business_intelligence -> Decision_support_system + Discovery_observation_ -> IEEE + Association_rule_mining -> Association_rule_learning + Association_rule_mining -> Data_mining + Association_rule_mining -> Sequence_mining + Association_rule_mining -> Lift_data_mining_ + Association_rule_mining -> Apriori_algorithm + Association_rule_mining -> Contrast_set_learning + Association_rule_mining -> K_optimal_pattern_discovery + Predictive_analytics -> Statistics + Predictive_analytics -> Machine_learning + Predictive_analytics -> Data_mining + Predictive_analytics -> Information_extraction + Predictive_analytics -> Regression_analysis + Predictive_analytics -> Decision_tree_learning + Predictive_analytics -> Neural_networks + Predictive_analytics -> SPSS_Modeler + FICO -> Analytics + Data -> Computer_science + Data -> Data_analysis + Data -> Data_management + Data -> Data_mining + Data -> Data_set + Data -> Data_warehouse + Data -> Statistics + Data -> Data_collection + Data -> Statistical_inference + Data -> Regression_analysis + Data -> Cluster_analysis + Regression_analysis -> Statistics + Regression_analysis -> Machine_learning + Regression_analysis -> Data + Regression_analysis -> Statistical_model + Regression_analysis -> Data_collection + Regression_analysis -> Statistical_inference + Regression_analysis -> Cluster_analysis + Neural_networks -> Artificial_intelligence + Neural_networks -> Regression_analysis + Neural_networks -> Data_mining + Neural_networks -> Machine_learning + Neural_networks -> Gene_expression_programming + Neural_networks -> Predictive_analytics + Genetic_algorithms -> Computer_science + Genetic_algorithms -> Artificial_intelligence + Genetic_algorithms -> Gene_expression_programming + Genetic_algorithms -> Cluster_analysis + Support_vector_machines -> Machine_learning + Support_vector_machines -> Regression_analysis + Support_vector_machines -> Association_for_Computing_Machinery + Support_vector_machines -> Predictive_analytics + Support_vector_machines -> Decision_tree_learning + Association_for_Computing_Machinery -> IEEE + Association_for_Computing_Machinery -> Computer_science + Association_for_Computing_Machinery -> SIGKDD + Association_for_Computing_Machinery -> SIGMOD + Association_for_Computing_Machinery -> Conference_on_Information_and_Knowledge_Management + CIKM_Conference -> Conference_on_Information_and_Knowledge_Management + CIKM_Conference -> Association_for_Computing_Machinery + CIKM_Conference -> Computer_science + Conference_on_Information_and_Knowledge_Management -> Association_for_Computing_Machinery + Conference_on_Information_and_Knowledge_Management -> Computer_science + European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases -> ECML_PKDD + European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases -> Machine_learning + IEEE -> Computer_science + KDD_Conference -> SIGKDD + KDD_Conference -> Association_for_Computing_Machinery + KDD_Conference -> Academic_journal + SIGMOD -> Association_for_Computing_Machinery + International_Conference_on_Very_Large_Data_Bases -> SIGMOD + International_Conference_on_Very_Large_Data_Bases -> SIGKDD + Cross_Industry_Standard_Process_for_Data_Mining -> Data_mining + Cross_Industry_Standard_Process_for_Data_Mining -> SEMMA + Cross_Industry_Standard_Process_for_Data_Mining -> SPSS_Modeler + SEMMA -> Statistics + SEMMA -> Business_intelligence + SEMMA -> Data_mining + SEMMA -> Cross_Industry_Standard_Process_for_Data_Mining + SEMMA -> Data_visualization +} diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_100.png b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_100.png new file mode 100644 index 00000000..080cd6f7 Binary files /dev/null and b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_100.png differ diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_rootbase.dot b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_rootbase.dot new file mode 100644 index 00000000..45d3647d --- /dev/null +++ b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_rootbase.dot @@ -0,0 +1,472 @@ +digraph { + Data_mining [color=deeppink, label="Data_mining\nauthScore = 0,4549\nhubScore = 0,3423\npageRank = 0,0313"]; + Analytics [color=deeppink, label="Analytics\nauthScore = 0,0844\nhubScore = 0,1630\npageRank = 0,0040"]; + Information_extraction [color=deeppink, label="Information_extraction\nauthScore = 0,0902\nhubScore = 0,0290\npageRank = 0,0024"]; + Computer_science [color=deeppink, label="Computer_science\nauthScore = 0,1201\nhubScore = 0,1319\npageRank = 0,0127"]; + Data_set [color=deeppink, label="Data_set\nauthScore = 0,0923\nhubScore = 0,0384\npageRank = 0,0031"]; + Artificial_intelligence [color=deeppink, label="Artificial_intelligence\nauthScore = 0,1134\nhubScore = 0,1177\npageRank = 0,0048"]; + Machine_learning [color=deeppink, label="Machine_learning\nauthScore = 0,2754\nhubScore = 0,1638\npageRank = 0,0159"]; + Statistics [color=deeppink, label="Statistics\nauthScore = 0,2435\nhubScore = 0,0940\npageRank = 0,0075"]; + Database_system [color=deeppink, label="Database_system\nauthScore = 0,0360\nhubScore = 0,1132\npageRank = 0,0016"]; + Statistical_inference [color=deeppink, label="Statistical_inference\nauthScore = 0,0928\nhubScore = 0,0950\npageRank = 0,0026"]; + Computational_complexity_theory [color=deeppink, label="Computational_complexity_theory\nauthScore = 0,0723\nhubScore = 1,0000\npageRank = 0,0045"]; + Data_visualization [color=deeppink, label="Data_visualization\nauthScore = 0,0965\nhubScore = 0,1472\npageRank = 0,0023"]; + Data_warehouse [color=deeppink, label="Data_warehouse\nauthScore = 0,1136\nhubScore = 0,0729\npageRank = 0,0021"]; + Decision_support_system [color=deeppink, label="Decision_support_system\nauthScore = 0,0691\nhubScore = 0,1093\npageRank = 0,0022"]; + Business_intelligence [color=deeppink, label="Business_intelligence\nauthScore = 0,1686\nhubScore = 0,1283\npageRank = 0,0041"]; + Cluster_analysis [color=deeppink, label="Cluster_analysis\nauthScore = 0,1318\nhubScore = 0,1345\npageRank = 0,0089"]; + Association_rule_mining [color=deeppink, label="Association_rule_mining\nauthScore = 0,0360\nhubScore = 0,0642\npageRank = 0,0016"]; + Spatial_index [color=deeppink, label="Spatial_index\nauthScore = 0,0379\nhubScore = 1,0000\npageRank = 0,0029"]; + Predictive_analytics [color=deeppink, label="Predictive_analytics\nauthScore = 0,1883\nhubScore = 0,1196\npageRank = 0,0038"]; + Data [color=deeppink, label="Data\nauthScore = 0,1212\nhubScore = 0,1315\npageRank = 0,0042"]; + Genetic_algorithms [color=deeppink, label="Genetic_algorithms\nauthScore = 0,0735\nhubScore = 0,0384\npageRank = 0,0016"]; + Support_vector_machines [color=deeppink, label="Support_vector_machines\nauthScore = 0,0603\nhubScore = 0,0587\npageRank = 0,0018"]; + Association_for_Computing_Machinery [color=deeppink, label="Association_for_Computing_Machinery\nauthScore = 0,0942\nhubScore = 0,0187\npageRank = 0,0076"]; + SIGKDD [color=deeppink, label="SIGKDD\nauthScore = 0,0573\nhubScore = 0,0156\npageRank = 0,0031"]; + Academic_journal [color=deeppink, label="Academic_journal\nauthScore = 0,0541\nhubScore = 1,0000\npageRank = 0,0044"]; + European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases [color=deeppink, label="European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases\nauthScore = 0,0360\nhubScore = 0,0311\npageRank = 0,0016"]; + KDD_Conference [color=deeppink, label="KDD_Conference\nauthScore = 0,0360\nhubScore = 0,0216\npageRank = 0,0016"]; + Conference_on_Knowledge_Discovery_and_Data_Mining [color=deeppink, label="Conference_on_Knowledge_Discovery_and_Data_Mining\nauthScore = 0,0360\nhubScore = 0,0216\npageRank = 0,0016"]; + Society_for_Industrial_and_Applied_Mathematics [color=deeppink, label="Society_for_Industrial_and_Applied_Mathematics\nauthScore = 0,0360\nhubScore = 1,0000\npageRank = 0,0020"]; + Cross_Industry_Standard_Process_for_Data_Mining [color=deeppink, label="Cross_Industry_Standard_Process_for_Data_Mining\nauthScore = 0,0587\nhubScore = 0,0599\npageRank = 0,0018"]; + SEMMA [color=deeppink, label="SEMMA\nauthScore = 0,0423\nhubScore = 0,1076\npageRank = 0,0016"]; + Association_rule_learning [color=deeppink, label="Association_rule_learning\nauthScore = 0,1053\nhubScore = 0,0532\npageRank = 0,0099"]; + Automatic_summarization [color=deeppink, label="Automatic_summarization\nauthScore = 0,0364\nhubScore = 0,0418\npageRank = 0,0029"]; + Receiver_operating_characteristic [color=deeppink, label="Receiver_operating_characteristic\nauthScore = 0,0533\nhubScore = 0,0769\npageRank = 0,0031"]; + SPSS_Modeler [color=deeppink, label="SPSS_Modeler\nauthScore = 0,0722\nhubScore = 0,1075\npageRank = 0,0025"]; + Text_mining [color=deeppink, label="Text_mining\nauthScore = 0,1220\nhubScore = 0,1576\npageRank = 0,0072"]; + Web_mining [color=deeppink, label="Web_mining\nauthScore = 0,0597\nhubScore = 0,0479\npageRank = 0,0024"]; + Profiling_practices [color=deeppink, label="Profiling_practices\nauthScore = 0,0360\nhubScore = 0,0479\npageRank = 0,0016"]; + Affinity_analysis [color=deeppink, label="Affinity_analysis\nauthScore = 0,0000\nhubScore = 0,0728\npageRank = 0,0016"]; + Anomaly_Detection_at_Multiple_Scales [color=deeppink, label="Anomaly_Detection_at_Multiple_Scales\nauthScore = 0,0082\nhubScore = 1,0000\npageRank = 0,0016"]; + Apriori_algorithm [color=deeppink, label="Apriori_algorithm\nauthScore = 0,0378\nhubScore = 0,0111\npageRank = 0,0047"]; + K_optimal_pattern_discovery [color=deeppink, label="K_optimal_pattern_discovery\nauthScore = 0,0124\nhubScore = 0,0717\npageRank = 0,0018"]; + Biomedical_text_mining [color=deeppink, label="Biomedical_text_mining\nauthScore = 0,0166\nhubScore = 0,0130\npageRank = 0,0016"]; + Co_occurrence_networks [color=deeppink, label="Co_occurrence_networks\nauthScore = 0,0014\nhubScore = 0,0128\npageRank = 0,0023"]; + Nearest_neighbor_search [color=deeppink, label="Nearest_neighbor_search\nauthScore = 0,0326\nhubScore = 0,0179\npageRank = 0,0020"]; + Concept_drift [color=deeppink, label="Concept_drift\nauthScore = 0,0082\nhubScore = 0,0978\npageRank = 0,0019"]; + Data_stream_mining [color=deeppink, label="Data_stream_mining\nauthScore = 0,0103\nhubScore = 0,0777\npageRank = 0,0016"]; + Formal_concept_analysis [color=deeppink, label="Formal_concept_analysis\nauthScore = 0,0115\nhubScore = 0,1036\npageRank = 0,0021"]; + Data_Mining_and_Knowledge_Discovery [color=deeppink, label="Data_Mining_and_Knowledge_Discovery\nauthScore = 0,0000\nhubScore = 0,0605\npageRank = 0,0016"]; + Document_classification [color=deeppink, label="Document_classification\nauthScore = 0,0000\nhubScore = 0,0671\npageRank = 0,0016"]; + ECML_PKDD [color=deeppink, label="ECML_PKDD\nauthScore = 0,0205\nhubScore = 0,0290\npageRank = 0,0023"]; + Nothing_to_hide_argument [color=deeppink, label="Nothing_to_hide_argument\nauthScore = 0,0000\nhubScore = 0,0479\npageRank = 0,0016"]; + ROUGE_metric_ [color=deeppink, label="ROUGE_metric_\nauthScore = 0,0000\nhubScore = 0,0038\npageRank = 0,0016"]; + Software_mining [color=deeppink, label="Software_mining\nauthScore = 0,0000\nhubScore = 0,0576\npageRank = 0,0016"]; + Concept_mining [color=cyan2, label="Concept_mining\nhubScore = 0,1090\npageRank = 0,0034"]; + Contrast_set_learning [color=cyan2, label="Contrast_set_learning\nhubScore = 0,0590\npageRank = 0,0018"]; + Data_dredging [color=cyan2, label="Data_dredging\nhubScore = 0,0774\npageRank = 0,0016"]; + Decision_tree_learning [color=cyan2, label="Decision_tree_learning\nhubScore = 0,1025\npageRank = 0,0036"]; + Evolutionary_data_mining [color=cyan2, label="Evolutionary_data_mining\nhubScore = 0,0479\npageRank = 0,0016"]; + FSA_Red_Algorithm [color=cyan2, label="FSA_Red_Algorithm\nhubScore = 0,0629\npageRank = 0,0016"]; + Lift_data_mining_ [color=cyan2, label="Lift_data_mining_\nhubScore = 0,0646\npageRank = 0,0018"]; + Molecule_mining [color=cyan2, label="Molecule_mining\nhubScore = 0,0479\npageRank = 0,0019"]; + Multifactor_dimensionality_reduction [color=cyan2, label="Multifactor_dimensionality_reduction\nhubScore = 0,0769\npageRank = 0,0016"]; + Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning [color=cyan2, label="Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning\nhubScore = 0,0777\npageRank = 0,0029"]; + Sequence_mining [color=cyan2, label="Sequence_mining\nhubScore = 0,0629\npageRank = 0,0038"]; + Structure_mining [color=cyan2, label="Structure_mining\nhubScore = 0,0607\npageRank = 0,0016"]; + Data_analysis [color=cyan2, label="Data_analysis\nhubScore = 0,1753\npageRank = 0,0045"]; + Data_management [color=cyan2, label="Data_management\nhubScore = 0,0784\npageRank = 0,0019"]; + Data_pre_processing [color=cyan2, label="Data_pre_processing\nhubScore = 0,0769\npageRank = 0,0016"]; + Neural_networks [color=cyan2, label="Neural_networks\nhubScore = 0,1086\npageRank = 0,0030"]; + FICO [color=cyan2, label="FICO\nhubScore = 0,0089\npageRank = 0,0016"]; + Uncertain_data [color=cyan2, label="Uncertain_data\nhubScore = 0,0126\npageRank = 0,0016"]; + Online_algorithm [color=cyan2, label="Online_algorithm\nhubScore = 0,0126\npageRank = 0,0016"]; + CIKM_Conference [color=cyan2, label="CIKM_Conference\nhubScore = 0,0225\npageRank = 0,0016"]; + Conference_on_Information_and_Knowledge_Management [color=cyan2, label="Conference_on_Information_and_Knowledge_Management\nhubScore = 0,0225\npageRank = 0,0020"]; + IEEE [color=cyan2, label="IEEE\nhubScore = 0,0126\npageRank = 0,0048"]; + Data_classification_business_intelligence_ [color=cyan2, label="Data_classification_business_intelligence_\nhubScore = 0,0275\npageRank = 0,0016"]; + Gene_expression_programming [color=cyan2, label="Gene_expression_programming\nhubScore = 0,0741\npageRank = 0,0023"]; + Feature_vector [color=cyan2, label="Feature_vector\nhubScore = 0,0546\npageRank = 0,0016"]; + Regression_analysis [color=cyan2, label="Regression_analysis\nhubScore = 0,0910\npageRank = 0,0019"]; + Data_collection [color=cyan2, label="Data_collection\nhubScore = 0,0493\npageRank = 0,0021"]; + Statistical_model [color=cyan2, label="Statistical_model\nhubScore = 0,0098\npageRank = 0,0018"]; + Anomaly_detection [color=cyan2, label="Anomaly_detection\nhubScore = 0,0249\npageRank = 0,0046"]; + Elastic_map [color=cyan2, label="Elastic_map\nhubScore = 0,0139\npageRank = 0,0016"]; + Optimal_matching [color=cyan2, label="Optimal_matching\nhubScore = 0,0139\npageRank = 0,0016"]; + Accuracy_paradox [color=cyan2, label="Accuracy_paradox\nhubScore = 0,0254\npageRank = 0,0016"]; + SIGMOD [color=cyan2, label="SIGMOD\nhubScore = 0,0099\npageRank = 0,0024"]; + International_Conference_on_Very_Large_Data_Bases [color=cyan2, label="International_Conference_on_Very_Large_Data_Bases\nhubScore = 0,0060\npageRank = 0,0016"]; + GSP_Algorithm [color=cyan2, label="GSP_Algorithm\nhubScore = 0,0040\npageRank = 0,0024"]; + List_of_machine_learning_algorithms [color=cyan2, label="List_of_machine_learning_algorithms\nhubScore = 0,0040\npageRank = 0,0020"]; + Mining_Software_Repositories [color=cyan2, label="Mining_Software_Repositories\nhubScore = 0,0000\npageRank = 0,0025"]; + Affinity_analysis -> Data_mining + Association_rule_learning -> Data_mining + Cluster_analysis -> Data_mining + Concept_drift -> Data_mining + Concept_mining -> Data_mining + Contrast_set_learning -> Data_mining + Data_dredging -> Data_mining + Data_Mining_and_Knowledge_Discovery -> Data_mining + Data_stream_mining -> Data_mining + Decision_tree_learning -> Data_mining + Evolutionary_data_mining -> Data_mining + Formal_concept_analysis -> Data_mining + FSA_Red_Algorithm -> Data_mining + K_optimal_pattern_discovery -> Data_mining + Lift_data_mining_ -> Data_mining + Molecule_mining -> Data_mining + Multifactor_dimensionality_reduction -> Data_mining + Nothing_to_hide_argument -> Data_mining + Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning -> Data_mining + Profiling_practices -> Data_mining + Receiver_operating_characteristic -> Data_mining + Sequence_mining -> Data_mining + Software_mining -> Data_mining + SPSS_Modeler -> Data_mining + Structure_mining -> Data_mining + Text_mining -> Data_mining + Web_mining -> Data_mining + Analytics -> Data_mining + Data_analysis -> Data_mining + Computer_science -> Data_mining + Artificial_intelligence -> Data_mining + Machine_learning -> Data_mining + Statistics -> Data_mining + Database_system -> Data_mining + Data_management -> Data_mining + Data_pre_processing -> Data_mining + Statistical_inference -> Data_mining + Data_visualization -> Data_mining + Data_warehouse -> Data_mining + Decision_support_system -> Data_mining + Business_intelligence -> Data_mining + Association_rule_mining -> Data_mining + Predictive_analytics -> Data_mining + Data -> Data_mining + Neural_networks -> Data_mining + Cross_Industry_Standard_Process_for_Data_Mining -> Data_mining + SEMMA -> Data_mining + Data_mining -> Data_analysis + Data_mining -> Data_management + Data_mining -> Data_pre_processing + Data_mining -> Statistical_model + Data_mining -> Online_algorithm + Data_mining -> Data_collection + Data_mining -> Anomaly_detection + Data_mining -> Data_dredging + Data_mining -> FICO + Data_mining -> Regression_analysis + Data_mining -> Neural_networks + Data_mining -> Decision_tree_learning + Data_mining -> CIKM_Conference + Data_mining -> Conference_on_Information_and_Knowledge_Management + Data_mining -> IEEE + Data_mining -> SIGMOD + Data_mining -> International_Conference_on_Very_Large_Data_Bases + Data_mining -> Sequence_mining + Data_mining -> Multifactor_dimensionality_reduction + Data_mining -> Mining_Software_Repositories + Data_mining -> Analytics + Data_analysis -> Analytics + Data_visualization -> Analytics + Business_intelligence -> Analytics + FICO -> Analytics + Data_mining -> Information_extraction + Concept_mining -> Information_extraction + Text_mining -> Information_extraction + Business_intelligence -> Information_extraction + Predictive_analytics -> Information_extraction + Information_extraction -> Concept_mining + Data_mining -> Computer_science + Cluster_analysis -> Computer_science + Data_Mining_and_Knowledge_Discovery -> Computer_science + Document_classification -> Computer_science + Uncertain_data -> Computer_science + Artificial_intelligence -> Computer_science + Data_visualization -> Computer_science + Online_algorithm -> Computer_science + Data -> Computer_science + Genetic_algorithms -> Computer_science + Association_for_Computing_Machinery -> Computer_science + CIKM_Conference -> Computer_science + Conference_on_Information_and_Knowledge_Management -> Computer_science + IEEE -> Computer_science + Data_mining -> Data_set + Data_classification_business_intelligence_ -> Data_set + Data_dredging -> Data_set + Software_mining -> Data_set + Statistics -> Data_set + Data_visualization -> Data_set + Data -> Data_set + Data_mining -> Artificial_intelligence + Concept_mining -> Artificial_intelligence + Gene_expression_programming -> Artificial_intelligence + Computer_science -> Artificial_intelligence + Machine_learning -> Artificial_intelligence + Decision_support_system -> Artificial_intelligence + Neural_networks -> Artificial_intelligence + Genetic_algorithms -> Artificial_intelligence + Artificial_intelligence -> Regression_analysis + Artificial_intelligence -> Neural_networks + Artificial_intelligence -> Gene_expression_programming + Artificial_intelligence -> Decision_tree_learning + Artificial_intelligence -> List_of_machine_learning_algorithms + Data_mining -> Machine_learning + Automatic_summarization -> Machine_learning + Cluster_analysis -> Machine_learning + Concept_drift -> Machine_learning + Data_stream_mining -> Machine_learning + Decision_tree_learning -> Machine_learning + Document_classification -> Machine_learning + ECML_PKDD -> Machine_learning + Feature_vector -> Machine_learning + Formal_concept_analysis -> Machine_learning + Gene_expression_programming -> Machine_learning + Multifactor_dimensionality_reduction -> Machine_learning + Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning -> Machine_learning + Receiver_operating_characteristic -> Machine_learning + Text_mining -> Machine_learning + Analytics -> Machine_learning + Information_extraction -> Machine_learning + Data_analysis -> Machine_learning + Computer_science -> Machine_learning + Artificial_intelligence -> Machine_learning + Data_pre_processing -> Machine_learning + Predictive_analytics -> Machine_learning + Regression_analysis -> Machine_learning + Neural_networks -> Machine_learning + Support_vector_machines -> Machine_learning + European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases -> Machine_learning + Machine_learning -> List_of_machine_learning_algorithms + Machine_learning -> Decision_tree_learning + Machine_learning -> Regression_analysis + Machine_learning -> Data_analysis + Machine_learning -> Sequence_mining + Data_mining -> Statistics + Cluster_analysis -> Statistics + Concept_mining -> Statistics + Decision_tree_learning -> Statistics + Feature_vector -> Statistics + Text_mining -> Statistics + Analytics -> Statistics + Data_analysis -> Statistics + Computer_science -> Statistics + Data_set -> Statistics + Machine_learning -> Statistics + Statistical_inference -> Statistics + Data_visualization -> Statistics + Data_collection -> Statistics + Predictive_analytics -> Statistics + Data -> Statistics + Regression_analysis -> Statistics + SEMMA -> Statistics + Statistics -> Data_collection + Statistics -> Statistical_model + Statistics -> Regression_analysis + Statistics -> Neural_networks + Data_mining -> Database_system + Database_system -> SIGMOD + Database_system -> IEEE + Data_mining -> Statistical_inference + Machine_learning -> Statistical_inference + Statistics -> Statistical_inference + Statistical_model -> Statistical_inference + Data_collection -> Statistical_inference + Data -> Statistical_inference + Regression_analysis -> Statistical_inference + Statistical_inference -> Statistical_model + Statistical_inference -> Regression_analysis + Statistical_inference -> Data_collection + Data_mining -> Computational_complexity_theory + Computer_science -> Computational_complexity_theory + Artificial_intelligence -> Computational_complexity_theory + Statistical_inference -> Computational_complexity_theory + Data_mining -> Data_visualization + Analytics -> Data_visualization + Data_analysis -> Data_visualization + Business_intelligence -> Data_visualization + SEMMA -> Data_visualization + Data_visualization -> Data_analysis + Data_visualization -> Data_management + Data_mining -> Data_warehouse + SPSS_Modeler -> Data_warehouse + Database_system -> Data_warehouse + Data_visualization -> Data_warehouse + Decision_support_system -> Data_warehouse + Business_intelligence -> Data_warehouse + Data -> Data_warehouse + Data_warehouse -> Data_analysis + Data_mining -> Decision_support_system + Database_system -> Decision_support_system + Data_warehouse -> Decision_support_system + Business_intelligence -> Decision_support_system + Data_mining -> Business_intelligence + Data_classification_business_intelligence_ -> Business_intelligence + SPSS_Modeler -> Business_intelligence + Text_mining -> Business_intelligence + Analytics -> Business_intelligence + Data_analysis -> Business_intelligence + Database_system -> Business_intelligence + Data_management -> Business_intelligence + Data_visualization -> Business_intelligence + Data_warehouse -> Business_intelligence + Decision_support_system -> Business_intelligence + SEMMA -> Business_intelligence + Data_mining -> Cluster_analysis + Affinity_analysis -> Cluster_analysis + Anomaly_detection -> Cluster_analysis + Elastic_map -> Cluster_analysis + Formal_concept_analysis -> Cluster_analysis + Nearest_neighbor_search -> Cluster_analysis + Optimal_matching -> Cluster_analysis + Machine_learning -> Cluster_analysis + Statistics -> Cluster_analysis + Statistical_inference -> Cluster_analysis + Data_collection -> Cluster_analysis + Data -> Cluster_analysis + Regression_analysis -> Cluster_analysis + Genetic_algorithms -> Cluster_analysis + Cluster_analysis -> Data_analysis + Cluster_analysis -> Anomaly_detection + Data_mining -> Association_rule_mining + Association_rule_mining -> Sequence_mining + Association_rule_mining -> Lift_data_mining_ + Association_rule_mining -> Contrast_set_learning + Data_mining -> Spatial_index + Nearest_neighbor_search -> Spatial_index + Data_mining -> Predictive_analytics + Accuracy_paradox -> Predictive_analytics + Concept_drift -> Predictive_analytics + Data_dredging -> Predictive_analytics + Gene_expression_programming -> Predictive_analytics + SPSS_Modeler -> Predictive_analytics + Text_mining -> Predictive_analytics + Analytics -> Predictive_analytics + Data_analysis -> Predictive_analytics + Machine_learning -> Predictive_analytics + Decision_support_system -> Predictive_analytics + Business_intelligence -> Predictive_analytics + Neural_networks -> Predictive_analytics + Support_vector_machines -> Predictive_analytics + Predictive_analytics -> Regression_analysis + Predictive_analytics -> Decision_tree_learning + Predictive_analytics -> Neural_networks + Data_mining -> Data + K_optimal_pattern_discovery -> Data + Data_analysis -> Data + Data_set -> Data + Statistics -> Data + Database_system -> Data + Data_management -> Data + Data_visualization -> Data + Regression_analysis -> Data + Data -> Data_analysis + Data -> Data_management + Data -> Data_collection + Data -> Regression_analysis + Data_mining -> Genetic_algorithms + Gene_expression_programming -> Genetic_algorithms + Artificial_intelligence -> Genetic_algorithms + Machine_learning -> Genetic_algorithms + Genetic_algorithms -> Gene_expression_programming + Data_mining -> Support_vector_machines + Document_classification -> Support_vector_machines + Machine_learning -> Support_vector_machines + Support_vector_machines -> Regression_analysis + Support_vector_machines -> Decision_tree_learning + Data_mining -> Association_for_Computing_Machinery + Cluster_analysis -> Association_for_Computing_Machinery + Conference_on_Knowledge_Discovery_and_Data_Mining -> Association_for_Computing_Machinery + SIGKDD -> Association_for_Computing_Machinery + Computer_science -> Association_for_Computing_Machinery + Database_system -> Association_for_Computing_Machinery + Support_vector_machines -> Association_for_Computing_Machinery + CIKM_Conference -> Association_for_Computing_Machinery + Conference_on_Information_and_Knowledge_Management -> Association_for_Computing_Machinery + KDD_Conference -> Association_for_Computing_Machinery + SIGMOD -> Association_for_Computing_Machinery + Association_for_Computing_Machinery -> IEEE + Association_for_Computing_Machinery -> SIGMOD + Association_for_Computing_Machinery -> Conference_on_Information_and_Knowledge_Management + Data_mining -> SIGKDD + Cluster_analysis -> SIGKDD + Conference_on_Knowledge_Discovery_and_Data_Mining -> SIGKDD + Association_for_Computing_Machinery -> SIGKDD + KDD_Conference -> SIGKDD + International_Conference_on_Very_Large_Data_Bases -> SIGKDD + Data_mining -> Academic_journal + Conference_on_Knowledge_Discovery_and_Data_Mining -> Academic_journal + SIGKDD -> Academic_journal + Database_system -> Academic_journal + KDD_Conference -> Academic_journal + Data_mining -> European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases + Data_mining -> KDD_Conference + Data_mining -> Conference_on_Knowledge_Discovery_and_Data_Mining + Data_mining -> Society_for_Industrial_and_Applied_Mathematics + Data_mining -> Cross_Industry_Standard_Process_for_Data_Mining + SPSS_Modeler -> Cross_Industry_Standard_Process_for_Data_Mining + SEMMA -> Cross_Industry_Standard_Process_for_Data_Mining + Data_mining -> SEMMA + Cross_Industry_Standard_Process_for_Data_Mining -> SEMMA + Data_mining -> Association_rule_learning + Affinity_analysis -> Association_rule_learning + Anomaly_detection -> Association_rule_learning + Apriori_algorithm -> Association_rule_learning + Contrast_set_learning -> Association_rule_learning + FSA_Red_Algorithm -> Association_rule_learning + K_optimal_pattern_discovery -> Association_rule_learning + Lift_data_mining_ -> Association_rule_learning + Sequence_mining -> Association_rule_learning + Machine_learning -> Association_rule_learning + Association_rule_mining -> Association_rule_learning + Association_rule_learning -> Sequence_mining + Association_rule_learning -> Lift_data_mining_ + Association_rule_learning -> Contrast_set_learning + Data_mining -> Automatic_summarization + ROUGE_metric_ -> Automatic_summarization + Data_mining -> Receiver_operating_characteristic + Accuracy_paradox -> Receiver_operating_characteristic + Gene_expression_programming -> Receiver_operating_characteristic + Lift_data_mining_ -> Receiver_operating_characteristic + Data_mining -> SPSS_Modeler + Machine_learning -> SPSS_Modeler + Predictive_analytics -> SPSS_Modeler + Cross_Industry_Standard_Process_for_Data_Mining -> SPSS_Modeler + SPSS_Modeler -> Anomaly_detection + SPSS_Modeler -> Regression_analysis + Data_mining -> Text_mining + Automatic_summarization -> Text_mining + Biomedical_text_mining -> Text_mining + Co_occurrence_networks -> Text_mining + Concept_mining -> Text_mining + Document_classification -> Text_mining + Formal_concept_analysis -> Text_mining + Structure_mining -> Text_mining + Analytics -> Text_mining + Artificial_intelligence -> Text_mining + Business_intelligence -> Text_mining + Text_mining -> Concept_mining + Text_mining -> Sequence_mining + Data_mining -> Web_mining + Document_classification -> Web_mining + Text_mining -> Web_mining + Data_mining -> Profiling_practices + Affinity_analysis -> Data_analysis + Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning -> Anomaly_Detection_at_Multiple_Scales + Anomaly_Detection_at_Multiple_Scales -> Proactive_Discovery_of_Insider_Threats_Using_Graph_Analysis_and_Learning + Association_rule_learning -> Apriori_algorithm + FSA_Red_Algorithm -> Apriori_algorithm + GSP_Algorithm -> Apriori_algorithm + List_of_machine_learning_algorithms -> Apriori_algorithm + Sequence_mining -> Apriori_algorithm + SPSS_Modeler -> Apriori_algorithm + Association_rule_mining -> Apriori_algorithm + Apriori_algorithm -> FSA_Red_Algorithm + Association_rule_learning -> K_optimal_pattern_discovery + Association_rule_mining -> K_optimal_pattern_discovery + Text_mining -> Biomedical_text_mining + Biomedical_text_mining -> Co_occurrence_networks + Cluster_analysis -> Nearest_neighbor_search + Data_analysis -> Nearest_neighbor_search + Data_stream_mining -> Concept_drift + Concept_drift -> Data_stream_mining + Data_stream_mining -> Sequence_mining + Concept_mining -> Formal_concept_analysis + Formal_concept_analysis -> Concept_mining + Document_classification -> Decision_tree_learning + Document_classification -> Concept_mining + Machine_learning -> ECML_PKDD + European_Conference_on_Machine_Learning_and_Principles_and_Practice_of_Knowledge_Discovery_in_Databases -> ECML_PKDD + Software_mining -> Mining_Software_Repositories +} diff --git a/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_rootbase.png b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_rootbase.png new file mode 100644 index 00000000..7eaaebe8 Binary files /dev/null and b/ss2013/1_Web Mining/Uebungen/5_Uebung/abgabe/wikigraph/wikigraph_rootbase.png differ