The Association Rules Algorithm Based on Clustering in Mining Research in Corn Yield

: With the popularization of agricultural information technology, the use of data mining techniques to analyze the impact of different types of soil nutrient content and yield of corn has become a hot topic in the field of agriculture. Association rule mining is an important part of the field in Data mining , association rules can be found associated with agricultural data attributes. This article will use cluster analysis and association rule to analysis correlation between corn yield and soil nutrient. Firstly compare different clustering algorithm to chooses the optimal algorithm, make data collected in scientific classification, and based on expert knowledge of the collected data into different levels; then determine the type and content of different soil by association rules corn yield and soil nutrient; final inspection algorithm is correct. The results showed that: comparing K-means, hierarchical clustering analysis, and PAM, K-means algorithm to determine the optimal clustering; K value can be determined at selected intervals.K is equal to3, 4 or 6,clustering effect is good according to Sil value when K from 3 to 10 . based on the principle of association rules, clustering algorithm to select a K value associated with the combination of rule 6; After clustering algorithm of association rules, support and credibility and improve degree of accuracy is better than not clustering; by mining association rules after clustering, a great influence on the different levels of soil nutrients in corn yield. The results for the corn yield provides intelligent decision support data.


Introduction
Corn is the world's most productive acreage and grain crops, it has fast growth rate and high yield characteristics [1].Moreover, with the development of animal husbandry and corn deep processing industry, corn has become the world's most important food crops, forage crops and cash crops [2].863 in support of national technology plan, the Shanghai, Beijing, Heilongjiang, Xinjiang, Jilin and other places to carry out tests to explore intelligent agriculture, the establishment of a number of precision agriculture experiment and demonstration area, and achieved gratifying results.China is a large agricultural country, China's grain output of corn yield very significant impact, where soil nutrients is one of the main factors affecting the yield of corn, so dig out the relationship between the different types and content of soil nutrients and corn yield is particularly important.But our data mining research later, there is no overall strength, compared with developed countries there is still a big gap, has seriously hampered the development of intelligent agriculture in China.
With the emergence and speed of data collection significantly improved, we urgently need new technologies and tools to the vast amounts of data into information and knowledge available to us.Data mining [3] is a large number of known data search and analysis to discover hidden potential in the data relationship, in order to predict the future.In the last decade, data mining as decision support has been rapid development [4].Data mining has an important direction include cluster analysis [5] and association rule mining [6], the paper first by comparing the different clustering methods to select the optimal clustering algorithm, combined with association rules algorithm, dig out different relationship between soil nutrients and corn yield type and content.Data mining is used in agriculture to promote agricultural production and direction sustained, high-yield and effective means of security is important to protect the interests of farmers and national food security.

Clustering and association rules
Clustering is a way to simplify data through data modeling, which uses similarity to different data into different classes.So in the same class has a great similarity, in the different classes has a larger dissimilarity .Association rules are used to find links between things.Firstly, the concept and characteristics of each clustering algorithm is analyzed, based on the value of Sil select the optimal clustering algorithm; and then by the clustering algorithm is divided into different clusters to compare and choose the most appropriate number of clusters.Finally through the association rules to determine different kinds and content of soil nutrients and corn production relations.

Select Clustering Algorithm
Different algorithms have different characteristics and adaptability.For example, K-means algorithm [7] When more dense clusters and cluster is obvious difference between clustering effect is good.But the noise and isolate sensitive data point.Specific steps are as follows: 1) Assign to each instance from its nearest cluster center to give K clusters; 2) were calculated for each cluster mean all instances, each of them as brand new cluster center.Repeat 1) and 2) until the position of the K cluster centers are fixed, assigned cluster is also fixed.Hierarchical clustering algorithm [8] feature is not required prior to a given class number, the system can display the results in the clustering tree way, more suitable for data hierarchy, But to determine the distance matrix computation.Specific steps are as follows: 1) Each object is classified as a class, received a total of N classes, each class contains only one object.Distance between classes of objects they contain is the distance between.2) find the closest two classes combined into one category, so the total number of class one less.3) recalculate a new class clustering and all the old classes.4) Repeat 2) and 3), until finally merged into a class so far.
PAM algorithm [9] is less sensitive to noise and is not affected by the order of the input data, which is insufficient to determine the high computational cost of clustering centers required for large data clustering process is slow.Specific steps are as follows: 1) choose K objects as the initial cluster centers.2) In addition to the open cluster center point of the sample to calculate the distance to each cluster center will classify the sample from the sample to the center of the nearest sample point.3) and then calculated for each category, in addition to other classes outside the center of sample points and minimum distance to all other points, the minimum point as a new cluster centers.4) Repeat 3) until the position of the two cluster centers unchanged.
Many data by understanding the structure and background information, you can know which algorithms are relatively good, but in many cases, our understanding of the data is not much, so we have to choose an objective criteria to evaluate [10].Between the use of class compactness evaluated within the class separation and clustering is a common approach, one of the most classic is Silhouette indicators.Silhouette indicators both for the number of clusters optimal estimation can also be applied to evaluate the quality of clustering [11].Therefore, this paper Silhouette indicators as the number of selection and clustering clustering algorithm to determine the objective evaluation criteria.
Having provided a sample data set is divided into clusters, the clusters in the sample and the average of all other samples of dissimilarity or distance for the sample to an average of all samples of another class or dissimilarity distance, wherein, and.Thus, the index is calculated Silhouette samples are as follows: ( All the samples in a cluster compactness of the average value of sil said tightness and separability; Sil average value of all samples of a data set may reflect the quality of clustering results , the greater the Sil value represents the better the quality of clustering. The main steps of Silhouette clustering algorithm selection method based on the validity of indicators designed as follows: 1) setting a given data set, candidate cluster algorithm; 2) to specify the output number of classes K, respectively for each candidate cluster algorithm for data collection clustering; 3) calculating an average value of clustering results Sil each candidate clustering algorithm; 4) comparing the average value Sil, Sil value corresponding to the maximum average selected candidate clustering algorithm for the optimal algorithm.
In this paper, the national "863" plan "Maize Precise Operation System and Application" project demonstration base -Jilin Nong'an some experimental data from 2005 to 2010, and use MATLAB R2014a to cluster analysis.When K is equal to 3, the average Sil K-means algorithm is the highest; when K is equal to 4, the average value of K-means algorithm Sil also the highest.K-means algorithm proved optimal clustering algorithm, the results shown in Figure 1.

Clustering algorithm K value selection
Different values of K greater impact on clustering results, so traverse K 3-10 is the case, the corresponding value of Sil. as shown in picture 2. when K is equal to 3,4,6 clustering is better.According to the association rules algorithm is the core elements of frequent itemsets, therefore this article selects the K value is 6 k-means algorithm and its combined association rules algorithm.

Association Rules Algorithm
1993, R.Agrawal, who designed a program called Apriori algorithm [12], the algorithm is the most influential algorithm, it laid the foundation for the Association Rules algorithm.This algorithm has two steps: The first step is mining frequent item sets.That support is greater than those specified by the user support selected projects, as frequent item set; the second step is based on frequent item sets to generate strong association rules.That association rules support and confidence are greater than or equal to the user specified support and confidence [13].
Support abbreviated as sup, Refers to a rule before or after a corresponding number of support as a percentage of total number of records.Formula is as follows: . (2) Figure 3 clustering results and screening

Data into Level
Based on the experience of experts of different data into different levels [15].The N, P and K from A to F is divided into six different levels, the yield from A to C is divided into three different levels.And based on the knowledge of experts just four frequent itemsets obtained by clustering into different levels, as shown in FIG.
Figure 4 data into level

Relationship between the various soil nutrients and Yield
Respectively to observe the alkali solution N, P and K impact on production [16].The results in FIG. 5 shown in FIG.
Figure 5 is not clustering and clustering after the relationship between the soil nutrient and yield

Data Association
To ignore an abnormal event and a small probability event.Selection criteria are: support between 15-100 percent, and lift the value of more than 1.0 and lift the top five values and yield-related association rules, as shown in Figure 6.

Examine the validity of
To test the accuracy of the algorithm, based on J48 algorithm [17] 10-fold cross-validation [18].
The first data set is divided, which will in turn nine as training data, one as the test data, test.Each test will draw the appropriate accuracy.Then the average of 10 times as a result of the correct rate of accuracy of the method of estimation.As shown in Figure 7.

when yield is equal to A, the relationship between the various soil nutrients
A full-time to the Yield.Relationship between N, P and K between the three shown in Figure 8.

Conclusion and Analysis
By the above operation, we can see: 1. Choose a good clustering algorithm and K values of great influence on the clustering results.
According to Comparative K-means, hierarchical clustering analysis, and PAM, K-means algorithm to determine the optimal clustering algorithm shown in Figure 1.
2. the value may be determined based on Sil K value chosen in the range from 3 to 10, when K is equal to 3,4,6 there will be a good clustering results, as shown in Fig. 2 3.After clustering accuracy of association rules algorithm is better than not clustering accuracy of association rules algorithm, show that using the association rules before clustering analysis of data processing is very necessary..As shown in Figure 7.By the respective association rules diagram can be drawn: clustering support after the credibility and lift are better than not clustering, association rules show before using cluster analysis of data processing when the actual decision is significant .
4. the impact of different levels of soil nutrients for corn production is very high.When soil nutrients N content of C, P content of C, K content of E grade, grade A larger yield ratio.

Discussion and Outlook
In recent years, as we get the data and access to information ability increase [19], expanding our database [20].A growing number of agricultural production, scientific research, business management to use the database, and this trend will continue to rise.In this era of information explosion, information overload is everyone has to face the problem.How can you in a sea of information, find useful, hidden knowledge, as our top priority.Only make data resources, mining the data potential and useful knowledge, to make data have the effect of support decision making and prediction.Otherwise, a large amount of data it may become a waste, even burden [21].Therefore, data mining is a powerful guidance and direction to the development of the future, is a hot research field in the world today, its research has broad application prospects and great practical significance.

Figure 1
Figure 1 each algorithm of Sil value comparison

Figure 2
Figure 2 different K value corresponding to the Sil value comparison

Figure 6
Figure 6 after clustering and clustering of relative contrast

Figure 7
Figure 7 not clustering and clustering accuracy after contrast

Figure 8
Figure 8 production for A while after not clustering and clustering of relations between the soil nutrient