Weighting Features Before Applying Machine Learning Methods to Pulsar Search

. In recent years, different Artificial Intelligence methods have been applied to pulsar search, such as Artificial Neural Network method, PEACE Sorting Algorithm, Real-time Classification method. In this paper, Weighting Feature method before applying machine learning (ML) was proposed. We give weight to each feature according to its ability to distinguish pulsar and non-pulsar candidates. The ability is determined by the separation degree of the distribution of pulsars and non-pulsars on particular feature. And then use the ML methods to classify different types of candidates. The results show that this method is significant. The accuracy of identifying pulsars and modeling time were both improved after weighting.


Introduction
Pulsar is fast rotated neutron star, which periodically sends pulse signal whose period is short and very stable.Pulsar plays an important part in physics, astronomy and many other fields.In recent years, AI methods like image pattern recognition [2], artificial neural network method and scheduling algorithm are used in pulsar search.Lee et al. (2013) proposed the PEACE sorting algorithm to search pulsar, which had obtained good results.Lyon et al. (2016) used the GH-VFDT (Gauss-Hellinger Very Fast Decision Tree) to distinguish the candidate, with recognition rate of pulsars over 90% [3].
While GH-VFDT obtained a high recognition rate of pulsars, the difference between the abilities of different features to distinguish the pulse and non-pulsar are not reflected.Thus, in this paper, we add different weights to the eight features before the machine learning process according to their separation degree.Results show that weighting improves both the accuracy rate of classification and modeling time.
The structure of this paper is shown as follows: the related work is mentioned in section 2; the Feature Weighting method is proposed in section 3; and with its corresponding experiments are showed and analyzed in section 4 and 5; the section of conclusion comes as the end.

2
Related Work

Feature
In the process of searching for pulsar signals with radio telescope, the most basic data are obtained.These data are subjected to Removing signal interference, de dispersion, FFT [4] [5].Pulsar Feature Lab and Presto [6] are used to process the primitive data into these eight features.

Dataset
Three separate datasets were used to the measure the performance of ML methods on pulsar search.The small scale dataset is LOTAAS which was obtained during the LOTAAS survey and is currently private.The medium scale dataset HTRU2 was obtained during an analysis of HTRU Medium Latitude data by Thornton (2013).The large scale dataset HTRU1 is produced by Morello et al.The detailed information of the three datasets is summarized in the table1.It is obvious that when we are classifying a pulsar candidate via its feature, the feature that has a high degree of separation between pulsars and non-pulsars weighs more than other features.Therefore, this paper naturally adds different weights to the eight features according to their separation degree between different types of candidates.As a specific feature, this paper defines the separation degree as follows: In this formula, as can be seen from figure2, for a particular feature, Ab denotes the separation degree, l means the coincident area of pulsar and non-pulsar, Rp denotes the width of the distribution of the pulsars on the feature, while R I means the distribution width of non-pulsars.The distribution of features between candidates can be considered as natural distribution.According 3σ principle, features of almost all candidates will be within the range of feature box.By analyzing the data from LOTAAS, HTRU2 and HTRU1, this paper get the weight of each feature W i ( i = 1~8).

Experiments
In this part, weighting each feature before utilizing ML methods on the datasets are proposed.Classification accuracy and modelling time are both taken to be criterion to judge the performance of the methods.The paper supposes weighting is useful if methods improves the accuracy or improves the modeling time.What's more, accuracy goes before modeling time.In conclusion, for the five ML methods SMO, IBK, JRIP, J48 and RandomForest, weighting either improves the accuracy or modeling time, or in the worst cases, weighting will at least be the same as not weighting.

Discussions
This part explains why SMO, IBK, JRIP, J48 and RandomForest are selected to test the effects of weighting instead of other ML methods.In this paper, we actually experimented various ML methods using WEKA.

Conclusion
Due to its stable cycle, Pulsar plays a very important part in physics, astronomy and many other fields.Traditional ways of pulsar search are manual.In recent years, Artificial intelligence is widely used in various fields and achieves great success.Therefore, AI methods are gradually applied to pulsar search.This paper is based on the principled real-time classification approach.Eight features are used to describe a pulsar candidate.
Before applying ML methods on datasets, this paper weights each feature according to their separation degree, and then find out that either the accuracy or modeling time is improved after weighting.

Fig. 2 .
Fig. 2. Distribution of pulsar and non-pulsars and their coincident area and periodic search.Then a pulsar candidate is generated which has some basic Features.Lyon et al. (2016) used eight new features to describe the pulsar candidate.The eight features are Mean of the integrated profile Prof μ , Standard deviation of the integrated profile Prof σ , Excess kurtosis of the integrated profile Prof k , Skewness of the integrated profile Prof s , Mean of the DM-SNR curve DM μ , Standard deviation of the DM-SNR curve DM σ , Excess kurtosis of the DM-SNR curve DM k , Skewness of the DM-SNR curve DM s

Table 1 .
Three pulsar candidate datasetsAnalyzing the statistic distribution of the eight features from the sample data of pulsars and non-pulsars, feature data was extracted from 90, 000 labelled pulsar candidates produced by Morello et al. (2014), via Pulsar Feature Lab.As it is showed in figure1, the data were scaled to the interval of [0, 1].For each feature, there are two box plots.The orange red box shows the feature distribution of known pulsars, while the box in light blue describes the RFI/noise distribution.

Table 2 .
Accuracy and modeling time before and after weighting for LOTAAS.

Table 3 .
Accuracy and modeling time before and after weighting for HTRU2.For medium scale dataset HTRU2, the experimental results are shown in Table3, after weighting, accuracy rate of SMO is improved.Modeling time of JRIP, J48 and Random-Forest are improved, while IBK remains the same.

Table 4 .
Accuracy and modeling time before and after weighting for HTRU1.

Table 5 .
Accuracy rate of pulsar recognition of various ML methods before weighting In table5, the purple number means the corresponding methods performs better than others.As is shown, with the scale of datasets becomes larger, SMO, IBK, JRIP, J48 and RandomForest have better performance over other algorithms.