Public Opinion Monitoring for Proactive Crime Detection Using Named Entity Recognition

Public opinion monitoring has been well studied in sociology and informatics. Considerable amounts of crime-related information are available on social media platforms every day. Current methods for monitoring public opinion are typically based on rule matching and manual searching instead of automated processing and analysis. However, the extraction of useful information from large volumes of social media data is a major challenge in public opinion monitoring. This chapter describes a methodology for extracting key information from a large volume of Chinese text using named entity recognition based on the LSTM-CRF model. Since traditional named entity recognition datasets are small and only contain a few types, a custom crime-related corpus was created for training. The results demonstrate that the methodology can automatically extract key attributes such as person, location, organization and crime type with a precision of 87.58%, recall of 83.22% and F1 score of 85.24%.


Introduction
Public opinion monitoring -or social listening -is a promising approach for alerting law enforcement about crimes before they occur, because some crimes are planned using social media [8].Several such cases were encountered during the protests against the Hong Kong extradition bill of 2019.The demonstrations against the bill began in March and April 2019 and escalated significantly in June 2019 [1].A significant number of criminal activities occurred during the protests, including intimidation, beatings and looting that seriously impacted public safety and social order.Many of these activities were planned and coordinated using social media platforms and online discussion groups.
Unfortunately, discovering potential crimes is difficult because of the need to sift through large volumes of data and interpret the slang terms used by criminal entities.Current approaches for recognizing criminal activities, which employ simple rule matching or manual processing, are inefficient and error-prone.
Named entity recognition is a fundamental component of many natural language processing applications such as relation extraction, event extraction, knowledge graphs and question-answering systems.It can classify specific and useful entities into appropriate semantic classes such as persons, locations, organizations, dates and times [5].
This chapter describes a named entity recognition methodology for monitoring public opinion in Chinese language posts and extracting crime-related features.Specifically, the LSTM-CRF model, an artificial recurrent neural network [3], is employed to extract key information from a large volume of Chinese text.Since traditional named entity recognition datasets are small and contain few types, a custom crimerelated corpus was created for training.Experiments reveal that the trained LSTM-CRF model was able to recognize special features that did not exist in the training dataset.The methodology automatically extracted key attributes such as person, location, organization and crime type with a precision of 87.58%, recall of 83.22% and F1 score of 85.24%.

Named Entity Recognition
Named entity recognition, also known as sequence labeling, is used to identify special entities in structured or unstructured text.Conventional named entity recognition methods fall in two categories, one based on rules or dictionaries and the other based on statistics [2].
Named entity recognition methods based on rules typically employ finite-state machines to match specific language models.However, the rule maker needs to have sufficient knowledge of the language to construct the finite-state machine.Methods based on dictionaries rely on previously-created dictionaries of persons, locations and organizations.Thus, the methods based on rules and dictionaries require large amounts of time and resources to prepare the supporting materials.Additionally, the methods have high error rates.
Statistics-based named entity recognition methods were developed to address the disadvantages of rule and dictionary based methods.These methods employ n-gram, hidden Markov, maximum entropy, conditional random field, support vector machine or decision tree models.All these time ... models require training datasets.While creating the datasets is not difficult, model performance depends significantly on the quality and quantity of the datasets [15].The proposed methodology processes large amounts of text using deep learning and transfer learning.The first step is to create the training and testing datasets.Since no public corpora containing crime-related words exist, a custom criminal corpus was created.BIO labels were added to each word and the resulting corpus was divided into a training dataset (90% of the data) and a testing dataset (10% of the data).

LSTM-CRF Model
A long short-term memory and conditional random field (LSTM-CRF) model combines a long short-term memory (LSTM) model and a conditional random field (CRF) model.The LSTM model is a special type of recurrent neural network that processes long-term dependence better than conventional recurrent neural networks.The CRF model is effective at labeling and segmenting serialized data.
A traditional neural network has an input layer, a hidden layer and an output layer, where all the nodes in each layer are fully connected to nodes in the next layer.The output values of each layer, which are computed from the input values of the layer, are passed as input values to the next layer.Each input value is processed independently and the process has no memory.For input data that is sequential, such as a sentence, it is necessary to process the data in sequence, one element at a time.A recurrent neural network is a special type of neural network that is geared for processing sequential data.Specifically, it iterates through the data in sequence and maintains state information while processing.Figure 1 shows the structure of a recurrent neural network.The LSTM model is a special type of recurrent neural network where the neurons are replaced by memory cells, each with input gates, forget gates and output gates.This special structure makes the LSTM model better at processing long-term dependence than a recurrent neural network; also, it avoids gradient disappearance and gradient expansion problems [13].
Figure 2 shows an LSTM memory cell.Note that C t−1 and X t are the input values at time t, tanh is a neural layer, h t is the state at time t and C t is the output value at time t.
The CRF model is a statistical name entity recognition technique.A conditional random field defines when a random variable Y , conditioned on a set of observations X, Prob(Y | X), obeys the Markov property.In this work, X is a set of words and Y is the corresponding label.The CRF model can then be used to learn the relationship between labels.For example, when a word is labeled as B-PER, the label of the next word is strongly believed to be I-PER.Compared with a conventional labeling model, the CRF model is better at using sentence-level label (tag) information and is able to model the transition behavior of different tags.Also, with CRF, the labeling of one character considers the labels of neighboring characters to determine the final label [4,7].
The proposed LSTM-CRF model combines the LSTM and CRF models.Figure 3 shows the structural graph of the LSTM-CRF model.It comprises three layers.The first layer is the word embedding layer, which transforms each word into a corresponding vector so that the entire sentence can be represented as an embedding matrix.The second layer is the LSTM layer, which uses forward propagation and backward propagation to extract features automatically.The third layer is the CRF layer, which uses the output of the second layer to label words with maximum probability.

Related Work
As mentioned above, a significant number of criminal activities occurred during the protests against the Hong Kong extradition bill of 2019, including intimidation, beatings and looting that impacted public safety and social order.Since many of these activities were planned and coordinated using social media platforms and online discussion groups, Hong Kong government authorities were interested in monitoring public opinion to identify potential crimes and work proactively to mitigate the hazards.However, very limited research has been done on applying deep learning techniques to detect potential criminal events by analyzing Chinese text in social media and online discussion groups.
Wang et al. [12] have employed machine learning for sentimental entity recognition with a precision of 89%.Kleinberg et al. [6] have developed an automated verbal deception detection system that employs the spaCy and Stanford NER tools.Motivated by these efforts, the research described in this chapter extracts information from large volumes of Chinese text using named entity recognition based on the LSTM-CRF model.

Experiments
This section discusses the experimental setup and the classification of named entities.

Experimental Setup
The corpora and LSTM-CRF model are the two key components in the experiments.The corpora comprise a normal part and a crime part.The normal part is a portion of the MSRA corpus [9] whereas the crime part comprising three Chinese dictionaries specializing in crime was downloaded from the Sougou platform [11].
Table 1 shows the distribution of data in the named entity recognition corpora.One corpus is the normal part, which contains four types of entities, i.e., person, location, organization and not a named entity (nonnamed entity).The other corpus is the crime part, which only contains the criminal entity type.
Each entry (sentence) in the corpora was processed to extract a set of tokens (Chinese characters).The BIO tagging style employed for labeling uses O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, B-CRM and I-CRM, where (i) O means that the word is not a named entity; (ii) B-X means that the word is the beginning of X (e.g., B-PER means that the word is the beginning of person); and (iii) I-X means that the word is inside word X. Figure 4 shows examples that use the BIO tagging style [10].

Classification of Named Entities
The classification of named entities involves two steps: Step 1: Data Processing: Based on the corpus, each word is labeled with a corresponding tag.Next, the corpus is serialized and a dictionary containing non-replicative words is constructed.The dictionary has the form: {"first word": [id1, counts], "second word": [id2, counts], ... }, where a raw word is in quotes and the square brackets contain the identification number of the word and The survey covers the Forbidden City, Libo, Institute of antiquity, Tsinghua University Library, beitu, Japanese and puppet databases and more than 200,000 cultural relics and 30,000 foreign words, which are precious historical materials of Beijing.

B-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM I-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM B-CRM I-CRM B-CRM I-CRM B-CRM I-CRM I-CRM B-CRM I-CRM I-CRM
Wechat hypnotic water, aphrodisiac powder, aphrodisiac drug, Tulun graduation certificate, answer package, provide invoice, replace invoice, sell invoice, Mongolian sweat drug, psychedelic drug, psychedelic drug, aphrodisiac drug the number of occurrences.Based on the counts, words with low frequencies are eliminated from the dictionary.Figure 5 shows the data processing step.
Step 2: Model Setup: The LSTM-CRF model was employed in the experiments -the LSTM layer is located at the bottom whereas the CRF layer is located on top.The softmax function was used to compute the probabilities of each target class over all possible target classes.
The hyper-parameters used in the experiments are shown in Table 2.The batch size was set to 64, meaning that 64 samples were trained in each epoch.state (hidden dim) was 300, the optimizer was Adam, the learning rate (lr) was 0.001 and the gradient clipping (clip) was 5. Table 3 shows the proportions of the training dataset and testing dataset.The amount of training data was set to 1,606,734 and the amount of testing data was set to 178,526, corresponding to proportions of 90% and 10%, respectively.In the experiment, the extracted tokens were used as the basic unit of processing.After each epoch, the loss function value, global step, precision, recall and F1 score were recorded.obtained.This is because the other three parts (person, location and organization) have been studied in public datasets by other researchers, but only this research contains the crime part.Also, since the corpus was created for crimes, it is more difficult to train the model to recognize crime-related entities.
As expected, the models produce different results.The named entity recognition technique is limited in that it only extracts key entities from text, but does not analyze the entities.As a consequence, the investigator has to analyze the extracted entities and make manual decisions.

Conclusions
Automated monitoring of social media platforms and online discussion groups can provide insights into potential criminal events, enabling law enforcement to work proactively to mitigate the hazards.The combined LSTM-CRF model described in this chapter is able to extract key information from large volumes of Chinese text using named entity recognition.Experiments indicate that the automated extraction of key attributes such as person, location, organization and crime is accomplished with a maximum precision of 87.58%, recall of 83.22% and F1 score of 85.24%.These results demonstrate that the methodology is effective at discovering potential criminal events.
Due to the absence of crime-related corpora, custom corpora had to be created for training and testing.Future research will focus on developing richer and larger corpora with criminal events.Training the model using these corpora would improve the overall performance.
A limitation of the methodology is that, while it identifies key entities, it cannot analyze them.Yang and Chow [14] have employed statistical methods to create relationships between entities.Future research will pursue this line of inquiry and also focus on relation extraction and emotional analysis using deep learning techniques.