Moderna's Vaccine Using the K-Nearest Neighbor (KNN) Method: An Analysis of Community Sentiment on Twitter

: The COVID-19 is still in Indonesia. The government has made efforts to stop the COVID-19 virus, by moving vaccination program. There are various types of vaccines, one of which is moderna vaccine or MRNA-1273 that applied intramuscularly. The vaccination programs using modern vaccines creates different opinions in public, especially among Twitter users. The opinion uploaded will be the data on Public Sentiment Analysis on Twitter About Moderna Vaccines Using K-Nearest Neighbor Method research. In this study, TF-IDF method is used for weighting the words and KNN for classifying the sentiment into two groups of sentiments, namely positive and negative. The tools used in this research are Rapid miner to collect tweet data and Python for sentiment classification and evaluation. From the test results Based on 50 training data when k = 3 it is known that the accuracy value is 80%, precision is 80%, recall is 100% and F-Measure is 89%.


Introduction
Currently the world is being hit by an outbreak of covid-19 which is very disturbing the activities of people around the world and until now it cannot be predicted when it will end (Baj et al., 2022).This virus can spread very quickly through direct touch or through the air.Various efforts have been made by the government to stop the spread of this virus.One way is to carry out a vaccination program using moderna vaccines.The responses of the Indonesian people regarding this moderna vaccine also varied (Harapan et al., 2020).There are those who welcome it and some who oppose this vaccination program.In the current era of digitalization, people are more inclined to express themselves and their views through social media, one of which is Twitter (Ramadhani & Wahyudin, 2022).Public opinion on Twitter will become data for sentiment analysis research on modern vaccines.
Sentiment Analysis is a stage of text analytics to obtain various data sources from the internet and several social media platforms.To obtain opinions from users who are on the platform (Alshuwaier et al., 2022).
Sentiment analysis is the process of understanding and classifying emotions (positive or negative) contained in writing using text analysis techniques (Veritawati et al., 2015).
Several related studies are used as references in this research such as the research entitled "Analysis of Sentiments to Astra Zeneca Vaccination on Twitter Using the Naïve Bayes and KNN Methods" which discusses public opinion on Twitter about the Astra Zeneca vaccine, in this case it is grouped into three, namely sentiment positive, neutral and negative (Ramadhani & Wahyudin, 2022).Research with these data has an accuracy value for the Naïve Bayes method of 88.56% +/-4.71%(micro average: 88.62%) while for the KNN method the results obtained from sentiment analysis are: 74.78% +/-3.74%(micro average: 74.77%).Another study entitled "Classification of Tweets on Twitter using the K-Nearest Neighbor (KNN) Method with TF-IDF weighting" conducted sentiment analysis on Kompas and detik news media, then classified them into technology, health, economics, sports and automotive groups (Satrio & Fauzi, 2019).Based on the results obtained from this study, the smaller the k value used, the more accurate the KNN method.Another study entitled "Application of Sentiment Analysis on Twitter Users Using the K-Nearest Neighbor Method" discusses sentiment analysis of the DKI Pilkada 2017 which is then classified into two classes, namely positive and negative.The results obtained from this study are an accuracy of 67.2% for the value of k = 5.
Unlike the research mentioned above, the method used in this study is the Term Frequency-Inverse Document Frequency (TF-IDF) method for word weighting and the K-Nearest Neighbor (KNN) method for classifying sentiment into two classes, namely positive and negative using tools such as Rapid Miner and Python programming (Khalid et al., 2020).
In this study, the data used as an object to be analyzed were taken from Twitter from May 1 to May 16, 2022.If the results are obtained, they will be tested for their truth value using the evaluation measure stage so that they can be sure that the KNN method can be used effectively.effective in the case of analyzing public sentiment on Twitter regarding the Moderna Vaccine.

Method Data Mining
Data mining is a science cluster of combining statistical techniques, mathematics, artificial intelligence, machine learning to extract and identify information from complex databases (Aher & Lobo, 2012).The purpose of data mining is to dig up information about the characteristics of the observed data or object, it can also be used as a reference for making decisions or even predicting future conditions based on the data being analyzed (Silalahi & Simanullang, 2022).The process of working on data mining is described in Figure 1.

Text Mining
Text Mining is a technique for getting a lot of data that was not previously known or rediscovering information sourced from automatically extracted text information.The purpose of this technique is to get useful information from a collection of documents (Lestari & Saepudin, 2021).There are many methods that can be used to extract text data, but the first step is data preprocessing.

K-Nearest Neighbor (KNN)
One simple method for classifying data based on data with the shortest distance is KNN (Akbar & Kusumodestoni, 2020;Na'iema et al., 2022).If this method is used to classify text, it will produce a more optimal value but first weight each word in a text document that will be processed using Term Frequency-Inverse Document Frequency (TF-IDF).Then to calculate the value of the distance between documents using Euclidean Distance.

Research Stages
In this study, the object under study was the public opinion of Twitter users regarding the moderna vaccine.The data used is in the form of tweets in Indonesian.
Here are some steps taken in this research shown in Figure 2.

Data Collection
Data collection in this study is a stage of data mining.Data mining is the process of taking data patterns to be processed and then the output is in the form of very important information.The goal is to understand more about the observed data behavior or often referred to as a description and to estimate conditions that will occur in the future or are called predictions.(Nikmatun & Waspada, 2019).The data collection process in this study used Rapidminer tools.The data taken is a tweet about the moderna vaccine in May 2022 which is saved in a csv format file.

Preprocessing Text
Preprocessing is the stage for preparing raw data before other processes are carried out (Naresh & Kiran, 2019).In general, the data preprocessing stage is carried out to eliminate inappropriate data or change data into a form that is easier for the system to process.Some of the processes carried out at this stage are as follows: Cleansing: Is the stage for cleaning attributes that have no effect such as symbols, numbers and links; Folding case: Is the process of changing all the letters in the tweet document to lowercase.Only letters a to z are processed.Characters other than letters will be left; Tokenizing: At this stage, sentences are cut or separated based on the specified space; Filtering: This is the stage of removing unnecessary words so that the calculation focuses more on words that are far more important; and Stemming: Is the stage to find the basic words.At this stage, the process of taking basic words and removing affixes from existing words is carried out.

Word Weighting
After the data has been cleared, then a value is given to the word per document using the Term Frequency -Inverse Document Frequency (TF-IDF) method using a calculation formula.
Where d is the d document, t is the t word of the keyword, W is the weight of the document and IDF is the inverse of the number of times the word appears.

Clasification
Classification is a method for grouping an object into a particular group or class (Rizki & others, 2019).The classification in this study uses the KNN method whose working system is to classify objects based on data that are closest to the object.Grouping new data based on its nearest neighbors expressed by k. using the following calculation formula.
Where A is a testing or testing document, B is a training or training document and t is the number of terms or words.

Evaluation Measure
Evaluation measurement aims to measure the performance that can be achieved by the system (Akhtar et al., 2021).Evaluation in this study is used to determine whether a system has been optimal in detecting pages that are indicated to have semantic similarities to other pages.The evaluations used are precision, recall, accuracy and F-Measure (F1-Score).Using the confusion matrix shown in Table 1.

Result and Discussion
Data collection (Crawling) Research data collection was obtained using a rapid miner and then stored in a csv format file (Hofmann & Klinkenberg, 2016;Kunnakorntammanop et al., 2019).Before the preprocessing stage is carried out.The data collected will be used as training data as much as 50 data.The stages of crawling data with rapid miner can be seen in Figure 2.

Preprocessing Results
Data preprocessing is a cleaning stage before the data is further processed.The stages of data preprocessing in this study were carried out using the rapid miner application (Sudarsono et al., 2021).The previously collected data will then be managed using a rapid miner through a series of stages such as cleansing, case folding, tokenizing, filtering, and stemming.The cleaning steps for the 1 example tweet obtained are as follows Table 2.In this stage, the dot symbol is removed.The case folding stage will change all letters to lower case.Examples of tweets that go through this stage can be seen in Table 3. Tokenizing stages to separate sentences into single words can be seen in Table 4 and the filtering stage will remove words that have no effect and can be seen in Table 5.

Results of Data Weighting
Data weighting is the stage of giving value or weight to data.The data that has been cleaned will then be given a weight or value for each word.In this study, the data weighting process was carried out using the TF-IDF method.The stages of data weighting in this study were carried out using the rapid miner application.The following is the result of weighting the data.

Test Results with Test Data
This stage is carried out using the Python programming language which is run through Google Collab (Perez & Granger, 2015;Zuraimi & Zaman, 2021).The program will classify sentiments using the KNN method by utilizing the sklearn library available in Python and also classifying based on the value of k that is input into the program (Nguyen et al., 2018).The output results that will be displayed are sentiments in the form of positive or negative.The data used as data training is as much as 50 data that has been labeled with sentiment.The training data consists of 35 positive sentiments and 15 negative sentiments which can be seen in the following figure: The data to be tested is weighted first using the TF-IDF method and then input into the program.The following shows the test data that is entered into the program code using the Python programming language.

Evaluation Measure
Evaluation Measure is a step taken to test the classification results of the system by testing its truth value (Arslan & Arslan, 2021;Iwendi et al., 2021).The data set is divided into two class groups, namely positive and negative (Isnain et al., 2021;Shah et al., 2020).The data used for evaluation measurements in this study were 50 data.The data will then be divided by the percentage of 80% training data and 20% testing data.In this study, system evaluation measurements will be measured using a confusion matrix where the two rows and columns of the confusion matrix are referred to as true and false positives and true and false negatives.Evaluation measure calculations in this study were calculated using python programming.The results of the calculations can be seen in the Figure 7.
Figure 8 is a graphic display in the form of a barchart of the evaluation measure process on the system according to the calculations that have been carried out.Can be seen in the figure 8.
Based on the results of the tests that have been carried out using 50 data, where 40 of the data are used as training data and 10 other data are used as data testing, the results obtained are accuracy of 80%, precision of 80%, recall of 100% and f1-score of 89%.

Figure 2 .
Figure 2. Schema of research

Figure 5 .
Figure 5. Label and training

Figure 6 .
Figure 6.Test Data InputBased on the test data above, the sentiment results obtained from tweets that have been weighted belong to the positive sentiment class.

Figure
Figure 8. Barchart Evaluation Measure

Table 2 .
Tweets with the Cleansing Process

Table 3 .
Tweets with Case Folding Process it.Eid al-Fitr is still Prokes

Table 5 .
Tweets with Filtering Process