Skip to main content

Econo-ESA in semantic text similarity

Abstract

Explicit semantic analysis (ESA) utilizes an immense Wikipedia index matrix in its interpreter part. This part of the analysis multiplies a large matrix by a term vector to produce a high-dimensional concept vector. A similarity measurement between two texts is performed between two concept vectors with numerous dimensions. The cost is expensive in both interpretation and similarity measurement steps. This paper proposes an economic scheme of ESA, named econo-ESA. We investigate two aspects of this proposal: dimensional reduction and experiments with various data. We use eight recycling test collections in semantic text similarity. The experimental results show that both the dimensional reduction and test collection characteristics can influence the results. They also show that an appropriate concept reduction of econo-ESA can decrease the cost with minor differences in the results from the original ESA.

Introduction

ESA (Gabrilovich and Markovitch 2007) is a unique approach in information retrieval studies and other related researches. This method measures the relatedness of two texts in a concept space, rather than a term space. In this sense, the relationship is not limited to the lexical form of a text but is expanded to include the meaning. The method uses a straight-forward scenario inside a vector space model. Due to its simple and straight-forward approach, ESA is easy to understand. ESA is actually a variant of a generalized vector space model (GVSM) that uses Wikipedia as its index corpus. Although its theoretical and mathematical foundation is not new, the method is unique because of its use of the constantly growing open-edited online encyclopedia Wikipedia. Wikipedia serves as an additional advantage to ESA because it yields good results for word and text-fragment relatedness measurements (Gabrilovich and Markovitch 2007).

Researchers have further developed ESA for use in different environments, including information retrieval (Hassan and Mihalcea 2009; Polajnar et al. 2013; Potthast et al. 2008; Potthast et al. 2012; Scholl et al. 2010; Sorg and Cimiano 2010;2012; Tanase and Kapetanios 2012), image retrieval (Popescu and Grefenstette 2011; Zhang et al. 2012), semantic text similarity (STS) (Aggarwal et al. 2012; Martín et al. 2013; Szarvas et al. 2011), categorization (O’Banion et al. 2012; Szarvas et al. 2011; Szczuka et al. 2011), machine translation (Matsuno and Ishida 2011), and question-answering (Walter et al. 2012). It is also used in knowledge discovery (Yan and Jin 2012), music classification (Aryafar and Shokoufandeh 2011), learning systems (Schmidt et al. 2011), text disambiguation (Fernandez et al. 2011), and case based reasoning systems (Patelia et al. 2011). Thanks to Wikipedia’s language links, there are also numerous implementations in multilingual environments (Potthast et al. 2008; Schönhofen et al. 2008; Schmidt et al. 2011; Sorg and Cimiano 2010;2012; Tanase and Kapetanios 2012).

The method is simple, but the process is expensive because of the following two problems. First, to produce a concept vector, the overall index matrix must be multiplied by a term vector; a large index matrix requires numerous multiplications. Second, if Wikipedia has a million documents, then the concept space has a million dimensions; similarity or relatedness computations between two vectors with numerous dimensions are costly.

Because of these problems, we propose a new scheme called econo-ESA. Econo-ESA reduces the dimensions at the interpretation stage. We call this decrement step a “safe dimensional reduction" because we derive a critical point where the results remain similar despite the reduction. This research provides the following contributions:

  • Runtime reduction. Econo-ESA introduces a dimensional reduction of ESA in order to decrease the processing runtime while achieving similar results. A reduction in the processing time can be utilized by all relevant areas of research and application because ESA has been used in various applications and environments.

  • Experiments with various data. We use eight Glasgow test collections with different characteristics in this study. The Empirical evaluation section reports the experimental results in STS. STS measures the similarity between texts based on their meaning. This measurement is closely related to information retrieval, text disambiguation, and machine translation, where STS can be used as a base. The results allow us to set the amount of dimensional reduction according to our own applications.

Before outlining our method, we briefly explain ESA/GVSM in Section ‘ESA/GVSM overview’. We discuss related improvements to the method in Section ‘Improvements to the method’ and our proposed position toward each improvement. Section ‘Econo-ESA’ outlines our proposal, econo-ESA, in detail. Section ‘Empirical evaluation’ provides the experimental results and discussions of our proposed method and the original scheme. Section ‘Conclusion’ offers conclusions and recommendations for future study.

ESA/GVSM overview

This section briefly explains ESA/GVSM in order to provide a fundamental overview of this method as it relates to the overall discussion in this paper. The mathematical operation of ESA underlies the pseudo-code of the procedure in Section ‘Econo-ESA’. A procedural analysis then evaluates the superiority of our method in Section ‘Empirical evaluation’.

ESA (Gabrilovich and Markovitch 2007) is a variant of GVSM (Wong et al. 1987) and can be considered a Wikipedia-based GVSM (Anderka and Stein 2009). ESA considers each Wikipedia document to hold a unique concept that is described by its text. The preprocessing of an overall Wikipedia corpus produces a matrix I. Document numbers m and vocabulary terms n of Wikipedia define the size of the matrix as m×n. This matrix is typically read as the term vectors of the documents. ESA transposes the matrix as I with size n×m and then defines it as a concept vector of terms. This I matrix is the index matrix of the ESA, the heart of this method.

ESA interprets a term vector x (with n dimensions) as a Wikipedia-based concept vector v (with m dimensions) by multiplying the index matrix I by the term vector x. This multiplication represents a term vector of a text to a higher vector space that is considered to be a concept space.

Each weight v j of concept dimension v j in vector v is defined as (Gabrilovich and Markovitch 2007):

v j = k = 1 n x k × i jk
(1)

where x k is the dimension weight of term i in vector x, and i j k is the weight of concept j for term i. The weight of a term or a concept can be determined by its term frequency, collection frequency, or normalization component (Salton and Buckley 1988).

If the ESA measures the semantic relatedness of two texts, then both texts are represented into two concept vectors u and v. The measurement of the vectors can then be accomplished in the vector space of the concept by a vector measurement such as cosine similarity:

Sim ( u , v ) = u . v u v
(2)

This concept is the same as in GVSM. GVSM allows a transformed vector x of vector x to be generated with this function (Yang et al. 1998):

x = A x
(3)

where A is an index matrix that is derived from a corpus.

Based on the above explanation, ESA computes the similarity of two texts through two stages: an interpretation and a similarity measurement stage. We propose reducing the index matrix I/ A by reducing the number of documents, thus reducing the dimensions of concept space. This proposal requires less multiplication and thus achieves faster processing time in both stages of ESA.

Improvements to the method

This section describes the position of our proposal as compared to other research and discusses related improvements to the method in detail. We compare our proposal to each improvement and summarize related studies based on Sorg and Cimiano’s framework (Sorg and Cimiano 2010). This framework considers the index matrix, association strength, similarity measurement, and dimension limitation as ESA enhancements. We also consider the semantic interpretation process as an additional aspect to the framework. The prior enhancements are described below.

Index matrix

ESA uses a Wikipedia corpus as its index matrix source. ESA also tests an ODP corpus and compares the results with Wikipedia. Hassan and Mihalcea (2009) used Wikipedia’s graph of category and intra-lingual links; they used the category to scale a concept’s weight based on its distance to the root and its intra-lingual link to map a concept into the other languages. Martín et al. (2013) used a collection of research papers as an ESA corpus. Potthast et al. (2008) proposed a cross-language ESA (CL-ESA) for cross-language similarity measurements. They defined an independent concept space that referred to the different language of the Wikipedia corpus. Sorg and Cimiano (2012) proposed a Cat-ESA and a Tree-ESA that deployed the category structures of Wikipedia. Szarvas et al. (2011) reported experimental results on the German and English knowledge resources of Wikipedia, Wiktionary, and WordNet/GermaNet for combination concept vector measurements. Aryafar and Shokoufandeh (2011) proposed using a music collection as an index matrix. Popescu and Grefenstette (2011) proposed using categories instead of Wikipedia articles. Szczuka et al. (2011) used Dbpedia. Gottron et al. (2011) proposed an index matrix with the same topic as the test collection to provide a better result. Scholl et al. (2010) proposed an extended ESA that used other Wikipedia semantic properties such as article links and categories. Polajnar et al. (2013) proposed a new index matrix semantic kernel of n×n size, thus describing concept-to-concept similarity.

Although econo-ESA works in an index matrix such as (Aryafar and Shokoufandeh 2011; Gottron et al. 2011; Hassan and Mihalcea 2009; Martín et al. 2013; Polajnar et al. 2013; Popescu and Grefenstette 2011; Potthast et al. 2008; Scholl et al. 2010; Sorg and Cimiano 2012; Szczuka et al. 2011; Szarvas et al. 2011), we do not propose a new index matrix. We intend to investigate the dimensions of the matrix by showing that econo-ESA decreases the index matrix’s dimensions appropriately.

Association strength

An association strength function derives the values of each cell within an ESA index matrix. The original ESA uses a TFIDF scheme. Hassan and Mihalcea (2009) refined ESA association strength by considering the length of the articles. Sorg and Cimiano (2010) examined five association strength schemes: TF, TFIDF, TFIDF*, BM25, and cosine. They (Sorg and Cimiano 2012) also introduced TFICF, TFICF2, and TFICF3. TFICF provides the same function as TFIDF but for a concept space, while TFICF2 and TFICF3 implement quadratic and cubic values for ICF, respectively. Aryafar and Shokoufandeh (2011) used a mel-frequency cepstral coefficient (MFCC) scheme for their music index matrix. Fernandez et al. (2011) proposed a balanced ESA that considered the differences between short and long articles in Wikipedia.

We do not propose a new association strength scheme such as (Aryafar and Shokoufandeh 2011; Fernandez et al. 2011; Hassan and Mihalcea 2009; Sorg and Cimiano 2010;2012). Econo-ESA uses an original ESA scheme, TFIDF, to highlight the improvement of our proposal to the original ESA.

Similarity measurement

ESA utilizes a cosine metric to compute the semantic relatedness of two concept vectors. Hassan and Mihalcea (2009) replaced cosine similarity with a Lesk-like metric. Martín et al. (2013) explored two similarity measurements of two concept vectors: a cosine and a generalized jaccard. Sorg and Cimiano (2010) explored the TFIDF, KL-divergence, and LM schemes in addition to cosine similarity. Aggarwal et al. (2012) proposed TunedESA, an ESA with an added tuning feature. This method tunes the ESA with three WordNet-based Lin semantic similarities of word functions in two sentences — the subject, action, and object of each sentence. Yan and Jin (2012) integrated the TFIDF similarity score with the ESA-based similarity score. Szarvas et al. (2011) proposed a new similarity measurement: sim avgprod ( t 1 , t 2 )= i ( w c i t 1 + w c i t 2 ). w c i t 1 . w c i t 2 . The relatedness of terms t1 and t2 is determined by the term’s concept vector weight, t 1 ={ w c 1 t 1 , w c 2 t 1 ,..., w c n t 1 } and t 2 ={ w c 1 t 2 , w c 2 t 2 ,..., w c n t 2 }.

The econo-ESA uses a cosine measurement (Equation (2)) for the similarity measurement (same as the ESA) in order to directly evaluate our proposal by comparing it to the original ESA. Although it is obviously easy to use other schemes in our proposal, we do not propose a new similarity scheme such as (Aggarwal et al. 2012; Hassan and Mihalcea 2009; Martín et al. 2013; Sorg and Cimiano 2010; Szarvas et al. 2011; Yan and Jin 2012).

Semantic interpretation process

To produce a concept vector, the original ESA multiplies a vector text and an index matrix. Martín et al. (2013) proposed a new method of semantic interpretation that considered the keywords, author, or journal information of a corpus. During the interpretation process, Potthast et al. (2012) proposed an ESA that reduced terms in text d that appeared in text d q , thus forming a new text d. The similarity between d and d q was measured by the similarity between d and d q . Schmidt et al. (2011) proposed two mapping scenarios – direct CL and meta CL. Direct CL uses Wikipedia’s inter-language link if it exists; if the link does not exist, then the method uses meta CL mapping.

Econo-ESA uses an original scheme of the semantic interpretation process. We do not propose a new strategy such as (Martín et al. 2013; Potthast et al. 2012; Schmidt et al. 2011).

Dimension limitation

Limitations to the dimension come under several proposals: a dimension projection function, dimensional pruning, and the number of index documents. We will describe each proposal idea below and outline the difference of econo-ESA as compared to these proposals.

Dimension projection function

Sorg and Cimiano (2010) formalized the limitation and proposed four dimension limitation functions: absolute, absolute threshold, relative threshold, and sliding window. All functions work in the concept space. The absolute function limits the dimension to d dimensions with the highest value. The absolute threshold function limits the dimensions based on their values toward a threshold. The relative threshold function uses a partial fraction of the highest-valued dimension. The sliding window function selects the first i dimensions from the results of the relative threshold function. All the above dimension projection functions are performed according to the ordered vector concept. As noted in Section ‘ESA/GVSM overview’, this process occurs after the interpretation process.

Dimensional pruning

Sorg and Cimiano (2012) also proposed dimensional pruning. Dimensions with values under a specified threshold are set to zero. Similar to the dimension projection function, this step is performed after the interpretation step.

Number of index documents

Anderka and Stein (2009) showed that a larger index matrix size (i.e., greater than 200,000 documents) provided more stable results. They also show that additional documents in the index matrix increased the stability. However, their experiments used a test collection of only 50 documents from the Australian Broadcasting Corporation’s news mail service.

Dimension limitation of econo-ESA

To reduce the processing cost, econo-ESA modifies the index matrix of ESA and reduces the number of dimensions. This study is similar to previous proposals (Anderka and Stein 2009; Sorg and Cimiano 2010;2012) but additionally considers the cost reduction of the processes. Econo-ESA reduces costs by limiting the dimensions at the interpretation stage, and that is different from previous studies. Because of this, the overall processing cost can be decreased in both the interpretation and relatedness measurement stages. Previous proposals decreased computational costs in the last step, relatedness measurement, but did not decrease the interpretation step costs. This required the interpretation step to be performed on the entire index matrix of the system. Moreover, an additional vector sorting step (Sorg and Cimiano 2010;2012) required before the reduction also adds to the cost. We used eight test sets with different characteristics, while (Anderka and Stein 2009) performed experiments with a single test collection. The goal is to identify the trade-off between dimension reduction, processing time, and performance while considering whether test collection characteristics may affect the results. Our experimental results can be used to guide the selection of appropriate reductions in accordance with the actual conditions as required. Based on these results, the reader can determine the proper reduction based on desired performance and speed, as well as the characteristics of the text to be processed; the results of (Anderka and Stein 2009) cannot be used for this purpose.

Econo-ESA

This section describes an economics-based ESA named econo-ESA that proposes a new research direction: decreasing ESA costs with negligible impact to the results. This research focuses on the index matrix and the limitation of the dimension. Therefore, we use original ESA schemes in the following aspects: association strengths, similarity measurements, and semantic interpretation processes.

The subsections that follow explain the concept of econo-ESA. The safe dimensional reduction decreases the cost with negligible impact to the results. The procedural analysis then explains how a faster procedure can be achieved by the decrement.

Safe dimensional reduction

The econo-ESA proposal suggests a way to determine how much the index matrix can be reduced without affecting the results. To our knowledge, this proposal has not been suggested in previous studies (Aggarwal et al. 2012; Anderka and Stein 2009; Aryafar and Shokoufandeh 2011; Fernandez et al. 2011; Gottron et al. 2011; Hassan and Mihalcea 2009; Martín et al. 2013; Polajnar et al. 2013; Popescu and Grefenstette 2011; Potthast et al. 2008; Scholl et al. 2010; Sorg and Cimiano 2010;2012; Szczuka et al. 2011; Szarvas et al. 2011; Yan and Jin 2012). We propose a dimensional reduction of the Wikipedia matrix. The proposal refers to a previous ESA experiment with different index matrix sizes, as shown in (Anderka and Stein 2009). Based on these data, we can build a fit model with an appropriate regression technique. After deriving the model from the data, we select one data point with a least squared error value. This point refers to the number of documents that have a percentage amount of the total corpus.

Table 1 shows the ESA experimental data of Pearson’s correlation coefficient (PCC) with different index matrix conditions. The PCC of two data sets x and y is determined by Equation (4). The experiments were performed for 1,000, 10,000, 50,000, 100,000, 150,000, and 200,000 random Wikipedia documents. The experiments were performed for several association strength schemes for several corpora. Table 1 shows one row of data resulting from the Wikipedia TFIDF index matrix. We chose this row because it is an original ESA scheme.

PCC = n ( xy ) ( x ) ( y ) ( n x 2 ( x ) 2 ) ( n y 2 ( y ) 2 )
(4)
Table 1 PCC of ESA based on the number of index documents (Anderka and Stein 2009)

Based on Table 1, we then built a model of ESA PCC performance. We chose two nonlinear regression models: logarithmic and logistic. The curves of both models apparently fit the data shown in Table 1. The logarithmic and logistic model functions of document numbers x and PCC prediction P C C p r are shown in Equations (5) and (6), respectively (Walpole et al. 2007).

PCC pr = β 0 + β 1 ln ( x )
(5)
PCC pr = 1 1 + e ( b o + b 1 x )
(6)

We used OpenOffice Calc solver and MS Office Excel logarithmic trendline to define the logarithmic function’s variable value. Calc produced β0 and β1 as 0.706869 and 0.006565, respectively. Excel produced β0 and β1 as 0.706 and 0.006, respectively. We used the logistic regression procedure of R statistic software for the logistic model. The software gave results for b0 and b1 as 1.192 and 5.541×10−7, respectively.

We selected one model from the three based on the model’s mean squared error (MSE) (Additional file 1 with yellow highlight). The least MSE value from the three is the logarithmic model generated by OpenOffice Calc solver, as shown in Table 2. We then selected the model generated by Calc for this econo-ESA proposal, as shown in Figure 1.

Table 2 MSE comparison
Figure 1
figure 1

Logarithmic model.

Now we determine the econo-ESA decremental percentage. From the model shown in Figure 1, we consider a point with the least squared error value between the model and the data (Additional file 1 with blue highlight). The requirement conformed with 100,000 documents, half of the 200,000 documents in the experiment. The PCC and P C C p r of this point are 0.782 and 0.782448, respectively. The similarity between the two values is 0.999427 with Equation (7).

PCC sim = 1 abs ( PCC pr PCC ) PCC
(7)

Based on this fact, we develop our expectation: Econo-ESA is similar to ESA at a 50% reduction of the dimensions. Our experiments are reported in the Empirical evaluation section.

Procedural analysis

This subsection describes the cost savings of econo-ESA. We consider a common procedure of ESA in two parts: the interpretation process procedure and the cosine similarity procedure. We then analyze the processing costs.

Interpretation procedure

Procedure 1 is an implementation of the ESA interpretation step. Different implementations may exist based on each implementer choice. The procedural input is an array T of a term vector and a matrix W of a Wikipedia corpus. The array has n elements based on the vocabulary term amount in the system. Matrix W keeps Wikipedia’s inverted index m×n for the interpretation process. The output of this procedure is an array C with m elements that matches the total number of documents in the corpus. An element of array C indicates the weight of concept dimension in the dimension space.

Procedure 1: Interpret a given term vector into a concept vector

According to Procedure 1, the computational cost C I is:

C I = i = 0 m 1 ( 2 + j = 0 n 1 1 ) = i = 0 m 1 ( 2 + n ) = m ( 2 + n ) .

If we decrease the number of documents to 50% of m, the value of C I is halved; this means that the procedure becomes twice as fast. We expect to compute ESA with similar results. Variable n also appears in the above analysis. This result shows that more terms present in a text require more running time.

Cosine similarity

Procedure 2 is a cosine similarity implementation. The procedural input consists of C1 and C2 arrays that represent concept vectors c 1 and c 2 . The output of this procedure is the similarity value, a fractional number between 0 and 1.

Procedure 2: Measure similarity score of two concept vectors

The analysis of Procedure 2 shows that document number m also affects the cost C M in the similarity computation step as follows:

C M = 4 + i = 0 m 1 3 = 4 + 3 m 3 m.

Similar to Procedure 1, if we decrease the number of documents by 50% of m, then the value of C M is halved. This further strengthens the advantage of the econo-ESA proposal.

Running cost of typical STS computation

One STS computation by ESA requires two interpretation steps and one cosine similarity step. The overall running cost can be defined as:

C S = 2 ( m ( 2 + n ) ) + ( 4 + 3 m ) = m ( 2 n + 7 ) + 4 m ( 2 n + 7 ) .
(8)

If we reduce m to 50% of m, the overall cost is halved, meaning that the procedure will be twice as fast.

Empirical evaluation

Experimental setup

To evaluate the superiority of our proposal to the original ESA, widely used techniques are adopted. We evaluate the PCC, precision, recall, F-score, and running time of the ESA and econo-ESA systems. PCC, precision, recall, and F-score evaluate the similarity of both systems, while running time evaluates the costs.

We used the Microsoft Wikipedia corpus (MSwik) from (Yih et al. 2011) in the semantic interpreter part of the ESA. The MSwik size is 218.44 MB and can be downloaded from http://research.microsoft.com/en-us/downloads/fd8811ca-a57f-4803-8f5c-41b536bf3a80/. This data set consists of 60,730 samples of Wikipedia English-Spanish articles from 2009. We only used the English articles from this data set, consisting of 20,000 terms. ESA used all 60,730 documents, while econo-ESA used only half—30,364 documents. We also ran 40%, 60%, and 70% of econo-ESA for comparative purposes, with 24,292, 36,438, and 42,511 documents, respectively. We randomly selected the documents. We used terms "econo40", "econo50", "econo60", and "econo70" for each econo-ESA scheme based on their semantic interpreter size. The ESA and econo-ESA implementations were performed in Perl 5.12.3 and MySQL 5.5.16. Experiments ran on a 3.4GHz Intel COREi7 PC with 8GB RAM.

For evaluation purposes, we used eight recycling test collections generated from Glasgow test collections. All test sets can be accessed from http://ir.dcs.gla.ac.uk/resources/test_collections/. These eight test collections were fitted to the semantic test similarity task as reported in (Rahutomo et al. 2012). We used these test collections because of their characteristic variations, which are shown in Table 3. “Docs.”, “Qys.”, and “Rel.” columns describe each dataset number of documents, queries, and relevant pairs, respectively. The next columns show term distribution characteristics of document texts and query texts. “Min.”, “Q1”, “Med.”, “Q3”, and “Max.” sub-columns describe the minimum, first quarter, median, third quarter, and maximum values, respectively. We pre-processed all eight Glasgow test collections as if they were MSwik. We referred to the 20,000 term library in MSwik. The pre-processing of the texts applied neither stemming nor folding to the terms and determined their TFIDF values. The ESA and econo-ESA then transformed each document or query term vector into a concept vector. Finally, the cosine similarity metric measured the semantic similarity between the document and query concept vectors.

Table 3 Test collection characteristics

Because the test collections were derived from the information retrieval data set, human judgment forms are expressed in binary value. To measure PCC, we selected two similarity thresholds for both the ESA and econo-ESA schemes — 0.5 and 0.6 — based on previous study (Islam and Inkpen 2008; Mihalcea et al. 2006) constraints.

We then measured the running time for the processing cost. We used the special tag “:hireswallclock” for a benchmark package in our Perl application. We measured the running time of several processes: interpretation, cosine measurement, and STS. For the interpretation process, we measured the running time of the document and query texts for both ESA and econo-ESA. We chose five randomly selected texts from all of the test collections. For the cosine similarity procedure, we measured the running time of five randomly selected documents and query pairs from all of the test sets. For the STS process, we combined the previous results of the interpretation and cosine measurement processes. We then calculated and compared the average scores.

Results and discussion

ESA and econo-ESA correlation

Tables 4 and 5 show the PCC evaluation results for the ESA and econo-ESA schemes with 0.5 and 0.6 similarity thresholds, respectively. The experimental results show that econo-ESA achieved our first expectation. The econo-ESA results are similar to those of ESA at a 50% decremental of documents. The average PCC values for the similarity thresholds of 0.5 and 0.6 were 0.9226 and 0.8276, respectively. The best result is identical between ESA and econo-ESA, at the 0.5 similarity threshold, while the worst at the 0.6 similarity threshold. A further inspection of the tables show that the variation in test collections influenced the results, whether in collection size or term size of the text.

Table 4 PCC of 0.5 similarity threshold
Table 5 PCC of 0.6 similarity threshold

We present the results differently from those of Anderka and Stein (2009) for several reasons. First, their results were gathered from the PCC of ESA and human judgment results. These judgments came from eight to twelve individual judgments. The averages of these judgments become human judgment values, a fractional number between 0 and 1. We do not follow this approach because human judgments in the Glasgow test collections (the test collections we used) are binary values of either 0 or 1. Second, there are no PCC values between the results of the ESA with different index matrices in Table 1; we therefore cannot determine these values because we do not have complete raw results of the experiments. We cannot determine the xy value of Equation (4). Because of this, we do not compare our results to those of Anderka and Stein, but we do compare the ESA and econo-ESA results.

In addition, the collection size affects the row amount of the experiment. Similarity computations were performed for all possible pairs of document and query texts. The incremental threshold value decreased the PCC values of all test sets except LISA. The decrements were between 0.0164 and 0.5643. The decrement in the PCC of ADI was the highest because ADI had the fewest test collections. ADI had only 82 documents and 35 queries, thus producing 2,870 rows. Small differences between ESA and econo-ESA results yielded a greater PCC decrement. Contrary to ADI, NPL had the greatest number of collections, with 11,429 documents and 93 queries. Robust test collection is more stable in threshold tuning, as seen in NPL, while minimal collection is more susceptible, as seen in ADI.

As for term size, we cannot discern any pattern from the results. The term sizes of NPL and CACM are small, but the results are very different. NPL’s PCC is the highest among the eight test sets, but CACM’s PCC is the second lowest. Cranfield and Medline have a large term size along with their high PCC, which is more than 0.9. Time has a greater term size than Cranfield and Medline, but its PCC is lower than these two. NPL has a smaller term size but a higher PCC than these two. There is no impact of term size in each test collection based on their PCC. According to our experiments, a 50% decrementation of the Wikipedia index matrix yields similar results for different test collections.

The other econo-ESA schemes show that the increase in PCC is directly proportional to the number of documents used in the interpretation. In Table 4, under the PCC of 0.5 similarity threshold, all of the test sets show this trend except for Cranfield and Medline. For both test sets, econo50 is superior to all other schemes. The average value of econo50 is also superior to the other schemes according to the Cranfield and Medline results. In Table 5, at the PCC of 0.6 similarity threshold, the results are similar to the Table 4 trends. Econo50 is superior for the other schemes for Cranfield and Medline test sets. However, Table 5 shows different average results from Table 4. In Table 5, econo70 is the best scheme.

ESA and econo-ESA STS performance

Tables 6 and 7 show the precision, recall, and F-score performance of both ESA and econo-ESA schemes. Tables 6 and 7 also provide the results based on the 0.5 and 0.6 similarity thresholds, respectively. Table 6 shows the same precision results in the LISA, NPL, CACM, CISI, Medline, and ADI experiments. Table 7 shows the same facts in the LISA, NPL, CACM, CISI, and Cranfield experiments. Most of the F-score values between ESA and econo-ESA are the same. Although the results are different, the gaps are small. Distinctive results are found in all of the recall metric experiments except for ADI at the 0.5 similarity threshold. Nevertheless, the differences are small. Econo-ESA at the 50% decremental of the ESA index matrix shows similar results to the ESA.

Table 6 Precision, Recall, and F-score of ESA and econo-ESA with 0.5 threshold value
Table 7 Precision, Recall, and F-score of ESA and econo-ESA with 0.6 threshold value

Processing cost

The experimental results are shown in Figures 2, 3, 4, 5, and 6. Interpretation experiments were performed for five random queries and documents of all test collections. Figure 2 shows the interpretation cost diagram for all query experiments. We calculated the ratio between the ESA and econo-ESA schemes and retained their averages in the first row of Table 8. As shown in the results, econo40 was the fastest scheme, followed by econo50, econo60, and econo70. The document number of the index matrix clearly influences the results.

Figure 2
figure 2

Running time of query texts.

Figure 3
figure 3

Running time of document texts.

Figure 4
figure 4

ESA/econo-ESA scatter plot of cosine similarity procedure.

Figure 5
figure 5

ESA/econo-ESA scatter plot of STS procedure.

Figure 6
figure 6

Running time of the different schemes.

Table 8 ESA/econo-ESA cost ratio

Figure 3 shows the interpretation costs of all the documents. The cost ratios between ESA and econo-ESA are shown in the second row of Table 8. The trend is the same as that in the previous process; in row 2, the running cost increases while the Wikipedia corpus is decreased.

We measured the cosine similarity processing cost of five pairs from each test collection. This experiment collected 40 measurement data of all test collections of all schemes. The results are shown in Figure 4, which shows a running time scatter plot comparison between ESA and econo-ESA for the same text pairs. The average values of the ESA and econo-ESA ratios are shown in the third row of Table 8. In this process, econo60 is faster than econo50. Note, however, that we only used 40 measurement data from randomly selected texts in the experiment; we may have slightly different results if we used more texts. In fact, the gap between econo50 and econo60 is very small. A further inspection of the results shows that we cannot find any pattern based upon the number of terms; econo50 and econo60 randomly beat each other. Based on this fact, we can also consider that the 60% index matrix of econo-ESA is also a good candidate because the PCC of econo60 is better than econo50 at the 0.6 similarity threshold. The term amount of a text has no impact on this process because each term vector of text has been interpreted as a concept vector with the same dimension.

Figure 5 shows the running time scatter plot of typical STS processing cost. The figure shows 40 experimental data comparisons between ESA and econo-ESA for the same text pairs of all test collections. The measurements were performed for all schemes. We obtained the running cost from the results of previous interpretation and cosine measurement experiments. The fourth row of Table 8 shows the experimental averages.

In general, the additional document of the index matrix increases the processing cost, as shown in Figure 6. Close to the calculations in the Procedural analysis section, econo50 is twice as fast as the original ESA. However, based on our experiments, econo60 is faster than econo50 in the cosine similarity process. In a typical STS process, which includes the interpretation portion of a query, document, and a cosine similarity process, the number of terms influences the result. As shown in Figure 6, Time is the most costly among the test sets because of its term size. In these experiments, econo50 achieved our expectation. We can also consider econo60 as our next econo-ESA candidate because it has shown better results than econo50 in several instances.

Results trade-off and further use

We examined the experimental results in terms of processing time, precision, recall, and F-score, which were not shown in the previous study (Anderka and Stein 2009). Therefore, we can provide an overview analysis of the trade-offs. In general, if we decrease the index matrix, the processes will run faster but the correlation with original ESA will decrease. Precision, recall, and F-score between ESA and econo50 are the same or very close. We determined that the 50% index matrix reduction is the best based on the analysis in Section ‘Safe dimensional reduction’, but we also found that the results were not always the same as our expectation. Econo50 is superior for Cranfield and Medline in both the 0.5 and 0.6 similarity thresholds, but not for the other test sets. As for the processing time, we found that the cost ratio of ESA/econo-ESA between econo50 and econo60 do not differ much in any processes. Therefore, we can use econo60 for better results than econo50 with little extra cost.

We hoped to obtain a pattern based on the test set characteristics. While we determined that a small test set such as ADI will be strongly impacted by the change in the similarity threshold, we could not show a clear relationship between the amount of dimension reduction, the amount of text in the test set, and the PCC in the results, particularly with Cranfield and Medline. We intend to investigate how we can balance these relationships in the future.

The results indicate that econo50 provides good results. A slight cost sacrifice to econo60 can improve outcomes over econo50. Econo60 usage may be considered for short texts because it has a ramp slope of running time similar to CACM and NPL, as shown in Figure 6. For longer texts such as Time, the running time covers a steep slope, so the use of econo50 is advised.

Conclusion

This paper proposed an econo-ESA scheme. This method decreases the cost of ESA while providing similar results. We proposed a decremental method as a safe dimensional reduction that uses modelling steps to determine an appropriate decremental percentage of 50%. Our experiment showed that a 50% decrement of the index matrix had average PCC results of 0.9226 and 0.8276 for the 0.5 and 0.6 similarity thresholds, respectively. Precision, recall, and F-score results between ESA and econo-ESA were also similar; when the results were different, the gaps were small. The experiments on running cost showed that econo-ESA was faster than ESA. Because the decremental percentage was set to 50%, the procedure was almost twice as fast, as per our expectation.

We found that using a 60% of the index matrix can be faster than 50%. This case also has a better PCC than 50% in the 0.6 similarity threshold. Based on these experiments, the 60% index matrix can also be considered as a good candidate for an econo-ESA proposal. For further implementation, we recommend the use of econo50 for long texts and econo60 for short texts.

Our future work on this topic will investigate the decremental patterns of the index matrix. Different choices of documents during the dimensional reduction would have provided different results. A reduction based on topics using Wikipedia categories seems as reasonable as reductions based on clustering or classification techniques.

References

  • Aggarwal N, Asooja K, Buitelaar P: DERI&UPM: Pushing corpus based relatedness to similarity: shared task system description. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval ’12. Stroudsburg, PA, USA: Association for Computational Linguistics; 2012:643-647. . http://dl.acm.org/citation.cfm?id=2387636.2387745

    Google Scholar 

  • Anderka M, Stein B: The ESA retrieval model revisited. In Proceedings of the 32nd international, ACM SIGIR conference on Research and development in information retrieval, SIGIR ’09. New York, NY, USA: ACM; 2009:670-671. . http://doi.acm.org/10.1145/1571941.1572070

    Chapter  Google Scholar 

  • Aryafar K, Shokoufandeh A: Music genre classification using explicit semantic analysis. In Proceedings of the 1st international ACM workshop on Music information retrieval with user-centered and multimodal strategies, MIRUM ’11. New York, NY, USA,: ACM; 2011:33-38. . http://doi.acm.org/10.1145/2072529.2072539

    Chapter  Google Scholar 

  • Fernandez I, Alegria I, Ezeiza N: Semantic relatedness for named entity disambiguation using a small wikipedia. In Text, Speech and, Dialogue, Volume 6836 of Lecture Notes in Computer Science. Edited by: Habernal I, Matous̆ek V. Berlin, Heidelberg: Springer; 2011:276-283. . http://dx.doi.org/10.1007/978-3-642-23538-2_35

    Google Scholar 

  • Gabrilovich E, Markovitch S: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence, IJCAI’07. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2007:1606-1611.

    Google Scholar 

  • Gottron T, Anderka M, Stein B: Insights into explicit semantic analysis. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM ’11. ACM: New York, NY, USA; 2011:1961-1964. . http://doi.acm.org/10.1145/2063576.2063865

    Chapter  Google Scholar 

  • Hassan S, Mihalcea R: Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP ’09. Singapore: Association for Computational Linguistics; 2009:1192-1201. . http://dl.acm.org/citation.cfm?id=1699665

    Chapter  Google Scholar 

  • Islam A, Inkpen D: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data 2008, 2(2):25,. . http://doi.acm.org/10.1145/1376815.1376819

    Article  Google Scholar 

  • Matsuno J, Ishida T: Constraint optimization approach to context based word selection. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three, IJCAI’11. Edited by: Walsh T. IJCAI/AAAI; 2011:1846-1851. . http://ijcai.org/papers11/Papers/IJCAI11-309.pdf

    Google Scholar 

  • Martín GH, Schockaert S, Cornelis C, Naessens H: Using semi-structured data for assessing research paper similarity. Inf Sci 2013, 221: 245-261. . http://dx.doi.org/10.1016/j.ins.2012.09.044

    Article  Google Scholar 

  • Mihalcea R, Corley C, Strapparava C: Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the 21st national conference on, Artificial intelligence - Volume 1, AAAI’06. Edited by: Cohn A. Boston, Massachusetts: AAAI Press; 2006:775-780. . http://dl.acm.org/citation.cfm?id=1597538.1597662

    Google Scholar 

  • O’Banion S, Birnbaum L, Hammond K: Social media-driven news personalization. In Proceedings of the 4th ACM RecSys workshop on Recommender systems and the social web, RSWeb ’12. New York, NY, USA: ACM; 2012:45-52. . http://doi.acm.org/10.1145/2365934.2365943

    Chapter  Google Scholar 

  • Patelia A, Chakraborti S, Wiratunga N: Selective integration of background knowledge in TCBR systems. In Proceedings of the 19th international conference on Case-Based Reasoning Research and Development, ICCBR’11. Berlin, Heidelberg: Springer-Verlag; 2011:196-210. . http://dx.doi.org/10.1007/978-3-642-23291-6_16

    Chapter  Google Scholar 

  • Polajnar T, Aggarwal N, Asooja K, Buitelaar P: Improving ESA with document similarity. In Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR’13. Berlin Heidelberg: Springer-Verlag; 2013:582-593. . http://dx.doi.org/10.1007/978-3-642-36973-5_49

    Chapter  Google Scholar 

  • Popescu A, Grefenstette G: Social media driven image retrieval. In Proceedings of the 1st, ACM International Conference on Multimedia Retrieval, ICMR ’11. New York, NY, USA: ACM; 2011:8,-8. . http://doi.acm.org/10.1145/1991996.1992029

    Google Scholar 

  • Potthast M, Stein B, Anderka M: A Wikipedia-Based Multilingual Retrieval Model. In Advances in Information Retrieval, Volume 4956 of Lecture Notes in Computer Science. Edited by: Macdonald C, Ounis I, Plachouras V, Ruthven I, White R. Berlin, Heidelberg: Springer; 2008:522-530. . http://dx.doi.org/10.1007/978-3-540-78646-7_51

    Google Scholar 

  • Potthast M, Stein B, Loose F, Becker S: Information Retrieval in the Commentsphere. ACM Trans Intell Syst Technol 2012, 3(4):21,. . http://doi.acm.org/10.1145/2337542.2337553

    Article  Google Scholar 

  • Rahutomo F, Kitasuka T, Aritsugi M: Test collection recycling for semantic text similarity. In Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services, iiWAS ’12. New York, NY, USA: ACM; 2012:286-289. . http://doi.acm.org/10.1145/2428736.2428784

    Google Scholar 

  • Salton G, Buckley C: Term-weighting approaches in automatic text retrieval. Inf Process Manag 1988, 24(5):513-523. -0. http://dx.doi.org/10.1016/0306-4573(88)90021 10.1016/0306-4573(88)90021-0

    Article  Google Scholar 

  • Schönhofen P, Benczúr A, Csalogány K, Bíró I: Advances in multilingual and multimodal information retrieval. In Proceedings of the 19th international conference on Case-Based Reasoning Research and, Development. Berlin, Heidelberg: Springer-Verlag; 2008:72-79. . http://dx.doi.org/10.1007/978-3-540-85760-0_9

    Google Scholar 

  • Scholl P, Böhnstedt D, Domínguez García R, Rensing C, Steinmetz R: Extended Explicit, Semantic Analysis for Calculating Semantic Relatedness of Web Resources. In Sustaining TEL: From Innovation to Learning and Practice, Volume 6383 of Lecture Notes in Computer Science. Edited by: Wolpers M, Kirschner P, Scheffel M, Lindstaedt S, Dimitrova V. Berlin, Heidelberg: Springer; 2010:324-339. . http://dx.doi.org/10.1007/978-3-642-16020-2_22

    Google Scholar 

  • Schmidt S, Scholl P, Rensing C, Steinmetz R: Cross-lingual recommendations in a resource-based learning scenario. In Towards Ubiquitous Learning, Volume 6964 of Lecture Notes in Computer Science. Edited by: Kloos C, Gillet D, Crespo García R, Wild F, Wolpers M. Berlin, Heidelberg: Springer; 2011:356-369. . http://dx.doi.org/10.1007/978-3-642-23985-4_28

    Google Scholar 

  • Sorg P, Cimiano P: An experimental comparison of explicit semantic analysis implementations for cross-language retrieval. In Proceedings of the 14th international conference on Applications of Natural Language to Information Systems, NLDB’09. Berlin, Heidelberg: Springer-Verlag; 2010:36-48. . http://dx.doi.org/10.1007/978-3-642-12550-8_4

    Google Scholar 

  • Cimiano P, Sorg P: Exploiting Wikipedia for cross-lingual and multilingual information retrieval. Data Knowl Eng 2012, 74: 26-45. . http://dx.doi.org/10.1016/j.datak.2012.02.003

    Article  Google Scholar 

  • Szczuka M, Janusz A, Herba K: Clustering of rough set related documents with use of knowledge from DBpedia. In Rough, Sets and Knowledge Technology, Volume 6954 of Lecture Notes in Computer Science. Edited by: Yao J, Ramanna S, Wang G, Suraj Z. Berlin, Heidelberg: Springer; 2011:394-403. . http://dx.doi.org/10.1007/978-3-642-24425-4_52

    Google Scholar 

  • Szarvas G, Zesch T, Gurevych I: Combining heterogeneous knowledge resources for improved distributional semantic models. In Computational Linguistics and, Intelligent Text Processing, Volume 6608 of Lecture Notes in Computer Science. Edited by: Gelbukh A. Berlin, Heidelberg: Springer; 2011:289-303. [http://dx.doi.org/10.1007/978-3-642-19400-9_23] []

    Google Scholar 

  • Tanase D, Kapetanios E: Are SKOS concept schemes ready for multilingual retrieval applications? In Proceedings of the 8th International Conference on, Semantic Systems, I-SEMANTICS ’12. New York, NY, USA: ACM; 2012:149-156. [http://doi.acm.org/10.1145/2362499.2362520] []

    Chapter  Google Scholar 

  • Walpole RE, Myers RH, Myers SL, Ye K: Probability & statistics for engineers and scientists. Upper Saddle River: Pearson Education; 2007.

    Google Scholar 

  • Walter S, Unger C, Cimiano P, Bär D: Evaluation of a layered approach to question answering over linked data. In Proceedings of the 11th international conference on The Semantic Web - Volume Part II, ISWC’12. Berlin, Heidelberg: Springer-Verlag; 2012:362-374. . http://dx.doi.org/10.1007/978-3-642-35173-0_25

    Google Scholar 

  • Wong SK, Ziarko W, Raghavan VV, Wong PC: On modeling of information retrieval concepts in vector spaces. ACM Trans Database Syst 1987, 12(2):299-321. . http://doi.acm.org/10.1145/22952.22957 10.1145/22952.22957

    Article  Google Scholar 

  • Yang Y, Carbonell JG, Brown RD, Frederking RE: Translingual information retrieval: learning from bilingual corpora. Artif Intell 1998, 103(1-2):323-345. -0. http://dx.doi.org/10.1016/S0004-3702(98)00063 10.1016/S0004-3702(98)00063-0

    Article  Google Scholar 

  • Yan P, Jin W: Improving cross-document knowledge discovery using explicit semantic analysis. In Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery, DaWaK’12. Berlin, Heidelberg: Springer-Verlag; 2012:378-389. . http://dx.doi.org/10.1007/978-3-642-32584-7_31

    Chapter  Google Scholar 

  • Yih W, Toutanova K, Platt JC, Meek C: Learning discriminative projections for text similarity measures. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL ’11. Stroudsburg, PA, USA: Association for Computational Linguistics; 2011:247-256.

    Google Scholar 

  • Zhang W, Tian L, Sun X, Wang H, Yu Y: A semantic approach to recommending text advertisements for images. In Proceedings of the sixth ACM conference on Recommender systems, RecSys ’12. New York, NY, USA: ACM; 2012:179-186. . http://doi.acm.org/10.1145/2365952.2365987

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Faisal Rahutomo or Masayoshi Aritsugi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

FR conducted the experiments and statistical analysis, interpreted the data, and drafted the manuscript. MA directed the research into the targeted contribution and critically revised the manuscript for intellectual content. Both authors read and approved the final manuscript.

Electronic supplementary material

40064_2014_909_MOESM1_ESM.xls

Additional file 1:MSE of three models. The MSE file can be opened with MS Excel, MS Excel viewer, or OpenOffice Calc applications. This file contains MSE calculations of three models mentioned in the safe dimensional reduction section. The file shows yellow highlights on model choosing based on the MSE. A logarithmic model is chosen with coefficient values derived from Calc Solver. The file also shows blue highlights in the least squared error, which were used for critical point determinations of Wikipedia’s dimensions decrementation. (XLS 20 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Rahutomo, F., Aritsugi, M. Econo-ESA in semantic text similarity. SpringerPlus 3, 149 (2014). https://doi.org/10.1186/2193-1801-3-149

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/2193-1801-3-149

Keywords