Multiaspect text categorization

This site documents our research on the multiaspect text categorization, a new text categorization problem inspired by the document management processes carried out in the public administration institutions and beyond. You can find here a brief description of the problem, a list of publications describing the results of the project, references to the other relevant literature and more...

Brief description

A new text categorization problem is introduced. As in the classical problem, there is a set of documents and a set of categories. However, in addition to being assigned to a specific category, each document belongs to a certain sequence of documents, referred to as a case. It is assumed that all documents in the same case belong to the same category. An example may be a set of news articles. Their categories may be sport, politics, entertainment, etc. In each category there exist cases, i.e., sequences of documents describing, for example evolution of some events. The problem considered is how to classify a document to a proper category and a proper case within this category.

View details »

Solutions

Here we briefly present the approaches we have developed as well as other techniques which have been identified as promising for developing further solutions. We also discuss briefly known in the literature approaches to some similar problems. The details are presented in the papers listed to the right.

View details »

Publications and references

Here you can find a list of publications reporting our research results obtained in the framework of this project as well as a list of other relevant publications. We will complete both lists as long as we continue our research on this topic which will last even beyond the end of the project.

Please contact the authors, i.e., Janusz Kacprzyk, Marek Gajewski and Slawomir Zadrozny at the following e-mail addresses: {kacprzyk,gajewskm,zadrozny}@ibspan.waw.pl

View details »

Datasets, scripts, ...

We have implemented our approaches briefly presented in the section "Solutions" and in detail presented in our papers listed in the section "Publications and references". They have been tested on some exemplary datasets. Here you can find links to some datasets and scripts in R which we have developed.

View details »

Brief description of the problem

Textual documents have always been an important component of broadly meant business processes. As more and more of documents are available in the electronic form there is ever growing need of effective methods of their automatic processing. The excellent tools and techniques developed by the information retrieval comunity have to address the growing challenges and often need some extensions and new ideas. One of the very important tasks to deal with is the automatic classification of documents to a set of predefined categories, usually referred to as the text categorization. The categories of interest are often of a topical character, i.e., two documents belong to the same category if they are thematically related. For example, news may be classified in such a way for the purposes of an editorial office of a journal, documents served by a website may be grouped based on their main topics, etc.

We introduce a new problem of textual documents classification which is an extension of the standard text categorization problem. The problem is inspired by a real life task of the acquisition, maintenance and handling of documents in commercial companies and institutions, with an emphasis on public institutions in Poland. These institutions are obliged to organize and manage their documents in a precisely specified way. There is a hierarchy (a tree) of topics, so-called JRWA, top level of which is defined by some acts of law. Every business process carried out by an institution is assigned to a node of the JRWA tree and all documents related to a given instance of such a process form a case. For example, there may be a JRWA tree node corresponding to the tenders for the equipment purchase. A case is then a particular tender and related documents may comprise a tender announcement, submitted offers, protocols of the tender commission meetings, etc. The documents within a case are chronologically ordered according to the date a document has been created or received

The classification problem considered here may thus be briefly described as follows. Let us assume a JRWA tree (or a list of its bottom level nodes, as a special case) with a number of cases assigned to its nodes. Usually these cases are at a different stage of development. For example, one tender may have been just announced and its list of related documents consists of only the announcement, while other tenders may be close to their completion. The same applies to the cases gathered in other nodes of the JRWA tree. The problem which we face is a proper classification of a document which has been just received or produced. It has to be classified both to an appropriate JRWA node and to a specific case within this node. The former classification is somehow easier and close to the standard text categorization problem. The latter is much more complex and requires taking into account the relationship between subsequent documents in the chronological order of particular cases. Both classification problems are intermingled as, first of all, assigning a document to a case implies its assignment to the JRWA node to which this case belongs. On the other hand, such a direct classification to a case may be difficult and preceding it by first classifying a document to a JRWA node may be advantageous

Back

Solutions

First group of solutions we have developed, comprises approaches directly attempting to model the sequences of documents forming the cases. To this aim, in our first approach, we employ a sequence mining algorithm. The frequent sequences of keywords are discovered for cases within each category and a newly incoming document is classified based on how well it extends such sequences occurring in the currently on-going cases. In the second approach, the Hidden Markov Models are employed to model the succession of the documents within the cases. A separate HMM is trained for each category and then, assuming that a category of a new incoming document d* is known, it is paired with each on-going case and its conditional probability is computed with respect to a given HMM and the current list of documents in a given case. The document is classified to a case for which thus computed probability is highest. Details are given in papers [8]-[9] and in another forthcoming paper.

Second group of solutions is based on the use of specific classification techniques which include both completely new techniques developed by us especially for the MTC problem as well as adopted classical techniques. The adoption of the latter required, in particular, developing special representations of the components of our model of the MTC problem, i.e., the categories and cases.

One of the proposed approaches in this second group is based on the representation of documents which refers to a small number of keywords which are given the highest weights within the document-term matrix created one of the popular weighting schemes of the vector space model. In our experiments we have been using mostly the normalized version of the tf x IDF weighting scheme and just 5 keywords per document. Thus, effectively, each document may be represented by a different set of keywords. In fact, in our approach a document is treated as a fuzzy set and the weights of the keywords, belonging by definition to the interval [0, 1], play the roles of the membership degrees. Categories are represented by the centroids of the documents belonging to a given category. The cases are represented just as the sets of fuzzy sets. Then, the classification of a new document d* is based on the subsethood of a fuzzy set representing d* in the fuzzy sets corresponding to the structures representing categories and cases. To this aim, a special index is proposed which is a weighted sum of the subsethoods of d* to both of these fuzzy sets. The document d* is assigned to such a case for which this subsethood is the highest (in fact, the complement of the subsethood is considered making it possible to interpret this index as the distance). Details are given in the paper [10].

Another class of techniques belonging to the second group may be characterized by their two-level approach: the document d* is assumed to be first classified to a category and then to a case within the selected category. Various instances of this group of techniques have been proposed. In [11] we employ the k-nn technique at both levels of classification. At the level of cases, a special form of this technique is used. The documents are represented with the help of the vector space model and strong reduction of the dimensionality is applied. The categories are represented by all training documents belonging to them while particular on-going cases are represented with their last documents. This gives rise to an effective and intuitive classfication scheme which is simple to use and proved to be useful in practice.
Next solution developed in the framework of this group consists in adapting the k-nn technique to the sparsity of the data space, characteristic for the MTC problem. We deal here usually with a rather limited number number of training documents representing a case. This may reduce the effectiveness of the k-nn technique because the number of the positive examples will be often far smaller than the number of negative examples in the neighbourhood of the classified document d*. The starting point for our approach is here the work of Yang et al. [A5] where, basically, the average distance to positive and negative examples appearing in the neighbourhood of d* is taken into account. In another variant, the neighbourhoods of the positive and negative examples are considered individually. Both variants, being proposed for the TDT problem, do not take into account the ordering of the documents within classes/cases, what is important for our MTC problem. Thus, in [13] we have proposed an original extension to the approach of Yang et al. which explicitly considers such an ordering. To this aim, we introduce an index of matching of a document against a case, which is initially formulated using the calculus of linguistically quantified propositions [A6] what secures its intuitive and strict interpretation at the same time. We consider also another modification of the basic k-nn technique where documents belonging to a case, treated as a set of positive examples, are multiplied to reduce the imbalance in the size of two classes. The task of the classification of the document d* to a given case is considered as a binary classification. In [13] we proposed and tested a few schemas of such a multiplication of the training data.

Next solution we proposed is based on the support vector machine technique (SVM). [14]. The approach proposed in [14] is again meant as a two-stage procedure: first the k-nn technique is assumed to be used to classify document d* to a category and next a category-specific SVM classifier assigns d* to a case. Hence, we focused on the second stage. A key issue is a proper representation of the documents and cases. We tested a few alternative representations including a naive but natural one one where each on-going case is treated as a separate class and an SVM is trained for the resulting multiclass classification problem using one of possible generalizations of the basic binary SVM. Documents are represented as in the classical vector space model, i.e., as vectors in the space of keywords. Such an approach has at least two weak points: the number of training documents belonging to particular classes (cases) will be usually rather small; moreover, in such an approach the information on the ordering of documents within cases is not taken into account.
In order to overcome these limitations of the naive approach we proposed a following representation. A data point, from the point of view of an SVM, is now a pair (a prefix of a case, a document), denoted (σ,d*). The mentioned prefix is a sequence composed of the number of leading documents of a case. Such a pair (σ,d*) is a positive training example if there is such a case of which σ is a prefix and a document d* actually immediately follows it; otherwise, a pair is a negative training example. In [14] we propose a specific scheme of generating positive and negative training examples based on the set of on-going cases. Then, the assignment of a document to a case becomes a binary classification problem: document d* is combined with each on-going case to form a pair mentioned earlier and all these pairs are classified by earlier trained SVM. The result produced by SVM is interpreted in terms of the probability that a given data point is a positive example and document d* is assigned to a case σ such that the probability for (σ, d*) being a positive example is the highest.
However, in order to make such an approach possible, a distance between the data points (i.e., earlier mentioned pairs) have to be defined. To this aim, in [14] we propose to transform all positive and negative training data points (pairs) to vectors of a fixed length where each coordinate corresponds to the value of a feature function computed for a given data point (one coordinate represents the class, positive or negative, to which a given data point belongs). In our computational experiments those feature functions take form of various measures of similarity between both components of a pair froming a data point.

Back

References

Publications prepared in the framework of the project (to be completed)

[1] Gajewski M., Kacprzyk J., Zadrożny S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana, 1/2014, Wyższa Szkoła Informatyki Stosowanej i Zarządzania, Warszawa, 133-147.PDF
[2] Kacprzyk J., Owsiński J.W., Viattchenin D. A.: A new heuristic possibilistic clustering algorithm for feature selection. Journal of Automation, Mobile Robotics & Intelligent Systems, vol. 8, no. 2, 2014, 40-46.PDF
[3] Kacprzyk J., Zadrożny S.: Power of linguistic data summaries and their protoforms. In: C. Kahraman (ed.) Computational Intelligence Systems in Industrial Engineering, Atlantis Press, 2012, vol. 6, 71–90. DOI: http://dx.doi.org/10.2991/978-94-91216-77-0_4
[4] Olszewski D., Kacprzyk J., Zadrożny S.: Time series visualization using asymmetric self-organizing map. In: M. Tomassini, A. Antonioni, F. Daolio, and P. Buesser (eds.): Adaptive and Natural Computing Algorithms, Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2013, vol. 7824, 40–49. DOI:http://dx.doi.org/10.1007/978-3-642-37213-1_5
[5] Olszewski D., Kacprzyk J., Zadrożny S.: Asymmetric k-means clustering of the asymmetric self-organizing map. In: L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh, and J. Zurada (eds.): Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science, Springer International Publishing, 2014, vol. 8468, 772–783. DOI: http://dx.doi.org/10.1007/978-3-319-07176-3_67
[6] Rybnik M., Homenda W.: A harmonization model with partial fuzzy knowledge. In: Proceedings of the IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), 2013, 1366–1371. DOI: http://dx.doi.org/10.1109/IFSA-NAFIPS.2013.6608600
[7] Szymczak M., Zadrożny S., Bronselaer A., Tré G. D.: Coreference detection in an XML schema. Information Sciences, vol. 296, 2015, 237– 262. DOI: http://dx.doi.org/10.1016/j.ins.2014.11.002
[8] Zadrożny S., Kacprzyk J., Gajewski M., Wysocki M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing 2013 (ICCIP’13). Cracow University of Technology, 2013.
[9] Zadrożny S., Kacprzyk J., Gajewski M., Wysocki M.: A novel text classification problem and its solution, Technical Transactions. Automatic Control, vol. 4-AC, 2013, 7–16. PDF
[10] Zadrożny S., Kacprzyk J., Gajewski M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proc. of the Annual Conference of the North American Fuzzy Information Processing Society and 5th World Conference on Soft Computing, Redmond, WA, USA, August 17-19, 2015. DOI: http://dx.doi.org/10.1109/NAFIPS-WConSC.2015.7284173
[11] Zadrożny S., Kacprzyk J., Gajewski M.: A New Two-Stage Approach to the Multiaspect Text Categorization. In: Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, December 8 10, 2015, 1484-1490. DOI: http://dx.doi.org/10.1109/SSCI.2015.210
[12] Zadrożny S., Kacprzyk J., Gajewski M.: On the detection of new cases in multiaspect text categorization: a comparison of approaches. In: Proceedings of the Congress on Information Technology, Computational and Experimental Physics (CITCEP 2015), Kraków, Poland , 18-20 December 2015, 213-218.
[13] Zadrożny S., Kacprzyk J., Gajewski M.: Multiaspect Text Categorization Problem Solving: A Nearest Neighbours Classifier Based Approaches and Beyond. Journal of Automation, Mobile Robotics & Intelligent Systems, vol.9, no.4, 2015, 58-70.PDF
[14] Zadrożny S., Kacprzyk J., Gajewski M.: A new approach to the multiaspect text categorization by using the support vector machines. In: G. De Trė et al. (eds.): Challenging problems and solutions in intelligent systems. Springer, 2016, 261-277. DOI: http://dx.doi.org/10.1007/978-3-319-30165-5_13
[15] Zadrożny S., Kacprzyk J., Gajewski M.: A Solution of the Multiaspect Text Categorization Problem by a Hybrid HMM and LDA Based Technique. In: J.P. Carvalho, M.-J. Lesot, U. Kaymak, S. Vieira, B. Bouchon-Meunier, R.R. Yager (Eds.): Information Processing and Management of Uncertainty in Knowledge-based Systems. 16th International Conference, IPMU 2016 Eindhoven, The Netherlands, June 20–24, 2016 Proceedings, Part I, 214-225, Springer International Publishing Switzerland, 2016. DOI: http://dx.doi.org/10.1007/978-3-319-40596-4_19
[16] Zadrożny S., Kacprzyk J., Gajewski M.: A Hierarchy-Aware Approach to the Multiaspect Text Categorization Problem. In: Proceedings of the 6th World Conference on Soft Computing, Berkeley, May 22-25 2016, 303-308.
[17] Zadrożny S., Kacprzyk J., Gajewski M.: On the Detection of New Cases in Multiaspect Text Categorization: A Comparison of Approaches. In: P. Kulczycki, L.T. Kóczy, R. Mesiar, J. Kacprzyk (eds.): Information Technology and Computational Physics, Springer, accepted for publication.

Other relevant literature (to be completed)

[A1] Allan J. (Ed.): Topic Detection and Tracking: Event-based Information. Kluwer Academic Publishers, 2002.
[A2] Bu F., Li H., Zhu X.: String re-writing kernel. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers, 449–458.
[A3] Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: C.E. Brodley, A.P. Danyluk (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001),Williams College, USA, June 28 - July 1, 2001, Morgan Kaufmann, 2001, 282–289.
[A4] Rabiner L., Rosenberg A., Levinson S.: Considerations in Dynamic Time Warping Algorithms for Discrete Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978, 26(6), 575-582.
[A5] Yang Y., Ault T., Pierce T., Lattimer C. W.: Improving text categorization methods for event tracking.In:SIGIR,2000,65-72.
[A6] Zadeh L.A.: A computational approach to fuzzy quantifiers in natural languages. Computers and Mathematics with Applications, 9, 1983, 149-184.
[A7] Zaki M. J.: SPADE: an efficient algorithm for mining frequent sequences. Machine Learning, vol.42, no.1/2, 2001, 31–60.

Back

Datasets and scripts

Datasets

Our primary intended test datasets are collections of documents managed by companies and recording their buisness processes. At the beginning we, however, performed our computaional experiments with datasets simulating such collections. Namely, we use a corpus of a scientific journal articles described here:

Bird, S., et al.: The {ACL} anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In: Proc. of Language Resources and Evaluation Conference (LREC 08). pp. 1755--1759. Marrakesh, Morocco.
which may be downloaded from here:
Link to a ZIP file containing the ACL anthology reference corpus
In our experiments we use the "xml\05" subdirectory of the collection. Each document in the collection is divided into sections (Section XML elements) which we treat as actual documents, while the whole articles are treated as cases. Thus,
  • first, the articles are grouped into categories using a clustering algorithm,
  • and then articles in a given category correspond to cases in this category while sections form documents within these case
The details of the above data preparing process are described in our papers listed in section "Publications and References".

In order to follow our experiments one may use the package MTC (still under development!) which is briefly described below in a subsection entitled Scripts. After installing and loading the package, to prepare a test data set execute the following steps:

  • Download the zipped data file following the above mentioned link to the ACL Anthology reference corpus
  • Unpack this zip file (or a part of it) to a folder (e.g., named "source")
  • Create a target folder (e.g., named "target")
  • Execute: xml2cases("source","target")
The "target" folder will then comprise as many subfolders as there are files in the "source" folder which are ``non-empty'', i.e., contain some <Section> XML elements which are long enough (longer than 100 characters). In each folder there are text files corresponding to subsequent sections of a given XML document.

The next step is then clustering of the documents so as to assign them to the categories. In practical applications the categories of particular documents are assumed to be known. However, for the purposes of the tests carried out using the ACL Anthology corpus they have to be arbitrarily assigned. This step should be executed using the creatcat function of the MTC package (see below). This results in assigining the whole articles of the ACL Anthology to some clusters/categories. Based on that, all documents which have emerged from the same article (as its <Section> elements) are assigned to the same category.

We use various representations of documents, cases and categories in our approaches to solving the MTC approach, as described in our papers listed in section "Publications and References". One of the convenient representations boils down to using a standard vector space model weighting scheme, such as "tf x IDF" or "tf" but limiting the representation of a document to a number of top keywords weights. Such a representation for a subcollection of the ACL Anthology corpus may be obtained using the createshort function of the MTC package (see below). The main result is the mentioned "short" representation of the collection but also some intermediary results such as document term matrices are produced.

Having produced such a "short" representation of a subset of the ACL Anthology, one may use the function classify_doc to run a series of experiments consisting in assigning test documents to cases. This function implements an algorithm presented in [10]. Test data subset is automatically randomly selected for each experiment.

Scripts (package MTC)

Installing the MTC package requires executing under R the following instruction:

install.packages("MTC", repos = c("http://www.ibspan.waw.pl/~zadrozny/MTC/pkg", "http://cran.us.r-project.org"), type="win.binary")
This is a standard R instruction for installing packages and one can set additional parameters according to his or her needs.

The package MTC is under development and currently provides the following functions:

  • xml2cases(sourcefoldername, targetfoldername) - makes it possible to convert a collection of XML documents from the ACL anthology reference corpus into a set of cases, as discussed above.
  • createcat(targetfoldername, clusterfoldername) - clustering of the articles belonging to the ACL anthology reference corpus; to be executed on the set of cases produced by xml2cases.
  • createshort(clusterfoldername, targetfoldername, shortfoldername) - produces a short representation of a given collection of documents (a selected subset of the ACL Anthology); to be executed on the set of cases and categories produced by xml2cases and createcat, respectively.
  • classify_doc(shortrepfoldername,resdirfoldername,subsetfuns1,subsetfuns2,range_of_weights1,range_of_weights2,number_of_experiments,number_of_ongoing) - runs number_of_experiments experiments using the fuzzy subsethood based MTC algorithm described in [10]. A data collection, prepared using createshort function, is read from shortrepfoldername folder. A number_of_ongoing per cent of cases present in read-in collection is randomly selected as the test cases for a given experiment. Each of number_of_experiments is executed several times, namely for each possible combination of the fuzzy subsethood functions and weights of two components of the objective function (the first corresponding to the similarity of a document with respect to a case, and the second one corresponding to the similarity of a document with respect to a category). Sets of fuzzy subsethood functions and weights are given by the parameters subsetfuns1, subsetfuns2,range_of_weights1,range_of_weights2, respectively. The results are written to an .RData file in the folder named by the parameter resdirfoldername. This results comprise a number of properly classified documents for each experiment and each combination of the mentioned parameters. Moreover, a number of documents which the second and third best classification is correct is given as well as a number of documents assigned to a correct category (but not necessarily to a correct case within this category).

Back