Event Registry architecture

Event Registry [1, 2] is a system that analyses news articles and identifies world events which appear in them. The system is capable of identifying groups of articles that report on the same event. It can identify groups of articles in different languages that report on the same event and groups them into a single event. From articles in each event, it extracts the event’s core information, such as location, date, persons involved and what is it about. The extracted information is stored in a database. Users are able to search for events using extensive search options, to visualize and aggregate the results, to examine individual events, and to identify related events.

System architecture

The system for detecting world events consists of a set of components that are illustrated in the figure below. The pipeline contains four main parts:

(a) data collection, which is responsible for collecting news articles,

(b) pre-processing steps, where articles are annotated and information is extracted,

(c) event construction, where articles describing the same event are grouped and event information is extracted, and

(d) event storage, where the events are stored and a means for accessing them is provided to users. Each part is described briefly below.

Each part is described briefly below.

System architecture

Data collection

The News Feed service is used to gather data. It collects news articles from around 75,000 news sources from around the world From 100,000 to 200,000 articles are collected this way each day. The articles are in various languages, where most represented languages are English (50% of all articles), German (10%), Spanish (8%) and Chinese (5%). These are also the only languages that we syntactically and semantically process in the following steps of the pipeline.

Pre-processing steps

The articles in the mentioned languages are subsequently processed with a set of linguistic tools. An essential component of Event Registry is the “named entity recognizer” which detects the entities mentioned in the articles and disambiguates them. Since events are associated with a date, the system also attempts to identify date mentions in the text using a set of regular expressions for different languages.

An important functionality of Event Registry is an identification of groups of articles describing the same event regardless of the language. To support this functionality we combine different learning features, one of which is cross-lingual similarity of articles. The cross-lingual similarity service [2] can compute an approximate similarity between articles written in English, German, Spanish and Chinese language. The computation is based on an aligned set of basis vectors obtained using latent semantic indexing and a generalized version of canonical correlation analysis. As an output, the service can provide for each article a set of recent most similar articles in other languages and an approximate similarity score.

Event construction

In the event construction phase, the system works to identify groups of articles which report about the same event and extract from the articles event information. In order to identify groups of articles describing the same event, we implemented an online clustering algorithm based on [4, 5]. We use a separate clustering system for each of the four languages. Each article is represented using the vector space model based on the article title, body and detected named entities (entities are assigned much higher weights than ordinary words). Based on the computed vector, each new article is placed into the closest cluster. After every n added articles the system re-evaluates the clusters and checks if some of them need to be merged or split into two. The Bayesian information criterion is used when deciding if two clusters should be split into two. Alternatively, the cosine similarity and the Lughofer’s ellipsoid criterion [6] are used to decide if two clusters are similar enough to be merged. Since articles on the same event are typically reported only for a few days, we delete the clusters (only from clustering, not from the Event Registry) that contain articles that are more than k days old.

As a result of the clustering,  the system offers groups of articles that describe the same event in a single language. In order to group clusters about the same event in different languages, we use an SVM model. The learning data used for building the model consists of pairs of clusters, for which experts manually decided if they describe the same event. For each pair of clusters, we extract various features that are relevant in deciding if two clusters discuss the same event or not. One of the main features is the cross-lingual cluster similarity that is computed based on cross-lingual article similarities. For tested clusters C1 and C2, the system checks for each article ai in C1, how many of its most similar articles are in cluster C2, and vice-versa. Another important feature is the similarity of the most frequently annotated entities in both clusters. Since named entities are represented with language-independent identifiers, the system can directly compare clusters based on entities regardless of cluster language. Other learning features include the time difference between the clusters, time variability inside the clusters, cluster qualities, etc. Given the positive and negative learning examples, the SVM model can predict for a new pair of clusters if they should be merged or not.

Once one or more clusters are identified that appear to belong to the same event, we create an event in the Event Registry and assign it a unique ID. To extract event information we analyse the articles in the event’s clusters. Event title and a short text snippet are determined by finding the article closest to the center of the cluster (medoid article) and using its title and first paragraph. For the event date, we analyse the detected date references in the articles. If the most frequently detected date is frequent enough, the system uses it as the event date. If no date passes the threshold, the system uses the average date of the articles in the cluster as the event date. The average date is used instead of the earliest article date to compensate for clustering errors: we don’t want an older, incorrectly assigned article to be responsible for incorrectly assigning the event date. To set the event location we find frequently detected named entities that are known to be locations (based on GeoNames). Especially high weights are put on the locations that appear at the start of the articles. The location with the highest weight is chosen as the event location. As a way of summarizing what the event is about, the system analyses all detected named entities in the articles and computes their weight based on how commonly they appear in the articles. Events are also about different topics (sports event vs politics or arts etc.). To categorize the events we used the DMOZ taxonomy which contains a categorization of 5 million web pages. We built a DMOZ classifier that can classify each event into a DMOZ category based on the content of the articles in the event.

Events with all the extracted information are then stored in the Event Registry database and are searchable using the search API.

Search options for events

The search interface allows the users to search for events based on different criteria. The main input box is an autocomplete field where the user can specify one or more concepts (entities or keywords) of interest. Only events that are associated with all specified concepts will be shown as search results. The user can also specify a condition: the event location, time of interest, event category and the minimum event coverage.

Displaying search results

After performing the search, it is not uncommon to find thousand or more events that match the criteria. To allow the user to better understand the results and possibly refine the query, the system provides different ways of presenting the results. The most common way is to check the list of events, where the main extracted information is displayed for each event. The list of events can be sorted either by date or by relevance to the query.

  • List of events

To enable users to understand the big picture of the search results the system can generate a number of visualizations:

  • The concept visualization displays a bar chart containing the most relevant entities and keywords discussed in the events. Each concept is also associated with a relevance score that describes the average relevance of the concept (on a 0 – 100 scale) in the resulting events.
  • The location visualization displays on map locations where the events occur.
  • The timeline visualization shows the distribution of the events over time.
  • The trending concepts graph uses the themeriver visualization to show how the popularity of top concepts changes in the events over time. The visualization should be especially useful when viewing events over a longer time period when themes can change significantly.
  • To understand the co-occurrence of concepts in the events, the entity graph can be used. It displays a network of top entities in the results, where edges are drawn between the entities that frequently co-occur in the same events.
  • Finally, the visualization of categories shows the categorization of events using the DMoz taxonomy.

Displaying event information

Clicking an individual event in the event list opens the event in a separate window. The top part of the window shows the title, location, date and a short summary of the event. Below is a list of articles describing the event. The articles are grouped by language and a particular language can be selected by clicking the appropriate button. As seen in the example provided, the system automatically identified and merged articles reporting about the same event in three different languages. The content of the article can be accessed by clicking on the title of a particular article.

  • Event articles

In order to quickly understand what the event is about, the concept visualization displays top entities and keywords for the event. To see the trending properties of the event, the article timeline visualization displays time when the articles about the event were written. The height of the curve indicates the cumulative number of articles about the event in the last 6 hours. The size of the point indicates the number of articles that were published in the same time frame.

An important feature when viewing an event is also the ability to display related events. Related events are found by computing the TF-IDF weights on the event concepts and finding other events with similar concept weights (by using cosine similarity measure). The similar events can be shown in two ways – as a bar chart of events, order by decreasing similarity, or on a timeline where the order of events is defined by event time. The related events can help a user to move from a single event and understand what were the events leading up to it and what were the consequences of the viewed event.

References

[1] G. Leban, B. Fortuna, J. Brank, M. Grobelnik. Event Registry – Learning About World Events From News. World Wide Web conference, Seoul, Korea, 2014.

[2] G. Leban, B. Fortuna, J. Brank, M. Grobelnik. Cross-lingual detection of world events from news articles. The 13th International Semantic Web Conference, Trentino, Italy, 2014.

[3] J. Rupnik, A. Muhic, and P. Skraba. Cross-lingual document retrieval through hub languages. In NIPS, 2012.

[4] C. C. Aggarwal and S. Y. Philip. On clustering massive text and categorical data streams. Knowledge and information systems, 24(2):171{196, 2010.

[5] C. C. Aggarwal and P. Yu. A framework for clustering massive text and categorical data streams. In Proceedings of the sixth SIAM international conference on data mining, volume 124, pages 479{483, 2006.

[6] E. Lughofer. A dynamic split-and-merge approach for evolving cluster models. Evolving Systems, 3(3):135{151, 2012.

I’m Gregor Leban

CEO and co-founder of Event Registry

5 Comments

milton 15 September, 2016

My apologies for the typo.
Thank you Mr. Leban.

Firas 10 February, 2017

Very informative, thank you guys.

I would love to know more about your development stack. What tools do you use, DBs, paas, etc.

    Gregor Leban 10 February, 2017

    We use DB engine developed in house. All backend is implemented in C++ for speed, middleware is in node, some data is in mongo. Database is replicated across several instances for speed and redundancy.

Firas 11 February, 2017

Very nice. I have another question, I think you do lots of NLP and machine learning work, what do you use for that?

    Gregor Leban 17 February, 2017

    Hi Firas, yes, NLP and ML are the core parts of our services. All technology we use is built in-house and was built by 20+ people over the last 15 years.

Leave A Comment