GERDAQ Dataset

This is a benchmark dataset of annotated search-engine queries. Mentions of entities in search-engine queries are tagged with the entity they refer to. Wikipedia is used as knowledge base.

For example, the query armstrong moon landing is tagged with two annotations:

While the query armstrong doping is tagged with:

The dataset has been constructed through the Crowdflower crowdsourcing platform. Queries are drawn randomly from the KDD 2005 Cup dataset.

Data and Resources
To access the resources you must log in
  • GERDAQ datasetXML

    The resource: 'GERDAQ dataset' is not accessible as guest user. You must login to access it!
Additional Info
Field Value
Accessibility Virtual Access
AccessibilityMode Download
Area Natural Language Understanding
Attribution requirements
Availability On-Line
Basic rights Modification
ChildrenData No
Consent obtained also covers the envisaged transfer of the personal data outside the EU No
Consent of the data subject No
CreationDate 2014-05-19
Creator Cornolti, Marco,
DataProtectionDirective Data needs no protection.
DiskSize 0.244
Display requirements
Distribution requirements
External Identifier
Field/Scope of use Any use
Format application/xml
Language eng, English
License term
ManifestationType Original
Personal data was manifestly made public by the data subject No
PersonalData No
ProcessingDegree Primary
Requirement of non-disclosure (confidentiality mark)
Restrictions on use
Semantic Coverage entities
Size 244KB
Sublicense rights No
Territory of use World Wide
ThematicCluster Text and Social Media Mining
system:type Dataset
Management Info
Field Value
Author Cornolti Marco
Maintainer Cornolti Marco
Version 1
Last Updated 29 April 2021, 11:19 (CEST)
Created 29 April 2021, 11:19 (CEST)