Data quality in Real Estate

  • Published on
    18-Mar-2018

  • View
    326

  • Download
    0

DESCRIPTION

Data Quality In Real Estate Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference About Geophy ● Goal to…

Transcript

Data Quality In Real Estate Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference About Geophy ● Goal to map all buildings in the world ● Provide a quality score for each building ○ Based on location, building status, history, environmental metrics, etc ● Semantic platform ○ RDF eases the data integration process ● Team of 45 with aim to double by next year Real Estate is a very complex domain Really! Possible constraints on addresses? ● An address will start with, or at least include, a building number. ● When there is a building number, it will be all-numeric. ● No buildings are numbered zero ● Well, at the very least no buildings have negative numbers ● A building number will only be used once per street ● A building will only have one number ● A building name won't also be a number ● [...] https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/ Geophy [set of] ontologies ● 13 ontologies (+ 9 external) ● 125 Classes ○ Buildings ○ Addresses ○ Companies ○ [...] ● 720 properties ○ 500 datatype ○ 160 relation properties ● Growing... Quality is expensive ● Quality of source data ○ Free, open, closed data sources, etc. ● Data clean up process ○ Violations, deduplication, precision, etc. ○ How much time and effort can one afford? How much quality is good enough? � Fitness for use Quality of ... ● Source data ○ Accuracy of the source ● Translation of source data ○ RDF mappings, rml, d2rq, scripts etc. ● Model design ○ Modelling quality ○ Data fitting on schema ● Model definition ○ Mapping of model on RDFS, OWL, ShEx|SHACL Shapes, etc ○ Semantics i.e RDFS, OWL DL/RL/FULL, etc Evolution & quality � Data evolves � so do ontologies � so do RDF mappings � so does code � so do SPARQL queries � so do constraints http://aligned-project.eu http://aligned-project.eu Scaling quality ... ● Thousands of triples ● Millions of triples ● Billions of triples ● ? Try to move validation in the K range (when possible) Validate closer to the source � Validate the model � Validate the RDF mappings � Validate RDF mapping excerpts � Validate instance data Automate, automate & automate Can you spot the error? rdfs:label ⇒ rdf:langString � :foo rdfs:label ″foo @en″ . Automate, automate & automate Can you spot the error? rdfs:label ⇒ rdf:langString � :foo rdfs:label ″foo @en″ . � :foo rdfs:label ″foo″@en . CI/CD is your buddy ● Integrate validation with your CI/CD ○ Choose tools & technologies wisely ○ Jenkins, Travis, Gitlab, TeamCity ● Fail the build until data issues are fixed ● Data integration validation checks ○ Standalone datasets can pass CI Thank you for your attention Questions?