Back to Projects
Data

OpenCityCorpus

A large‑scale urban data corpus (~200 GB) harmonized from over 200 cities to provide a unified, searchable repository for AI and urban r...

Project Details

The OpenCityCorpus project was motivated by the significant challenge of data fragmentation in urban research. Open city data, while abundant, is scattered across disparate platforms like Socrata, ArcGIS, and CKAN, each with unique schemas and access methods. This fragmentation creates a major barrier for researchers and developers seeking to conduct large-scale, cross-city analyses or train data-intensive AI models. To address this, we developed a robust ingestion and schema-harmonization pipeline that aggregates and transforms heterogeneous structured data into a unified, queryable format. The resulting corpus contains over 200,000 datasets, totaling approximately 200 GB of data from over 200 U.S. cities. This resource is designed not only for training large-scale AI models but also as a comprehensive knowledge base for Retrieval-Augmented Generation (RAG) systems, enabling more accurate and context-aware urban AI applications. A critical component of this work was the development of a comprehensive legal and ethical framework. We conducted a thorough analysis of API Terms of Service and relevant case law, such as hiQ Labs v. LinkedIn, to ensure compliance. Furthermore, we implemented privacy-preserving techniques to mitigate the risk of re-identification in the aggregated data. By providing this unified and ethically-grounded resource, the OpenCityCorpus significantly lowers the barrier to entry for urban data science and supports the next generation of urban AI research.

Technologies Used:

Python
Data Engineering
Data Harmonization
Urban Data
API