Quality Statement
Languages spoken provides information on which languages, and how many, a person can speak or use. This includes New Zealand Sign Language.
This concept also includes number of languages spoken and official language indicator.
Number of languages spoken
The number of languages an individual speaks.
Official language indicator
The combinations of official languages and English spoken for individuals.
High quality
Data quality processes section below has more detail on the rating.
Priority level 2
A priority level is assigned to all census concepts: priority 1, 2, or 3 (with 1 being highest and 3 being the lowest priority).
Languages spoken is a priority 2 concept. Priority 2 concepts cover key subject populations that are important for policy development, evaluation, or monitoring. These concepts are given second priority in terms of quality, time, and resources across all phases of a census.
The census priority level for languages spoken has increased from priority level 3 in 2018 to acknowledge key statutory duties, use of concept in policy and monitoring, and the importance of te reo Māori data.
The 2023 Census: Final content report has more information on priority ratings for census concepts.
Census usually resident population count
‘Subject population’ means the people, families, households, or dwellings that the variable applies to.
Languages spoken
Languages spoken is classified into the following categories:
Language - Standard Classification 2 V3.0.0 – Level 1 of 4
Code | Category | Code | Category |
---|---|---|---|
0001 | Germanic | 0014 | Austroasiatic |
0002 | Romance | 0015 | Tai-Kadai |
0003 | Greek | 0016 | Central-Eastern Malayo-Polynesian |
0004 | Balto-Slavic | 0017 | Western Malayo-Polynesian |
0005 | Albanian | 0018 | Afro-Asiatic |
0006 | Armenian | 0019 | Niger-Congo (Congo-Kordafanian) |
0007 | Indo-Aryan | 0020 | Pidgins and Creoles |
0008 | Celtic | 0021 | Language Isolates |
0009 | Iranian | 0022 | Miscellaneous Language Groupings |
0010 | Turko-Altaic | 0023 | Artificial Languages |
0011 | Uralic | 0024 | Sign Language |
0012 | Dravidian | 0066 | None (eg, too young to talk) |
0013 | Sino-Tibeto-Burman | 9999 | Not elsewhere included |
Languages spoken uses a 4-level hierarchical classification with level 1 categories presented in the table above. The level 1 residual category “Not elsewhere included” contains the residual categories ‘Don’t know’, ‘Refused to answer’, ‘Response unidentifiable’, ‘Response outside scope’ and ‘Not stated’. Follow the link to examine the classification and find more detail.
The classification for ‘Languages spoken’ has changed since 2018 Census. These changes are:
- Moriori added
- Miaow-Yao changed to Miao-Yao.
Respondents could provide multiple answers to the languages spoken question, up to a maximum of six responses. As languages spoken is a multiple-response variable, the total number of responses in a table is greater than the total number of people stated.
Official language indicator
Official language indicator uses a flat 1-level hierarchical classification as presented in the table below. Census official language indicator classification group combinations of ‘Māori’, ‘English’, ‘NZ Sign Language’, and ‘Other’. Follow the link to examine the classification and find more detail.
Census official language indicator recode V1.0.0 – Level 1 of 1
Code | Category | Code | Category |
---|---|---|---|
00 | No language | 31 | Māori, English and NZ Sign Language (not Other) |
11 | Māori only | 32 | Māori, English, and Other (not NZ Sign Language) |
12 | English only | 34 | English, NZ Sign Language, and Other (not Māori) |
13 | NZ Sign Language only | 41 | Māori, English, NZ Sign Language, and Other |
21 | Māori and English only (not NZ Sign Language) | 51 | Other languages only (neither English, Māori, nor NZ Sign Language) |
24 | English and NZ Sign Language only (not Māori) | 52 | Other combination of Māori, English, NZ Sign Language, and Other |
25 | English and Other only (not Māori or NZ Sign Language) | 99 | Not elsewhere included |
Number of languages spoken
Number of languages spoken uses a 2-level hierarchical classification with level 1 categories presented in the table below. Follow the link to examine the classification and find more detail.
Census number of languages spoken V2.0.0 – Level 1 of 2
Code | Category |
---|---|
00 | None |
01 | One language |
02 | Two languages |
… | … |
06 | Six languages |
99 | Not elsewhere included |
The 2023 Census classification for number of languages spoken is consistent with that used in 2018 Census.
The level 1 residual category “Not elsewhere included” contains the residual categories ‘Response unidentifiable’, ‘Response outside scope’ and ‘Not stated’.
Data collected on languages spoken is classified through the languages spoken standard classification, as well as a number of additional classifications.
Languages spoken
- Census languages spoken – detailed V1.0.0
- Language – six categories V1.0.0
- Census languages spoken – 17 languages V1.0.0
Number of languages spoken
Note that the ‘Census number of languages spoken – 2 or more recode V1.0.0’ classification counts people who reported speaking either one language or two or more languages (which may or may not include English or Māori).
Standards and classifications has more information on what classifications are, how they are reviewed, where they are stored, and how to provide feedback on them.
Languages spoken is collected on the individual form (paper form question 15).
There were the following changes to the question format:
- the reminder of the possibility to select multiple responses was rephrased and moved in front of the question
- ‘New Zealand Sign Language’ was moved up from fourth response option to third because of its status as an official language of New Zealand.
There were differences in the way a person could respond between the modes of collection (online and paper forms).
On the online form:
- as-you-type functionality helped respondents provide valid languages in the classification.
On the paper form:
- responses outside the valid range were possible. Alternative data sources were used to replace these responses.
Data from the online forms may therefore be of higher overall quality than data from paper forms. However, processing checks and edits were in place to improve the quality of the paper forms.
Stats NZ Store House has samples for both the individual and dwelling paper forms.
Data-use outside Stats NZ:
- to formulate, target, and monitor policies and programmes to revitalise the Māori language as an official language of New Zealand, such as the Maihi Karauna (the Crown’s Strategy for Māori Language Revitalisation 2019–2023)
- to provide information on use of New Zealand Sign Language, to help maintain and promote its use
- to assess the need to provide multi-lingual pamphlets and other translation services in areas such as education, health, and welfare
- to evaluate and monitor existing language education programmes and services
- to provide information for television and radio programmes and services
- to understand the diversity and diversification of the New Zealand population over time, as well as language maintenance, retention, and distribution.
Alternative data sources were used for missing and residual census responses and responses that could not be classified or did not provide the type of information asked for. The table below shows the distribution of data sources for languages spoken data.
Data sources for languages spoken data, as a percentage of census usually resident population count, 2023 Census | ||
---|---|---|
Source of languages spoken data | Percent | |
2023 Census response | 85.3 | |
Historical census | 8.8 | |
2018 Census | 6.3 | |
2013 Census | 2.5 | |
Admin data | 0.0 | |
Deterministic derivation | 0.4 | |
Statistical imputation | 5.4 | |
Probabilistic imputation | 2.5 | |
CANCEIS(1) donor’s response sourced from 2023 Census form | 2.5 | |
CANCEIS donor’s response sourced from 2018 Census | 0.2 | |
CANCEIS donor’s response sourced from 2013 Census | 0.1 | |
CANCEIS donor’s response sourced from probabilistic imputation | 0.1 | |
CANCEIS donor’s response sourced from deterministic derivation | <0.1 | |
No information | 0.0 | |
Total | 100.0 | |
1. CANCEIS = imputation based on CANadian Census Edit and Imputation System.
Note: Due to rounding, individual figures may not always sum to the stated total(s) or score contributions. |
Where appropriate, responses were used from the 2018 and 2013 Censuses to replace missing or residual responses.
If it was not possible to obtain languages spoken data from historical census data, probabilistic imputation was used. This is where the languages spoken by the person closest in age in the same household was used to fill in missing information on languages spoken for the record.
Statistical imputation was used for remaining records with missing or residual responses.
Deterministic derivation was used for records age zero that have responses other than ‘None (eg, too young to talk)’. This is where a consistency edit was applied to change the languages spoken in these records to ‘None (eg, too young to talk)’.
Editing, data sources, and imputation in the 2023 Census describes how data quality is improved by editing, and how missing and residual responses are filled with alternative data sources (admin data and historical census responses) or statistical imputation. The paper also describes the use of CANCEIS (the CANadian Census Editing and Imputation System), which is used to perform imputation.
Missing and residual responses represent data gaps where respondents either did not provide answers (missing responses) or provided answers that were not valid (residual responses).
Alternative data sources have been used to fill all missing and residual responses in the 2023 and 2018 Censuses.
Percentage of ‘Not stated’ for the census usually resident population count:
- 2023: 0.0 percent
- 2018: 0.0 percent
- 2013: 6.3 percent
For output purposes, these residual category responses are grouped with ‘Not stated’ and are classified as ‘Not elsewhere included’.
Percentage of ‘Not elsewhere included’ for the census usually resident population count:
- 2023: 0.0 percent
- 2018: 0.0 percent
- 2013: 6.5 percent.
Overall quality rating: High
Data has been evaluated to assess whether it meets quality standards and is suitable for use.
Three quality metrics contributed to the overall quality rating:
- data sources and coverage
- consistency and coherence
- accuracy of responses.
The lowest-rated metric determines the overall quality rating.
Data quality assurance in the 2023 Census provides more information on the quality rating scale.
Data sources and coverage: Very high quality
The quality of all the data sources that contribute to the output for the variable have been assessed. To calculate the data sources and coverage quality score for a variable, each data source was rated and multiplied by the proportion it contributes to the total output.
The rating for a valid census response is defined as 1.00. Ratings for other sources are the best estimates available of their quality relative to a census response. Each source that contributes to the output for that variable is then multiplied by the proportion it contributes to the total output. The total score then determines the metric rating according to the following ranges:
- 0.98 –0.100 = very high
- 0.95–<0.98 = high
- 0.90–<0.95 = moderate
- 0.75–<0.90 = poor
- <0.75 = very poor.
The high proportion of data received from the 2023 Census forms, alongside the high quality of alternative data sources, resulted in a score of 0.98, leading to a quality rating of very high.
Data sources and coverage rating calculation for languages spoken data, census usually resident population count, 2023 Census | |||
---|---|---|---|
Source of languages spoken data | Rating | Percent | Score contribution |
2023 Census response | 1.00 | 85.34 | 0.85 |
2018 Census | 0.92 | 6.28 | 0.06 |
2013 Census | 0.89 | 2.52 | 0.02 |
Deterministic derivation | 1.00 | 0.45 | <0.01 |
Probabilistic imputation | 0.80 | 2.50 | 0.02 |
CANCEIS(1) nearest neighbour imputation | 0.80 | 2.92 | 0.02 |
No information | 0.00 | 0.00 | 0.00 |
Total | 100.00 | 0.98 | |
1. CANCEIS = imputation based on CANadian Census Edit and Imputation System.
Note: Due to rounding, individual figures may not always sum to stated total(s) or score contributions. |
Consistency and coherence: High quality
Languages spoken data is consistent with expectations across nearly all consistency checks, with some minor variation from expectations or benchmarks, which makes sense due to real-world change, incorporation of other sources of data, or a change in how the variable has been collected.
The Central-Eastern Malayo-Polynesian language group (which includes te reo Māori) has a relatively high proportion of data sourced from historical census and statistical imputation. This, coupled with the ability for type and number of languages spoken to change over time, contributed to the minor variation in data from expectations.
Accuracy of responses: Very high quality
Languages spoken data has no data quality issues that have an observable effect on the data. The quality of coding is very high. Any issues with the variable appear in a very low number of cases (typically less than a hundred).
Improvement in scanning repair for paper forms reduced the number of responses needing to be sourced from alternative sources. The consistency edit applied to inconsistent and contradictory responses further improved data quality.
Languages spoken data can be used in a comparable manner to the 2018 and 2013 Censuses.
The language group with the highest imputation rate (CANCEIS and probabilistic) is the Central-Eastern Malayo-Polynesian language group. This category includes te reo Māori and most of the languages spoken in the Pacific. It is recommended users be aware of the proportion of alternatively sourced data when analysing languages spoken data at low levels of geography or lower levels of the classification.
Comparisons to other data sources
Although surveys and sources other than the census collect language data, data users are advised to familiarise themselves with the strengths and limitations of the sources before use.
Key considerations when comparing languages spoken information from the 2023 Census with other sources include the following:
- Census is a key source of information on languages spoken for small areas and small populations. Many other sources do not provide detail at this level.
- Census aims to be a national count of all individuals in a population while other sources, such as Te Kupenga, measuring this variable are only based upon a sample of the population.
To assess how this concept aligns with the variables from the previous census, use the links below:
Contact our Information centre for further information about using this concept.
en-NZ