Languages spoken (information about this variable and its quality)


Language spoken provides information on which languages, and how many, a person can speak or use.

This includes New Zealand Sign Language.



Variable Details

Other Variable Information

Priority level

Priority level 3

We assign a priority level to all census variables: Priority 1, 2, or 3 (with 1 being highest and 3 being the lowest priority).

Languages spoken is a priority 3 variable. Priority 3 variables do not fit in directly with the main purpose of a census but are still important to certain groups. These variables are given third priority in in terms of quality, time, and resources across all phases of a census.

The census priority level for languages spoken remains the same as 2013.

Quality Management Strategy and the Information by variable for languages spoken (2013) have more information on the priority rating.

Overall quality rating for 2018 Census

High quality

Data quality processes section below has more detail on the rating for this variable.

The External Data Quality Panel has provided an independent assessment of the quality of this variable and has rated it as very high to poor quality, depending on the language. 2018 Census External Data Quality Panel: Assessment of Variables and Final report of the 2018 Census External Data Quality Panel have more information.

Subject population

Census usually resident population

‘Subject population’ means the people, families, households, or dwellings to whom the variable applies.

How this data is classified

Language - Standard Classification 2V2.0.0

Languages spoken is a hierarchical classification with four levels. Level 1 has 26 categories. Level 2 contains 30 categories. Level 3 contains 49 categories. Level 4 contains 196 categories. The categories in the first level are:

0001 Germanic

0002 Romance

0003 Greek

0004 Balto-Slavic

0005 Albanian

0006 Armenian

0007 Indo-Aryan

0008 Celtic

0009 Iranian

0010 Turko-Altaic

0011 Uralic

0012 Dravidian

0013 Sino-Tibeto-Burman

0014 Austroasiatic

0015 Tai-Kadai

0016 Central-Eastern Malayo-Polynesian

0017 Western Malayo-Polynesian

0018 Afro-Asiatic

0019 Niger-Congo (Congo-Kordafanian)

0020 Pidgins and Creoles

0021 Language Isolates

0022 Miscellaneous Language Groupings

0023 Artificial Languages

0024 Sign Language

0066 None (eg too young to talk)

9999 Not elsewhere included

‘Not elsewhere included’ contains the residual categories such as ‘response unidentifiable’, ‘response outside scope’, and ‘not stated’.

Specific languages such as Italian, Japanese, English, and Te Reo Maori are at level 3 of the classification. Up to 6 languages can be selected across all levels in a valid response.

The classifications for the variables derived from languages spoken are:

Census official language indicator 2V2.0.0

000 No language

011 Māori only

012 English only

013 NZ Sign Language only

021 Māori and English only (not NZ Sign Language)

022 Māori and NZ Sign Language only (not English)

023 Māori and other only (not English or NZ Sign Language)

024 English and NZ Sign Language only (not Māori)

025 English and other only (not Māori or NZ Sign Language)

026 NZ Sign Language and other only (not English or Māori)

031 Māori, English and NZ Sign Language (not other)

032 Māori, English and other (not NZ Sign Language)

033 Māori, NZ Sign Language and other (not English)

034 English, NZ Sign Language and other (not Māori)

041 Māori, English, NZ Sign Language and other

051 Other languages only (neither Māori, English nor NZ Sign Language)

099 Not elsewhere included

‘Not elsewhere included’ contains the residual categories of ‘response unidentifiable, ‘response outside scope’, and ‘languages not stated’.

Census number of languages spokenV2.0.0

00 None

01 One language

02 Two languages

03 Three languages

04 Four languages

05 Five languages

06 Six languages

99 Not elsewhere included

‘Not elsewhere included’ contains the residual categories of ‘response unidentifiable, ‘response outside scope’, and ‘languages not stated’.

The classification of languages spoken in the 2018 Census is consistent with that of both the 2006 Census and the 2013 Census.

The Standards and Classifications page provides background information on classifications and standards.

Question format

Languages spoken is collected on the individual form (question 15 on the paper form).

Stats NZ Store House has samples for both the individual and dwelling paper forms.

There were no differences between the wording or question format in the online and paper versions of this question. However, there were differences in the way a person could respond.

On the online individual form:

  • as-you-type functionality helped respondents provide valid languages in the classification.

On the paper individual forms:

  • responses outside the valid range were possible. Alternative data sources were used to replace these responses.

How this data is used

Outside Stats NZ

  • To formulate, target and monitor policies and programmes to revitalise the Māori language as an official language of New Zealand.
  • As an indicator of iwi vitality and cultural resources.
  • To assess the need to provide multi-lingual pamphlets and other translation services in a variety of areas such as education, health and welfare.
  • To evaluate and monitor existing language education programmes and services.
  • To provide information for television and radio programmes and services.
  • To understand the diversity and diversification of the New Zealand population over time, as well as language maintenance, retention and distribution.

2018 data sources

We used alternative data sources for missing census responses and responses that could not be classified or did not provide the type of information asked for. Where possible, we used responses from the 2013 Census, administrative data from the Integrated Data Infrastructure (IDI), or imputation.

The table below shows the breakdown of the various data sources used for this variable.

2018 languages spoken – census usually resident population
Source Percent
Response from 2018 Census 83.8 percent
2013 Census data 8.2 percent
Administrative data 0.0 percent
Statistical imputation 8.0 percent
No information 0.0 percent
Total 100 percent
Due to rounding, individual figures may not always sum to the stated total(s)  

The ‘no information’ percentage is where we were not able to source languages spoken data for a person in the subject population.

Please note that when examining languages spoken data for specific population groups within the subject population, the percentage that is from 2013 Census data and statistical imputation may differ from that for the overall subject population.

Missing and residual responses

‘No information’ in the data sources table above is the percentage of the subject population coded to ‘not stated’. In previous censuses, non-response was the percentage of the subject population coded to ‘not stated.’

In 2018, the percentage of ‘not stated’ is zero due to the use of the additional data sources described above.

Percentage of ‘not stated’ for the census usually resident population:

  • 2018: 0.0 percent
  • 2013: 6.3 percent
  • 2006: 4.9 percent.

In 2018, there were no residual responses remaining in the data due to the use of 2013 Census data and imputation to replace them. In output for previous censuses, responses that could not be classified or did not provide the type of information asked for were grouped with ‘not stated’ and classified as ‘not elsewhere included’.

Percentage of ‘not elsewhere included’ for the census usually resident population:

  • 2018: 0.0 percent
  • 2013: 6.5 percent
  • 2006: 5.1 percent.

2013 Census data user guide provides more information about non-response in the 2013 Census.

Data quality processes

Overall quality rating: High quality

Data was evaluated to assess whether it meets quality standards and is suitable for use.

Three quality metrics contributed to the overall quality rating:

  • data sources and coverage
  • consistency and coherence
  • data quality.

The lowest rated metric determines the overall quality rating.

Data quality assurance for 2018 Census provides more information on the quality rating scale.

Data sources and coverage: High quality

We have assessed the quality of all the data sources that contribute to the output for the variable. To calculate a data sources and coverage quality score for a variable, each data source is rated and multiplied by the proportion it contributes to the total output.

The rating for a valid census response is defined as 1.00. Ratings for other sources are the best estimates available of their quality relative to a census response. Each source that contributes to the output for that variable is then multiplied by the proportion it contributes to the total output. The total score then determines the metric rating according to the following range:

  • 98–100 = very high
  • 95–<98 = high
  • 90–<95 = moderate
  • 75–<90 = poor
  • <75 = very poor.

2013 Census data was highly comparable to 2018 Census responses while data from within household imputation was mostly comparable to census forms. Data imputed from donor responses was moderately comparable to census forms. The proportions of data from each source along with the ‘no information’ rate of zero percent contributed to the score of 0.96, determining the high quality rating.

Quality rating calculation table for the sources of languages spoken – 2018 census usually resident population
Source Rating Percent of total Score contribution
2018 Census form 1.00 83.83 0.84
2013 Census 0.93 8.22 0.08
Within household donor 0.70 1.61 0.01
Donor’s 2018 Census form 0.60 5.55 0.03
Donor’s response sourced from 2013 Census 0.56 0.61 0.00
Donor’s response sourced from within household 0.42 0.18 0.00
No Information 0.00 0.00 0.00
Total 100.00 0.96
Due to rounding, individual figures may not always sum to the stated total(s) or score contributions.      

Data sources, editing, and imputation in the 2018 Census has more information on the Canadian census edit and imputation system (CANCEIS) that was used to derive donor responses.

Consistency and coherence: High quality

At level 1 of the classification, this data is highly comparable with the 2013 and 2006 Census data.

Languages spoken data is consistent with expectations across nearly all consistency checks, with some minor variation from expectations or benchmarks that makes sense due to real-world change, incorporation of other sources of data, or a change in how the variable has been collected.

Level 3 of the classification was checked at a national level. There were some inconsistencies with time series at this level. At level 3 of the classification, as-you-type functionality enabled respondents to provide more detailed responses (for example by stating they spoke Fiji Hindi, rather than Hindi). This has resulted in some deviations from time series.

Data quality: High quality

The data quality checks for languages spoken included edits for consistency within the dataset and cross-tabulations to the national and regional council level of geography.

Languages spoken data has only minor data quality issues. The quality of coding and responses within classification categories is high. Any impact of other data sources used is minor. Any issues with the variable appear in a low number of cases (typically in the low hundreds).

Recommendations for use and further information

We recommend that the use of this data can be similar to its use in 2013.

When using this data you should be aware that:

  • the language classification with the highest imputation rate (CANCEIS and probabilistic) is the Central Eastern Malayo Polynesian language group. This category includes Te Reo Māori and most of the languages spoken in the Pacific.
  • data has been assessed to be consistent at level 3 of the classification at the national level. Some variation is possible at geographies below this level.

Comparisons with other data sources

Although surveys and sources other than the census collect language data, data users are advised to familiarise themselves with the strengths and limitations of the sources before use.

Key considerations when comparing languages spoken information from the 2018 Census with other sources include:

  • census is a key source of information on languages spoken for small areas and small populations. Many other sources do not provide detail at this level.
  • census aims to be a national count of all individuals in a population while other sources, such as Te Kupenga, measuring this variable are only based upon a sample of the population.

Contact our Information Centre for further information about using this variable.

Revision Information

Currently viewing revision 9 by on 11/03/2020 4:10:14 a.m.

Revision 9 *
11/03/2020 11:39:04 p.m.
Revision 8
19/02/2020 2:59:11 a.m.
Revision 7
27/11/2019 8:24:13 p.m.
Revision 6
19/11/2019 2:19:53 a.m.
Revision 5
3/10/2019 2:16:37 a.m.
Revision 4
22/09/2019 9:53:26 p.m.

Show / Hide more...


DDI Agency
DDI Version


DDI 3 Download

Select the languages to display