オーストラリア議会の討論のデジタル化、1998 年

Scientific Data volume 10、記事番号: 567 (2023) この記事を引用

242 アクセス

12 オルトメトリック

メトリクスの詳細

議会での発言が国民に知られることは民主主義の教義であり、政治学研究にとって重要な情報源です。オーストラリアでは、英国の伝統に従って、議会での発言を文書化した記録はハンサードとして知られています。 Australian Hansard は常に公開されてきましたが、PDF または XML でしか入手できなかったため、大規模なマクロレベルおよびミクロレベルのテキスト分析の目的で使用するのは困難でした。カナダでこれを達成した Linked Parliamentary Data プロジェクトの主導に従い、1998 年から 2022 年までのオーストラリア議会の審議の議事録を記録した、新しく包括的で高品質な長方形のデータベースを提供します。このデータベースは公開されており、リンクすることができます。選挙結果などの他のデータセットへ。このデータベースの作成とアクセス可能性により、新たな疑問の探求が可能になり、研究者と政策立案者の両方にとって貴重なリソースとして役立ちます。

正式には Hansard1 として知られる議会討論の公式記録は、政治手続きの歴史を把握し、貴重な研究課題の探求を容易にする上で基本的な役割を果たします。英国議会で始まったハンサードの製造は、カナダやオーストラリアなど他の多くの英連邦諸国の伝統となりました2。これらの記録の内容と規模を考えると、特に政治学研究の文脈において重要な意味を持ちます。カナダの場合、ハンサードは 1901 年から 2019 年までデジタル化されています3。デジタル化されたハンサードバージョンを使用すると、研究者はテキスト分析と統計モデリングを行うことができます。そのプロジェクトの先導に従って、この文書ではオーストラリアの同様のデータベースを紹介します。これは、1998 年 3 月から 2022 年 9 月までの下院の開会日ごとの個別のデータセットで構成されており、研究者がすぐに使用できる形式で議会で発言されたすべての詳細が含まれています。大規模なテキスト分析ツールの開発により、このデータベースはオーストラリアの政治行動を長期的に理解するためのリソースとして機能するでしょう。

このデータベースにはさまざまな用途が考えられます。たとえば、オーストラリア国内では、公共政策に関する議論（定義はどうあれ）の「質」が低下しているという大きな懸念がある。私たちのデータセットを使用すると、特定の面で実際に悪化しているかどうか、悪化している場合はその理由を調べることができます。また、特定の下位集団が議会で議論される内容に適切に代表されているかどうかにも関心があるかもしれません。例えば、大都市圏に比べて地方地域が軽視されているのではないかという懸念がよくあります。繰り返しますが、私たちのデータベースを使用して、これが時間の経過とともに変化したかどうかを調べることができます。私たちは、比較分析を可能にする他の国の同様のデータベースとリンクできる方法でデータベースを開発しました。たとえば、パンデミックや戦争などのさまざまな世界的出来事を考慮して、議会の政策の焦点がどのように変化するかに興味があるかもしれません。国際的なつながりは、国内の問題は異なるが、国際的な問題は共通であるという比較例を提供します。この連携を有効にする例として、PartyFacts ID (https://partyfacts.herokuapp.com) をデータベースに含めました。これにより、私たちのデータベースを、ParlaMint4、ParlSpeech5、ParlEE6、MAPLE7 などの他の大規模な議会演説コレクションプロジェクトとリンクできるようになります。

オーストラリア下院はしばしば「下院」と呼ばれ、新しい法律の制定や政府支出の監督など、多くの重要な政府機能を果たしています8、ch. 1. 下院の政治家は国会議員 (MP) と呼ばれます。下院は並行議院構成の下で運営されており、審議が行われる討論会場は議場と連邦議場の2つであることを意味します。下院の議席は、事前に定義された議事順序に従い、常任命令と呼ばれる手続き規則によって規制されています8、ch. 8. 議場での典型的な着席日には、政府業務に関する討論、90 秒間の議員声明、質問時間 8、第 8 章を含む多数の予定された議事が行われます。 8. 連邦会議所は、会議所の下部討論会場として 1994 年に設立されました。これにより、下院の議事は第 8 章の議事と同時進行するため、より適切な時間管理が可能になります。 21. 連邦議場での議席は、議事の順序と議論の範囲の点で議場での議席とは異なります。連邦会議所で議論されるビジネス事項は、主に法案作成の中間段階と民間会員のビジネスに限定されています8、第 2 章。 21. これは、ハンサードの基礎となるこれらの議事録の記録と編纂であり、完全ではないが本質的にはそのままです。

/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p> and serves as a container for the entire document. This parent node may have up to four child nodes, where the first child node contains details on the specific sitting day. Next, contains all proceedings of the Chamber, contains all proceedings of the Federation Chamber, and contains Question Time proceedings. The Federation Chamber does not meet on every sitting day, so this child element is not present in every XML file. The use of separate child nodes allows for the distinction of proceedings between the Chamber and Federation Chamber. The structure of the and nodes are generally the same, where the proceeding begins with which is followed by a series of debates. Debate nodes can contain a child node which has a child node nested within it. That said, sometimes is not nested within . Each of these three elements (i.e., , , and ) as well as their respective sub-elements contain important information on the topic of discussion, who is speaking, and what is being said. The node within each one contains the bulk of the text associated with that debate or sub-debate. A typical node begins with a sub-node, providing information on the MP whose turn it is to speak and the time of their first statement. Unsurprisingly, speeches rarely go uninterrupted in parliamentary debate settings — they are often composed of a series of interjections and continuations. These statements are categorized under different sub-nodes depending on their nature, such as or . The final key component of Hansard is Question Time, in which questions and answers are classified as unique elements. More detail on the purpose and processing of Question Time will follow./p> (highlighted in blue), followed by a child element (highlighted in yellow) with sub-child elements such as the date and parliament number, which are all highlighted in pink. Next, there is the child element containing everything that takes place in the Chamber, , which is also highlighted in yellow in Fig. 1. As previously mentioned, the first sub-node of is . The structure of this can be seen between the nodes highlighted in green in Fig. 1, where the content we parse from the business start is highlighted in orange./p> versus . The next key task stemmed from the fact that the raw text data were not separated by each statement when parsed. In other words, any interjections, comments made by the Speaker or Deputy Speaker and continuations within an individual speech were all parsed together as a single string. As such, the name, name ID, electorate and party details were only provided for the person whose turn it was to speak. There were many intricacies in the task of splitting these speeches in a way that would be generalizable across sitting days. Details on these are provided later./p> content, and some days did not have a Federation Chamber proceeding. To improve the generalizability of these scripts, if-else statements were embedded within the code wherever an error might arise due to a missing element. For example, the entire Federation Chamber block of code is wrapped in an if-else statement for each script, so that it only executes if what the code attempts to parse exists in the file./p> in all XML files prior to 14 August 2012. Having developed our first script based on Hansard from recent years, all XPath expressions for parsing Federation Chamber proceedings contain the specification. To avoid causing issues in our first script which successfully parses about 10 years of Hansard, we created a second script where we replaced all occurrences of with . After making this modification and accounting for other small changes such as timestamp formatting, this second script successfully parses all Hansard sitting days from 10 May 2011 to 28 June 2012 (inclusive)./p> are typically , and . The first child node contains data on the person whose turn it is to speak, and the second contains the entire contents of that speech –- including all interjections, comments, and continuations. After the element closes, there are typically a series of other child nodes which provide a skeleton structure for how the speech proceedings went in chronological order. For example, if the speech began, was interrupted by an MP, and then continued uninterrupted until the end, there would be one node and one node following the node. These would contain details on the MP who made each statement, such as their party and electorate./p> node. Rather than this single child node that contains all speech content, statements are categorized in individual child nodes. This means that unlike our code for parsing more current Hansards, we cannot specify a single XPath expression such as “chamber.xscript//debate//speech/talk.text” to extract all speeches, in their entirety, at once. This difference in nesting structure made many components of our second script unusable for processing transcripts preceding 10 May 2011, and required us to change our data processing approach considerably./p> node, we found that the most straightforward way to preserve the ordering of statements and to parse all speech contents at once was to parse from the element directly. The reason we did not use its child node is because every speech has a unique structure of node children, and this makes it difficult to write code for data cleaning which is generalizable across all speeches and sitting days. The challenge with parsing through the element is that every piece of data stored in that element is parsed as a single string, including all data, and all nested sub-debate data. For example, the data shown in Fig. 2 would be parsed as a single string preceding the speech content, like so:/p>

node, and used them to split statements wherever one of these patterns was found. After separating the statements, we were able to remove these patterns from the body of text. We also used this method of extracting and later removing unwanted patterns for other pieces of data which did not belong to the debate proceedings, such as sub-debate titles./p> child node, with sub-child nodes called and to differentiate the two. Questions in writing, however, are embedded in their own child node called at the end of the XML file./p> speeches used in all four scripts meant that all questions without notice content was already parsed in order. For the first two scripts, questions and answers were already separated onto their own rows. For the third and fourth scripts, just as we did with the rest of the speech content, we used those patterns of data preceding the text to separate questions and answers. Finally, since questions in writing exist in their own child node we were able to use the same parsing method for all scripts, which was to extract all question and answer elements from the child node./p> nodes to separate speeches. As evident in Fig. 3, nodes are nested within nodes, meaning that the patterns of data from interjection statements were separated out in the process. This meant that we did not need to create lists of names and titles for which to search in the text as we did before. However, we used the same list of general interjection statements on which to separate as was used in the first two scripts. We then did an additional check for statements that may have not been separated due to how they were embedded in the XML, and separated those out where needed. In particular, while most statements were categorized in their own child node and hence captured through pattern-based separation, some were not individually categorized, and had to be split manually in this step./p> nodes contain important data on the MP making each statement. As such, we could extract those data associated with each pattern by parsing one element inward, using the XPath expression “talk.start/talker”. We created a pattern lookup table with these data, and merged it with the main Hansard dataframe by the first pattern detected in each statement. Figure 6 provides an example of that lookup table. This approach enabled us to fill in missing data on each MP speaking using data extracted directly from the XML. Finally, we then used the AustralianPoliticians dataset to fill in other missing data, and flagged for interjections in the same manner as before./p> content in their own nodes that contain the voting data and division result. Since we focus primarily on the spoken Hansard content, our parsing scripts do not necessarily capture all divisions data from House proceedings. Our approach to parsing Hansard in the third and fourth scripts described in the Script Differences section naturally allowed for much of the divisions data to be added to our resulting files for 1998 to March 2011, however the parsing scripts used for May 2011 to September 2022 Hansard did not. To supplement our database and in an effort to fill this divisions data gap, we created an additional file containing all divisions data nested under the XPath “//chamber.xscript//division” from the Hansard files in our time frame. To produce this data file, for each Hansard XML we parsed the , , and child-nodes where they existed, extracted any timestamps where available, and did any additional data cleaning as necessary. We used a series of if-else statements in this script to account for variation in the structure of the node over time. Finally, we then added a date variable to distinguish between sitting days./p> element is the date. Every file passed this test, and we detected one discrepancy in an XML file from 03 June 2009, where its session header contained the wrong date. We validated that our file name and date was correct by checking the official PDF release from that sitting day./p>