ColossalAI/applications/ColossalQA/colossalqa/text_splitter/chinese_text_splitter.py
YeAnbang e53e729d8e
[Feature] Add document retrieval QA (#5020)
* add langchain

* add langchain

* Add files via upload

* add langchain

* fix style

* fix style: remove extra space

* add pytest; modified retriever

* add pytest; modified retriever

* add tests to build_on_pr.yml

* fix build_on_pr.yml

* fix build on pr; fix environ vars

* seperate unit tests for colossalqa from build from pr

* fix container setting; fix environ vars

* commented dev code

* add incremental update

* remove stale code

* fix style

* change to sha3 224

* fix retriever; fix style; add unit test for document loader

* fix ci workflow config

* fix ci workflow config

* add set cuda visible device script in ci

* fix doc string

* fix style; update readme; refactored

* add force log info

* change build on pr, ignore colossalqa

* fix docstring, captitalize all initial letters

* fix indexing; fix text-splitter

* remove debug code, update reference

* reset previous commit

* update LICENSE update README add key-value mode, fix bugs

* add files back

* revert force push

* remove junk file

* add test files

* fix retriever bug, add intent classification

* change conversation chain design

* rewrite prompt and conversation chain

* add ui v1

* ui v1

* fix atavar

* add header

* Refactor the RAG Code and support Pangu

* Refactor the ColossalQA chain to Object-Oriented Programming and the UI demo.

* resolved conversation. tested scripts under examples. web demo still buggy

* fix ci tests

* Some modifications to add ChatGPT api

* modify llm.py and remove unnecessary files

* Delete applications/ColossalQA/examples/ui/test_frontend_input.json

* Remove OpenAI api key

* add colossalqa

* move files

* move files

* move files

* move files

* fix style

* Add Readme and fix some bugs.

* Add something to readme and modify some code

* modify a directory name for clarity

* remove redundant directory

* Correct a type in  llm.py

* fix AI prefix

* fix test_memory.py

* fix conversation

* fix some erros and typos

* Fix a missing import in RAG_ChatBot.py

* add colossalcloud LLM wrapper, correct issues in code review

---------

Co-authored-by: YeAnbang <anbangy2@outlook.com>
Co-authored-by: Orion-Zheng <zheng_zian@u.nus.edu>
Co-authored-by: Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com>
Co-authored-by: Orion-Zheng <zhengzian@u.nus.edu>
2023-11-23 10:33:48 +08:00

57 lines
2.5 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""
Code for Chinese text splitter
"""
from typing import Any, List, Optional
from colossalqa.text_splitter.utils import get_cleaned_paragraph
from langchain.text_splitter import RecursiveCharacterTextSplitter
class ChineseTextSplitter(RecursiveCharacterTextSplitter):
def __init__(self, separators: Optional[List[str]] = None, is_separator_regrx: bool = False, **kwargs: Any):
self._separators = separators or ["\n\n", "\n", "", "", "", "", "?"]
if "chunk_size" not in kwargs:
kwargs["chunk_size"] = 50
if "chunk_overlap" not in kwargs:
kwargs["chunk_overlap"] = 10
super().__init__(separators=separators, keep_separator=True, **kwargs)
self._is_separator_regex = is_separator_regrx
def split_text(self, text: str) -> List[str]:
"""Return the list of separated text chunks"""
cleaned_paragraph = get_cleaned_paragraph(text)
splitted = []
for paragraph in cleaned_paragraph:
segs = super().split_text(paragraph)
for i in range(len(segs) - 1):
if segs[i][-1] not in self._separators:
pos = text.find(segs[i])
pos_end = pos + len(segs[i])
if i > 0:
last_sentence_start = max([text.rfind(m, 0, pos) for m in ["", "", ""]])
pos = last_sentence_start + 1
segs[i] = str(text[pos:pos_end])
if i != len(segs) - 1:
next_sentence_end = max([text.find(m, pos_end) for m in ["", "", ""]])
segs[i] = str(text[pos : next_sentence_end + 1])
splitted.append(segs[i])
if len(splitted) <= 1:
return splitted
splitted_text = []
i = 1
if splitted[0] not in splitted[1]:
splitted_text.append([splitted[0], 0])
if splitted[-1] not in splitted[-2]:
splitted_text.append([splitted[-1], len(splitted) - 1])
while i < len(splitted) - 1:
if splitted[i] not in splitted[i + 1] and splitted[i] not in splitted[i - 1]:
splitted_text.append([splitted[i], i])
i += 1
splitted_text = sorted(splitted_text, key=lambda x: x[1])
splitted_text = [splitted_text[i][0] for i in range(len(splitted_text))]
ret = []
for s in splitted_text:
if s not in ret:
ret.append(s)
return ret