ShellKode Blog

Official tech blog from ShellKode, we publish posts about our engineering team’s awesome work on…

Follow publication

Build Multilingual Translation with Cohere Model on Amazon Bedrock

Featured

Build Multilingual Translation with Cohere Model on Amazon Bedrock

ShellKode Blogs
ShellKode Blog
Published in
7 min readJul 11, 2024

Seamless communication and access to information across languages is crucial in our increasingly diverse and interconnected world. However, language barriers can often impede an organization’s reach to new markets and its potential target audience. Advancements in artificial intelligence (AI), particularly in multilingual embeddings and advanced conversational systems, provide promising solutions. By leveraging these technologies alongside the extensive foundation models (FMs) and capabilities offered by Amazon Bedrock, organizations can effortlessly build GenAI solutions and applications.

In this blog post, I’ll delve into my recent work utilizing Cohere’s powerful multilingual embedding model and Langchain’s flexible framework to build a multilingual retrieval-based QA system. This system can handle over 100 languages, making information retrieval and question-answering truly accessible on a global scale.

Cultural misunderstandings directly stand in the way of businesses’ measures to maximize revenues. According to a survey, businesses lose over $2 billion annually in America. — United States Committee on Economic Development

Understanding the Building Blocks:

At the core of this system lie three key components:

  • Cohere’s Multilingual Embedding Model: This state-of-the-art AI model takes text in various languages and transforms it into numerical vectors — essentially, capturing the semantic meaning of the text in a way that allows for comparison and retrieval across different languages. This unique ability makes it a game-changer in bridging the language gap.
embeddings = BedrockEmbeddings(model_id="cohere.embed-multilingual-v3")
  • Anthropic: This platform acts as a host for large language models (LLMs) like the Claude model. It provides access to sophisticated functionalities like question answering and text generation, allowing us to leverage the power of LLMs in our QA system.
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model = BedrockChat(
client=bedrock_client,
model_id=model_id,
model_kwargs=model_kwargs,
)
  • Langchain: This framework serves as the foundation for building text-based AI pipelines. It streamlines the integration of various models and tools like retrievers and summarizers, simplifying the development process and enabling the construction of complex workflows.

Building the Multilingual QA Pipeline:

The system’s functionalities can be broken down into several key steps:

  • Document Loading and Splitting: The process begins by extracting text from a document using a document loader. The extracted text is then divided into smaller, manageable chunks to optimize the retrieval process.
text_splitter = RecursiveCharacterTextSplitter()
  • Multilingual Embedding Generation: Each text chunk is then fed into Cohere’s multilingual embedding model (model_id = cohere.embed-multilingual-v3), generating a dense vector representation. This vector encodes the semantic meaning of the text chunk in a numerical format, enabling efficient comparisons and retrieval based on semantic similarity.
  • PGVector for Efficient Retrieval: Instead of relying on FAISS, we leverage PGVector, an extension of PostgreSQL that enables storing and searching vector data directly within the database. This approach offers several advantages, including:
    - Seamless Integration: PGVector integrates seamlessly with existing PostgreSQL infrastructure, making it easy to manage and query vector data alongside other document information.
    - Scalability: PGVector leverages the inherent scalability of PostgreSQL, allowing the system to handle larger datasets and increase user traffic effectively.
    - Cost-Effectiveness: By utilizing existing database infrastructure, PGVector can be a more cost-effective solution compared to external retrieval engines.
db = PGVector.from_documents(
embedding=embeddings,
documents=texts,
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
pre_delete_collection= False
)
  • Retrieval-Based QA Chain: Langchain is utilized to construct a retrieval-based QA chain, similar to the previous approach
    - Retriever: This component leverages PGVector’s powerful search capabilities to identify the most relevant text passages from the document based on the user’s query. It finds the passages in the document that are semantically most similar to the query, expressed as a vector.
 qa = RetrievalQA.from_chain_type()
  • LLM (Large Language Model): Anthropoic’s Claude LLM model takes the retrieved content (relevant text passages) and the user’s question as input and generates an answer that is both informative and strictly adheres to the language of the query. This ensures that users receive answers in their native language, eliminating the need for translation and fostering better understanding.

Experimentation and Results:

The system exhibited promising results in the preliminary stages of testing retrieval-based question-answering (QA) capabilities. Specifically, it demonstrated its proficiency in extracting relevant information from Hindi documents and formulating precise answers in English. This underscores the system’s remarkable ability to transcend linguistic barriers effectively.

Furthermore, regardless of whether the user poses queries in English or Hindi, the system adapts seamlessly, generating responses in the language of the query. This adaptability highlights the system’s versatility in language processing, enhancing its utility across diverse linguistic contexts. The system enables the user to converse in any language and gives responses in the same language as the user’s query.

The Potential of Multilingual AI:

This project serves as a stepping stone towards a future where language barriers no longer impede information access and understanding. The potential applications of this technology are vast and transformative, including:

  • Multilingual Search: Transforming search engines to enable seamless retrieval of information across languages, empowering users to find what they need regardless of their language.
  • Multilingual Customer Support: Providing exceptional customer service by catering to users in their native languages, fostering trust, and improving overall satisfaction.
  • Cross-Lingual Content Aggregation and Exploration: Breaking down language barriers in content discovery, allowing users to explore a wider range of information and perspectives, regardless of the language.
  • Zero-Shot Cross-Lingual Text Classification: Categorizing text across diverse languages without the need for explicit training data in each language, opens doors to more efficient and scalable information processing.

Steps in Building a Multilingual Retrieval-Based QA System

Let us look into the steps to build a complete multilingual retrieval-based QA system and the list of the required dependencies and the programming language that needs to be installed.

Dependencies: Install the required dependencies:

  • langchain
  • langchain-community
  • streamlit
  • botocore
  • boto3
  • pypdf

Python: Ensure you have Python installed, preferably version 3.8 or higher, as it’s the programming language we’ll use.

Step 1: Initializing the Environment

To set up the environment for the AI-powered Multi-Lingual chatbot, follow these steps:

  • Create a Python file (e.g., main.py) and open it in the code editor.
  • Import necessary libraries
  • Set your AWS access credentials using the AWS_Secret_Access_ID and AWS_Secret_Access_Key.

Step 2: Processing the Data

  • Upload a PDF document using the provided interface.
  • The backend code will handle the document, splitting it into smaller chunks for processing and storing.

Step 3: Leveraging Bedrock Embeddings for Multi-Lingual

Bedrock embeddings are essential for converting text data into numerical vectors, a crucial step in developing our intelligent, multi-lingual chatbot

Step 4: Uploading data in PG Vector

Next, we will embed our text data and store it in PG Vector. This enables efficient similarity search and vector retrieval, which are core components of our chatbot’s conversational capabilities.

Step 5: Integrating Claude using Bedrock

At the core of our chatbot is Claude, a highly capable language generation model. We integrate this model using AWS Bedrock to enable our chatbot to understand user queries and generate intelligent responses.

Step 6: Building the QA Retrieval Chain

To create a dynamic and interactive chatbot, we build the QARetrievalChain by combining Claude LLM and the PG vector database. This chain empowers the chatbot to retrieve relevant responses based on user queries.

Step 7: Chatting with the Chatbot

The chatbot responds in the same language as the user’s input query

Architecture Diagram:

import boto3
import botocore
from langchain_community.chat_models import BedrockChat
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores.pgvector import PGVector
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain_community.embeddings import BedrockEmbeddings


config = botocore.config.Config(
read_timeout=1800,
connect_timeout=1800,
retries={"max_attempts": 3}
)

bedrock_client = boto3.client(
service_name="bedrock-runtime",
region_name="us-east-1",
config=config
)

model_kwargs = {
"max_tokens": 4096,
"temperature": 0.0,
"top_k": 3,
"top_p": 1,
"stop_sequences": ["\n\nHuman"],
}

model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model = BedrockChat(
client=bedrock_client,
model_id=model_id,
model_kwargs=model_kwargs,
)

loader = PyPDFLoader(r"") #upload your document
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400,chunk_overlap=0)
texts = text_splitter.split_documents(docs)

CONNECTION_STRING = "" #enter your connection_string

COLLECTION_NAME = "" #enter your collection name
embeddings = BedrockEmbeddings(model_id="cohere.embed-multilingual-v3")

db = PGVector.from_documents(
embedding=embeddings,
documents=texts,
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
pre_delete_collection= False
)

template = """
Answer truthfull based on the given text
Instruction:
1.Must identify the language of user's question
2.Must Give the response only in identified user's language in question
3.Provide answer only with in the text provided, dont generate answers from your own

for example:
1.if the asked question is tamil language you should give the response in the tamil language only.

{context}
{question}
"""


retriever = db.as_retriever(search_type='mmr', search_kwargs={"k": 3})
qa_prompt = PromptTemplate(template=template, input_variables=["context","question"])
chain_type_kwargs = { "prompt": qa_prompt, "verbose": False }
qa = RetrievalQA.from_chain_type(
llm=model,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs=chain_type_kwargs,
verbose=False
)

question="" #ask your question
result = qa.run(question)
print(result)

Conclusion

The multilingual QA system empowers businesses to expand their reach and enter emerging markets by transcending linguistic barriers. This innovative solution not only addresses the challenge of retrieving information across multiple languages but also serves a diverse range of industries, from healthcare to logistics. By streamlining operations and fostering connections across varied user groups, it paves the way for enhanced global engagement and efficiency.

To learn more about our solution, you can reach out to us at: https://www.shellkode.com/contact-us

Author

This blog post is written by Bakrudeen K (Head of AI/ML Practice), and Lalit Khatter (PSA, AWS Ambassador APJ Lead).

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in ShellKode Blog

Official tech blog from ShellKode, we publish posts about our engineering team’s awesome work on AWS, GCP, Data and OpenSource tools.

No responses yet

Write a response